How to build Talend ETL custom components in OOP way – part 1

How to build Talend ETL custom components in OOP way – part 1

Talend Open Studio is a very nice open and scalable platform for data integration. This eclipse-based application is made of components, each of them absolving a particular data processing task. Although not very well documented by the vendor (but luckily someone else filled the gap), a perspective is provided to let the user the possibility to build new components. The developing flow is really primitive, as even the majority part of official out-of-the-box TOS components are made in a procedural coding style. This is not a fail in the very sense of the word, as this is the most common style in ETL environments for historical reasons, when speed was essence. But in the new world of Big Data, nosql and graph databases, and with modern hardware, this approach is totally inadequate. In the past, I presented a way to build components in a maven-aided environment. Starting from that, I built a OOP framework that, in its intents, aims to force and help the developer to make components in an object oriented flavour.

In this first article I would like to show you the basic idea behind my framework and the advantages you could obtain using it in terms of coding speed and code reliability. In following articles we’ll go further analyzing in details the APIs and a couple of components I made with them. You’ll be surprised on how easy is to build a component using design patterns!

At each time, please refer to javadoc. You can also fork or download the source code if you want to quickly jump in the action. Code is quite well-commented and should be easy to read.

The idea between the ETL framework

If you have a look to a Talend job generated source code you’ll see how poor is this piece of code. Components are not instances of classes, there is no local scope of variables inside data flows, neither namespace of course, and rows carrying data are translated to a plethora of Struct classes with only public fields (one for each column of the connection itself). Basically speaking, a Talend component is a set of javajet files that act as templates to generate tons of inline java code. In this hell, here and there, there is data logic, data processing, data transport between connections and between components and so on.

I tried to analyze carefully why in my opinion this design has huge perks and these are my conclusions.

Talend rows are flat

There is nothing wrong with it, as Talend moves data using heterogeneous vectors of data and vectors are flat by definition. All components input/output interfaces are made of vectors and, again, this is the way ETL world works.

But what if a component needs to capture more complex relationships between data to do its internal processing duties? What if you have to, for example, download an open graph from Facebook and project it on a set of database tables? Your component must implement a relation layer and put it inline the generated code. This is really time-consuming and not very elegant.

  • So the assumption was: even if input/output stages are vectorized, perhaps components still need a more general way to manage data in their internal processing stages. A framework must provide this abstract data model structure, so the component designer will just need to build a data interface to map data from external stage to the internal abstract representation of a more general data model, while most of the data processing could be standardized and untied to Talend vectors. In other words: I need a graph data model.
Talend rows are limited and made at compile-time

A component has a couple if incoming and outgoing connections, each of them has a set of columns determined at compile-time. What if a component needs, let’s say, a buffer table with data for some processing duties? Or extra rows? Or extra column metadata? Again, everything should be implemented by the component itself, reinventing the wheel thousands of times, putting more inline code and making the generated code even more dirty. Slow. And sad.

  • My idea was: what’s wrong if my kernel data model framework gives the possibility to build any number of data structures you may need. You could map a structure one-to-one to some Talend vectors, but it can also stand on his own foot, ready for component needs. And they can be made at runtime, of course.
Talend rows have no memory

Suppose we need to build a component that calculates a moving average of a particular column in the incoming set of rows. Even if it’s a trivial task, code this in Talend is really a pain. Your component must implement a buffering logic (to store data somewhere), perhaps a FIFO strategy with a parametric threshold, and a rolling strategy at least. This is a huge amount of code to write for a so trivial payload (the moving average calculus).

  • Conclusions were obvious: the framework data model must have the possibility to store more rows at once. One must not be limited to store just one row inside the component, if she needs more. If you don’t need a memory, you can always limit the stack size to one and build a data follower, but it cannot be a constraint. And, of course, this must be handled by the framework itself.
Talend rows have no rules

Even if Talend metadata allows to flag a column as key, or to set a column default, this is – as I said – just metadata, just a tag. This has no relation with data in the very sense of word, this doesn’t check the data.

Just think about it. How could you match the concept of primary key if current data row has no reminiscence of which row came just before it?

  • Sometimes you need to add some constraints to your data, for example a primary key condition, even if you don’t plan to store it in a structure that handles it as primitive (like a RDBMS). The framework must implement a set of basic building blocks that every data model should have. I mean: primary keys, default values, commit and rollback processes and so on. And – this is important – it must apply at data level, not at metadata level.
Data transport is not standardized

If you look the main javajet file of a component, somewhere you will find a loop that cycles between each column of an outgoing rows and fill the value from some computed values or from an equivalent input column. If you see a generated code, you will see this list of fields outputted by the aforementioned loop. This is not great, since it mixes data processing steps to data transport. But data transport should be abstracted, so if you need to change something in data processing you won’t touch the component output stage.

  • My idea was to try to promote the use of visitors as a way to move data inside the abstract data model, but also as external world interfaces. This way, I was able to keep separated the data model from data controller.
Javajet coding is bad, inline code is worst

Any experienced Talend component designer well knows how painful is javajet coding. You don’t have a good IDE, code completition, a decent syntax highlighter, debug and testing tools. Last but not least, javajet builds code that goes inline the java code. At the end of the day: poor code, long developing time.

  • The solution comes by a MVC separation of the data model. Thanks to this design, I’m able to build the 99% part of component processing stage in a pure java environment (using eclipse, maven, junit and other precious applications). The remaining 1% part is still javajet and produces inline java code but it’s usually few dozens of lines of code: just the code needed to init the framework, build the needed data structures and define data interfaces. This means: data controller part is inside the javajet scope, while the data processing stage (usually the most complex part of a component) is outside in his safe set of component JARs.

Next time we’ll go in more details on how the bridge works. For now, i just liked to point you out the limits of current component design. Just to be clear, this doesn’t mean you are forced to push your philosophy out and do a blind switch. Simply components doesn’t probably need this design. But this is not the very basic philosophy of object-oriented paradigm?

How to use Maven in Talend component development

Recently I started doing some really serious Talend Open Studio component development for my company and for a spinoff of mine, so I realized that using a more robust build life-cycle would be great to speed-up development and avoid doing repetitive tasks (developing TOS component is really tedious and boring for such causes, actually). Since Talend team doesn’t seem to agree with this vision and the only other solution apparently made doesn’t satisfy my needs, I opted for a do-it-yourself approach and adopted Maven in Talend development for all my self-made components.

The idea is basically to provide a skeleton hello-world component supplied with a pom.xml files with a bunch of goals useful in component package and install. As a very quick start, you just need to check it out from my github repository, clone it and start playing following provided README information. It should be really easy to make it up-and-running in minutes.

Feel free to fork it, too! I strongly encourage using github for TOS components development and upgrading. But, well, this is another story.

Benefits of using Maven in Talend

The benefits of using Maven are many, as Apache Software Foundations says, and all applies to TOS component development, too.

First of all, you can easily integrate tests in your build cycle. Too often, TOS developers simply doesn’t test their  components an Talend guidelines on the subject are incomplete and not mainly focused on custom components. But a java-level test suite is mandatory for every serious development and should be planned with care.

Then, you can decouple your TOS custom component folder from your development environment. It’s trivial to say that  it’s always a good idea to have dev and prod environments as isolated as possible and to let maven handle the release process. This skeleton component, for example, let you specify a target deploy directory (the TOS custom component folder, usually) where the component will be copied when a maven install trigger is issued.

And finally (and that’s nice, don’t you think so?) Maven can help you automatically maintain external dependencies and patch you component XML file accordingly. The idea is: you declare your needed dependencies in pom.xml file, as any regular java application and during the package phase the <IMPORTS> element in XML is automatically updated. No need to update it manually.

Since i remember well how painful is to copy paste all these external libraries in XML, and keeping the version number up-to-date and so on, I decided it’s a nice feature to have.

Relevant build phases and parameters

The build life-cycle is quite simple as just few phases have sense in TOS component deployment:

  • package – when the internal component extra-jar (if present) and the external dependencies are assembled and all needed resources (javajet files, icon, messages…) are compiled and assembled in the target basedir directory and when XML file is patched to match your dependency tree and some metadata are updated (author name, build timestamp and so on). It basically makes all the necessary steps to build the component but doesn’t copy it in the target folder;
  • install – when the packaged component is copied in a specified folder that it’s assumed to be the TOS custom component folder;
  • clean – when target dir is cleaned-up.
Please not that for several reason, a TOS component is not a regular java artifact. That’s why I disabled the default-install process (which by default install the artifact in the local maven repository). The provided install phase simply copy the component to the target folder.
You can also specify a bunch of optional command-line properties, to tune-up the build process depending on your needs and your environment:
  • skipTests (default: true) to skip/unskip the test phase
  • component-author (default: Gabriele Baldassarre…ehm…It’s me) to set the component author in XML descriptor;
  • addMavenDescription (default: true) to add/cut the META-INF/maven stuff in client JAR;
  • dependencyVersion (default: true) to add/cut the version number suffix to all external dependencies;
  • useSnapshot (default: true) to use or not snapshot releases in all external dependencies;
  • componentsFolder (default: $HOME$/talend_components) path to your installation TOS custom component folder, used during the install phase.

Get the plugin

The skeleton plugin (that just prints out a HelloWorld string out to the console) is available on github. Feel free to clone and fork it as you wish and just comment this post or contact me if you need further support.

Talend Open Studio API component development docset for Dash

As anticipated in my previous post, I built a docset that will dramatically speed-up Talend Open Studio component development under Mac OS X environments and specifically to those using Dash Snippet Manager. This docset, made in Apple standard format, basically exposes the full set of Talend Open Studio API 5.2 in a more organic and searchable way. To use it, follow these steps:

  1. Install Dash from this App Store link;
  2. In Preferences > Downloads click the “+” icon and paste the following feed URL: http://extras.gabrielebaldassarre.com/dash/Talend_Open_Studio.xml;
  3. Click the “Download” button to automatically install. After this operation, the window will looks similar to the following picture:
  4. Close the Preferences Windows. A new entry Talend Open Studio should be visible between installed docsets. Now you can explore the full API set or make searches using the “tos:” prefix (ie. “tos:IMetadataColumn):
  5. Namespaces unfortunately follow Obj-C syntax rather than Java one, but I was not able to fix it yet. I think it’s a bug in my docsetutil configuration, I’ll investigate further. However, this should not avoid the usefulness of the package.

Next steps will include a set of code snippets for Dash, devoted to Talend component development. Have a nice day!

Doxygen/javadoc for Talend Open Studio API for 5.2

Talend Open Studio is a great open source platform to develop ETL and data processing workflows based on Eclipse, with a easy-to-lean plugin architecture (although not perfect, in my opinion) and a powerful set of API for developers. Even if the most part of this set of APIs is not for ordinary development (it’s for the designer part, and useful to customize the GUI perspective), the remaining part is in the classpath of the javajet preprocessor, so it can be used to compile components. It’s open source, but documentation is not easily available, so I decided to compile, host and maintain a doxygen/javadoc documentation of the whole set of Talend Open Studio API. It’s fully indexed and full-text server-side searchable and it’s built upon svn repository for the 5.2 branch.

Browse the documentation

You can find it there: Talend Open Studio API for 5.2

Here’s you can find the doxygen definition file used to make it, in case you need to tweak it and perhaps host locally for your needs. Please remember that you need to adjust all paths (including DOT_PATH, for graphs generation) accordingly!

Next step includes the generation of a docset, useful to speed up development under Max OS X, perhaps using an assistant like Dash.

Build GML graphs for Social Network Analysis in Talend

Build GML graphs for Social Network Analysis in Talend

I’m attending a coursera-powered class in Social Network Analysis from University of Michigan. While the course is really stunning, I had to realize that it’s kind difficult to obtain enough example datasets to study, especially if you haven’t a solid programming background. Since the course is not targeted to programmers but data analysts, It sounds really weird to me that my classmates have to wait for someone to extract, clean-up and eventually share these datasets. However, network data is usually well-structured and with low dimensionality, so I think that a data integration and manipulation software could be an easier way to prepare these datasets. I choose to do this using Talend Open Studio, as it’s probably the best free data integration platform available nowadays. It obviously doesn’t have a component to build datasets in a format readable by for example Gephi or Pajek, but Talend is eclipse-based, so it was easy to me to build a custom component to write .GML files.

In this tutorial I will guide you using this component to write graphs file, then I’ll explain you a 120-seconds method to build an endless set of example valid network datasets, useful for learning, testing, simulations and so on.

Here’s the screenshot of the Talend job that does the magic. Please remember that you need to download my component and copy the unpacked folder into Talend Open Studio plugins folder before starting!

How it works

A typical Graph Modeling Language file is generally made of two parts:

  • A nodes list, describing every node of the network. Each node must have at least an ID and a label, while can have zero or more node attributes (usually literals or numbers). If an attribute is missing, it must be included anyway in the graph with a NULL value. This structure is perfect to be fed by a tabular data frame and we propagate it in the first branch (the one labeled by row3/out1)
  • An edge list, describing relations between nodes. While this is not technically mandatory, a list of unconnected nodes is quite useless for SNA studies, so it makes sense to always provide it as a reference (or lookup) connection. In the screenshot, it’s the branch named row4/row5. The edge definition vector must have at least two column describing source and target node, but can have weight coefficients and labels, too.

(please remember than Talend subjobs can have as many lookup connections as you desire, but just one main connection that keeps the pace. I decided to design the component using nodes data flow as the main one).

It’s quite handy to see how the tFileOutputGML instance is the heart, just around the middle of the canvas.

First of all, let’s have a look to tFileOutputGML parameters, since this is the main purpose of this tutorial (click to open the full-sized version):

The first parameter let you to set the node attributes that will be included in the GML file. Even complex java types can be used as node attribute columns, however they will be flattened to a primitive type (mostly literals or numbers), with perhaps unexpected results. So, it’s better to convert them to a primitive type and following your needs before going into tFileOutputGML block. You can define graph direction, node properties (where to search for IDs and labels) and edge properties. Here you can activate/deactivate them and define the location for 2 to 4 describing columns.

You can eventually set some file and graph properties into Advanced parameters, but these properties are fully optional and default values are ok in most cases.

That’s all you need to save a GML file for your social network analysis. But if you need to fill an example network to testing and educational needs, continue reading the post.

Build example networks for Social Network Analysis

In Talend we have a very nice 20-seconds set of functions to generate example datasets. They are very basic, but enough for our needs. We will use them to generate a fictional network of friends. Click below to see the complete job.

Generating nodes

First of all we’ll generate a set of nodes representing persons; it’s the upper branch of our job which starts with the tRowGenerator_1 block. Let’s generate a dataset of let’s say 100 nodes and the following 4 columns:

  • label, using the getFirstName() function (It returns a random english american name, no parameters needed). It will become the node label
  • city, using the getUsCity() function (It returns a random town of the United States, no parameters needed). It will become the literal node attribute “city”
  • age, using the random number generation function with two parameters (upper value, lower value). It will become the numeric node attribute “age”

The following screenshot should give you a clever idea of what we’re doing.

As you may see, I didn’t defined the node ID, here. For the sake of clarity, I usually prefer to generate numeric sequences in a separate  tMap component. And that’s what i did, when i defined the ID column using the following snip of code.


Numeric.sequence("node_id", 1, 1)

You may add other attributes if you need, for example US State ID, sex (using a random integer number which can be only zero or one), phone numbers and so on. This will however be beyond the scope of this tutorial, so let’s stay focused and link this to a tFileOutputGML block, setting basic and advanced parameters as I explained earlier. Don’t forget to fill the output definition schema, as It represents the node structure you’ll see in your GML file. This means, if you don’t add a column in the output schema, the corrispettive attribute will not be saved in the output file. The following screenshot should give you a better idea on what I’m talking about:

That’s all about nodes definition. Let’s fill the edges definition part, now.

Generating edges

For this part, we’ll use another instance of tRowGenerator family component, with just two integer columns – source and target – both filled with random integer numbers ranged from 1 to 100. They will simulate our connections between nodes. To obtain a significant network for SNA, I suggest you to generate two or three times the number of nodes at least. So, since we set as 100 the number of nodes, 300 is a nice numbers to set here, instead. A lower number will let you see a smaller giant component and more dangling nodes around the graph, while a higher number will let you see a much more deepness in nodes connections. Just play a bit with these thresholds to get the result that best fits your needs.

Since we’re describing friendship relations, we don’t want self-referring relations (one cannot be a friend of himself!). So we need to filter out these unwanted rows using a filter and the following easy condition:


input_row.source != input_row.target

Finally, we need to remove duplicates source/target pairs using an aggregation block. In this step, if you need to, you can define a weight column fed by a count operation between rows. It will be useful if you want to make some tests on weighted networks, obviously. The edges data flow is ready and you can link to our file writer block using a lookup connection.

That’s almost done; I just added a print-to-console block (tLogRow) after the output stage just to demonstrate that a file writer doesn’t have to be the final block of the job. Just remember than only the main connection (nodes definitions) is propagated.

You can finally run up the job and explore the generated network. You can obviously extend the dimensionality of your network, adding more node attributes. You can easily extends Talend data generation functions to do so, as they are simple java static functions; you can find them in the project repository, under system routines.

While this method is probably not indicated for more sophisticated analysis (ie, social networks rarely have a full random distribution of connections), I think it could still be useful. I hope this tutorial will help you getting the most from Talend, from this point of view. If you need any kind of support, help, or wanna share opinions, critiques or considerations, feel free to use the comment form or to contact me directly. See you next time and have a nice day!