Talend Open Studio is a very nice open and scalable platform for data integration. This eclipse-based application is made of components, each of them absolving a particular data processing task. Although not very well documented by the vendor (but luckily someone else filled the gap), a perspective is provided to let the user the possibility to build new components. The developing flow is really primitive, as even the majority part of official out-of-the-box TOS components are made in a procedural coding style. This is not a fail in the very sense of the word, as this is the most common style in ETL environments for historical reasons, when speed was essence. But in the new world of Big Data, nosql and graph databases, and with modern hardware, this approach is totally inadequate. In the past, I presented a way to build components in a maven-aided environment. Starting from that, I built a OOP framework that, in its intents, aims to force and help the developer to make components in an object oriented flavour.
In this article I would like to show you the basic idea behind my framework and the advantages you could obtain using it in terms of coding speed and code reliability. Then, we’re going to explore the framework itself and x-ray a Talend component. You’ll be surprised on how easy is to build a component using design patterns!
At each time, please refer to javadoc. You can also fork or download the source code if you want to quickly jump in the action. Code is quite well-commented and should be easy to read.
The idea between the ETL framework
If you have a look to a Talend job generated source code you’ll see how poor is this piece of code. Components are not instances of classes, there is no local scope of variables inside data flows, neither namespace of course, and rows carrying data are translated to a plethora of Struct classes with only public fields (one for each column of the connection itself). Basically speaking, a Talend component is a set of javajet files that act as templates to generate tons of inline java code. In this hell, here and there, there is data logic, data processing, data transport between connections and between components and so on.
I tried to analyze carefully why in my opinion this design has huge perks and these are my conclusions.
There is nothing wrong with it, as Talend moves data using heterogeneous vectors of data and vectors are flat by definition. All components input/output interfaces are made of vectors and, again, this is the way ETL world works.
But what if a component needs to capture more complex relationships between data to do its internal processing duties? What if you have to, for example, download an open graph from Facebook and project it on a set of database tables? Your component must implement a relation layer and put it inline the generated code. This is really time-consuming and not very elegant.
- So the assumption was: even if input/output stages are vectorized, perhaps components still need a more general way to manage data in their internal processing stages. A framework must provide this abstract data model structure, so the component designer will just need to build a data interface to map data from external stage to the internal abstract representation of a more general data model, while most of the data processing could be standardized and untied to Talend vectors. In other words: I need a graph data model.
A component has a couple if incoming and outgoing connections, each of them has a set of columns determined at compile-time. What if a component needs, let’s say, a buffer table with data for some processing duties? Or extra rows? Or extra column metadata? Again, everything should be implemented by the component itself, reinventing the wheel thousands of times, putting more inline code and making the generated code even more dirty. Slow. And sad.
- My idea was: what’s wrong if my kernel data model framework gives the possibility to build any number of data structures you may need. You could map a structure one-to-one to some Talend vectors, but it can also stand on his own foot, ready for component needs. And they can be made at runtime, of course.
Suppose we need to build a component that calculates a moving average of a particular column in the incoming set of rows. Even if it’s a trivial task, code this in Talend is really a pain. Your component must implement a buffering logic (to store data somewhere), perhaps a FIFO strategy with a parametric threshold, and a rolling strategy at least. This is a huge amount of code to write for a so trivial payload (the moving average calculus).
- Conclusions were obvious: the framework data model must have the possibility to store more rows at once. One must not be limited to store just one row inside the component, if she needs more. If you don’t need a memory, you can always limit the stack size to one and build a data follower, but it cannot be a constraint. And, of course, this must be handled by the framework itself.
Even if Talend metadata allows to flag a column as key, or to set a column default, this is – as I said – just metadata, just a tag. This has no relation with data in the very sense of word, this doesn’t check the data.
Just think about it. How could you match the concept of primary key if current data row has no reminiscence of which row came just before it?
- Sometimes you need to add some constraints to your data, for example a primary key condition, even if you don’t plan to store it in a structure that handles it as primitive (like a RDBMS). The framework must implement a set of basic building blocks that every data model should have. I mean: primary keys, default values, commit and rollback processes and so on. And – this is important – it must apply at data level, not at metadata level.
If you look the main javajet file of a component, somewhere you will find a loop that cycles between each column of an outgoing rows and fill the value from some computed values or from an equivalent input column. If you see a generated code, you will see this list of fields outputted by the aforementioned loop. This is not great, since it mixes data processing steps to data transport. But data transport should be abstracted, so if you need to change something in data processing you won’t touch the component output stage.
- My idea was to try to promote the use of visitors as a way to move data inside the abstract data model, but also as external world interfaces. This way, I was able to keep separated the data model from data controller.
Any experienced Talend component designer well knows how painful is javajet coding. You don’t have a good IDE, code completition, a decent syntax highlighter, debug and testing tools. Last but not least, javajet builds code that goes inline the java code. At the end of the day: poor code, long developing time.
- The solution comes by a MVC separation of the data model. Thanks to this design, I’m able to build the 99% part of component processing stage in a pure java environment (using eclipse, maven, junit and other precious applications). The remaining 1% part is still javajet and produces inline java code but it’s usually few dozens of lines of code: just the code needed to init the framework, build the needed data structures and define data interfaces. This means: data controller part is inside the javajet scope, while the data processing stage (usually the most complex part of a component) is outside in his safe set of component JARs.
In the next page we’ll go in more details on how the bridge works. For now, i just liked to point you out the limits of current component design. Just to be clear, this doesn’t mean you are forced to push your philosophy out and do a blind switch. Simple components doesn’t probably need this design. But this is not the very basic philosophy of object-oriented paradigm?