## How to build Talend ETL custom components in OOP way – part 3

In the previous article, we saw how to set up a development environment for Talend component creation in a more structured way. This time we’re going to build a Talend component following a more object-oriented approach, eventually using the frame I developed for that purpose. And to speed up the whole process.

In this part of the tutorial, we’re going to initialize and describe our model. We’re going to use tJiraProjects as an example plugin I just refactored. Feel free to watch the code or to fork it on github.
(more…)

## How to build Talend ETL custom components in OOP way – part 2

Last time I pointed out the advantages in using a more rigorous OOP methodology while coding your Talend Open Studio custom components. This is specially true in case you need some complex transformation inside your component stages involving time-series, some stack or buffer needs for your rows, non-relational links and so on. Today, before (finally!) introduce Talend Bridge, the framework to write Talend components in a true OOP way, we’ll spend a bit of time on how to set up a comfortable development environment. As often is best to do, I’ll demonstrate the use in a real case, explaining step-by-step my component tJiraProjects. It’s a very simple piece of code that use the official Jira Java APIs for this popular issue tracking software. It basically connects to a Jira instance and retrieve a list of projects.

I’ll drive you step-by-step in how to build component using the OOP bridge, but please remember that this article is not intended to teach you how to build custom talend components. There’s a great tutorial on that if you need to fresh your memory.

So, before driving you into the real action, let’s prepare the environment.
(more…)

## How to build Talend ETL custom components in OOP way – part 1

Talend Open Studio is a very nice open and scalable platform for data integration. This eclipse-based application is made of components, each of them absolving a particular data processing task. Although not very well documented by the vendor (but luckily someone else filled the gap), a perspective is provided to let the user the possibility to build new components. The developing flow is really primitive, as even the majority part of official out-of-the-box TOS components are made in a procedural coding style. This is not a fail in the very sense of the word, as this is the most common style in ETL environments for historical reasons, when speed was essence. But in the new world of Big Data, nosql and graph databases, and with modern hardware, this approach is totally inadequate. In the past, I presented a way to build components in a maven-aided environment. Starting from that, I built a OOP framework that, in its intents, aims to force and help the developer to make components in an object oriented flavour.

In this first article I would like to show you the basic idea behind my framework and the advantages you could obtain using it in terms of coding speed and code reliability. In following articles we’ll go further analyzing in details the APIs and a couple of components I made with them. You’ll be surprised on how easy is to build a component using design patterns!
(more…)

## How to use Maven in Talend component development

Recently I started doing some really serious Talend Open Studio component development for my company and for a spinoff of mine, so I realized that using a more robust build life-cycle would be great to speed-up development and avoid doing repetitive tasks (developing TOS component is really tedious and boring for such causes, actually). Since Talend team doesn’t seem to agree with this vision and the only other solution apparently made doesn’t satisfy my needs, I opted for a do-it-yourself approach and adopted Maven in Talend development for all my self-made components.

The idea is basically to provide a skeleton hello-world component supplied with a pom.xml files with a bunch of goals useful in component package and install. As a very quick start, you just need to check it out from my github repository, clone it and start playing following provided README information. It should be really easy to make it up-and-running in minutes.

Feel free to fork it, too! I strongly encourage using github for TOS components development and upgrading. But, well, this is another story.

#### Benefits of using Maven in Talend

The benefits of using Maven are many, as Apache Software Foundations says, and all applies to TOS component development, too.

First of all, you can easily integrate tests in your build cycle. Too often, TOS developers simply doesn’t test their  components an Talend guidelines on the subject are incomplete and not mainly focused on custom components. But a java-level test suite is mandatory for every serious development and should be planned with care.

Then, you can decouple your TOS custom component folder from your development environment. It’s trivial to say that  it’s always a good idea to have dev and prod environments as isolated as possible and to let maven handle the release process. This skeleton component, for example, let you specify a target deploy directory (the TOS custom component folder, usually) where the component will be copied when a maven install trigger is issued.

And finally (and that’s nice, don’t you think so?) Maven can help you automatically maintain external dependencies and patch you component XML file accordingly. The idea is: you declare your needed dependencies in pom.xml file, as any regular java application and during the package phase the <IMPORTS> element in XML is automatically updated. No need to update it manually.

Since i remember well how painful is to copy paste all these external libraries in XML, and keeping the version number up-to-date and so on, I decided it’s a nice feature to have.

#### Relevant build phases and parameters

The build life-cycle is quite simple as just few phases have sense in TOS component deployment:

• package – when the internal component extra-jar (if present) and the external dependencies are assembled and all needed resources (javajet files, icon, messages…) are compiled and assembled in the target basedir directory and when XML file is patched to match your dependency tree and some metadata are updated (author name, build timestamp and so on). It basically makes all the necessary steps to build the component but doesn’t copy it in the target folder;
• install – when the packaged component is copied in a specified folder that it’s assumed to be the TOS custom component folder;
• clean – when target dir is cleaned-up.
Please not that for several reason, a TOS component is not a regular java artifact. That’s why I disabled the default-install process (which by default install the artifact in the local maven repository). The provided install phase simply copy the component to the target folder.
You can also specify a bunch of optional command-line properties, to tune-up the build process depending on your needs and your environment:
• skipTests (default: true) to skip/unskip the test phase
• component-author (default: Gabriele Baldassarre…ehm…It’s me) to set the component author in XML descriptor;
• addMavenDescription (default: true) to add/cut the META-INF/maven stuff in client JAR;
• dependencyVersion (default: true) to add/cut the version number suffix to all external dependencies;
• useSnapshot (default: true) to use or not snapshot releases in all external dependencies;
• componentsFolder (default: $HOME$/talend_components) path to your installation TOS custom component folder, used during the install phase.

#### Get the plugin

The skeleton plugin (that just prints out a HelloWorld string out to the console) is available on github. Feel free to clone and fork it as you wish and just comment this post or contact me if you need further support.

## Talend Open Studio API component development docset for Dash

As anticipated in my previous post, I built a docset that will dramatically speed-up Talend Open Studio component development under Mac OS X environments and specifically to those using Dash Snippet Manager. This docset, made in Apple standard format, basically exposes the full set of Talend Open Studio API 5.2 in a more organic and searchable way. To use it, follow these steps:

1. Install Dash from this App Store link;
2. In Preferences > Downloads click the “+” icon and paste the following feed URL: http://extras.gabrielebaldassarre.com/dash/Talend_Open_Studio.xml;
3. Click the “Download” button to automatically install. After this operation, the window will looks similar to the following picture:
4. Close the Preferences Windows. A new entry Talend Open Studio should be visible between installed docsets. Now you can explore the full API set or make searches using the “tos:” prefix (ie. “tos:IMetadataColumn):
5. Namespaces unfortunately follow Obj-C syntax rather than Java one, but I was not able to fix it yet. I think it’s a bug in my docsetutil configuration, I’ll investigate further. However, this should not avoid the usefulness of the package.

Next steps will include a set of code snippets for Dash, devoted to Talend component development. Have a nice day!

## Doxygen/javadoc for Talend Open Studio API for 5.2

Talend Open Studio is a great open source platform to develop ETL and data processing workflows based on Eclipse, with a easy-to-lean plugin architecture (although not perfect, in my opinion) and a powerful set of API for developers. Even if the most part of this set of APIs is not for ordinary development (it’s for the designer part, and useful to customize the GUI perspective), the remaining part is in the classpath of the javajet preprocessor, so it can be used to compile components. It’s open source, but documentation is not easily available, so I decided to compile, host and maintain a doxygen/javadoc documentation of the whole set of Talend Open Studio API. It’s fully indexed and full-text server-side searchable and it’s built upon svn repository for the 5.2 branch.

### Browse the documentation

You can find it there: Talend Open Studio API for 5.2

Here’s you can find the doxygen definition file used to make it, in case you need to tweak it and perhaps host locally for your needs. Please remember that you need to adjust all paths (including DOT_PATH, for graphs generation) accordingly!

Next step includes the generation of a docset, useful to speed up development under Max OS X, perhaps using an assistant like Dash.

## Build GML graphs for Social Network Analysis in Talend

I’m attending a coursera-powered class in Social Network Analysis from University of Michigan. While the course is really stunning, I had to realize that it’s kind difficult to obtain enough example datasets to study, especially if you haven’t a solid programming background. Since the course is not targeted to programmers but data analysts, It sounds really weird to me that my classmates have to wait for someone to extract, clean-up and eventually share these datasets. However, network data is usually well-structured and with low dimensionality, so I think that a data integration and manipulation software could be an easier way to prepare these datasets. I choose to do this using Talend Open Studio, as it’s probably the best free data integration platform available nowadays. It obviously doesn’t have a component to build datasets in a format readable by for example Gephi or Pajek, but Talend is eclipse-based, so it was easy to me to build a custom component to write .GML files.

In this tutorial I will guide you using this component to write graphs file, then I’ll explain you a 120-seconds method to build an endless set of example valid network datasets, useful for learning, testing, simulations and so on.
(more…)