In this step-by-step tutorial I’ll show you how to use Talend Open Studio and the Twitter Components Pack to connect to Twitter, do a simple REST query and build a trivial relevance report on top on it. There’re tons of similar Talend tutorials out there, but no one is focused on my Twitter components pack, which let you do queries and result parsing without writing a single line of custom code. So let’s go into this 101 crash course on how to download tweets and build a real-world analysis on it.
Following a survey made for a well estabilished group at LinkedIn, Big Data professionals impute the causes of failure of most Big Data projects to a lack of knowledge on how to set up, deploy and manage a Hadoop cluster. Although I consider this vision a bit too much simplistic, while I endorse a vision where the change management matters, the bootstrap difficulties for the IT departments are indeed huge and steep. Major Hadoop distribution companies, like Cloudera and IBM, are aware of this flaw and provide sophisticated installers and managers, while they offer quick start VMs for developers at same time. First of all, this cannot be enough for those Enterprises which need to put their production environments under control and it’s even not enough for those developers which need to use something more reliable than a single-node virtual cluster. Finally, it’s important to understand how Hadoop works under the hood, but of course without wasting time on configuration details.
Vagrant definitively comes to help. Using a Cloudera Vagrant box, you’ll be able to build a complete scalable cluster that (with some minor tuning) could even be used in production environments. In this tutorial we’re going to build a cluster based on Cloudera Hadoop Distributions in minutes and without user interventions.
Talend Open Studio is a handy ETL tool which amazing extending capabilities and a complete set of tools for building new custom components as I showed in several posts in the past. Talend offers an automatic way to install components through their official marketplace. However, that place is not famous for UX and for being attractive for developers and, as a matter of facts, components hosted there are often poor and outdated. Here I’m going to show a general way to install custom components which works for both Talend Exchange components and third-party hosted ones. For hard-core developers, I’ll also show how to compile a component starting from source code.
In the majority of Talend tutorials related to database operations I found there’s no or little use of RDBMS prepared statements. To build or parametrize queries, the most part of Talend users and developers seems to prefer a pure string concatenation approach. But this is absolutely a bad habit, since it offers the side to some important security flaws and doesn’t make use of the caching mechanisms of modern RDBMS. Although the guys at Talend really don’t make your life easier because of some choices in I/O DB components, It’s still possible to design a job which make use of PreparedStatements at full extent. In this tutorial I’m going to introduce a technique for some common use cases, while hardening the security and improving the debugging speed at the same time.
With the new year, Packt Publishing has reinforced its offering of reference guides on Open Source Analytics and Business Intelligence tools with this brand new Talend Open Studio Cookbook. Thanks to the publisher, which gave me a very early copy of the book to review, I had the time to read it twice and have a good understanding of the bundled code. I’m now able to write a complete review, especially focusing on target readers’ needs and on differences with the other Talend book on Packt’s offering. That Talend for DI primer gave us a first idea on how a professional-made reference guide on Talend would look like, but this one is a completely different matter and took the subject from a different perspective: the coder side of the Moon.
This review will be organized in three sections. In the first part, I’m going to go deep on the book content and presentation. In the middle part, I will focus on potential readers’ expectations and gains. Finally, in the last section I’m going to summarize my conclusions.
There are plenty of scenarios when one would benefit to do a cross-over between Talend Open Studio and R. The first is perfect for even complex ETL tasks, which by their very basic nature involves massive data I/O, manipulation, federation and governance, but it completely lacks any kind of serious statistical tool.
On the other hands, R is an absolute standard for statisticians, with a huge amount of external packages for practically any possible kind of analysis one could imagine, but even simple data operations must be hand-coded. R language is a very expressive and extensible data language, but one perhaps would prefer to spend time reasoning on the predictive model, rather than writing code to get the data out from the database. This is particularly true in data exploitation scenarios, but also in rapid prototyping and, generally speaking, in the whole business world.
If it’s not enough, R is basically a data language plus a command line executor. This is historically common for statistical software (just think to SAS) so it’s not a flaw on its own. But in real life Business Intelligence life-cycle, you probably have a corporate standard, a service bus, a protocol for data transfer and so on. A better interface with R is really advisable.
This is possible using a custom optional component made by me for Talend. In this tutorial I’ll show you how to use R to build a simple predictive model with data coming from Talend and how to get results back to Talend himself, for all your ETL good habits.
Talend Open Studio is a very nice open and scalable platform for data integration. This eclipse-based application is made of components, each of them absolving a particular data processing task. Although not very well documented by the vendor (but luckily someone else filled the gap), a perspective is provided to let the user the possibility to build new components. The developing flow is really primitive, as even the majority part of official out-of-the-box TOS components are made in a procedural coding style. This is not a fail in the very sense of the word, as this is the most common style in ETL environments for historical reasons, when speed was essence. But in the new world of Big Data, nosql and graph databases, and with modern hardware, this approach is totally inadequate. In the past, I presented a way to build components in a maven-aided environment. Starting from that, I built a OOP framework that, in its intents, aims to force and help the developer to make components in an object oriented flavour.
In this article I would like to show you the basic idea behind my framework and the advantages you could obtain using it in terms of coding speed and code reliability. Then, we’re going to explore the framework itself and x-ray a Talend component. You’ll be surprised on how easy is to build a component using design patterns!
Recently I started doing some really serious Talend Open Studio component development for my company and for a spinoff of mine, so I realized that using a more robust build life-cycle would be great to speed-up development and avoid doing repetitive tasks (developing TOS component is really tedious and boring for such causes, actually). Since Talend team doesn’t seem to agree with this vision and the only other solution apparently made doesn’t satisfy my needs, I opted for a do-it-yourself approach and adopted Maven in Talend development for all my self-made components.
The idea is basically to provide a skeleton hello-world component supplied with a pom.xml files with a bunch of goals useful in component package and install. As a very quick start, you just need to check it out from my github repository, clone it and start playing following provided README information. It should be really easy to make it up-and-running in minutes if you follow this simple tutorial.
As anticipated in my previous post, I built a docset that will dramatically speed-up Talend Open Studio component development under Mac OS X environments and specifically to those using Dash Snippet Manager. This docset, made in Apple standard format, basically exposes the full set of Talend Open Studio API 5.2 in a more organic and searchable way. To use it, follow these steps:
- Install Dash from this App Store link;
- In Preferences > Downloads click the “+” icon and paste the following feed URL: https://s3.amazonaws.com/extra-openanalytics/dash/Talend_Open_Studio.xml;
- Click the “Download” button to automatically install. After this operation, the window will looks similar to the following picture:
- Close the Preferences Windows. A new entry Talend Open Studio should be visible between installed docsets. Now you can explore the full API set or make searches using the “tos:” prefix (ie. “tos:IMetadataColumn):
- Namespaces unfortunately follow Obj-C syntax rather than Java one, but I was not able to fix it yet. I think it’s a bug in my docsetutil configuration, I’ll investigate further. However, this should not avoid the usefulness of the package.
Next steps will include a set of code snippets for Dash, devoted to Talend component development. Have a nice day!
Talend Open Studio is a great open source platform to develop ETL and data processing workflows based on Eclipse, with a easy-to-lean plugin architecture (although not perfect, in my opinion) and a powerful set of API for developers. Even if the most part of this set of APIs is not for ordinary development (it’s for the designer part, and useful to customize the GUI perspective), the remaining part is in the classpath of the javajet preprocessor, so it can be used to compile components. It’s open source, but documentation is not easily available, so I decided to compile, host and maintain a doxygen/javadoc documentation of the whole set of Talend Open Studio API. It’s fully indexed and full-text server-side searchable and it’s built upon svn repository for the 5.2 branch.
Browse the documentation
You can find it there: Talend Open Studio API for 5.2
Here’s you can find the doxygen definition file used to make it, in case you need to tweak it and perhaps host locally for your needs. Please remember that you need to adjust all paths (including DOT_PATH, for graphs generation) accordingly!
Next step includes the generation of a docset, useful to speed up development under Max OS X, perhaps using an assistant like Dash.