Following a survey made for a well estabilished group at LinkedIn, Big Data professionals impute the causes of failure of most Big Data projects to a lack of knowledge on how to set up, deploy and manage a Hadoop cluster. Although I consider this vision a bit too much simplistic, while I endorse a vision where the change management matters, the bootstrap difficulties for the IT departments are indeed huge and steep. Major Hadoop distribution companies, like Cloudera and IBM, are aware of this flaw and provide sophisticated installers and managers, while they offer quick start VMs for developers at same time. First of all, this cannot be enough for those Enterprises which need to put their production environments under control and it’s even not enough for those developers which need to use something more reliable than a single-node virtual cluster. Finally, it’s important to understand how Hadoop works under the hood, but of course without wasting time on configuration details.
Vagrant definitively comes to help. Using a Cloudera Vagrant box, you’ll be able to build a complete scalable cluster that (with some minor tuning) could even be used in production environments. In this tutorial we’re going to build a cluster based on Cloudera Hadoop Distributions in minutes and without user interventions.
Talend Open Studio is a handy ETL tool which amazing extending capabilities and a complete set of tools for building new custom components as I showed in several posts in the past. Talend offers an automatic way to install components through their official marketplace. However, that place is not famous for UX and for being attractive for developers and, as a matter of facts, components hosted there are often poor and outdated. Here I’m going to show a general way to install custom components which works for both Talend Exchange components and third-party hosted ones. For hard-core developers, I’ll also show how to compile a component starting from source code.
In the majority of Talend tutorials related to database operations I found there’s no or little use of RDBMS prepared statements. To build or parametrize queries, the most part of Talend users and developers seems to prefer a pure string concatenation approach. But this is absolutely a bad habit, since it offers the side to some important security flaws and doesn’t make use of the caching mechanisms of modern RDBMS. Although the guys at Talend really don’t make your life easier because of some choices in I/O DB components, It’s still possible to design a job which make use of PreparedStatements at full extent. In this tutorial I’m going to introduce a technique for some common use cases, while hardening the security and improving the debugging speed at the same time.