Automatic deployment of data science (and more) with Travis CI and Git

Being a data scientist often means wasting many times on repetitive tasks and deployments into production environments is one of them. This is involved into the typical (boring) development life cycle:

  1. Write some code;
  2. Test all your code;
  3. Commit on your version control system (ie. git);
  4. Upload on production server;
  5. Test it again, you know, just in case;
  6. Wonder how the hell that nasty bug was able to sneak among all the tests;
  7. Go to 1 and restart.

This is ideally going to be done manually on each update to the code…again, again and again. So, the computer engineer that is in me started thinking about a solution to automate this tedious process. In this tutorial I’m going to explain a solution which can be used in a very wide range of applications. In facts, I use it to deploy my R scripts and packages in production, but also for the WordPress theme for the website you’re looking into, to automatically build my markdown docs into Github Pages and so on. Applications are endless.

In this tutorial I’m going to host the code on a public Github repository and the free version of Travis CI. This means that, in theory, your source code must be opensourced for this trick to work for you, unless your going to use the paid version of Travis CI, which allow to source private repositories. But…well…if you don’t believe in Karma, there’s a 100% black hat solution which allows you to use private repos on free Travis CI, but this is not allowed by Travis CI terms and conditions, so don’t do that. Really, don’t.

Anyway, what’s we’re going to do here is to take the old school development life cycle and turn it into a modern continuous integration life cycle:

  1. Write some code;
  2. Commit on your version control system;
  3. Sit down and watch the continuous integration server building the code, executing tests and deploying the fresh code into production environment;
  4. Good Job! You deserve a beer! Then, go to 1 and improve your code.

There are very good continuous integration solutions like Capistrano, over there, but their setup is often overkill for the typical user, and we don’t need so much power for small projects, after all. As a matter of facts, mere mortals need an easier solution, something doable in 15 minutes without much hassle. So, start the timer now, you’re on the way!
Read more

How to setup Cloudera Hadoop in 10 minutes using Vagrant

Following a survey made for a well estabilished group at LinkedIn, Big Data professionals impute the causes of failure of most Big Data projects to a lack of knowledge on how to set up, deploy and manage a Hadoop cluster. Although I consider this vision a bit too much simplistic, while I endorse a vision where the change management matters, the bootstrap difficulties for the IT departments are indeed huge and steep. Major Hadoop distribution companies, like Cloudera and IBM, are aware of this flaw and provide sophisticated installers and managers, while they offer quick start VMs for developers at same time. First of all, this cannot be enough for those Enterprises which need to put their production environments under control and it’s even not enough for those developers which need to use something more reliable than a single-node virtual cluster. Finally, it’s important to understand how Hadoop works under the hood, but of course without wasting time on configuration details.

Vagrant definitively comes to help. Using a Cloudera Vagrant box, you’ll be able to build a complete scalable cluster that (with some minor tuning) could even be used in production environments. In this tutorial we’re going to build a cluster based on Cloudera Hadoop Distributions in minutes and without user interventions.

Read more