Automatic deployment of data science (and more) with Travis CI and Git

Being a data scientist often means wasting many times on repetitive tasks and deployments into production environments is one of them. This is involved into the typical (boring) development life cycle:

  1. Write some code;
  2. Test all your code;
  3. Commit on your version control system (ie. git);
  4. Upload on production server;
  5. Test it again, you know, just in case;
  6. Wonder how the hell that nasty bug was able to sneak among all the tests;
  7. Go to 1 and restart.

This is ideally going to be done manually on each update to the code…again, again and again. So, the computer engineer that is in me started thinking about a solution to automate this tedious process. In this tutorial I’m going to explain a solution which can be used in a very wide range of applications. In facts, I use it to deploy my R scripts and packages in production, but also for the WordPress theme for the website you’re looking into, to automatically build my markdown docs into Github Pages and so on. Applications are endless.

In this tutorial I’m going to host the code on a public Github repository and the free version of Travis CI. This means that, in theory, your source code must be opensourced for this trick to work for you, unless your going to use the paid version of Travis CI, which allow to source private repositories. But…well…if you don’t believe in Karma, there’s a 100% black hat solution which allows you to use private repos on free Travis CI, but this is not allowed by Travis CI terms and conditions, so don’t do that. Really, don’t.

Anyway, what’s we’re going to do here is to take the old school development life cycle and turn it into a modern continuous integration life cycle:

  1. Write some code;
  2. Commit on your version control system;
  3. Sit down and watch the continuous integration server building the code, executing tests and deploying the fresh code into production environment;
  4. Good Job! You deserve a beer! Then, go to 1 and improve your code.

There are very good continuous integration solutions like Capistrano, over there, but their setup is often overkill for the typical user, and we don’t need so much power for small projects, after all. As a matter of facts, mere mortals need an easier solution, something doable in 15 minutes without much hassle. So, start the timer now, you’re on the way!
Read more

Too many tags! Topic modeling from multi values tags in data using LDA

Too many tags! Topic modeling from multi values tags in data using LDA

Modern web platforms and CMS follow a general trend to let the user set collection of tags, ie. attributes describing the content (a Corpus, a dataset, a collection…) which can be set to multiple values. This is not only true for web data (think about survey when users can answer with multiple values). So, multiple tags collections usually decorate even discrete datasets , and the values are in a freeform by definition. Thus, at the end of the day, we end up having thousands of different tags, most of them with very low frequency, almost useless to explain the data and the relations among observation. My idea is to classify the tags in topics, for better understanding the data and for dimensionality reduction, thanks to topic modeling solutions like LDA and Gibbs sampling.

Read more

GTM for Bloggers: How to track outbound links clicks with Google Tag Manager

GTM for Bloggers: How to track outbound links clicks with Google Tag Manager

A solid way to build SEO around your blog is obviously to link to sources and to other in-depth articles for your readers to route in. This is an ethic, yet convenient, way to manage a blog and, last but not least, makes Google not evil. On the other hands, be able to measure how much and which kind of traffic you’re driving to your sources is crucial for your content strategy, as It could give could insights about which kind of topics your readers are interested in and you should craft in a first place. That’s why the third step of our learning path is about how to track outbound links clicks and how to efficiently report it on Google Analytics.
Read more

Page 1 of 6123456