Too many tags! Topic modeling from multi values tags in data using LDA

Too many tags! Topic modeling from multi values tags in data using LDA

Modern web platforms and CMS follow a general trend to let the user set collection of tags, ie. attributes describing the content (a Corpus, a dataset, a collection…) which can be set to multiple values. This is not only true for web data (think about survey when users can answer with multiple values). So, multiple tags collections usually decorate even discrete datasets , and the values are in a freeform by definition. Thus, at the end of the day, we end up having thousands of different tags, most of them with very low frequency, almost useless to explain the data and the relations among observation. My idea is to classify the tags in topics, for better understanding the data and for dimensionality reduction, thanks to topic modeling solutions like LDA and Gibbs sampling.

Read more

How to install RStudio Server in a production-ready Ubuntu environment

RStudio is a full-featured programming environment for coding in R and, as it comes for free, it’s totally in scope for our Open Analytics duties. The best known version of this nice piece of software is the desktop one, which is available for Windows, Mac and Linux platforms and it’s not uncommon to see as a personal analytics solution, especially where SAS is way to expensive. Sometimes, people use RStudio to work locally with R, for developing/prototyping/testing and then deploy .R files on a remote (heavy) server which runs it using stand-alone R for better performance.

The question is: why do that if one could use RStudio bigger brother, the RStudio server?

Read more