Modern web platforms and CMS follow a general trend to let the user set collection of tags, ie. attributes describing the content (a Corpus, a dataset, a collection…) which can be set to multiple values. This is not only true for web data (think about survey when users can answer with multiple values). So, multiple tags collections usually decorate even discrete datasets , and the values are in a freeform by definition. Thus, at the end of the day, we end up having thousands of different tags, most of them with very low frequency, almost useless to explain the data and the relations among observation. My idea is to classify the tags in topics, for better understanding the data and for dimensionality reduction, thanks to topic modeling solutions like LDA and Gibbs sampling.
Data scientist, open source Business Intelligence evangelist, Head of Analytics at MRM//McCann.