Modern web platforms and CMS follow a general trend to let the user set collection of tags, ie. attributes describing the content (a Corpus, a dataset, a collection…) which can be set to multiple values. This is not only true for web data (think about survey when users can answer with multiple values). So, multiple tags collections usually decorate even discrete datasets , and the values are in a freeform by definition. Thus, at the end of the day, we end up having thousands of different tags, most of them with very low frequency, almost useless to explain the data and the relations among observation. My idea is to classify the tags in topics, for better understanding the data and for dimensionality reduction, thanks to topic modeling solutions like LDA and Gibbs sampling.

For this tutorial, I’m going to use the BoardGameGeek dataset, a collection that describes more than 94,000 board games and their expansions, took from the entire human history. The dataset has several columns which host various kinds of tags: categories, mechanics, implementations, families, designers and so on. Multiple tags could be added for a given property and they are coded into the table in a comma-separated form. At the end of the day, we have more than 5,000 distinct tags into the data, and most of them with very low frequency (and explaining very low variance).

This tutorial presents an extensive use of the tidyverse collection, so a basic understanding of its programming pattern is recommended. Here’s how to load the data and all the needed dependencies:

Preprocessing the data

Before diving into the topic modeling stage, data need some sanitation first. To start, I removed game expansions from the database, as they almost always share the same tags of the main game, thus they won’t add variance on the data, while hugely increase the computation time. Then, I decided to remove all those games which have received less than 5 ratings from the community. This will help me to wipe-out unpublished games, vaporware, and, in general, all those forgotten games that nobody has tagged yet or have been tagged poorly.

Since tags columns can host more than one tag (comma-separated), I created a set of dummy variables from all the tags I found in the columns category, mechanic, family and implementation. Other tags collections exist in the data, for example the column expansions, but they are so sparse and quality of tags so low (in terms of frequency distribution) that they would add more noise than value, so I decided to ignore them. This is done from the nice cSplit_e function from the package splitstackshape.

I ended up with a quite impressive set of 5841 dummy variables, very sparsely distributed. And here the problems start to arise…

 Why topic modeling?

To answer this question, let’s have a look to the frequencies of all the tags in one of the important attribute: the game category (click to enlarge).

BoardGameGeek dataset - Top 20 games categoryLooking at the relative frequencies, we would see that, excluding the one or two top elements, tags distribute themselves in a very long tail, with the mostly part (outside the Top 20) sits in less than 1% of games. If you put terms in columns and games in rows, you will end up with a very sparse documents-terms matrix where the “documents” are the games and the “terms” are the tags themselves.

Anyway, the problem is that with a so sparse matrix, typical dimensionality reduction will simply not work. I’ve trained different MCA models and I ended up with the primary dimension explaining only ~2% of the total variance, secondary dimension adding ~0.005% variance and so on. No clear cutoff, thus no accurate dimensionality reduction.

But we must consider that tags are filled by the users so synonyms are really common. Thus, it’s natural to say that some relations between part of them are certainly possible. As I said, this data shape really resemble a documents-term matrix you would have while working on a Corpus of text documents. It’s admissible to try models usually used to solve Text Mining problems, like the topic modeling problem.

The point is: does a topic model exist for a very space document-term matrix? Yes, it does!

Topic modeling: Build the model with LDA and Gibbs sampling

If you read in the aforementioned paper, you’ll see that the Latent Dirichlet Allocation is the best model so far to classify short and sparse text with hidden topics (ie. without knowing the topics in advance). LDA requires you to specify the number of topics you want….but which is the optimal number of topics for a given dataset?

To answer this question, I could build a very robust bag of indexes but it would take days to compute and I’m not sure they would work in this case as they tend to overestimate the number of topics in some circumstances. Since I need minimization more than accuracy (remember I’m pursuing dimensionality reduction?), I decided to not follow this path.

I think it’s much easier to use this function made by tinker which basically performs a trial-and-error simulation of various LDA models and select the number of topic which maximizes the harmonic mean of the log likelihood. After the execution of the code chunk below with my data I got 37 as optimal number of topics, but thanks to the plot I was able to see that 25 was very near. Thus, I decided to keep this value instead.

Now we’re ready to train the model using the Gibbs sampling. If you want to try this at home, you could have a nap in the meanwhile, as it take a while (if you have more than 30 topics and a mid-size dataset you had better go to bed, instead!).

Here, I’ve also sliced a Top 10 terms-topic matrix, evaluated with the index beta, which represents the probability for a given term (tag) to be generated in each topic. The same can be done with the index gamma which evaluated the probability for a given document (game) to be assigned to each topic. As a practical example, let’s see the top 10 tags that you will most probably meet in the topic 24.


Visual Inspection of Results

The inspection of the results is crucial to understand if the topics are realistic and reliable, and this must be done manually (another good reason to keep the number of topics low!). The tags and the documents plots, as inspected alone, are probably difficult to interpretate, so I’m going to build an integrated plot. Staying with the topic 24:

BGG dataset - Topic Modeling - Summary of Topic 24

Now it’s clear! This is the cluster of those abstract strategy games which probably have been inspired by Chess! And since this iconic game is not into the Top 10 of documents (games), but I expect it to fall into the topic 24 as well, so let’s have a look on where it is:

Nice to see that of the 65 Chess games (and variants) present in database 53 have been correctly assigned to topic 24. If this is not a good sign, nothing else will be! 🙂

Keep those steps for each topic and you’ll able to assign a label to all of them them and add the topic column to the original dataset. Luckily enough, in this case it seems that all the topics are reliable.

Distributions of results and further explorations

If you remember, the categories distribution (the most important tags collection) presents a short-head, long-tail distribution with no tags expressing a relevant amount of variance. So, let’s see the frequency plot for topics, instead.

BGG Dataset - Topic Modeling - Topic Plot

Aaaah…much better, isn’t it? A finite set of discrete topics all having a relevant weight in expressing variance into the data! A perfect starting point for further analysis.

Let’s say, for example, we want to study the correlation between rating and weight (difficulty). It’s natural to use topics to slice data here.

BGG dataset - Topic Modeling - Correlation Plot between rating and weight

This not only suggests me that BGG users prefer difficult games (this is something already known, actually), but that this is especially true for Corporate Strategy Games. This is realistic since fall in this topic all games like Puerto Rico and Le Havre, those german games made famous to be strategically rich, but very hard to play. Kinda cool, isn’t it?

Relations between topics

While I assigned only one topic to a particolar game (the one with the highest probability of its term to be generated into), Gibbs’ matrix computes the probability for all terms and all topics, actually.. This is logic, as documents could be made by a variable mixture of all the topics (each of them with different probability). This gives me the possibility to explore the relations between topics based on terms (tags) probabilities and plot them into a very impressive network graph:

BGG Dataset - Topic Model - Topics Relations


In this Kernel I introduced a LDA approach to topic modeling problem when you have discrete, many-to-many, several collections of tags. Although this model is usually used in text mining, if you have enough tags you can successfully find a finite number of topics, even if the documents-terms matrix is very sparse. After some labeling through manual inspections of results, it seems that topics are much reliable, and the model is robust. Thus, topics can be used for better inspection and, last but not least, for dimensionality reduction. Thanks for reading this post and don’t forget that all the code presented here can be run, modified and discussed on Kaggle. Aloha!

Share This