With the growing popularity of micro-blogging, there also has been an increasing interest to analyze those text with machine learning methods. The first challenge, as the name already indicates, is that the texts are very short and riddled with slang words and abbreviations. In some way, this is very similar to our situation where a movie is described by a very limited set of keywords and not all of those keywords have a counterpart in language. However, in contrast to the movies, the text of the blog still follows a sentence structure as it somehow describes a language.
But regardless of the actual domain, short text needs a different approach to discover topics, because we have very limited data to precisely describe topics in a consistent way. For simplicity, we use NMF[1301.3527], with a sparsity constraint to learn the topics, because the approach is well-understood and the output is human-readable because of the absence of negative weights.
The sparsity is beneficial for several reason: First, usually topics can be described by very few keywords which means those keywords should have a high impact, while other keywords should have no impact at all. Second, we assume that a lot of fine-grained topics exist in the data and storing them as sparse weight vectors allows to better disentangle patterns and increase the expressive power of the learned representation.
Therefore, we train the NMF model with a sparsity of 0.96 and set the number of topics to 20% of the number of the raw input features, which is 200 in our case. For the training, we use the co-occurrence matrix and not the data itself. Furthermore, we set all weights, in the learned model, that are below 1e-6 to zero. With these settings, the trained model has on average ~10 positive weights per learned topic, but there are some anomalies where topics have more than 50 entries. We plotted the top-k features for each learned topic which confirmed that the disentangling worked, because we had a coverage of 975 of 1,000 keywords (97.5%) and the revealed topics are reasonable. In other words, if we ignore the anomalies, we learned about 180 fine-grained topics that cover about 98% of all input features. On average, the number of activated topics per sample is ~9 which means a sparsity of ~96%.
Despite the success of the model, there are some major drawbacks: First, without an orthogonality constraint, the learned topics might still overlap too much which means the model is wasting capacity. Furthermore, in case of limited data, the captured aspects of a theme might be too broad. For instance, the ‘zombie’ theme is only covered by a single topic that is very general:
– werewolf, undead, blood, zombie, vampires
Clearly, the topic encodes the pairs (undead, zombie) and (blood, vampires) which are the most obvious themes, but other combinations are completely ignored.
In a nutshell, it seems the right step, to force the model to learn topics that are very sparse on the one hand, but diverse on the other hand, but more work has to be done to control the capacity and the diversity of the model.