One good thing about machine learning is that it is often interdisciplinary, meaning that it uses ideas from physics, mathematics or even neuro or computer science. Or stated differently, there are no real limits if you can express your idea in some formal mathematical way. Combined with a powerful tool to train your models without the need to do all the hard work, derivations and such, manually, it is much easier to focus on the actual problem.
In a recent post, we let our thoughts wandering by considering different ways to describe movies in some way. Well known approaches are TF-IDF or collaborative filtering and of course many others. However, in contrast to classical ranking or information retrieval systems, our first goal is not to derive a supervised model but a model that is sufficiently able to describe the underlying data.
We stumbled about a new possible approach while we implemented a “predicting the next word” feature for our front-end. The idea is simple: If you enter some keywords -a search query-, the system should be able to suggest words that match into the query context. For example, a query like ‘alien sf’ could be enhanced with “creature” or “space”. The training of such a model can be done in numerous ways. Popular choices are ‘Neural Probabilistic’ or ‘Log-Bilinear’ models.
The idea is pretty simple. We consider a dictionary V that consists of textual words. Each word will be represented by some feature and in the original space, each word is represented as a one-hot encoding. Then we have to define a context length, n=3 for example. All we need to do then is to convert some text data into our vocabulary by replacing each word with the corresponding ID from the vocabulary.
A training sample is generated by splitting a sentence like “my lifeboat is full of eels” into a context and the word to predict. Example: context: “my lifeboat is”, word “full”. With the IDs from the vocabulary, we get a tuple (ID1, ID2, ID3, ID4) for each training sample where 1..3 is the context and 4 is the word to predict.
With a trained model, we can then predict what word of the vocabulary best matches a given test context of words. What all these approaches have in common is that they are supposed to generalize to contexts that have not been seen before by using the learned semantic of the words from the vocabulary.
After we trained a basic model with our data, we thought about how we could use the model for our semantic clustering by using the learned similarities between the words. In the next post, we will elaborate on that.