How to Handle Out-of-Vocabulary Words Pragmatically

Using pre-trained word embeddings can be a huge advantage if you don’t have enough data to train a proper embedding, but also if you are afraid to overfit to the task at hand if you train it jointly with your model. In case of English you find quite a lot word embedding models that are publicly available, but it’s different for other languages like French, German or Spain. There are some available, like the ones provided with fastText, but depending on the language, it can be challenging. So, let’s assume that you have some luck and there is a pre-trained model, then the next challenge waits just around the corner. The problem is that for specific tasks, there are definitely words that are not present in the vocabulary. There are some dirty solutions like mapping all those words to a fixed token like ‘UNK’ or using random vectors, but none of these approaches is really satisfying.

In case of fastText there is a clever, built-in, way to handle oov words: n-grams. The general idea of n-grams is to also consider the structure of a word. For instance, with n=3 and word=’where’: [%wh, whe, her, ere, re%]. The % are used to differ between the word “her” and the ngram her, since the former is encoded as “%her%” which leads to [%he, her, er%]. In case of a new word, the sum of n-grams is used to encode the word which means as long as you have seen the n-grams, you can encode any new word you require. For a sufficiently large text corpus, it is very likely that a large portion, or even all, required n-grams are present.

Since most implementions, also fastText, is using the hashing trick, you cannot directly export the mapping n-gram vector, however, there is a function to query n-grams for a given word:

$ fasttext print-ngrams my_model "gibberish"

There is an open pull request (#289), to export all n-grams for a list of words, but right now to call fasttext for each new word which is very inefficient in case of huge models. Without knowing about the pull request, we did exactly the same. First we slightly modified the code to accept a list of words from stdin and then we performed a sort with duplicate elimination to get a distinct list of all n-grams which are present in the model.

Now, we have a pre-trained model for the fixed vocabulary, but additionally, we also have a model that allows us to map oov words to the same vector space as long as the n-grams are known. We also evaluated the mapping with slightly modified or related words which are not in the vocabulary with very good results.

Bottom line, without a doubt n-grams are not the silver bullet but they help a lot if you work with data that is dynamically changing, which includes spelling errors, variations and/or made-up words. Furthermore, publicly available models often deliver already solid results which takes the burden from you to train a model yourself which might overfit to the problem at hand or is not satisfying at all because you don’t have enough data. In case a model does not come with n-gram support, there is also a good chance to transfer the knowledge encoded in the vectors into n-grams by finding an appropriate loss function that preserves this knowledge in the sum of n-grams.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s