Inspired by the recent advances in multi-task learning we also wanted to investigate the capabilities of those methods for simpler tasks. As stated by researchers, learning your own word embedding can easily lead to overfitting which is why we also wanted to use pre-trained embeddings. However, to address out-of-vocab words, it is mandatory that the model can also embed words that were unknown during training. For instance, fastText comes with n-gram support that allows exactly this and furthermore, it also provides models for other
languages than English.
However, with a vocab of 2M words and 300 dimensions, those models can be a bit cumbersome to work with. You require approximately 8 GB RAM just to hold the data in memory which is no problem for modern servers, but at your machine at home, it might be quite a lot since you also need to run other services. A further question is about the interface between the data and your favorite language. There is a very nice API for python, but it requires to load the full binary model and also introduces some minor overhead.
So, we decided to analyze the binary format of the model, to see if we can somehow represent the model more compactly. The good thing is that the format is quite simple: there is a dictionary, additional model parameters and at the end of the file there is the input and the output matrix.
For the pre-trained models, we have 2M n-gram buckets and 2M words, and each row has 300 dimensions (float32). A matrix is represented by a fixed 16 octet header: int64 rows, int64 cols and each vector is a sequence of 300 float32 values. Between the input and output matrix is a single octet that indicates if quantization has been used. With this numbers it is easy to go to the correct offset in the file to read the data.
The size of a matrix is: rows*cols*sizeof(float32) which is 2M * 300 * 4 ~ 2.2 GB. Actually, the input matrix is a concatenation of the vocabulary words and the buckets for the n-grams. Thus, the index 0…2M references the words and 2M…4M references the buckets which is why 2M is added to the index returned by the hash function. So, we have three matrices with a total size of ~7 GB.
Now, we come to the coding part which is pretty easy with python. You just need to calculate the total size of the data + 1 octet as a quantization flag and seek to this position relative from the end: lseek(fd, -total_size, 2). After, you parsed the header, you can simply read each vector by vec = unpack('300f', fd.read(300 * sizeof(float32))) and store it in a numpy array. It is also possible to use float16 for the array since the loss of precision should not be noteable for similarity lookups. Since there is no direct lookup for each n-gram, we also need to port the hash function from c++ to python which is used to retrieve the row ID for each n-gram. Example:
h = 2166136261
for i in xrange(len(str)):
h = h ^ int(ord(str[i]))
h = h * 16777619
h = h & 0xffffffff
return h % bucket_size
Because we cannot model a int32_t type, we need to ensure the boundaries with the AND mask. And bucket_size is 2M. To retrieve the embedding of a specific ngram, we use the hash function:
v_emb = W_ngrams[hash("%wh")]
To convert a new word into a lookup vector, we first generate all n-grams with a size between 3 and 6. Then we convert each n-gram to a row ID and sum all vectors up and dividing it by the length of the n-gram list, which is nothing more than the average.
id_list = [hash(x) for x in ngrams("%where%")]
emb = np.sum(W_ngrams[id_list], 0) / len(id_list)
It is not complicated to wrap the whole procedure into a single class that outputs an embedding vector for an arbitrary word. So, instead of fiddling with 7 GB we just have 2.2 GB as a single numpy array, in case we need to generate vectors for unknown words.
But there is a minor issue we need to address. The representation of words from the vocabulary is a combination of the actual embedding vector plus the sum of its n-grams. In other words, to perform a similarity lookup, we require both matrices to perform it. Since the accumulated representation does not change and the actual embedding vector is never use stand-alone, we decided to perform the pre-processing step and store the result as a separate matrix. Now, we can use this matrix to directly output word embeddings and/or to perform nearest neighbor queries.
W_words = [..]
W_grams = [..]
id_list = ngrams(vocab)
W_words = np.vstack((W_words, W_grams[id_list]))
W_word = np.sum(W_words, 0) / W.shape
Finally, we have two matrices: (1) the accumulated vocabulary matrix, which can be compared to the output of a word2vec with n-gram support and (2) one for the 2M n-grams which can be used with the hash function for lookups.
Bottom line, it is very helpful to have access to pre-trained embeddings which were trained on a large-scale corpus since very often, your data at hand is not sufficient to train ‘unbiased’ embeddings that generalize well. Furthermore, the n-gram support allows you to embed also oov words which will definitely occur and neither random inits nor UNK token are really appropriate to address them.