For documents, word-level embeddings work pretty well, because the vocabulary does not contain too many special cases. However, in case of tweets or (short) texts with a rather dynamic structure, using words might not be appropriate because it is not possible to generalize to unknown words. That includes the case where those words are pretty similar to existing ones, but not quite the same. The issue can be tackled with recurrent nets that are working with characters, but the problem is that the processing cannot be parallelized easily. In a recent paper “Attention Is All You Need” an approach is described that uses a positional encoding which is used to encode sequences of words without the necessity to use a recurrent net. Thus, the process can be more easily parallelized and has therefore a better performance. However, we still have the problem that we cannot generalize to unknown words except for a ‘UNK’ token which is not very useful in our case.
To be more precise, we try to model titles from TV shows, series, etc. in order to predict the category and there it is imperative that we can generalize to slight variations of those titles which are likely to happen due to spelling errors and several formats for different TV channels. The method to adapt the procedure to utilize characters instead of words is straightforward. We just need to build a lookup map with the calculated values based on the position and the index of the dimension up to the maximal sequence length that is present in the training data. Then, we can easily calculate the encoding of arbitrary inputs by using this formular:
char_id_list #sentence as a sequence of characters
embed = np.sum(pos_enc_map[np.arange(len(char_id_list)) * char_embed[char_id_list], 0)
and one possible lookup could be calculated by:
result = np.zeros((max_seq_len, dim_len))
for pos in xrange(max_seq_len):
a = np.arange(dim_len) + 1.
if pos % 2 == 0:
t = np.cos(pos / a)
t = np.sin(pos / a)
result[pos] = t
The lookup uses different weights for different positions, but also considers each dimension differently which helps to bring more diversity for the encoding. However, since the encoding does not depend on previous states, it also guarantees to preserve semantic similarity of tokens that are present in multiple strings.
Bottom line, an positional encoding allows to us to take the order of a sequence in consideration without using a recurrent net which is very beneficial in many cases. Furthermore, the character-level encoding allows us to classify text sequences we have never seen before which is very important due to minor variations of titles.