Character-Level Positional Encoding

For documents, word-level embeddings work pretty well, because the vocabulary does not contain too many special cases. However, in case of tweets or (short) texts with a rather dynamic structure, using words might not be appropriate because it is not possible to generalize to unknown words. That includes the case where those words are pretty similar to existing ones, but not quite the same. The issue can be tackled with recurrent nets that are working with characters, but the problem is that the processing cannot be parallelized easily. In a recent paper “Attention Is All You Need” an approach is described that uses a positional encoding which is used to encode sequences of words without the necessity to use a recurrent net. Thus, the process can be more easily parallelized and has therefore a better performance. However, we still have the problem that we cannot generalize to unknown words except for a ‘UNK’ token which is not very useful in our case.

To be more precise, we try to model titles from TV shows, series, etc. in order to predict the category and there it is imperative that we can generalize to slight variations of those titles which are likely to happen due to spelling errors and several formats for different TV channels. The method to adapt the procedure to utilize characters instead of words is straightforward. We just need to build a lookup map with the calculated values based on the position and the index of the dimension up to the maximal sequence length that is present in the training data. Then, we can easily calculate the encoding of arbitrary inputs by using this formular:

 char_id_list #sentence as a sequence of characters
 embed = np.sum(pos_enc_map[np.arange(len(char_id_list)) * char_embed[char_id_list], 0)

and one possible lookup could be calculated by:

result = np.zeros((max_seq_len, dim_len))
for pos in xrange(max_seq_len):
 a = np.arange(dim_len) + 1.
 if pos % 2 == 0:
  t = np.cos(pos / a)
 else:
  t = np.sin(pos / a)
 result[pos] = t

The lookup uses different weights for different positions, but also considers each dimension differently which helps to bring more diversity for the encoding. However, since the encoding does not depend on previous states, it also guarantees to preserve semantic similarity of tokens that are present in multiple strings.

Bottom line, an positional encoding allows to us to take the order of a sequence in consideration without using a recurrent net which is very beneficial in many cases. Furthermore, the character-level encoding allows us to classify text sequences we have never seen before which is very important due to minor variations of titles.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s