Fighting One-Time Words With Synsets

A problem that is present in all keyword-based feature systems is that there are lots of words that are used exactly once. In our data set, the portion of one-time words is about 30%. A naive approach would be to map those words to the ‘closest’ word in the data set that is more frequent. However, in case of limited data, it is doubtful that this works. A more sophisticated approach would be to determine synonyms for all keywords and then to merge one-time words into the matching synset -the set of synonyms for a specific word- if one exists.

For example, the keyword ‘bookworm’ could be mapped to a synset consisting of {booklover, scholar, savant, intellectual}. Another example is ‘cathedral’ that could be mapped to {house-of-god, temple, minster}. With this approach, we could replace one-time words with a more frequent word in the data set that is in the same synset as the word, or we could directly use the synsets as features.

In short, with synsets we are able to utilize a subset of the one-time keywords as features to learn better models. We expect that this improves our models a lot, because rare keywords are usually very valuable, as we demonstrate with a closing example. Two movies which are heavy on a “religious” theme both contain only one-time keywords. With the synset ‘cathedral’, we can relate all words in the set to infer that a specific theme is present. For instance, ‘temple’ and ‘cathedral’ are both mapped to the same high-level concept. Without the additional data, neither the keywords could be used, nor they could be related.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s