It is no secret that data augmentation, except for the domain of images and to some degree for audio, requires some creativity. For instance, if we have a picture of a cat, we can alter some pixels without risking to change the perception a lot. However, if we have a document and we remove/replace/insert new words, probably strange things will happen and the impact will be more severe for input spaces with limited values. For example, if we describe a movie with a few descriptive keywords, omitting some words can have harsh consequences. In a nutshell, most straight-forward approaches won’t work for sparse text data.
However, we can shift the problem to the label space as we demonstrate with an example. We argued that common genres like “action” or “drama” are too diverse to be used for most classification tasks. Instead, we focus on minor genres like sci-fi, horror, fantasy and so on. The drawback is that we now have much fewer data samples than before, since a lot of movies belong to this genre. But if sub-genres are available, we can often cheat by “reversing” the label back into the main genre space.
Here is an example: An arbitrary movie has “action” as its main genre, but “sci-fi-action” as its sub-genre. With this knowledge, we can infer that there is at least a minor science-fiction theme present which justifies that we add “science-fiction” to the main genres. With this method, we augmented our training set with about 6,000 new samples with the reverse mapping from sub-genres to main genres.
Of course, the mapping from the sub-genres to the main genres is not always as straightforward as in the example and sometimes requires human interaction. For instance, alien-film -> science fiction or sword-sorcery -> fantasy, but the invested time is definitely not wasted.