Recently it happened that our feature database got corrupted, we are not sure how this happened but since we had backups, the damage was very minimal. However, the good thing was that we used the fresh start to check the data consistency.
In the past we wrote about the value of meta data and that the meta data is very likely neither complete nor free of errors. Whenever we found an obvious problem, we corrected it but this time, we wanted to spend some more time to clean the data. The aim was a higher level of unification. In other words, better near-duplicate detection and at least a simple “stemming” method. Ordinary stemming does not work, since keywords might be concatenated by some separator like ‘post-apocalypse’ and then we would only stem the last part of the keyword. Since n-grams can help us to solve some of our problems, we used this encoding to clean the data in combination with a heuristic strategy to split keywords.
The procedure helped us a lot to get rid systematically of singular and plural forms, like forms like ‘mummy’ and ‘mummies’ or ‘werewolf’ and ‘werewolves’. It also helped us to find spelling mistakes and specialization of words, like ‘concorde-aircraft’.
However, the extra value of the procedure can be also seen if we train a model on the cleaned data. Without any pre-processing, a model would differ between ‘zombie’ and ‘zombies’ with the consequence that if not both features are amongst the top-k features, the expressive power of the model would be limited. The same would happen if there is a keyword that is a specialization of some other keyword but the frequency of it would be low. Thus, the feature wouldn’t be included and semantic relations from other items to this item based on the special keyword are not possible.
Because we stored the output of the cleaning as simple rules, like A -> B which means that value A is substituted with value B, it is much easier for us to keep the data consistent. Of course some manual work is still required, but most of the work can be done automatically.
How much the procedure will actually improve our models needs to be shown but the fact that now more items contain a subset of the most frequent keywords is definitely positive, since it increases the number of items we can use for training.