Because of some modernization of our features, we found a very nasty bug in the processing of the raw features. So far, this is not surprising, since we all know that bugs love to live in software. Strictly speaking, the problem is not in the code itself but is in a regular expression. REs are something wonderful if you have to work with data that has some regularities you want to extract, but they can torment you if they get very complex.
This is a perfect example where something works like a charm the one day, but not all at the other day. Whenever we have to rely on somebody and there is no standard format to work with the data, we are doomed. Why? Because from one minute to the other, the format might be changed. Maybe only a little and the RE is still working, but often the result is not what we expect. Even worse, maybe some is returned, but only fragments of it and thus, no trivial error checking can be done.
On the bright side, the problem explains some oddities in our trained model sand the bug fix hopefully improves the precision of further models. On the dark side this problem shows that a problem in the feature processing can have a huge impact.
That is why we asked ourself if it wouldn’t be wonderful, if the Web would consist of JSON objects and not free text? Sure, this would not fix all problems on Earth, but at least we could process data more easily.