In the last weeks, we tried lots of new stuff. It is like a puzzle with dozens of pieces. With each new one, we get a better look at the big picture, at least sometimes, but other times we have to go back a step. Yes, it would be much easier to focus on a single problem, for instance, to create a factor model based on existing ratings, but that is not how it works in our case. One problem leads to another and even the little problems are interweaved with other little problems.
For instance, with the data from the electronic program guide, EPG for short, we have channels that provide very good details about the movies their will show, but for other channels, there is no guarantee that there is even a summary. In other words, heterogeneity is the major problem. In theory, the EPG data can even contain categorizations, like genres, but in practice, it is rarely used. Thus, even the matching of a movie to some meta data is hard work, because there is not always a year present. Furthermore, title mapping can be error prone because of alternative titles for foreign movies.
Long story short, the EPG data alone is not sufficient to setup even some lightweight information retrieval system. The problem is that some movies have a full description and even a list of actors, while some other movie does not even provide a year and the country it was made in. Thus, our system would only index movies with existing meta data while skeleton movies would be completely ignored. That is the reason why we spent so much time to unify all movies with a common set of meta data.
With the experiences we gathered over all the years, the whole process works pretty good, but with just a title from the EPG data, we are still doomed to fail. Nevertheless, as dark as this all sounds, with some manual work, our database of movie/series mappings covers a lot of stuff that is on free television. But since the EPG data is the only steady source, we have some reservations to rely on a single site for gathering meta data. The reason is simple, if the site is gone, we cannot index new items and all the work so far is pretty much useless.
But one is for sure, we will clutch this straw!