Tagged: processing

Adaptable Data Processing And Presentation

When we began to rewrite our TV recommender application from scratch some patterns repeated a lot which are not new, but extremely useful. The basic pipeline is like that: we have chunked id/key/value data that is converted into a coherent CSV file which is then typified, evaluated and imported into some back-end.

The chunked key-value data is nothing special and we only need to keep track of the current ID to combine all pairs into one item. However, streaming makes our life a lot of easier, because then we can treat the encoding as a pipeline with adaptable stages. In the pythonic world, we are using iterators which can be easily chained to model a pipeline. To conclude the process, need two operations: (1) filter out items that do match a certain criteria and (2) map items by modifying existing values or adding new ones.

In this setting, ML procedures can be often seen as an additional mapping step. For instance, we could use the method to build data that is required for a special model, but we could also directly apply the raw data on existing models. Like to estimate the preference of movies, to determine an “interaction” score with series or to annotate items with tags/concepts. And last but not least, it is also possible to build data for ML step-wise, or to model dependencies from one stage to another.

With this in mind, a simple pipeline could look like that:

csv = ChunkedReader(sys.stdin)
pipe = Pipeline(csv.__iter__())
pipe.add(TypifyMap()) # convert strings to data type
pipe.add(ExpireFilter()) #filter out expired items
pipe.add(DedupFilter()) #ignore duplicates
pipe.add(PrefModelMap()) #score movies
pipe.add(IndexMap()) # tokenize relevant data

The chunked data is read from stdin and as soon as there is a complete item, it is passed to the next stage. In this case, a module that converts the strings into more useful data type. Like datetime() for timestamps, integer for IDs and lists for keys that have multiple values. Next, we want to skip items that have been already aired, or if the primary ID has been already seen in the stream of items. At the end, we use some ML model to access the meta data -if available- of movie items and assign a preference score to them. Finally, all values of relevant keys of the item will be tokenized, unified and indexed to allow to search the content.

What is important is that the sequence of the stages might be internally reorganized since filters need to be executed first to decide if any mapping should be done at all. In other words, if any filter rejects an item, the pipeline skips it altogether and requests a new item from the stream.

So, the basic ingredients are pretty simple: We have a stream of items -the iterator- and we have filters to skip items and maps to adjust items. At the end of the pipeline a transformed item emerges that can be stored into some back-end. The beauty is that we can add arbitrary stages with new functionality without modifying existing modules, as long as the design-by-contract is not violated.

Later, when we use dialogs to present aggregated data to the user, we can use the same principle. For instance, a user might request all western movies with a positive score which is nothing more than two filters to reject non-western movies and all items with a negative score. Another example is when a user clicks on a tag to get a list of all items that share the same tag and there are dozens of other examples like that.

The pattern can be described like that: we need a basic iterator to access the data, probably with a strong filter to reduce the amount, and then we need to adapt the data with combinations of filter/map to access only the relevant parts of it. This approach has the advantage that we can implement basic building blocks that are reused all over the place and new functionality can be built by combining elementary blocks.

However, despite the flexibility we gain with such an approach, it is very challenging to integrate this into a GUI since many views require customized widgets to preserve this flexibility. Furthermore, recommenders should also “learn” the optimal layout with respect to the preferences of users which introduces dynamics that further increase the complexity of the graphical interface.

Bottom line, the success of a recommender system mostly depends on the implementation of the graphical user interface and requires both an intuitive, but also flexible application to avoid frustrated user. Hopefully we can find some time to elaborate on it in a new post.

Advertisements