Tagged: pipeline

Adaptable Data Processing And Presentation

When we began to rewrite our TV recommender application from scratch some patterns repeated a lot which are not new, but extremely useful. The basic pipeline is like that: we have chunked id/key/value data that is converted into a coherent CSV file which is then typified, evaluated and imported into some back-end.

The chunked key-value data is nothing special and we only need to keep track of the current ID to combine all pairs into one item. However, streaming makes our life a lot of easier, because then we can treat the encoding as a pipeline with adaptable stages. In the pythonic world, we are using iterators which can be easily chained to model a pipeline. To conclude the process, need two operations: (1) filter out items that do match a certain criteria and (2) map items by modifying existing values or adding new ones.

In this setting, ML procedures can be often seen as an additional mapping step. For instance, we could use the method to build data that is required for a special model, but we could also directly apply the raw data on existing models. Like to estimate the preference of movies, to determine an “interaction” score with series or to annotate items with tags/concepts. And last but not least, it is also possible to build data for ML step-wise, or to model dependencies from one stage to another.

With this in mind, a simple pipeline could look like that:

csv = ChunkedReader(sys.stdin)
pipe = Pipeline(csv.__iter__())
pipe.add(TypifyMap()) # convert strings to data type
pipe.add(ExpireFilter()) #filter out expired items
pipe.add(DedupFilter()) #ignore duplicates
pipe.add(PrefModelMap()) #score movies
pipe.add(IndexMap()) # tokenize relevant data

The chunked data is read from stdin and as soon as there is a complete item, it is passed to the next stage. In this case, a module that converts the strings into more useful data type. Like datetime() for timestamps, integer for IDs and lists for keys that have multiple values. Next, we want to skip items that have been already aired, or if the primary ID has been already seen in the stream of items. At the end, we use some ML model to access the meta data -if available- of movie items and assign a preference score to them. Finally, all values of relevant keys of the item will be tokenized, unified and indexed to allow to search the content.

What is important is that the sequence of the stages might be internally reorganized since filters need to be executed first to decide if any mapping should be done at all. In other words, if any filter rejects an item, the pipeline skips it altogether and requests a new item from the stream.

So, the basic ingredients are pretty simple: We have a stream of items -the iterator- and we have filters to skip items and maps to adjust items. At the end of the pipeline a transformed item emerges that can be stored into some back-end. The beauty is that we can add arbitrary stages with new functionality without modifying existing modules, as long as the design-by-contract is not violated.

Later, when we use dialogs to present aggregated data to the user, we can use the same principle. For instance, a user might request all western movies with a positive score which is nothing more than two filters to reject non-western movies and all items with a negative score. Another example is when a user clicks on a tag to get a list of all items that share the same tag and there are dozens of other examples like that.

The pattern can be described like that: we need a basic iterator to access the data, probably with a strong filter to reduce the amount, and then we need to adapt the data with combinations of filter/map to access only the relevant parts of it. This approach has the advantage that we can implement basic building blocks that are reused all over the place and new functionality can be built by combining elementary blocks.

However, despite the flexibility we gain with such an approach, it is very challenging to integrate this into a GUI since many views require customized widgets to preserve this flexibility. Furthermore, recommenders should also “learn” the optimal layout with respect to the preferences of users which introduces dynamics that further increase the complexity of the graphical interface.

Bottom line, the success of a recommender system mostly depends on the implementation of the graphical user interface and requires both an intuitive, but also flexible application to avoid frustrated user. Hopefully we can find some time to elaborate on it in a new post.

Batch Normalization: The Untold Story

With all the success of BN, it is amazing and disappointing at the same time that there are so many fantastic results but so little practical advice, how to actually implement the whole pipeline. No doubt, BN can be implemented pretty easy in the training part of the network, but that is not the whole story. Furthermore, there are, at least, two ways to use BN during training. First, with a running average for mean/std values per layer which can later be used for unseen data. Second, to calculate the mean/std values for each mini-batch and then run a separate step to fix the statistics for the data at the end of the training.

Method ONE is self-contained, since it does not involve a second step to finalize the statistics, but it introduces additional hyper-parameters, like the decay value end an epsilon to ensure numeric stability. Furthermore, for some architectures we encountered numerical problems which lead to NaN values.

Method TWO is more straight-forward because the mean/std are determined for each mini-batch and no parameter updates are involved. However, this also forces us to run an extra step over the whole data to fix the statistics, per layer, for the mean and the standard deviation.

We are stressing all this because because posts regarding the use of BN for unseen/test data were pretty frequent in the near past. All in all, despite the fact that BN is not very complex, a guideline for beginners would still make sense to avoid common pitfalls and frustration. And since BN is becoming standard in the repertoire of training neural networks, researcher need a solid understand of it. Like the detail that with BN a batch size of one is not possible any longer, because the input data does not suffice to estimate the required statistics. The same might be true for a smaller batch sizes like 4, 8 when the original batch size was much larger.

With all this in mind, a possible pipeline could looke like that:

= Training =
– setup a network with separate BN layers as described in the previous post, follow by an activation layer
– sample random mini-batches and feed them to the network which will use per batch values for mean/std
– continue as long as the network did not converge

= Postproc =
– disable all weight updates and switch all BN layers to accumulate mean/std values by using a moving average with a decay of, say, 0.9
mean = decay * mean + (1-decay) * mean_batch #initialize with 0
std = decay * std + (1-decay) * std_batch #initialize with 1
– feed the whole training data in mini-batches to the network to update the statistics, the output can be ignored
– freeze the mean/std values for each BN layer and store them along with the network parameters

= Test =
– load the network weights from disk and set all BN layers to test-mode which uses the fixed mean/std values from Postproc and no updates of the values are performed
– feed new data to the network and do something with the output

The annoying parts are the additional passes of the data through the network, but as soon as this is done, the fprop part of unseen data is business as usual. However, in contrast to non-BN networks, the processing overhead is noticeable but usually worth it because of the benefits BN comes with.

Bottom line, BN is definitely a very powerful tool to tackle existing problems but it should be noted that it does not come for free. First, BN introduces some overhead during training in terms of space, extra parameters, and time, for the data normalization. Second, layers in networks needs to be adjusted because BN is applied before the activation and third, a post-processing step is often required.