Tagged: model

The Model Has Been Trained … Now What?

With the increasing popularity of data science, whatever this term actually means, a lot of functionality in software is based on machine learning. And yes, we all know machine learning is a lot of fun, if you get your model to learn something useful. In the academic community a major focus is to find something new, or enhance existing approaches, or to just beat the existing state of the art score. However, in industry, once a model is trained to solve an actual problem, it needs to be deployed and maintained somewhere.

Suddenly, we might not have access to huge GPU/CPU clusters any longer which means the final model might need to run a device with very limited computational power, or even on commodity hardware. Not to speak about the versioning of models and the necessity to re-deploy the actual parameters at some time. At this point, we need to change our point of view from research/science to production/engineering.

In case of python, pickling the whole model is pretty easy, but it requires that a compatible version of the code is used for de-serialization[1]. For long-term storage, it is a much better idea to store the actual model parameters in a way that does not depend on the actual implementation. For instance, if you trained an Elman recurrent network, you have three parameters:
(1) the embedding matrix
(2) the “recurrent” matrix
(3) the bias
which are nothing more than just plain (numpy) arrays which can be even stored in JSON as a list (of lists). To utilize the model, it is straightforward -in any language- to implement the forward propagation, or to use an existing implementation which just requires to initialize the parameters. For example, in major languages like Java or C++, initializing arrays from JSON data is no big deal. Of course there are many other ways, but JSON is a very convenient data transport format.

And since the storage of the parameters is not coupled to any code, a model setup is possible in almost any environment with sufficient resources.

Sure, we are aware that restoring a 100-layer network from a JSON file can be burdensome, but nevertheless it is required to transfer model parameters in a unified, non-language dependent, way. So, we discussed some details of the storage and the deploy, but what about using the model in real applications?

We want to consider a broader context and not only images. Like the implementation of a search / retrieval system. In contrast to experiments, it is mandatory, that we get a result from a model within reasonable time. In other words, nobody says that training is easy, but if a good model need too much time to get a decision, it is useless for real-world applications. For instance, if the output of the model is “rather large”, we need to think about ways for an efficient retrieval in case of information retrieval. As an example: if the final output is 4096 dims (float32) and we have 100K ‘documents’, we need ~1.5 GB to store just the results and not a single bit of meta data.

But even for smaller models, like word2vec, we might have 50K words, each represented by 100 dims, which makes it a non-trivial task to match the entered sequence of words to, say, an existing list of movie titles and to rank the results in real-time and multi-threaded.

We know, we are a bit unfair here, but we have the feeling that way too often most of the energy is put into beating some score, which includes that often even the information in papers are not sufficient to reproduce results, and the models that actually provide new insights are sometimes not usable because they require lots of computational power.

Bottom line, on the one hand, without good models we have nothing to deploy, but on the other hand, machine learning is so much more than just training a good model and then dumping it at some place to let some other guys run it. As a machine learning engineer, we have the responsibility to be part of the whole development process and not just the cool part where we can play with new neural network architectures and let other folks do the “boring” part.

[1] deeplearning.net/software/theano/tutorial/loading_and_saving.html

Advertisements

Converting Theano to Numpy

It is an open secret that we like Theano. It’s flexible, powerful and once you mastered some hurdles, it allows you to easily test a variety of loss functions and network architectures. However, once the model is trained, Theano can be a bit of a burden when it comes to the fprop-only part. In other words, if we just want to get predictions or feature representations, the setup and compilation overhead might be too much. The alternative would be to convert the flow graph into numpy which has the advantage that there are fewer dependencies and less overhead for the actual predictions with the model. Frankly, what we describe is neither rocket science nor new, but it is also no common usage, so we decided to summarize the method in this post.

To convert the graph notation to numpy, we make use of the __call__ interface of python classes. The idea is to call an instance of a class as a function with a parameter:

class Input(Layer):
def __init__(self):
 self.prev = None # no previous layer

def __call__(self, value):
 return value #identity

class Projection(Layer):
def __init__(self, prev, W, bias):
 self.W = W, self.bias = bias
 self.prev = prev # previous layer

def __call__(self, value):
 val = self.prev(value)
 return np.dot(val, self.W) + self.bias

We illustrate the method with a 1-layer linear network:

inp = Input()
lin = Projection(inp, W="random matrix", b="zero bias")
X = "input matrix"
out = lin(X)

The notation of fprop might be confusing here, since the input travels backwards from the last layer to the input layer. So, let’s see what is happening here:

lin(X) is equivalent to lin.__call__(value) and inside this function, the output of the previous layer is requested self.prev(value) which is continued until the input layer returns the actual value. This is the stop condition. The approach is not restricted to a 1-layer network and can be used for arbitrary large networks.

With this idea, all we have to do is to split the layer setup and computation part that is combined in Theano. For instance, a projection layer in Theano:

class Projection(Layer):
def __init__(self, input, W, bias):
 self.output = T.dot(input, W) + bias

now looks like this with numpy:

class ProjectionNP(LayerNP):
def __init__(self, input, W, bias): # setup
 self.prev = input
 self.W, self.bias = W, bias

def __call__(self, value): # computation
 val = self.prev(value)
 return np.dot(value, self.W) + self.bias

In other words, the step to convert any Theano layer is pretty straightforward and only needs time to type, but not to think (much).

The storage of such a model is just a list with all layers and we can extract the output of any layer, by simply calling the layer object with the input:

net = [inp, pro, bn, relu]
net[-1](X) # relu
net[-3](X) # projection

Let’s summarize the advantages again: First, except for numpy there are no other dependencies and numpy is pretty portable and introduces not much overhead. Second, we do not need to compile any functions since we are working with real data and not symbolic variables. The latter is especially important if an “app” is started frequently but the interaction time is rather low, because then a constant overhead very likely declines user satisfaction.

Bottom line, the method we described here is especially useful for smaller models and environments with limited resources which might include apps that are frequently started and thus should have low setup time.

Model Expectations

In the previous post we talked about unbalanced labels and the consequences a strong regularization might have, if only a few errors remain, but the whole dataset is repeatedly feed to the model. At the end of the day, the careful selection actually helped to improve the model. However, there are still some “errors” that are very annoying.

The human brain is remarkable because it can perform an inference step with a minimum amount of power and information, thanks to the long-term memory. For instance, the title of a movie and a minimal description often suffices to decide if we are interested in the movie or not. With the ongoing success of Deep Learning, a lot of people transfer this expectation to the output of machine learning models.

Why is this problematic? In case of content-based methods, the model can be only as good as the features and a 100 layer network won’t change the fact, because even the largest network need to see the whole picture to learn a useful a representation. Thus, if a movie is described by a few keywords, maybe some themes, flags and genres the achievable quality of the model, the inference is bounded by the quality and the completeness of the features.

The theory is supported if we analyze errors made by trained models on unseen movies. As a human, we look at the title and we already have a vague expectation if we like it and the category. Of course we could be wrong, but the point is that in most cases, at least one descriptive feature is missing and thus, the prediction, if we only consider the features, is perfectly right, but totally unusable for us.

The problem is the lack of understanding of users because, when one or two missing keywords can fool the whole system, but even a child could correctly make the right prediction, what is the point of using machine learning at all? Well, for most cases such models work flawless and really help users to make the right decisions, but it is obvious that there is a lack of appreciation, if you tell your users you cannot fix a problem that seems trivial for them, but impossible for a multi-core CPU system.

Bottom line, it is an old hat that content-based systems are only as good as the input data and that collaborative systems are more powerful, but require lots of data before they make useful predictions. In our case, we plan to circumvent the problem by using different modalities that are projected into a single feature space.

Classification: Linear vs. Embedding

Whenever we are juggling with very high-dimensional, but also very sparse data, linear models are a good way to start. Why? Because they are fast to train and to evaluate, with minimal footprint and often sufficient to deliver a good performance, because data in high-dim spaces is more likely to be separable. However, the simplicity also comes with a price.

In a linear model, each feature has a scalar weight, like barbecue=1.1, cooking=0.3, racing=-0.7, car=-0.9 and a bias is used as a threshold. To predict if something belongs to the category “food”, we calculate:

y_hat = x_1 * barbecue + x_2 * cooking + x_3 * racing + x_4 * car + bias

where x_1 is {0, 1} depending if the feature is present or not. If y_hat is positive, the answer is yes, otherwise no. It is obvious that the answer SHOULD be positive, if a sufficient number of food-related features are present. In case of a mixture, the answer resembles a majority vote. The problem is that linear models completely ignore the context of features which means the meaning of a feature cannot be adjusted depending on present neighbors.

This is when embedding models jumps in to fill the gap. In such a model, each feature is represented by a vector and not a scalar and an item is represented as the average of all its feature vectors. For the labels, we also use a vector, like for linear models, but not in the original feature space, but in the embedding space. Then, we can make a prediction by transforming the item and calculating for each label:

y_i = f(dot(mean_item, label_i) + bias_i)

What is the difference? Let’s assume that we encode the vectors with 20-dims, which means, we have much more representational power to encode relations between features. For instance, if a certain feature usually belongs to the non-food category, but if it is combined with specific other features, it is strongly related, a linear model is likely to have trouble to capture the correlation. More precisely, the weight can be either negative, around zero, or positive. On the one hand, if it is positive, it must be related to the category which is usually not the case, on the other hand, if it’s negative, it never contributes to a positive prediction. And the last case, where it is very small, neglecting the sign, it does not contribute at all. To be frank, we simplified the situation a lot, but the point is still valid.

In case of an embedding, we can address the issue because we have more than a single dimension to encode relations with other features. Let’s see again how a prediction looks like:

dot(mean_item, label_i) + bias_i

following by some function f that that we ignore, since it only rescales the output value. We can also express this as:

sum(mean_item[j] * label_i[j]) + bias_i

Stated differently, we could say that each dim has a weighted vote that is summed up and thresholded against the label bias. The more positive votes we get, the higher the chance that (sum_i + bias_i) > 0 which means the final prediction is positive.

To come back to our context example, with the embedding it is possible that some dimensions are sensible for correlations. For instance, for a specific feature, dim “k” might be negative and positive for non-food features which eliminates the contribution due to the averaging. However, if it is also negative for food-related features, and also negative in the label, the specific feature strengthens the prediction because of the context. Of course, it is not that simple for real-word data, but the argument remains the same, because with a vector the model is able to learn more powerful representations.

Bottom line, linear models are very useful, but depending on the problem, they are too simple. Given sparse data, embedding models are still very efficient, because the complexity does not depend on |features| but on |features>0|, with a sparsity of ~99.9%. With the more powerful representations, it is also likely that the embedding can be re-used, for instance to learn preference-based models or to cluster data.

Understand First, Model Next

With all the lasting hype about Deep Learning, it seems that all problems can be easily solved with just enough data, a couple of GPUs and a (very) deep neural network. Frankly, this method works quite well for a variety of classification problems, especially for the domain of language and images, but mostly because, despite the possible large number of categories, classification is a very well understood problem.

For instance, in case of word2vec not a deep network was used, but a shallow model that worked amazingly well. And even if support vector machines went out of fashion, there are some problems that can be solved very efficiently thanks to the freedom to use (almost) arbitrary kernel functions. Another example is text classification where even a linear model, in combination with a very high-dim input space, is able to get a very good accuracy.

The journey to build a “complete” model of the human brain is deeply fascinating, but we should not forget that sometimes all we need is a hammer and not a drill to solve a problem. In other words, if the problem is easy, a simple model should be preferred and not one that is -theoretically- able to solve all problems but requires huge amount of fine-tuning and training time and even then the outcome might not be (much) better than the simple model. This is a bit like Occam’s Razor.

Bottom line, in machine learning we should be more concerned with a deeper understanding of the problem which allows to select and/or build a more efficient model than to solve everything with a single method just because it is a hot topic. Since Deep Learning is used by the crowd, we are getting more and more the impression that DL is treated as a grail to solve all problems without the necessity for a solid understanding of the underlying problem. The impression is supported by frequent posts at DL groups where newbies ask for a complete solution of their problem. To be clear, asking for help is a good thing, but like the famous saying, if you cannot (re-)build something you do not understand it.

The problem is, if somebody gives you a ‘perfect’ solution, it works as long as your problem remains the same. But if your problem evolves over time, you have to modify the solution. And for this, you need to understand both the problem and the solution to adjust the model. In case of the black-box approach often used by DL, data in and labels out, encoding prior knowledge or whatever is necessary to solve the evolved problem, requires a much deeper understanding of the whole pipeline.

Something New, Something Blue

One good thing about machine learning is that it is often interdisciplinary, meaning that it uses ideas from physics, mathematics or even neuro or computer science. Or stated differently, there are no real limits if you can express your idea in some formal mathematical way. Combined with a powerful tool to train your models without the need to do all the hard work, derivations and such, manually, it is much easier to focus on the actual problem.

In a recent post, we let our thoughts wandering by considering different ways to describe movies in some way. Well known approaches are TF-IDF or collaborative filtering and of course many others. However, in contrast to classical ranking or information retrieval systems, our first goal is not to derive a supervised model but a model that is sufficiently able to describe the underlying data.

We stumbled about a new possible approach while we implemented a “predicting the next word” feature for our front-end. The idea is simple: If you enter some keywords -a search query-, the system should be able to suggest words that match into the query context. For example, a query like ‘alien sf’ could be enhanced with “creature” or “space”. The training of such a model can be done in numerous ways. Popular choices are ‘Neural Probabilistic’ or ‘Log-Bilinear’ models.

The idea is pretty simple. We consider a dictionary V that consists of textual words. Each word will be represented by some feature and in the original space, each word is represented as a one-hot encoding. Then we have to define a context length, n=3 for example. All we need to do then is to convert some text data into our vocabulary by replacing each word with the corresponding ID from the vocabulary.

A training sample is generated by splitting a sentence like “my lifeboat is full of eels” into a context and the word to predict. Example: context: “my lifeboat is”, word “full”. With the IDs from the vocabulary, we get a tuple (ID1, ID2, ID3, ID4) for each training sample where 1..3 is the context and 4 is the word to predict.

With a trained model, we can then predict what word of the vocabulary best matches a given test context of words. What all these approaches have in common is that they are supposed to generalize to contexts that have not been seen before by using the learned semantic of the words from the vocabulary.

After we trained a basic model with our data, we thought about how we could use the model for our semantic clustering by using the learned similarities between the words. In the next post, we will elaborate on that.