# Ridiculously Sparse

Despite our fascination for unsupervised learning, we are currently facing a problem that is riddled with labels. The problem could be summarized as ‘Describing very high-dimensional, sparse input data with thousand of labels and very limited input data’, or in short ‘extreme multilabel learning’ where both the data and labels are sparse. There was a recent workshop at NIPS about exactly this problem that allowed us to study existing approaches and see if we can borrow and/or adjust methods discussed there.

For example, a typical sample is described by 2,000 dimensions, bag-of-word style, with tf-idf values and an average sparsity of 99.25% (15 non-zero elements). The number of all labels is about 1,000 and on average a sample has about 4 labels which is a sparsity of 99.60%. Our current data set is about 100,000 examples. Of course this is not as extreme as for large websites, but it is nevertheless very challenging, because especially rare features are often imperative for accurate predictions of some labels.

Before we dive into the details, let’s see what sparsity means for the optimization. We start with a linear model:

loss = np.maximum(0, 1 - y_hat * y) + 0.5 * l2_decay * np.sum(W**2)

with y in {-1, 1}, y_hat = x * W + b and l2_decay = 2e-4. The input x is a 2,000 dim real vector. The parameters are: [W, b]. The gradient is:

g(W) = -y * x + l2_decay * W

g(b) = -y

To minimize the loss, vanilla gradient descent is defined as:

W(t+1) = W(t) - lrate * g(W(t)) = W(t) - lrate * (-y * x + l2_decay * W(t))

b(t+1) = b(t) - lrate * g(b(t)) = (b) - lrate * (-y)

With the sparsity of the input x in mind, we can easily see that the gradient g(W) evaluates to zero whenever x[i] is 0, if we ignore the regularization term! Furthermore, we need an optimizer like AdaGrad that provides a learning rate for each future to emphasize less frequent features. Why is this important?

The sparsity in the input data means that we can update the parameters very efficiently, because only very few actual parameter values are involved. However, this scheme is ruined by the regularization term because it is applied to all parameters. As a quick solution it is suggested[arxiv:1505.06449] that only parameters for non-zero entries in the input data are updated. This step is necessary to ensure

that we never lose the sparsity we once gained.

After the model is trained, predictions can be done very efficiently since only very few dot products need to be calculated. In case the data is stored as a dictionary {pos_1: val_1, pos_2: val_2, …}, which allows a very compact representation due to the sparsity, a prediction looks like that (python-style):

y_hat = b

for pos, val in x.iteritems():

y_hat += val * W[pos]

With x as the input vector, W as the weight vector and b as the bias. That means, we only need |x > 0| steps and one addition for the bias to get the prediction and with a sparsity of 99.25% most predictions can be done with <20 calculations. Furthermore, if we have more linear models, we can evaluate them with a single pass of x:

y_hat = b

for pos, val in x.iteritems():

for j in xrange(num_labels):

y_hat[j] += val * W[j, pos]

The only difference is that b is now a vector of 1x(num_labels) and W is a matrix of (num_labels)x(dim_x). At the end, y_hat holds the raw prediction for each label.

So, let’s summarize what we go so far: The sparsity of the input data allows us to train a linear model, even for very high dimensional data, very quickly and also to evaluate predictions with few steps. In other words, with the adjusted regularization, the whole training does only depend on |x > 0| and not |x| which means ~20 steps instead of 2,000 steps per sample.

Finally, let’s analyze how the weights evolve over time for a specific problem. The aim is to predict genres (mystery, horror, scifi, western) of movie data where it is possible that a movie has more than one genre. There are two groups of features: common and rare, where the mean, frequency, of the rare is about 1/15 of the mean of the common. For example, if a common feature has a frequency of 1,500, a rare feature has one about 100, just to get a feeling of the magnitudes. If we now shuffle the data set and update each sample exactly once during an epoch, a rare feature gets 1/15 fewer updates than a common feature. This is partly addressed by AdaGrad who gives a higher learning rate to parameters with smaller gradients. Stated differently, if we assume that both gradients have a similar magnitude, the common one gets lots of more updates, which means the accumulated gradient is about 15 times larger than the rare one. For AdaGrad:

history_grad += current_grad**2

param_lrate = current_grad / np.sqrt(history_grad + eps)

which means the learning rate for each rare features is much higher compared to common features.

Bottom line, we can summarize all this with:

– keep rare features to ensure the performance of difficult labels

– keep the sparsity of the model parameters over time

– use individual learning rates for each parameter

– ensure the time/space complexity depends on |x>0| and not |x|

Equipped with this, we can easily train models for data with thousands of dimensions as long as the data is sufficient sparsity > 99%.

But, it is a complete different story to train a model hundreds of labels because then we have to decide how to model the label space. For instance, one-vs-rest, an embedding scheme like wsabie, or some other way to utilize the correlation of labels. Not to forget that in case of multilabel data, we have to decide at what magnitude a label is assigned to a sample.