Tagged: gating

Top-K-Gating With Theano

In a recently published paper [arxiv:1701.06538], the authors introduced a mixture of experts which is not new. However, the twist is to use only a small subset of those experts which cannot be done with an ordinary softmax, since the output of a softmax is always -slightly- positive. The idea is to keep only the truly top-k experts by setting the values, before applying the softmax operation, of all non-top-k experts to a large negative value. The result is that the actual output value at the corresponding position is zero.

With numpy, x is a vector, this is actually straightforward:

def keep_topk(x, k, neg=-10):
 rest = x.shape[0] - k
 idx = np.argsort(x)[0:rest]
 x[idx] = neg
 return x

We just sort the values of x, getting the indicies for the |x|-k positions and set the values to -10.

But since we want to use all the nice features of Theano, we need to port the code to the tensor world. Frankly, this is no big deal either, but it requires a tiny adaption since we cannot assign values to tensors directly.

def keep_topk(x, k, neg=-10):
 rest = x.shape[0] - k
 idx = T.argsort(x)[0:rest]
 return T.set_subtensor(x[idx], neg)

And that’s it.

The reason why we spent some time with the porting is that we also had the idea to use soft attention to model the final prediction as a decision of a small set of experts. The experts might have different opinions and with the gating, we can blend different confidence levels with different outputs.