# PyTorch – Combining Dense And Sparse Gradients

In case you a train a vanilla neural network, gradients are usually dense. However, in PyTorch, the embedding layer supports the “sparse=True” option to speed up learning in case of larger vocabularies. We wrote about it before[1]. The only optimizer that can handle both dense and sparse gradients is SGD and not to forget Adagrad. But it should be noted in case of momentum, for SGD, at least for version <1.0.0, there seemed a slowdown because of the accumulator after some epochs. Thus, we usually stayed with Adagrad which was a good choice.

Nevertheless, it seems not satisfying to be tied to one optimizer for all our problems. But what happens if you mix both types of gradients? Adam complaints it can't handle sparse gradients and SparseAdam complaints it can't handle dense ones. Maybe the solution is obvious, but probably not straightforward as it should. In the a discussion[2] a quite elegant solution is proposed to avoid to call multiple optimizers.

But before, let's precisely describe the problem. Let's say we've a simple network with an Embedding layer (sparse) and a Linear layer (dense) and the forward() function combines both. If we want to use Adam as the optimizer, we need one optimizer for the sparse layers (Embedding) and one for the rest. We are omitting bias values for brevity.

The learning step looks like that:

net = Network()

opt_sparse = torch.optim.SparseAdam([net.emb.weight], lr=3e-4)

opt_dense = torch.optim.Adam([net.out.weight], lr=3e-4)

```
```

`y_hat = net(torch.LongTensor([1, 2, 3])`

loss = ((y - y_hat)**2).mean()

opt_sparse.zero_grad()

opt_dense.zero_grad()

loss.backward()

opt_sparse.step()

opt_dense.step()

It’s not hard to follow, but it could be probably a bit more intuitive. The thing is that we can’t use net.parameters(), at least not without filtering parameters that require a sparse gradient. At the end, we need two parameter lists and a separate optimizer for each type: sparse and dense.

At least from the software engineering point of view, also noted by [2], we can improve the situation a little.

class MultipleOptimizer:

def __init__(self, *op):

self.optimizers = op

```
```def zero_grad(self):

for op in self.optimizers:

op.zero_grad()

`def step(self):`

for op in self.optimizers:

op.step()

With this helper, the new training code becomes:

net = Network()

opt_sparse = torch.optim.SparseAdam([net.emb.weight], lr=3e-4)

opt_dense = torch.optim.Adam([net.out.weight], lr=3e-4)

opt = MultipleOptimizer(opt_sparse, opt_dense)

```
```

`y_hat = net(torch.LongTensor([1, 2, 3])`

loss = ((y - y_hat)**2).mean()

opt.zero_grad()

loss.backward()

opt.step()

And we can take this approach a little further by letting it figure out what layers lead to sparse gradients and which lead to dense ones.

Surely, there is more work to do, but our aim was to give a simple recipe how to handle sparse and dense gradients without the hassle to call multiple optimizers.

[1] raberrytv.wordpress.com/2017/06/28/efficient-embedding-models-with-pytorch/

[2] discuss.pytorch.org/t/two-optimizers-for-one-model/11085