Since we have a rather unusual setup, ordinary classification often delivers a performance that is not satisfying. We tried to address the issue with a large-nargin approach that uses a triplet loss to model local neighborhoods, but even this approach fails to solve our problems for some kind of data. To put it simple, some samples with a rare combination of features might not find their place in the learned embedding space and thus, probably end up at some “random” place. To guide the way of those little rascals, we introduced a memory component like in [arxiv:1703.03129].
First, the memory is filled uniformly with samples with different labels. After that for each query, we find the nearest neighbor with the same label, but also one with a different label. The idea is to ensure that memory slots with different labels are well separated in the embedding space. Since we do not backprop through the memory slots, we adjust the embedding parameters of the query to ensure the margin. This is done by a simple triplet loss:
max(0, margin + dot(query, nn_neg) - dot(query, nn_pos))
where all data is unit-normed.
The memory slot with the matching label is then updated:
mem[idx_pos] = l2_norm(query + mem[idx_pos])
where idx_pos is the position of the memory slot.
The argumentation why this helps to improve the performance is similar to the one in the paper: The additional memory helps to remember combinations of features that are rarely seen and thus is often able to infer the correct label even if the embedding has not been “settled” at all.
Furthermore, the memory can also help to improve the embedding space by concatenating embeddings of samples and slots which leads to a cleaner separation of class boundaries: h_new = hidden||nn(mem, hidden).
Still, there are quite a few questions that need to be addressed by further research:
– The more a memory slot gets updated, the more likely it is that it will be closer to an arbitrary query. This will likely lead to lots of orphaned memory slots. How can we ensure that distinct feature combinations won’t get mixed into those slots?
– Shall we introduce some some non-determinism to introduce some noise to improve the utilization of the memory (to better preserve some rare patterns)?
– Shall the memory be either a circular buffer or shall we average memory slots to convert to a stable state?
As long as there is no way to learn in an reliably, but unsupervised way from sparse input data, we believe that external memory is a promising path to pursuit. However, even with this adjustment there are still lots of challenges that need to be addressed before we can come up with a working solution.