Despite the simplicity, Nearest Neighbors (NN for short), is still a very powerful baseline for classification. Except for the number of neighbors, there are no parameters to tune or learn, but the accuracy depends solely on the quality of the feature representation. For domains with highly entangled representations, the results might be very disappointing. Thus, it seems reasonable to combine some neural net encoder to disentangle the input data first and then to jointly optimize the NN loss on this representation.
We experimented with a soft NN loss function  a while and we also stumbled again about NCA  (Neighborhood Component Analysis). Especially the note on the last slide  “This does not force all points of the same class to be similar.” caught our attention for several reasons. First, in case of clustering or softmax classifiers a single template is learned for the whole class which can be burdensome if the a single class contains a lot of variance. In this case, the network needs a sufficient depth to disentangle those aspects to learn the final template prototype that allows a linear separation of classes.
However, in case of NCA which is a soft version of NN, some arbitrary sample just need to be closer to some other sample with the same class, but not all of them. And a bit further away from a sample with a different label . So, instead of wasting energy to collapsing all samples from the same class into a single point, we just need to ensure that the probability mass of all samples with the same label is maximal.
k, i = torch.arange(len(batch)), 0
dists = ((batch[i] - batch)**2).sum(dim=1)
dists[i] = 1e9 # mask out 'self'
probs = torch.softmax(-dists**2, dim=1).view(-1)
probs_leave_one_out = probs[0, (labels == labels[i]) & (k != i)]
loss = -torch.log(probs_leave_one_out.sum())
In words: We select the zero-th sample from the batch, calculating the distances to all other examples, we mask the i-th out since dist(x, x) is always zero to avoid a contribution to the softmax and then we further only consider probabilities from the same class and we also leave out the i-th example.
The classical softmax would push the whole mass to a single sample, while we are happy if the whole mass belongs to samples with same label class. This makes learning much easier, since it does not matter if a single positive sample is nearby or a dozen, as long as the negative samples are a bit more unlikely neighbors according to the softmax output.
To illustrate this with an numerical example: Let us assume that we have 6 examples where 3 are positive and three are negative. Then the all the following softmax outputs would all lead to a loss of zero
#1 [0.98, 0.02, 0.02, ~0, ~0, ~0]
#2 [0.33, 0.33, 0.33, ~0, ~0, ~0]
#3 [0.50, 0.25, 0.25, ~0, ~0, ~0]
since the sum of the positive samples is always close to one.
As stated in  this means the manifold can be fractured since there might bedifferent regions for the same label that are not connected. But frankly we do not care, as long as any new example is mapped into a region where at least one correct example, with respect to the label, is present and any sample with a different label is just a bit more unlikely.
A drawback of NCA is that the optimal solution requires all training examples which does not scale due to the quadratic complexity to calculate all distances. The problem is that if there exist a neighbor for some input x_i that would preserve the local neighborhood of x_i, but there is no such neighbor in the (mini-)batch, the neighborhood will be either distorted or even destroyed. A solution for the problem is memory + momentum encoders, like , where it is possible to scan much more samples to find nearest neighbors that might preserve the local neighborhood. For those samples, the gradient is not derived which means it is only propagated through the mini-batch data and the momentum samples are considered as constants.
Bottom line, the strength of NCA that it is often able to decompose categories into sub-categories, even if the labels are already fine-grained. This works, because the network is not forced to summarize a category with a single prototype template, at the cost that an inference step now requires to determine the k nearest neighbors from memory.
 CSC2535 Lecture 11: “Non-linear Dimensionality Reduction”, Hinton – Toronto
 Improving Generalization via Scalable Neighborhood Component Analysis