Why Contrastive loss for unsupervised learning.
Unsupervised learning holds out the possibility that we can acquire transferable representations even in the absence of direct human guidance. Contrastive learning based optimisation has shown immense success in unsupervised method. To assist in the differentiation between positive and negative samples, contrastive learning methods share a common design of the loss function. This loss function is a softmax function of the feature similarities that is accompanied by a temperature. The contrastive loss plays an important role in determining the level of success achieved by unsupervised contrastive learning.
Contrastive learning based embeddings have uniformity and hardness aware property.
A large number of recent empirical works learn representations while constrained by a unit l2 norm, effectively limiting their scope.
the space that is sent to the unit hypersphere as output.The existence of the features on the unit hypersphere leads to the development of several desirable characteristics. In contemporary machine learning, fixed-norm vectors are known to improve training stability. This is especially important given the widespread use of dot products.In addition, if the features of a class are sufficiently well clustered, then they can be linearly separated from the other features in the feature space. This is a common criterion that is used to evaluate the quality of the representation. Uniformity prefers a feature
distribution that preserves maximal information, i.e., the
uniform distribution on the unit hypersphere.
The contrastive loss function is a hardness-aware loss function that automatically focuses on optimising the difficult negative samples by assigning penalties to them that are proportional to the level of difficulty they present. The local structure of each sample has a tendency to be more distinct, and the embedding distribution is likely to be more uniform as a result of contrastive loss occurring at low temperatures. This is because contrastive loss tends to penalise much more heavily on the hardest negative samples. On the other hand, contrastive loss with large temperature is less sensitive to the hard negative samples, and the hardness-aware property disappears as the temperature approaches infinity.
We have found that although uniformity is a key indicator to the performance of contrastive models, excessive pursuit of uniformity may break the underlying semantic structure. This is the case despite the fact that uniformity is a key indicator. If the contrastive loss is equipped with a very small temperature, the loss function will give very large penalties to the nearest neighbours, who are very likely to share similar semantical contents with the anchor point. This is because the temperature has a direct impact on the likelihood that two points will share similarities in their semantic content.
We are cognizant of the fact that there is
a uniformity-tolerance dilemma in unsupervised contrastive
learning. On the one hand, we are hoping that the features are distributed uniformly enough so that they are easier to distinguish from one another. On the other hand, we are hoping that contrastive loss can become more tolerant of samples that are semantically similar.we observe that embeddings trained with τ = 0.07 are more uniformly distributed, however the embeddings trained with τ = 0.2 present a more reasonable distribution which is locally clustered and globally separated
References:
- Understanding the Behaviour of Contrastive Loss — https://arxiv.org/pdf/2012.09740.pdf
- Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere — https://proceedings.mlr.press/v119/wang20k/wang20k.pdf