🔥 Burn Fat Fast. Discover How! 💪

​​DetCon: The Self-supervised Contrastive Detection Method De | Gradient Dude

​​DetCon: The Self-supervised Contrastive Detection Method
DeepMind

A new self-supervised objective, contrastive detection, which tasks representations with identifying object-level features across augmentations.

Object-based regions are identified with an approximate, automatic segmentation algorithm based on pixel affinity (bottom). These masks are carried through two stochastic data augmentations and a convolutional feature extractor, creating groups of feature vectors in each view (middle). The contrastive detection objective then pulls together pooled feature vectors from the same mask (across views) and pushes apart features from different masks and different images (top).

Highlights
+ SOTA detection and Instance Segmentation (on COCO) and Semantic Segmentation results (on PASCAL) when pretrained in self-supervised regime on ImageNet, while requiring up to 5× fewer epochs than SimCLR.
+ It also outperforms supervised pretraining on Imagenet.
+ DetCon(SimCLR) converges much faster to reach SOTA: 200 epochs are sufficient to surpass supervised transfer to COCO, and 500 to PASCAL.
+ Linear increase in the number of model parameters (using ResNet-101, ResNet-152, and ResNet-200) brings a linear increase in the accuracy on downstream tasks.
+ Despite only being trained on ImageNet, DetCon(BYOL) matches the performance of Facebook's SEER model that used a higher capacity RegNet architecture and was pretrained on 1 Billion Instagram images.
+ First time a ResNet-50 with self-supervised pretraining on COCO outperforms the supervised pretraining for Transfer to PASCAL
+ The power of DetCon strongly correlates with the quality of the masks. The better the masks used during the self-supervised pretraining stage, the better the accuracy on downstream tasks.

Method details
DetConS and DetConB, based on two recent self-supervised baselines: SimCLR and BYOL respectively with ResNet-50 backbone.
Authors adopt the data augmentation procedure and network architecture from these methods while applying the proposed Contrastive Detection loss to each.

Each image is randomly augmented twice, resulting in two images: x, x'.
In addition, they compute for each image a set of masks that segment the image into different components.
These masks can be computed using efficient, off-the-shelf, unsupervised segmentation algorithms. In particular, authors use Felzenszwalb-Huttenlocher algorithm a classic segmentation procedure that iteratively merges regions using pixel-based affinity. This algorithm does not require any training and is available in scikit-image. If available, human-annotated segmentations can also be used instead of automatically generated. Each mask (represented as a binary image) is transformed using the same cropping and resizing as used for the underlying RGB image, resulting in two sets of masks {m}, {m'} which are aligned with the augmented images x, x'.

For every mask m associated with the image, authors compute a mask-pooled hidden vector (i.e., similar to regular average pooling but applied only to spatial locations belonging to the same mask).
Then 2-layer MLP is used as a projection on top of the mask-pooled hidden vectors. Note that if you replace masked-pooling with a single global average pooling then you will get exactly SimCLR or BYOL architecture.

Standard contrastive loss based on cross-entropy is used for learning. Positive pair is the latent representations of the same mask from augmented views x and x'. Latent representations of different masks from the same image and from different images in the batch are used as negative samples. Moreover, negative masks are allowed to overlap with a positive one.