Written by Suyash Agarwal
For almost a decade, convolutional neural networks have dominated computer vision research all around the globe. However, a new method is being proposed which harnesses the power of transformers to make sense out of images. Transformers were initially designed for natural language processing tasks with a primary focus on neural machine translation. The paper, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Alexey Dosovitskiy, Lucas Beyer et. al. from Google Research, proposes an architecture called Vision Transformer (ViT) to process image data using transformers. In this article, I will try to explain how it works.
Problem with CNNs
Before we dive into the method proposed for vision transformers, it is imperative to analyze the drawbacks and fundamental flaws in convolutional neural networks. For starters, CNNs fail to encode relative spatial information. What we mean by this is that it’s concerned with detecting certain features and does not consider their positioning with respect to each other.
Fig 1: Both images are similar for a CNN as it does not consider the relative positioning of facial features
In the above image, both of them will be classified as a face because a CNN only looks whether certain features are present in the input image and doesn’t care about their arrangement with respect to each other.
Another major flaw in CNN is that of pooling layers. Pooling layers lose a lot of valuable information such as the precise location of the most active feature detector. In other words, it fails to convey the exact location of the detected feature in the image.
Transformers in Brief
Transformers, in essence, use the concept of self-attention. Let’s break it down into two parts, self and attention. Attention is nothing but trainable weights that model the importance of each part of an input sentence. If we suppose a sentence is an input, it will look at each word of the sentence and compare its position in the sentence with respect to the position of all the words present in the same sentence (including itself) hence the term self-attention. A score is calculated based on these positional clues which are then used to encode the semantics or meaning of the sentence in a better way.
Fig 2: Screenshot is taken from Tensor2Tensor implementation of Transformer
In the above example, we can see that the attention unit in the transformer is comparing the position of the word ‘it’ to every other word in the sentence including itself. The different colors represent the multiple attention units working independently and simultaneously to find different patterns of these connections. Once a score is calculated by using the above comparisons, they are sent through simple feed-forward neuron layers with some normalizations in the end. During training, these attention vectors are learned.
If you want a more in-depth explanation of self-attention and transformers, be sure to check out Jay Alammar’s excellent blogpost The Illustrated Transformer or you can also check out the original paper Attention Is All You Need by Ashish Vaswani et. al.
Fig 3: Screenshot of the model taken from the research paper
Just as regular transformers use words to learn about sentences, Vision transformer uses pixels to achieve a similar result for images. However, there is a catch. In contrast to words, individual pixels do not convey any meaning by themselves which was one of the reasons we shifted to convolutional filters that operated upon a group of pixels. Therefore, they divide the whole image into small patches or words. All the patches are flattened using a linear projection matrix and are fed into the transformer along with their positions in the image as seen in the figure above. In their implementation, they went for patches of size 16x16 hence the poetic title of their research.
Now, these embedded patches go through alternating layers of multiheaded self-attention, multi-layer perceptron (simple feed-forward neural network), and layer normalizations like in a regular transformer. A classification head is attached at the end of the transformer encoder to predict the final classes. Like any other convolutional model, one can use the pre-trained encoder base and attach a custom MLP layer for fine-tuning the model to suit their classification task.
Fig 4: What attention looks like in images. A screenshot is taken from the paper
The authors trained this model on various standard datasets like ImageNet, CIFAR-10/100, and the JFT-300M which is a private dataset owned by Google has a collection of 300 million high-res images. Their model had approximately the same accuracy (even slightly higher in many cases) as compared to other state-of-the-art convolutional models but at a significantly reduced training time (decreased by around 75%) and using fewer hardware resources.
Another advantage of ViT is that it is able to learn about higher-level relationships very early on as it uses global attention instead of local. Even at the very beginning, you can already pay attention to things that are very far away as opposed to a convolutional neural network. Apart from being efficient during training, it continues to get better and better as the training data is increased.
So does this mean that CNNs are outdated and ViT is the new normal? Certainly not! While CNN has its share of disadvantages, it is still very much effective for tasks like object detection and image classification. ResNet and EfficientNet models which are state-of-the-art convolutional architectures still reign supreme for such tasks. However, transformers have been a breakthrough in natural language processing tasks such as language translation and show quite a promise in the field of computer vision. Only time will tell what’s in store for us in this ever-evolving field of research.
Thanks for reading this article and if you like it, don’t forget to share and leave your comments down below if you have any suggestions.