Transformers in Computer Vision: Farewell Convolutions

Written by Victor Perez

This article aims to introduce/refresh the main ideas behind Transformers and to present the latest advancements in using these models for Computer Vision applications.

After reading this article you will know…

  • … why Transformers outperformed SOTA models in NLP tasks.

  • … how the Transformer model works at a glance.

  • … which are the main limitations of convolutional models.

  • … how Transformers can overcome limitations in convolutional models.

  • … how novel works use Transformers for Computer Vision tasks.

Long-term Dependencies and Efficiency Tradeoffs

In NLP, the goal of neural language models is to create embeddings that encode as much information as possible of the semantics of a word in a text. These semantics, are not limited to the definition of a word, in fact, lots of words are meaningless by themselves if we do not know about the context that they belong to: in the sentence “Transformers are cool because they are efficient” an embedding for the word “they” will be meaningless if it does not take into account that it is referencing “Transformers”.

Optimal models should be able to represent these dependencies between words even when dealing with large texts where these words can be distant. We say that a model with such ability can encode long-term dependencies.

The following is a presentation of the problems that SOTA models in NLP (before Transformers) faced in order to modelize long-term dependencies efficiently.

Problems with RNNs

In particular, LSTMs and GRUs were popular RNNs able to encode rich semantics from words in a text. They work in a sequential manner, processing one token at a time and keeping a “memory” of all of those tokens that the model has already seen in order to add some of their semantics to further words that require them.

These RNNs are able to keep tokens “in memory” thanks to their “gates” components, elements that use Neural Networks to learn what information should be kept, what information should be dropped and what information should be updated each time that a new token is processed (see this post for more insights on this).

The architecture of these models makes them robust to exploding and vanishing gradients, a common problem in RNNs, which enabled them to keep track of quite long dependencies between elements in the sequence, although processing tokens sequentially and relying on keeping their information in memory is not suitable at all when dependencies are really distant.

This sequential nature also makes them difficult to scale or parallelize efficiently. Each forward pass is conditioned on the model having seen previous samples of the sequence i.e., only one embedding can be computed at a time.

Problems with CNNs

Convolutions have also been popular in NLP tasks due to their efficiency and scalability when trained using GPUs. In the same way that 2D convolutions can extract features from an image, these models use 1D filters to extract information from texts, which are represented as a 1D sequence.

The receptive field on these kinds of CNNs depends on the size of their filters and the number of convolutional layers used. Increasing the value of these hyperparameters increases the complexity of the model, which can produce vanishing gradients or even models impossible to train. Residual connections and dilated convolutions have also been used to increase the receptive fields of these models, but the way convolutions operate over texts always present limitations and tradeoffs on the receptive field that it can capture.


Transformers appeared in 2017 as a simple and scalable way to obtain SOTA results in language translation. They were soon applied to other NLP tasks becoming a new SOTA of several benchmarks (such as GLUE, SQuAD or SWAG).

It is common to train large versions of these models and fine-tune them for different tasks, so they are useful even when the data is scarce. Performances in these models, even with billions of parameters, do not seem to saturate. The larger the model, the more accurate the results are, and the more interesting the emerging knowledge that the model presents (see GPT-3).

The Transformer model

Given an input text with N-words, for each word (W) Transformers create an N weight, one for every word (Wn) in the text. The value of each weight will depend on the dependency of the word in the context (Wn) to represent the semantics of the word that is being processed (W). The following image represents this idea. Note that the transparency of the blue lines represents the value of the assigned attention weights.

Image by Author

Here, the top row represents the words that are being processed and the lower row represents the words used as context (see that the words are the same but they are treated differently if they are being processed or used to process another word). See that “they”, “cool” or “efficient” in the top row have high weights pointing to “Transformer” since that is indeed the word that they are referencing.

These weights are then used to combine the values from every pair of words and produce an updated embedding for each word (W) that now contains information about those important words (Wn) in the context for that word in particular (W).

Under the hood, in order to compute these updated embeddings, transformers use self-attention, a highly efficient technique that makes it possible to update the embeddings of every word in the input text in parallel.


Starting with the embedding of a word (W) from the input text, we need to somehow find a way to measure the importance of every other word embeddings (Wn) in the same text (importance with respect to W) and to merge their information to create an updated embedding (W’).

Self-attention will linearly project each word embedding in the input text into three different spaces, producing three new representations known as query, key, and value. These new embeddings will be used to obtain a score that will represent the dependency between W and every Wn (high positive scores if W depends on W’ and high negative scores if W is not correlated with W’). This score will then be used to combine information from different Wn word embeddings, creating the updated embedding W’ for the word W.

The following diagram shows how attention scores would be computed between two words:

Image by Author

In the image, blue lines represent the flow of the information from the first word (W) and brown lines the flow of information of the second (Wn).

Each word embedding gets multiplied by a Key and a Query matrix resulting in the query and key representations of each word. To compute the score between W and Wn, the query embeddings of W (W_q) is “sent” to the key embeddings of Wn (Wn_k ) and both tensors are multiplied (using dot product). The resulting value of the dot product is the score between itself, it will represent how dependant W is with respect to Wn.

Note that we can use the second word as W and the first word as Wn as well, that way we would compute a score that would represent the dependency of the second word to understand the first. We can even use the same word as W and Wn to compute how important the word itself is for their definition!

Self-attention will compute attention scores between every pair of words in the text. The scores will be softmaxed, converting them into weights with a range between 0 and 1.

The following diagram represents how the final word embedding for each word is obtained using these weights:

Image by Author

See that each word embedding is now multiplied by a third matrix generating their value representation. This tensor will be used to compute the final embedding of each word. For each word W, the computed weights for each other word in the text Wn will be multiplied by their corresponding value representations (Wn_v) and they will be added all together. The result of this weighted sum will be the updated embedding of the word W! (represented as e1 and e1 in the diagram).

I would recommend taking a look at the great post by Jay Alamar to those readers who want a more in-depth understanding of self-attention and of the Transformer model since this was a super shallow description of the most important parts of this technique.

Convolutional Inductive Biases

Convolutional models have dominated the field of Computer Vision for years with tremendous success. Convolutions can be efficiently parallelized using GPUs and they provide suitable inductive biases when extracting features from images. Convolutional operations impose two important spatial constraints that facilitate the learning of visual features:

  • Thanks to weight sharing, the features extracted from a convolutional layer are translation invariant, they are not sensitive to the global position of a feature, instead, they determine whether the feature is present or not.

  • Thanks to the nature of the convolutional operator, the features extracted from a convolutional layer are locality sensitive, every operation only takes into account a local region of the image.

Convolutional inductive biases, though, lack a global understanding of the image itself. They are great at extracting visual features but they are not able to modelize the dependencies between them.

For example, a convolutional layer of a model trained to recognize faces can encode information about whether the features “eyes”, “nose” or “mouth” are present in the input image, but these representations will not have the kind of of “eyes above nose” or “mouth below nose” because each convolutional kernel will not be large enough to process multiple of these features at once.

Large receptive fields are required in order to track long-range dependencies within an image, which in practice involves using large kernels or long sequences of convolutional layers at the cost of losing efficiency and making the model extremely complex, even impossible to train.

Does this tradeoff sound familiar? :)

Transformers in Computer Vision

In parallel to how Transformers leveraged self-attention to modelize long-range dependencies in a text, novel works have presented techniques that use self-attention to overcome the limitations presented by inductive convolutional biases in an efficient way. These works have already shown promising results in multiple Computer Vision benchmarks in fields such as Object Detection, Video Classification, Image Classification, and Image Generation. Some of these architectures are able to match or outperform SOTA results even when getting rid of convolutional layers and relying solely on self-attention.

The visual representations generated from self-attention components do not contain the spatial constraints imposed by convolutions. Instead, they are able to learn the most suitable inductive biases depending on the task and on the stage where the layer is placed within the pipeline. It has been shown how self-attention used in the early stages of a model can learn to behave similarly to a convolution.

This great tweet by Gabriel Ilharco shares a good selection of recent work in this field.

Self-attention layers

Self-attention layers in Computer Vision take a feature map as input. The goal is to compute attention weights between every pair of features resulting in an updated feature map where each position has information about any other feature within the same image. These layers can directly replace or be combined with convolutions and they are able to attend to a larger receptive field than regular convolutions, hence being able to modelize dependencies between spatially distant features. The most basic approach (used by Non-local Networks and Attention Augmented Convolutional Networks) consists of flattening the spatial dimensions of the input feature map into a sequence of features with shape HW x F, where HW represents the flattened spatial dimensions and F the depth of the feature map and uses self-attention directly over the sequence to obtain the updated representations. The computation cost of this self-attention layer can be expensive for high-resolution inputs, so it is only suitable with small spatial dimensions. Some works have already presented ways to overcome this problem, such as Axial-DeepLab, where they compute attention along the two spatial axes sequentially instead of dealing directly with the whole image, making the operation more efficient. Other simpler solutions include processing patches of feature maps instead of the whole spatial dimensions at the cost of having smaller receptive fields (this is done in Stand-Alone Self-Attention in Vision Models). These smaller receptive fields, though, can still be way larger than convolutional kernels. Models that use these kinds of layers combined with convolutional layers, obtain optimal results when self-attention is used in the later layers of the model. In fact, in On the Relationship between Self-Attention and Convolutional Layers, it is shown that the inductive biases learned by self-attention layers used early in the model resemble the ones that convolutions have by default.

Vision Transformers

Instead of including self-attention within convolutional pipelines, other works have proposed to rely uniquely on self-attention layers and to leverage the original encoder-decoder architecture presented for Transformers, adapting them to Computer Vision tasks. When using a large number of parameters and when trained with lots of data, these models produce similar or better results than SOTA in tasks such as Image Classification or Object Detection with way simpler models and faster to train.

The following is a quick summary of three important papers that use this Transformer architecture for Computer Vision tasks: 1) Image Transformer

This work presented a new SOTA for Image Generation on ImageNet and showed great results on super-resolution tasks.

They propose to treat Image Generation as an autoregressive problem where each new pixel is generated by only taking into account previously known pixel values within the image. In each feature generation, self-attention takes into account a flattened patch of m features as context and produces a representation for the unknown pixel value. In order for these pixel values to be suitable as input for self-attention layers, each RGB value is converted into a tensor of d dimensions using 1D convolutions, and the m features of the context patch are flattened to be 1 dimensional. The following image represents the proposed model:

Self-attention architecture from Figure 1, Section 3.2 in the original paper.

Here, q represents the pixel embedding to be updated. It gets multiplied with all the other embeddings from pixels in memory (represented as m) using query and key matrices (Wq and Wk) generating a score that is then softmaxed and used as weights for the sum of the value vector obtained with the matrix Wv. The resulting embedding is added to the original q embedding, this way obtaining the final result. In this figure, p represents the positional encodings added to each input embedding. This encoding is generated from the coordinates of each pixel. Note that by using self-attention, multiple pixel values can be predicted at once and in parallel (since we already know the original pixel values of the input image), and the patch used to compute self-attention can handle a higher receptive field than a convolutional layer. In evaluation though, image generation depends on each pixel having the values of their neighbors available, so it can only be performed one step at a time. 2) DETR DEtection TRansformer presents a simple model that achieves accuracies and performances on par with SOTA Object Detection methods. The structure of the proposed model can be seen in the image below:

DETR architecture from figure 2, section 3.2 in the original paper.

It uses self-attention with visual features extracted from a convolutional backbone. The feature maps computed in the backbone module are flattened over their spatial dimensions i.e., if the feature map has shape (h x w x d) the flattened result will have shape (hw x d). A learnable positional encoding is added to each dimension and the resulting sequence is fed into the encoder.

The encoder uses multiple self-attention blocks to combine the information between the different embeddings. The processed embeddings are passed to a decoder module that, using learnable embeddings as queries (object queries) that are able to attend to all the computed visual features, generates an embedding. In that embedding, all the information needed to perform the object detection is encoded. Each output is fed into a fully connected layer that will output a five-dimensional tensor with elements c and b where c will represent the predicted class for that element and b the coordinates of the bounding box (1D and 4D respectively). The value of c can be assigned to a “no object” token, which will represent an object query that did not find any meaningful detection and hence the coordinates will not be taken into account.

This model is able to compute multiple detections for a single image in parallel. The number of objects that it can detect, though, is limited to the number of object queries used. The authors of the paper claim that the model outperforms SOTA models in images with large objects. They hypothesize that this is due to the higher receptive field that self-attention provides to the model.

3) Vision Transformer (ViT) This model presents a new SOTA on Image Recognition with a model that, even fully relying on self-attention, is able to present performances on par with current SOTA. The following is a representation of the presented model:

ViT architecture from figure 1, section 3.1 in the original paper.

The input sequence consists of a flattened vector of pixel values extracted from a patch of size PxP. Each flattened element is fed into a linear projection layer that will produce what they call the “patch embeddings”. An extra learnable embedding is attached to the beginning of the sequence. This embedding, after being updated by self-attention, will be used to predict the class of the input image. A learnable positional embedding is also added to each of these embedding.

The classification is performed by just stacking an MLP Head on top of the Transformer, at the position of the extra learnable embedding that we added to the sequence. A hybrid architecture is also presented in this work. Instead of using projected image patches as input to the transformer, they use feature maps from the early stages of a ResNet. By training Transformers and this CNN backbone end-to-end, they achieve their best performances.

Positional encodings

Since Transformers need to learn the inductive biases for the task they are being trained for, it is always beneficial to help that learning process by all means. Any inductive bias that we can include in the inputs of the model will facilitate its learning and improve the results. When updating features with transformers, the order of the input sequence is lost. This order will be difficult or even impossible to learn by the Transformer itself, so what it is done is to aggregate a positional representation to the input embedding of the model. This positional encoding can be learned or it can be sampled from a fixed function, and the position where it is aggregated can vary, although it is usually done just at the input embeddings, right before being fed into the model.

In Computer Vision, these embeddings can represent either the position of a feature in a 1-dimensional flattened sequence or they can represent the 2-dimensional position of a feature.

In this field, relative positional encodings have been found to work really well. They consist of learnable embeddings that learn to encode relative distances between features instead of encoding their global positions.


  • Transformers solve a problem that was not limited to NLP, long-term dependencies are also important to improve Computer Vision tasks.

  • The Transformer model is a simple yet scalable approach that can be applied to any kind of data if it is modelized as a sequence of embeddings.

  • Convolutions are translation invariant, locality-sensitive, and lack a global understanding of images.

  • Transformers can be used in convolutional pipelines to produce global representations of images.

  • Transformers can be used for Computer Vision, even when getting rid of regular convolutional pipelines, producing SOTA results.