Written by Satvik Tripathi, Founder & Head, Techvik
The arrival of powerful and adaptable deep learning frameworks in the past few years has made it possible to perform convolution layers into a deep learning model, an extremely simple task, often attainable in just a few lines of code.
However, understanding convolutions, especially for the very first time can often seem a bit intricate, with terms like kernels, filters, channels, and so on all stacked onto each other. Yet, convolutions as a concept are fascinatingly powerful and highly extensible, and in this post, I’ll break down the mechanism of the convolution operation, step-by-step, relate it to the standard fully connected network, and examine how they build up a strong visual hierarchy, making them a compelling feature extractor for images.
2D Convolutions: The Operation
The 2D convolution is a surprisingly straightforward operation at the root: you begin with a kernel, which is basically a small matrix of weight. This kernel "slides" over a 2D input data plane, implements an element-wise multiplication with the input component it is presently on and then adding up the results obtained in a single output pixel.
The kernel continues this step for every place it slides over, transforming the 2D matrix of features to another 2D matrix of features. The output features are important, the weighted sums (with the weights being the kernel values themselves) of the input features located approximately at the same position of the output pixel on the input layer.
Whether or not an input feature falls within this "roughly the same place" is directly determined by whether or not it is in the kernel region where the output was generated. This means that the size of the kernel explicitly dictates how many (or few) input features are combined in the creation of a new output feature.
This is all in pretty strong contrast to a completely interconnected layer. In the example above, we have 5×5=25 input features and 3×3=9 output features. If this were a standard completely connected layer, you would have a weight matrix of 25×9 = 225 parameters, with each output feature being the weighted sum of each input feature. Convolutions allow us to do this transformation with only 9 parameters, with each output feature, instead of "looking" at each input feature, only "looking" at input features coming from approximately the same place. Take note of this, as it will be crucial to our further discussion.
Some commonly used techniques
Before we go on, it's certainly worth looking at two very important techniques and are commonplace in convolution layers: Padding and Strides.
Padding: If you see the animation below, you'll observe that during the sliding process, the edges are effectively trimmed off, converting the 5×5 to 3×3 feature matrix. The pixels on the edge are never in the middle of the kernel so there is nothing for the kernel to widen beyond the edge of the kernel. This is not optimal, as much as we want the output size to be equal to the input.
Padding does something smart enough to fix this: pad edges with extra, "fake" pixels (usually 0, hence the often-used word "zero paddings"). This way, when the kernel is sliding, it will cause the original edge pixels to be at the center of the kernel while extending to the fake pixels outside the edge, generating the output of the same size as the input.
Striding: Sometimes when running a convolution layer, you want to output less than the input size. This is popular in convolutional neural networks, where the size of the spatial dimensions is decreased by increasing the number of channels. One way to achieve this is through the use of a pooling layer (e.g. take the average/max of any 2×2 grid to reduce each spatial dimension by half). Another way to do this is to use a stride:
The idea of the stride is to ignore some of the positions of the kernel slides. Phase 1 means to select the slides apart from the pixel, so essentially every single slide, functioning as a standard convolution. A stride of 2 means selecting slides 2 pixels apart, skipping every other slide in the process, downsizing by approximately a factor of 2, a stride of 3 means skipping every 2 slides, downsizing roughly by factor 3, and so on.
More modern networks, such as the ResNet architectures, totally forgo pooling layers in their internal layers, in favor of strided convolutions when they need to reduce their output sizes.
The multi-channel version
Of course, the diagram presented below just comply with the scenario where the image has a single input channel. In practical terms, most of the input images have 3 channels, and that number only increases the deeper you go into the network. It's pretty easy to think of networks, in general, as a "vision" of the picture as a whole, underlining some elements, de-emphasizing others.
Most of the time, we deal with RGB images with three channels. (Credit: Andre Mouton)
So this is where the main difference between terms comes in handy: while in the case of 1 channel, where the term filter and kernel are similar in meaning and maybe interchangeable, in the general case, they are actually very different. In fact, each filter happens to be a set of kernels, with one kernel for each single input channel to the layer, and each kernel being unique.
Each filter in a convolution layer generates one and only one output channel, and they do it as follows:
Every kernel of the filter “slides” over their respective input channels, generating a processed version of each. Some of the kernels might have stronger weights than others, to give more emphasis to certain input channels than others (eg. a filter may have a red kernel channel with stronger weights than others, and hence, respond more to differences in the red channel features than the others).
Each of the per-channel processed versions is then added together to create one channel. The kernels of a filter each generate one unique version of each channel, and the filter as a whole generates one overall output channel.
Finally, then there’s the term called bias. The way the bias term works here is that each output filter has one bias term. The bias gets added to the output channel so far to generate the final output channel.
And with the single filter case down, the case for any number of filters is identical: Each filter processes the input with its own, different set of kernels, and a scalar bias with the process described above, producing a single output channel. Then, they are concatenated together to produce the overall output, with the number of output channels being the number of filters. A nonlinearity is then normally added before passing it as an input to another convolution layer, which then repeats this process.
2D Convolutions: The Intuition
Convolutions are still linear transforms
Even with the mechanics of the convolution layer down, it can still be hard to relate it back to a standard feed-forward network, and it still doesn’t explain why convolutions scale to and work so much better for image data.
Suppose we have a 4×4 input, and we want to transform it into a 2×2 grid. If we were using a feedforward network, we’d reshape the 4×4 input into a vector of length 16, and pass it through a densely connected layer with 16 inputs and 4 outputs. One could visualize the weight matrix W for a layer:
All in all, some 64 parameters
And although the convolution kernel operation may seem a bit strange at first, it is still a linear transformation with an equivalent transformation matrix. If we were to use a kernel K of size 3 on the reshaped 4×4 input to get a 2×2 output, the equivalent transformation matrix would be:
There’s really just 9 parameters here.
(Note: while the above matrix is an equivalent transformation matrix, the actual operation is usually implemented as a very different matrix multiplication)
The convolution then, as a whole, is still a linear transformation, but at the same time, it’s also a dramatically different kind of transformation. For a matrix with 64 elements, there are just 9 parameters that themselves are reused several times. Each output node only gets to see a select number of inputs (the ones inside the kernel). There is no interaction with any of the other inputs, as the weights to them are set to 0.
It’s useful to see the convolution operation as a hard prior to the weight matrix. In this context, by prior, I mean predefined network parameters. For example, when you use a pre-trained model for image classification, you use the pre-trained network parameters as your prior, as a feature extractor to your final densely connected layer.
In that sense, there’s a direct intuition between why both are so efficient (compared to their alternatives). Transfer learning is efficient by orders of magnitude compared to random initialization, because you only really need to optimize the parameters of the final fully connected layer, which means you can have a fantastic performance with only a few dozen images per class.
Here, you don’t need to optimize all 64 parameters, because we set most of them to zero (and they’ll stay that way), and the rest we convert to shared parameters, resulting in only 9 actual parameters to optimize. This efficiency matters, because when you move from the 784 inputs of MNIST to real-world 224×224×3 images, that's over 150,000 inputs. A dense layer attempting to halve the input to 75,000 inputs would still require over 10 billion parameters. For comparison, the entirety of ResNet-50 has some 25 million parameters.
So fixing some parameters to 0, and tying parameters increases efficiency, but unlike the transfer learning case, where we know the prior is good because it works on a large general set of images, how do we know this is any good?
The answer lies in the feature combinations the prior leads the parameters to learn.
Early on in this article, we discussed that:
Kernels combine pixels only from a small, local area to form an output. That is, the output feature only “sees” input features from a small local area.
The kernel is applied globally across the whole image to produce a matrix of outputs.
So with backpropagation coming in all the way from the classification nodes of the network, the kernels have the interesting task of learning weights to produce features only from a set of local inputs. Additionally, because the kernel itself is applied across the entire image, the features the kernel learns must be general enough to come from any part of the image.
If this were any other kind of data, eg. categorical data of app installs, this would’ve been a disaster, for just because your number of app installs and app type columns are next to each other doesn’t mean they have any “local, shared features” common with app install dates and time used. Sure, the four may have an underlying higher-level feature (eg. which apps people want most) that can be found, but that gives us no reason to believe the parameters for the first two are exactly the same as the parameters for the latter two. The four could’ve been in any (consistent) order and still be valid!
Pixels, however, always appear in a consistent order, and nearby pixels influence a pixel e.g. if all nearby pixels are red, it’s pretty likely the pixel is also red. If there are deviations, that’s an interesting anomaly that could be converted into a feature, and all this can be detected from comparing a pixel with its neighbors, with other pixels in its locality.
And this idea is really what a lot of earlier computer vision feature extraction methods were based around. For instance, for edge detection, one can use a Sobel edge detection filter, a kernel with fixed parameters, operating just like the standard one-channel convolution:
Applying a vertical edge detector kernel
For a non-edge containing grid (eg. the background sky), most of the pixels are the same value, so the overall output of the kernel at that point is 0. For a grid with a vertical edge, there is a difference between the pixels to the left and right of the edge, and the kernel computes that difference to be non-zero, activating, and revealing the edges. The kernel only works only a 3×3 grids at a time, detecting anomalies on a local scale, yet when applied across the entire image, is enough to detect a certain feature on a global scale, anywhere in the image!
So the key difference we make with deep learning is to ask this question: Can useful kernels be learned? For early layers operating on raw pixels, we could reasonably expect feature detectors of fairly low-level features, like edges, lines, etc.
There’s an entire branch of deep learning research focused on making neural network models interpretable. One of the most powerful tools to come out of that is Feature Visualization using optimization. The idea at core is simple: optimize an image (usually initialized with random noise) to activate a filter as strongly as possible. This does make intuitive sense: if the optimized image is completely filled with edges, that’s strong evidence that’s what the filter itself is looking for and is activated by. Using this, we can peek into the learned filters, and the results are stunning:
Feature visualization for 3 different channels from the 1st convolution layer of GoogLeNet.
Notice that while they detect different types of edges, they’re still low-level edge detectors.
One important thing to notice here is that convolved images are still images. The output of a small grid of pixels from the top left of an image will still be on the top left. So you can run another convolution layer on top of another (such as the two on the left) to extract deeper features, which we visualize.
Yet, however deep our feature detectors get, without any further changes they’ll still be operating on very small patches of the image. No matter how deep your detectors are, you can’t detect faces from a 3×3 grid. And this is where the idea of the receptive field comes in.
An essential design choice of any CNN architecture is that the input sizes grow smaller and smaller from the start to the end of the network, while the number of channels grows deeper. This, as mentioned earlier, is often done through strides or pooling layers. Locality determines what inputs from the previous layer the outputs get to see. The receptive field determines what area of the original input to the entire network the output gets to see.
The idea of a strided convolution is that we only process slides a fixed distance apart, and skip the ones in the middle. From a different point of view, we only keep outputs a fixed distance apart and remove the rest.
3×3 convolution, stride 2
We then apply a nonlinearity to the output, and per usual, then stack another new convolution layer on top. And this is where things get interesting. Even if were we to apply a kernel of the same size (3×3), having the same local area, to the output of the strided convolution, the kernel would have a larger effective receptive field:
This is because the output of the strided layer still does represent the same image. It is not so much cropping as it is resizing, only thing is that every single pixel in the output is a “representative” of a larger area (of whose other pixels were discarded) from the same rough location from the original input. So when the next layer’s kernel operates on the output, it’s operating on pixels collected from a larger area.
(Note: if you’re familiar with dilated convolutions, note that the above is not a dilated convolution. Both are methods of increasing the receptive field, but dilated convolutions are a single layer, while this takes place on a regular convolution following a strided convolution, with nonlinearity in between)
This expansion of the receptive field allows the convolution layers to combine the low-level features (lines, edges), into higher-level features (curves, textures), as we see in the mixed3a layer.
Followed by a pooling/strided layer, the network continues to create detectors for even higher level features (parts, patterns), as we see for mixed4a.
The repeated reduction in image size across the network results in, by the 5th block on convolutions, input sizes of just 7×7, compared to inputs of 224×224. At this point, every single pixel represents a grid of 32×32 pixels, which is huge.
Compared to earlier layers, where an activation meant detecting an edge, here, activation on the tiny 7×7 grid is one for a very high-level feature, such as for birds.
The network as a whole progresses from a small number of filters (64 in case of GoogLeNet), detecting low-level features, to a very large number of filters(1024 in the final convolution), each looking for an extremely specific high-level feature. Followed by a final pooling layer, which collapses each 7×7 grid into a single pixel, each channel is a feature detector with a receptive field equivalent to the entire image.
Feature visualization of channels from each of the major collections of convolution blocks, showing a progressive increase in complexity
Compared to what a standard feedforward network would have done, the output here is really nothing short of awe-inspiring. A standard feedforward network would have produced abstract feature vectors, from combinations of every single pixel in the image, requiring intractable amounts of data to train.
The CNN, with the priors imposed on it, starts by learning very low-level feature detectors, and as across the layers as its receptive field is expanded, learns to combine those low-level features into progressively higher-level features; not an abstract combination of every single pixel, but rather, a strong visual hierarchy of concepts.
By detecting low-level features, and using them to detect higher-level features as it progresses up its visual hierarchy, it is eventually able to detect entire visual concepts such as faces, birds, trees, etc, and that’s what makes them such powerful, yet efficient with image data.
A final note on adversarial attacks
With the visual hierarchy CNN's build, it is pretty reasonable to assume that their vision systems are similar to humans. And they’re really great with real-world images, but they also fail in ways that strongly suggest their vision systems aren’t entirely human-like. The most major problem: Adversarial Examples, examples that have been specifically modified to fool the model.
To a human, both images are obviously pandas. To the model, not so much.
Adversarial examples would be a non-issue if the only tampered ones that caused the models to fail were ones that even humans would notice. The problem is, the models are susceptible to attacks by samples which have only been tampered with ever so slightly, and would clearly not fool any human. This opens the door for models to silently fail, which can be pretty dangerous for a wide range of applications from self-driving cars to healthcare.
Robustness against adversarial attacks is currently a highly active area of research, the subject of many papers and even competitions, and solutions will certainly improve CNN architectures to become safer and more reliable.
CNN's were the models that allowed computer vision to scale from simple applications to powering sophisticated products and services, ranging from face detection in your photo gallery to making better medical diagnoses. They might be the key method in computer vision going forward, or some other new breakthrough might just be around the corner. Regardless, one thing is for sure: they’re nothing short of amazing, at the heart of many present-day innovative applications, and are most certainly worth deeply understanding.
Hope you enjoyed this article! If you’d like to stay connected, you’ll find me on LinkedIn here. If you have a question, comments are welcome! — I find them to be useful to my own learning process as well.
If you want to continue on the path of getting a deeper understanding of Neural Networks and Deep Learning, here are some resources that might help you-
Stanford CS330: Deep Multi-Task and Meta-Learning
Feature Visualization — How neural networks build up their understanding of images (of note: the feature visualizations here were produced with the Lucid library, an open-source implementation of the techniques from this journal article)