Convolutional Neural Networks: Basic Theory in a Nutshell

Convolutional Neural Networks: Basic Theory in a Nutshell


Majority of the tutorials I've seen on convolutional neural networks either focus on providing a basic analogy or going straight into describing terminology. Therefore, I aim to start with an overview of the stages involved in CNN's (Convolutional Neural Networks) and then provide an analogy, as well as a small glossary of key and external resources for further assistance. Make sure to utilize the glossary to understand key terms used throughout the blog post to help understand the material and continue onwards to a few other articles or video's mentioned in my resources section! By the way, don't expect to completely understand CNN's straight away, as they ain't all too simple!

Note that I'll be providing tangible/practical code in another one of my problem-solution blog post (where I take a problem I've had and explain my final derived solution, along with how I've overcome some major hurdles).

Key Stages?

  • Convolutional Layers (extract features from the input)
    • Filters (matrices of weights) convolve over the input to produce feature maps
      • When the filter and input are similar, a high number is produced
    • Applying ReLu functions to increase non-linearity
  • Pooling/Down Sampling (combine clusters of neurons together) to reduce dimensionality
    • Flattens clusters of neurons into one-long vectors
  • Fully Connected Layers (connect a neuron to all those in the next layer)


You probably won't understand the above descriptions straight away, however a worded example feel more intuitive (to ease the confusion)!

So humor me and imagine the following scenario:

  • You're given a few hundred paintings and need to identify which picture corresponds to which shape
  • None of the paintings are too precise, and the grid they were painted onto is HUGE (so there's no use just trying every single all combinations in a fully connected artificial neural network)
  • You notice that each picture/painting is composed of smaller, more subtle, strokes (lines), which form curves, which themselves create each shape's final outline

This might sound insane (are we like 2?), but more complicated and meaningful problems can be solved in the exact same way!

The process of segregating/labelling images begins with realizing that you can't comprehend a full picture at once, so you must break each down into smaller 2x2 squares. You can move across, with a stride of 2 pixels at a time and compare these squares against a few filters. The filters themselves are just another grid, which resembles unique features which may be present in the original input image. Example basic features are like lines in different directions:

  • Horizontal
  • Vertical
  • Diagonal from left to right (upwards)
  • Diagonal from right to left (downwards)

Through convolving from one mini-image (receptive field) to the next, documenting how similar each receptive and filter grids are, a smaller down sampled image can form (this is pooling). The aim of this gradual comparative process is to form a more abstract, higher level image composed of lines instead of individual pixels!

Now the enlightening idea is that you can apply the same down sampling process used to extract lines from pixels, to find curves in lines, and then shapes formed from the curves! Each of these stages form separate layers of your neural network and are separated by activation functions (for accentuating non-linearity) and fully connected layers (to join together the different patters and eventually produce a final output)!


CNNConvolutional Neural Network
Kernel/ChannelA matrix of weights used to produce a feature map by convolving over the input (note that multiple can be used to preserve spacial depth to a higher extent)
FilterA set of kernel/channel\'s
ConvolvingMoving through a broken down version of the input, summing the input values in each section and multiplying them by the filter/kernel
Activation/Feature MapOriginal input processed by filters to accentuate certain features (effectively performs operations like edge detection, blur, ect)
StrideThe number of divisions of the input to scroll across at a time
Receptive FieldPart of the input which the filter\'s scrolling over
PaddingAdding zeroes (i.e. zero-padding) to the input or dropping part of it (valid-padding), to mitigate the impact of the stride not perfectly dividing the input (e.g. there is a remainder)

Places to LEARN MORE?

I've come across several amazing blogs and videos describing how convolutional neural networks work, so here's a rough list of the one's I feel are the most valuable!

Cover image sourced from here


Now that you've heard me ramble, I'd like to thank you for taking the time to read through my blog (or skipping to the end).