Why?

There are so many models to trial out, but training a neural network is a slow process when you have a substantial amount of data. Here I'll step you through the full model life cycle, starting with a list of potential models, explaining their differentiating factors and then how one final model works.

The contenders

ResNext152
ResNext50_32x4d
Inception V2
VGG16
SqueezeNet 1_1
MobileNetV2

Choosing a Model

When choosing a neural network model it's important to consider each model's performances as well as its computational cost.

Consider where you're model will be used (i.e. mobile), as this may impact your choice of models (i.e. a lightweight model may be required). The second consideration is the training time. If you have a powerful graphics card like the RTX 2080Ti with a large amount of VRAM (i.e. 11 GB+) then training time likely wouldn't be a large concern. However, if you have a large dataset and a GPU with less VRAM, a model designed to run quickly will train far quicker! MobileNet balances both factors quite well, and so is the focus of this blog post.

For completion note that there's a classic pattern where larger amounts of resource usage on average leads to small performance improvements. This is because adding more layers will cause vanishing/exploding gradients, so older networks like Inception tended to perform worse than modern models.

MobileNetV2

A drop-in replacement for standard CNN's

A "factorized" version of regular CNN's can be created by dividing CNN's into two parts:

Filtration (depth wise convolution)
Finding new linear combinations of features (pointwise convolution)

The first depthwise convolution applies a single convolutional kernel to each input layer, before aggregating the results. The second pointwise convolution is a 1x1 convolution

Rationally the model reduces parameters as:

Standard convolutions produce $height_{input} * width_{input} * depth_{input} * {no\_kernels} * {kernel\_size}^2$
Depth wise convolutions produce $height_{input} * width_{input} * depth_{input} * ({no\_kernels}^2 + {no\_kernels})$

Thus if a kernel size of 3 (standard) is used, these convolutions will be 8-9x faster!

Linear bottleneck layers

Let me preface this by asserting that channels mentioned here are layers (like RGB).

There are two assumptions at play:

When ReLU transforms maintain a non-zero volume, the transformation is linear
ReLU can preserve input information, but only if it originated in a low-dimensional subspace

Using linear bottleneck layers (convolutional layers without ReLU's) allows greater preservation of information. This is because linear functions don't collapse any channel, unlike ReLU's. Note that the paper describing this emphasizes that collapsing a channel is fine when that information is likely stored elsewhere in another channel. This is explained further later on.

Inverted residuals

Very much like ResNet's, MobileNetV2 uses residual blocks to improve gradient flow! Here though, the links are between the linear bottleneck layers which reduce the output dimensions. This design choice is more efficient, as computations (matrix multiplication) occurs between smaller matrices.

ReLU6

Throughout the MobileNetV2 implementation, ReLU6 is always used instead of a regular ReLU. ReLU6 is a modified version of the original Rectified Linear Unity which stops activations from growing too large. This allows ReLU6 to be more robust. However, the 6 itself is an arbitrary choice of value.

Layers

1x1 convolution with ReLU
Depthwise convolution (with 3x3 kernel) as a residual bottleneck layer
Pointwise 1x1 convolution (finding linear combinations between features)

The first stage with 1x1 convolutions effectively increases the number of channels. An expansion ratio is used to represent the desired increase in channels (the size of the input bottleneck vs inner size). As there are now more channels present, it is fine to use ReLU after the bottleneck layer. The idea is that with a large number of channels if one channel is collapsed, that same info is likely still within another channel as well. Hence, ReLU can be used.

The final pointwise 1x1 convolution does the opposite (decrease output dimensions). No ReLU is used as reducing dimensionality itself can cause destruction of information.

Note more hyperparameters are present in the actual model, but for simplicity, I'm leaving them out. To learn about MobileNetV2 in more detail check out its paper.

A final note

Now that you know how a basic lightweight mode works, you may be interested in where it could come handy. I previously mentioned reduced training time, but a model which can run on any device without the need of an internet connection, or decent hardware can be useful in several scenarios. One such scenario is my snake species classification app which tells you the snake species in a photo (i.e. if bitten). Please feel free to use the GitHub repo with code as a reference to how you can do the same for your model!

Resources?

Cover image sourced from here

THANKS FOR READING!

Now that you've heard me ramble, I'd like to thank you for taking the time to read through my blog (or skipping to the end). If you liked this article, then check out how to improve your model and how to overcome several hurdles during creating a project of your own.

The Data Science Swiss Army Knife