How To Optimize A Neural Network


In this article, we will learn how to optimize or cut a neural network without affecting its performance and efficiency to run on an edge device.


Using this, the degree to which a machine executes its task is measured. The output is usually calculated with respect to device performance, inference speed, or energy consumption.


A metric is a quantity or an attribute.
Performance Metric
  1. Inference Time
    It should be reduced to increase performance

  2. Model size
    A smaller model takes less time and energy to load.

  3. Accuracy
    It should be kept high but not at the cost of other metrics.
Some other metrics,
  1. Power
    Optimize the system for a longer operating time.

  2. System Size
    Optimize the system for less volume

  3. System Cost
    Optimize the system for deployment costs.

Software Optimization

It involves changing your code or model in order to improve the performance of your application. This involves techniques and algorithms that reduce the machine complexity of the model as applied to edge computing.

Hardware Optimization

It may be as easy as moving to a different hardware platform or as complex as designing specially built specialized hardware for increasing the performance of a specific program.


It is time to deduce an image before the result is provided for that image. Where a single data point is given, low latency structures are used, the inference is made where data is available. An example of low latency is when one image is processed per unit of time i.e. 1 fps


It is the quantity of data produced in a single time period. With a huge number of data points, the assumption is rendered as a batch, high-throughput solutions are used. An example of a high-performance device is one processing 5 images per unit time, i.e. 5fps.
  1. In non-batch case, thoughout= 1/latency
  2. In the batch case, latency= Batch size/thoughput

Ways to optimize our model


1. Reducing the size of the model

This decreases the loading time of the model and correlates in the elimination of unimportant or redundant parameters from our network. It will result in,
  1. Model loads faster
  2. Less space is required to store model
  3. Reduction in model parameters
  4. Model compiles faster
Methods to reduce model size
  1. Quantization
    Here high precision weights are converted into low precision weights.

  2. Knowledge Distillation
    Here the larger model is converted to a smaller model

  3. Model Compression
    Here fewer weights are stored as compared to more weights in the original model.
  • FP32 uses 4 bytes
  • FP16 uses 2 bytes
  • INT8 uses 1 byte
  • INT11 packed data type

2. Reduce the number of operations

This decreases the deduction time by minimizing the number of operations or measurements required to operate the network. It can be achieved with more efficient layers and the elimination of neural connections. It will result in:
  1. More efficient operations
  2. Reduction in System Energy Consumption
  3. Reduction in inference time
Methods to reduce the number of operations
  1. More efficient operations
    Here we convert convolutions layers to separable convolution layers

  2. Downsampling
    Using max or average pooling we reduce the number of parameters

  3. Model Pruning
    Here we remove redundant weights
  1. FLOP means Floating Point Operations.
  2. FLOPS means Floating Points Operations per Second.
  3. MAC means Multiply and Accumadate i.e. multiply followed by addition,
  4. One MAC = Two FLOPS
  5. Fewer FLOPS means a faster model

Pooling Layers

This means sub-sampling layers, which minimize data or parameters transferred from layer to layer. The average pooling and max pooling are the two most commonly used pooling levels.
  1. Max Pooling takes a maximum of all values in a Kernel.
  2. Average Pooling takes the average of all values in Kernel.
Suppose a kernel is represented by a 3x3 matrix given by [23, 56,76,224,177,223,122, 23, 34]. So :
  1. result of Max Pooling will be 224.
  2. result of Average Pooling will be 106.
The disadvantage of Pooling is that we are losing a lot of information about an image.

Separable Convolution Layers

It's a convolution layer that splits the normal convolution layer into two parts, one depth-wise and one point-wise. This decreases the number of FLOPs or operations necessary to execute the model.

1. Depthwise Convolution

In this, each filter involves only one channel if the input image i.e. no of filters is equal to the number of channels in input always.
Output shape = (Input shape - Kernel shape) +1

2. Pointwise Convolution

This reduces the depth of an image by using a kernel that has a depth equal to the input depth. If a kernel is there, then input height is equal to the output height, the output width is equal to input width, and output depth is 1.
Output shape = (Input shape - Kernel shape) + 1


It is a compression model technique in which redundant network parameters are omitted while attempting to retain the initial network precision (or other metrics).
Steps to Prune
  1. Rank weights (either layerwise or across the whole network)
    We take equal weights out of the layer if we prune. If we don't realize how to act through the multiple levels, we're going to prune all across the network.
  2. Remove Weights (Weights are removed by setting them to zero)
  3. Retrain your model (Fine-tune your model to prevent a drop in accuracy)
To remove neurons, we have to set all weights to that return equal to zero.

Model Compression

It refers to a series of algorithms that allow us to reduce the amount of memory required to store the model and also compact the number of parameters of our models.
Model optimization techniques are:
  1. Quantization
  2. Weight Sharing


It is the mapping method between larger and smaller values. Here, we can begin with a number of possible values that are continuous (and probably endless). And map this to a smaller set of (finite) values. In other words, quantization limits several continuous quantities to a small set of discrete numbers.
  1. Weight Quantization
    In this only weights are reduced. It uses floating-point arithmetic.
    In weight quantization new values are calculated as follows:
    a. OldRange = OldMax - OldMin
    b. newRange = newMax - newMin
    c. newValue = ((oldValue - oldMin) * newRange / oldRange) + newMin
  2. Weight and Activation Quantization
    In this, both weights and activations are reduced. It used integer arithmetic.
Weight and activation quantization results in a reduction in computational complexity hence a reduction in memory.
  1. The temporal neural network has 3 weights i.e. -1, 0, and 1.
  2. A binary neural network has 2 weights i.e. -1 and 1.
  3. INT8 quantized network has 256 weights, which means 8 bits are required to represent each weight.

Weight Sharing

The goal here is to store the same value in multiple weights. This decreases the amount of uniquely stored weights to conserve memory and reduce the scale of the model.
Weight sharing techniques are:
  1. K-Means
    After practicing the number of individual weights can be accurately managed using K-means. It is used in Deep Compression.

  2. Hashing Weights
    It is used in HashedNets. This group weights before training and allows weights to be shared between the entire network or various layers.

Knowledge Distillation

It is a process where the information that is gained from a broad and precise model (teacher model) is translated into a smaller model (student model). The process is less expensive.
Steps of Knowledge Distillation
  1. Train a teacher model
  2. Get soft labels
  3. Train students model on soft labels
Here the student minimizes a loss function where the target is the soft labels from the teacher model.
The following can be used as output from the teacher model to train the student model:
  1. The soft output from the final softmax layer in the teacher network.
  2. The output from the third layer in a teacher network containing 6 layers.
  3. The hard output from the final softmax layer in the teacher network.


  1. The shape of a convolutional layer = Height x Width x Depth
  2. Output shape = (Input Shape - Kernel Shape) +1
  3. Inference time = Total FLOPs / Speed of Hardware
  4. For Convolutional Layers, FLOPs = 2 x Number of Kernel x Kernel Shape x Output Height x Output Width
  5. For Fully Connected Layers, FLOPs = 2 x Input Size x Output Size
  6. For Pooling layers,
    1. FLOPs = Height x Depth x Width of an image
    2. With a stride, FLOPs =  (Height / Stride) x Depth x (Width / Stride) of an image


1. Give a convolutional network with input shape 25x25x9, with 9 kernel shape of 7x7, which results in an output of the shape of 19x19x64. Calculate inference time, if the speed of hardware is 4 TeraFLOPs.
MAC    =  64x7x7x9x19x19
            =  10188864
FLOPs =  10188864 x 2
            =  20377728
Inference Time = 20377728 / 4x109 = 0.5094432 x 10-3 seconds
2. Suppose you have to design an image classification application. Input to the network is 28x28 MNIST images. There are two Conv2D layers, two average pooling layers, and two fully connected layers. Each Conv2D layer has five 3x3 filters. Each Pooling layer has a 2x2 filter and a stride of 2. First fully connected layer has 128 neurons and the second has 10 neurons. The system is designed in the following format:
Input--->Conv2d-->Pooling Layer-->Conv2d-->Pooling Layer-->FC->FC
Calculate the number of FLOPs required during each layer. Given that the number of kernels is 10.
Layer 1: Conv2D
Output shape = (28-3) +1 =26, so we get a output of 5x26x26
FLOPs = 5x26x26x3x2x1x2 = 60840
Layer 2: Average Pooling 2D
Output shape = 5x13x13
FLOPs = 13x13x2x2x5 = 3380
Layer 3: Conv2D
Output shape = (13-3) +1 = 11, so we get 5x11x11
FLOPs = 5x11x11x3x3x10x2 = 108900
Layer 5: Average Pooling 2D
Output shape =  5x5x5
FLOPS =  11x11x2x2x5 = 2420
Layer 6: Fully Connected
FLOPs = 5x5x5x128x2 = 32000
Layer 7: Fully Connected 
FLOPs =  128x10x2 = 2560
Total FLOPs required are:
= 60840 + 3380 + 108900 + 2420 + 32000 + 2560
= 210100


In this article, we learned how we can reduce or cut a neural network without affecting performance and efficiency so that we can get the maximum out of AI application on an edge device.