How To Optimize A Neural Network

Rohit Gupta
4y
9.5k
0
7

Article

Introduction

In this article, we will learn how to optimize or cut a neural network without affecting its performance and efficiency to run on an edge device.

Performance

Using this, the degree to which a machine executes its task is measured. The output is usually calculated with respect to device performance, inference speed, or energy consumption.

Metric

A metric is a quantity or an attribute.

Performance Metric

Inference Time
It should be reduced to increase performance
Model size
A smaller model takes less time and energy to load.
Accuracy
It should be kept high but not at the cost of other metrics.

Some other metrics,

Power
Optimize the system for a longer operating time.
System Size
Optimize the system for less volume
System Cost
Optimize the system for deployment costs.

Software Optimization

It involves changing your code or model in order to improve the performance of your application. This involves techniques and algorithms that reduce the machine complexity of the model as applied to edge computing.

Hardware Optimization

It may be as easy as moving to a different hardware platform or as complex as designing specially built specialized hardware for increasing the performance of a specific program.

Latency

It is time to deduce an image before the result is provided for that image. Where a single data point is given, low latency structures are used, the inference is made where data is available. An example of low latency is when one image is processed per unit of time i.e. 1 fps

Throughput

It is the quantity of data produced in a single time period. With a huge number of data points, the assumption is rendered as a batch, high-throughput solutions are used. An example of a high-performance device is one processing 5 images per unit time, i.e. 5fps.

Note

In non-batch case, thoughout= 1/latency
In the batch case, latency= Batch size/thoughput

Ways to optimize our model

1. Reducing the size of the model

This decreases the loading time of the model and correlates in the elimination of unimportant or redundant parameters from our network. It will result in,

Model loads faster
Less space is required to store model
Reduction in model parameters
Model compiles faster

Methods to reduce model size

Quantization
Here high precision weights are converted into low precision weights.
Knowledge Distillation
Here the larger model is converted to a smaller model
Model Compression
Here fewer weights are stored as compared to more weights in the original model.

Note

FP32 uses 4 bytes
FP16 uses 2 bytes
INT8 uses 1 byte
INT11 packed data type

2. Reduce the number of operations

This decreases the deduction time by minimizing the number of operations or measurements required to operate the network. It can be achieved with more efficient layers and the elimination of neural connections. It will result in:

More efficient operations
Reduction in System Energy Consumption
Reduction in inference time

Methods to reduce the number of operations

More efficient operations
Here we convert convolutions layers to separable convolution layers
Downsampling
Using max or average pooling we reduce the number of parameters
Model Pruning
Here we remove redundant weights

Note

FLOP means Floating Point Operations.
FLOPS means Floating Points Operations per Second.
MAC means Multiply and Accumadate i.e. multiply followed by addition,
One MAC = Two FLOPS
Fewer FLOPS means a faster model

Pooling Layers

This means sub-sampling layers, which minimize data or parameters transferred from layer to layer. The average pooling and max pooling are the two most commonly used pooling levels.

Max Pooling takes a maximum of all values in a Kernel.
Average Pooling takes the average of all values in Kernel.

Suppose a kernel is represented by a 3x3 matrix given by [23, 56,76,224,177,223,122, 23, 34]. So :

result of Max Pooling will be 224.
result of Average Pooling will be 106.

The disadvantage of Pooling is that we are losing a lot of information about an image.

Separable Convolution Layers

It's a convolution layer that splits the normal convolution layer into two parts, one depth-wise and one point-wise. This decreases the number of FLOPs or operations necessary to execute the model.

1. Depthwise Convolution

In this, each filter involves only one channel if the input image i.e. no of filters is equal to the number of channels in input always.

Output shape = (Input shape - Kernel shape) +1

2. Pointwise Convolution

This reduces the depth of an image by using a kernel that has a depth equal to the input depth. If a kernel is there, then input height is equal to the output height, the output width is equal to input width, and output depth is 1.

Output shape = (Input shape - Kernel shape) + 1

Pruning

It is a compression model technique in which redundant network parameters are omitted while attempting to retain the initial network precision (or other metrics).

Steps to Prune

Rank weights (either layerwise or across the whole network)
We take equal weights out of the layer if we prune. If we don't realize how to act through the multiple levels, we're going to prune all across the network.
Remove Weights (Weights are removed by setting them to zero)
Retrain your model (Fine-tune your model to prevent a drop in accuracy)

To remove neurons, we have to set all weights to that return equal to zero.

Model Compression

It refers to a series of algorithms that allow us to reduce the amount of memory required to store the model and also compact the number of parameters of our models.

Model optimization techniques are:

Quantization
Weight Sharing

Quantization

It is the mapping method between larger and smaller values. Here, we can begin with a number of possible values that are continuous (and probably endless). And map this to a smaller set of (finite) values. In other words, quantization limits several continuous quantities to a small set of discrete numbers.

Weight Quantization
In this only weights are reduced. It uses floating-point arithmetic.

In weight quantization new values are calculated as follows:
a. OldRange = OldMax - OldMin
b. newRange = newMax - newMin
c. newValue = ((oldValue - oldMin) * newRange / oldRange) + newMin
Weight and Activation Quantization
In this, both weights and activations are reduced. It used integer arithmetic.

Weight and activation quantization results in a reduction in computational complexity hence a reduction in memory.

Note:

The temporal neural network has 3 weights i.e. -1, 0, and 1.
A binary neural network has 2 weights i.e. -1 and 1.
INT8 quantized network has 256 weights, which means 8 bits are required to represent each weight.

Weight Sharing

The goal here is to store the same value in multiple weights. This decreases the amount of uniquely stored weights to conserve memory and reduce the scale of the model.

Weight sharing techniques are:

K-Means
After practicing the number of individual weights can be accurately managed using K-means. It is used in Deep Compression.
Hashing Weights
It is used in HashedNets. This group weights before training and allows weights to be shared between the entire network or various layers.

Knowledge Distillation

It is a process where the information that is gained from a broad and precise model (teacher model) is translated into a smaller model (student model). The process is less expensive.

Steps of Knowledge Distillation

Train a teacher model
Get soft labels
Train students model on soft labels

Here the student minimizes a loss function where the target is the soft labels from the teacher model.

The following can be used as output from the teacher model to train the student model:

The soft output from the final softmax layer in the teacher network.
The output from the third layer in a teacher network containing 6 layers.
The hard output from the final softmax layer in the teacher network.

Formulas

The shape of a convolutional layer = Height x Width x Depth
Output shape = (Input Shape - Kernel Shape) +1
Inference time = Total FLOPs / Speed of Hardware
For Convolutional Layers, FLOPs = 2 x Number of Kernel x Kernel Shape x Output Height x Output Width
For Fully Connected Layers, FLOPs = 2 x Input Size x Output Size
For Pooling layers,
1. FLOPs = Height x Depth x Width of an image
2. With a stride, FLOPs = (Height / Stride) x Depth x (Width / Stride) of an image

Numerical

1. Give a convolutional network with input shape 25x25x9, with 9 kernel shape of 7x7, which results in an output of the shape of 19x19x64. Calculate inference time, if the speed of hardware is 4 TeraFLOPs.

Answer

MAC = 64x7x7x9x19x19

= 10188864

FLOPs = 10188864 x 2

= 20377728

Inference Time = 20377728 / 4x10⁹ = 0.5094432 x 10^-3 seconds

2. Suppose you have to design an image classification application. Input to the network is 28x28 MNIST images. There are two Conv2D layers, two average pooling layers, and two fully connected layers. Each Conv2D layer has five 3x3 filters. Each Pooling layer has a 2x2 filter and a stride of 2. First fully connected layer has 128 neurons and the second has 10 neurons. The system is designed in the following format:

Input--->Conv2d-->Pooling Layer-->Conv2d-->Pooling Layer-->FC->FC

Calculate the number of FLOPs required during each layer. Given that the number of kernels is 10.

Answer

Layer 1: Conv2D

Output shape = (28-3) +1 =26, so we get a output of 5x26x26

FLOPs = 5x26x26x3x2x1x2 = 60840

Layer 2: Average Pooling 2D

Output shape = 5x13x13

FLOPs = 13x13x2x2x5 = 3380

Layer 3: Conv2D

Output shape = (13-3) +1 = 11, so we get 5x11x11

FLOPs = 5x11x11x3x3x10x2 = 108900

Layer 5: Average Pooling 2D

Output shape = 5x5x5

FLOPS = 11x11x2x2x5 = 2420

Layer 6: Fully Connected

FLOPs = 5x5x5x128x2 = 32000

Layer 7: Fully Connected

FLOPs = 128x10x2 = 2560

Total FLOPs required are:

= 60840 + 3380 + 108900 + 2420 + 32000 + 2560

= 210100

Conclusion

In this article, we learned how we can reduce or cut a neural network without affecting performance and efficiency so that we can get the maximum out of AI application on an edge device.

MCN Solutions Pvt. Ltd.

Technical Lead