Hailo Quantization Scheme

1. Introduction

Quantization is the process of converting the full-precision representation of the weights, biases, and activations of a Neural-Network (NN) to a lower bit representation. The quantization scheme of Hailo’s DataFlow Compiler uses an 4/8/16bit int representation for the weights (see Figure 1a), and 8/16bit uint for the activations (see Figure 1b).

Choosing the right quantization scheme for a NN is a not a ”one size fits all”, and one needs to consider some trade-offs, mainly accuracy vs. memory con sumption. We recommend to start by applying the basic quantization scheme, and do incremental steps of adjust, and measure the performance.

2. Prerequisites

Before running Quantization, one needs to have the Hailo Network representa tion (HN file), and the native weights (NPZ file), as obtained after parsing the
Checkpoint/ONNX/Frozen-graph.

Another critical component, is having a calibration set of representative input images. We use a calibration set, typically the size of 32-96 items. It is
critically important to perform the same pre-processing that the network was trained with on the calibration set.

3. Basic API

3.1 SDK Client Runner

Create a runner and load the HN and NPZ files to it:

1 from hailo_sdk_client import ClientRunner
2
3 runner = ClientRunner(hn=open(’net.hn’, ’r’).read())
4 runner.load_params(’net.npz’)

Listing 1: Runner Creation

3.2 Quantization API (run quantization)

The basic command for quantizing a network is the

run quantization

and

run quantization from np.

Here is how to to a basic invocation:

  1. from hailo_sdk_client import run_quantization_from_np
  2. 2
  3. run_quantization_from_np(runner, np_array, …)

Listing 2: run quantization from np

np array

is a

numpy.ndarray

of shape (N,H,W,C) of pre-processed images. This command has many additional flags to perform a more advanced quantization, here is a brief description of the main ones:

  1. calib_ num _batch and batch_ size – Number of images to use for the quantization (calibration) is the multiplication of these two.
  2. Iterative Bias Correction (IBC)¹ – Checks for systematic error between native and quantized activation, and correct it in the bias.
    This boils down to adding ibc=True in the run quantization command.
  3. Equalization² – Equalize channels with different dynamic-range on the same layer.
    This boils down to adding equalize=True in the run quantization command.
  4. Quantization aware FineTune (QFT)³ – Performing additional training, this time on the quantized graph.

The QFT has several next-level details that one can choose, the basic invocation would be finetune=True in the run quantization command.
For the full description please see the specific application note about it


¹Finkelstein, A., Almog, U. and Grobman, M., 2019. Fighting Quantization Bias With Bias
²Meller, E., Finkelstein, A., Almog, U. and Grobman, M., 2019. Same, Same But Different– Recovering Neural Network Quantization Error Through Weight Factorization
³QFT requires a larger calibration set, by default 2048, and access to GPU

3.3 Quantization Script

The aforementioned quantization techniques are roughly said applied on the whole net. When there is a need for having more fine-grained control over which method to apply to what layer, we can use a quantization script. The quantization script is written in an additional file, with the .alls suffix. Few of the main features that are enabled through quantization scripts are:

  1. 16bit bias (Double Precision)
  2. Weights and/or activations clipping
  3. 4bit weights or 16bit weights and activations
  4. quantization groups – This allows to use of different scale factors per group, we allow numbers from 1-4. Currently supported only for conv layers.

An example of a quantization script applying 16bit bias to all supported layers:

1 quantization_param({conv*}, bias_mode=double_scale_initialization)

Listing 3: quantization.alls

An example for enabling 16bit representation to the weights and activations.
This option must be applied on the whole network currently.

1 quantization_param({conv*}, precision_mode=a16_w16)

Listing 5: quantization.alls

The loading of the quantization script can be done directly to the

runner

or as a parameter to the

run quantization

command. If the former is used, it needs to be loaded before the run quantization command.

runner.load_quantization_script(’quantization.alls’)

Listing 6: load quantization.alls

3.4 Evaluation

For evaluating the quantization quality we can define a simple evaluation func tion to take as input image(s), the runner, and the emulation target (usually SdkNumeric() or SdkNative())

def eval(runner, target, images):
2 results = []
3 with tf.Graph().as_default():
4 network_input = tf.placeholder(dtype=tf.float32)
5 sdk_export = runner.get_tf_graph(target, network_input)
6 output_tensors = sdk_export.output_tensors
7 with sdk_export.session.as_default() as sess:
8 sess.run(tf.local_variables_initializer())
9 for img in images:
10 feed = {network_input:np.expand_dims(img, axis=0)})
11 logits = sess.run(output_tensors, feed_dict=feed)
12 results.append(logits)
13 return results

Listing 7: base eval function
If there is a need to do a deeper analysis, we would use an enhanced eval function.

4. The Flow

4.1 Before you begin

Normally we assess the quantization quality vs. the results of the Native graph on a similar metric. At first, it is easier to run the quantized version on the emulator, rather than the real device, to eliminate issues that might arise from the physical connection. Take a look at Figure 2 for the overall picture.

First, we need to establish the pre and post-processing functions that the network was trained with. Second, we need the decide upon the metric for which
we measure the network performance (e.g. cosine, top1, mAP, etc.).

For the evaluation of quantized model (SdkNumeric()) vs. the Native model (SdkNative()), we would also want that the exact same images would
be fed into both models. After you have established a working quantization setup on the emulator, you should move to exercise it on the device itself.

We also note that the pre or post-processing implementations running on the device can be different, as a result of more strict requirements on the run-time environment.

4.2 Getting Started

As said earlier, the recommended approach is incremental. After each step we recommend evaluating the quantization performance and take note of the figures considering that the first step would be the basic scheme, the next steps should be as listed below. The reasoning is to start from the methods that are most likely to improve quantization quality, and ease of use.

  1. Apply 16b bias on the whole network4 – If applicable, this will always improve the accuracy. It also increases the effectiveness of IBC. If there is not enough memory for applying this on the whole net, try to remove layers with high number of output channels.
  2. Apply Equalization
  3. Apply IBC – this might take some time, depends on the number of layers. If you have access to a GPU that could be expedited.
  4. Apply QFT NOTE: QFT is inclusive over IBC, so there is no need to add IBC if QFT has been applied.
  5. 16bit weights and activations4

4This option requires additional memory resources

Quantization Analysis flowchart

5. Advanced Analysis

In certain occasions, even after execution of the basic flow the quantized network still performs poorly. We can divide this task to two parts:

    • Find which layer(s) induce the errors5
    • Understand the nature of the error

Since the emulator back-end is TensorFlow, this analysis would basically be a TensorFlow project. To be effective, we would need to have the following:

  1. pre-processing function, or preprocessed set of images, for some cases even 1 is enough, to begin with.
  2. post-processing function
  3. High-level understanding of what/how it is ”bad” – Are the results far off,or just slightly off? Is it a particular branch/class that fails?

Following on Figure 2, we will first compare the Native results to these of the emulator (SdkNumeric()).

5.1 Find the erroneous layers

Let’s start with finding which layers induce most of the error. We often find that there is a small set of layers, that has some merit in common that adds most of the error. The best option to start with is the per-layer analysis. For this, we can usually have a visual inspection of the post-processed results to see if it got better.

During this analysis, we had a hidden assumption that there is no meaningful interaction between the errors. To validate this, we should now run an image
through the net with Lerr layers in Native form, and check that the error is indeed minimal (visually or numerically 1-2%).


5The assumption is that there is a small set of layers that produces most of the error

5.2 Understand the nature of the error

Now that we know which are the offending layers, we would like to collect some more statistics on the activation of these layers. We would run the network in Native mode and keep the intermediate tensors. These operations might be heavy and require a GPU to complete.

Here’s a short snippet of how might one be collecting the inter-layer tensors from a TensorFlow session, using the emulator.

1 runner = ClientRunner(hn=open(hn).read())
2 input_tensor = tf.placeholder(dtype=tf.float32)
3 g = runner.get_tf_graph(SdkNative(), input_tensor, fps=fps,
allocator_script_filename=alls)
4 inter_layers = {x:y for x,y in zip(g.all_layers_names, g.all_layers)}
5
6 # Running the session
7 feed_dict={input_tensor: np.expand_dims(image, axis=0)}
8 output = g.session.run(inter_layers, feed_dict=feed_dict

Listing 8: Collect inter-layer tensors

In the above snippet, output contains all the intermediate tensor values, as well as the logits. We recommend saving the results as an npy to be able to analyze offline.

The first thing to do is to plot a histogram of the activation, channel-wise if possible to get a feeling of the distribution.

  1. The X-axis is quantized activation, the Y-axis is the native, the different Guides and Tutorials colors account for different output channels. Figure 3a shows a ’healthy’ looking distribution, We can see that:
    1. The dynamic range is 0-3, which is good.
    2. All the channels have a similar distribution, meaning no channel is underrepresented.
    3. The correlation between the quantized and native is very good.

Figure 3b shows a quite different picture.
1. The dynamic range is huge, 5000+ at the native
2. One channel has a few outlier activations that thins the representation of the bulk of the activations.
3. The correlation is not all that great, we can see the colors are smudged on the lower right corner of the graph.

Some things to consider when looking at that plot:
• If there is a large dynamic range, that is caused by a few outliers, we should consider clipping the activations.

• If there is high variance between channels or high dynamic range, we should check if there is a BN after this layer, if not, adding one would improve the net.
• For classification branches, make sure the Sigmoid is included, it will natively trim large negative values.
• When the zero-point is not at zero, QFT/IBC or equalization could help.

6. FAQ

6.1 Fail to Quantize errors

• run_quantization ”unsupported required slope” error

Make sure that the weights and the calibration set are real, and not ranodomized

• run_quantization OutOfRange Error

This usually means that the total number of images that run_qunatization was asked to use for the calibration (batch-size X calib-num-batch) is larger the the number of images in the npy file or the directory.

• run_quantization limvals out small range

In this case limvals out has a very small range, this means that on the given calibration set, this output branch (conv37 in the example) is not being activated. Need to check if the calibration set is representative enough, so it’s activating (all of) the output branches.

• out of memory crash/killed

This can happen if running the run quantization command with the the more heavy algorithms such as QFT or IBC for a large network on a machine without a GPU.

• run_qunatization FT out of data

The QFT algorithm, by default needs 2048 images. If the calibration images source (npy or folder) is short of that we will reach that error. Either supply more images, or reduce the number of images used by QFT.

6.2 Bad Accuracy Errors

• MobileNet (MN) – The nature of the layers in the MN based networks, makes the quantization of them greatly benefits from applying IBC and Equalization.
• YOLO – In the YOLO architecture a single output branch incorporates 3 elements – classes, confidence, and boxes, with different dynamic ranges, which makes the quantization more challenging. Splitting to different layers on the HW makes it better.
For an output branch of a conv1x1 (e.g. YOLO) we can also output the activation in 16b. This significantly helps in reducing under-representation issues. To do that, one need to apply the following quantization script: quantization param (output layer*, precision mode=a16 w16)
• 4bit weights – Any net that has been quantized with one or more layers with 4bits, must used QFT and weights clipping.
• Bad Accuracy on C++ – When creating an application in C
C++, the HailoRT API gives us the tools to specify what is the data type, and formats that we are handing over (on the TX direction) or expecting to receive (on the RX direction).
There are 2 structs that holds this information:
– hailo_stream_info_t – Holds the data format, type etc. that the NN expect to get (input) or return (output)
– input_stream_params – Holds the data format, type etc. that the user environments sends or expects to receive the data.

As a user you do not have the option to directly control the way that the NN expects to get the data. It only serves as a mean for optimizations.
For example, you can choose to receive the data not-quantized, and take responsibility of doing that yourself, maybe in a smarter way.