Quantization#

Note

Tensor data type used by common neural network models is generally the ``float32’’ type, and the industry needs to convert the model to a low-precision/bit type like ``int8’’ for specific scenarios. —— The whole process is called Quantization.

flowchart LR FM[Float Model] -- processing --> QM[Quantized Model]

通常以浮点模型为起点,经过中间的量化处理后最终变成量化模型#

  • Quantization can convert 32-bit floating-point numbers into 8-bit or even 4-bit fixed-point numbers, which has less runtime memory and cache requirements; in addition, since most hardware has specific optimizations for fixed-point operations, it is in terms of operating speed. There will also be a big improvement. Compared with ordinary models, **quantized models have the advantages of smaller memory capacity and bandwidth usage, lower power consumption and faster inference speed. **

  • Some computing devices only support fixed-point operations. In order for the model to run normally on these devices, we need to perform quantitative processing.

“In order to pursue the ultimate inference calculation speed, thereby giving up the precision of numerical representation”, intuition will bring a larger model drop point, but after a series of sophisticated quantization processing, the drop point can become minimal, and Can support normal deployment and use.

Users do not need to understand the implementation details behind them, and use quantization to meet the basic quantization needs. We provide more introduction to the basic principles of quantization for interested users, please refer to Explanation of the principle of quantification scheme.

Users who are familiar with the basic principles can jump directly to Megengine quantification steps ↩ to see the basic usage.

Warning

Please do not confuse “quantization” with “Mixed precision”, please refer to the Automatic mixing accuracy (AMP) document.

Introduction to the basic quantification process#

Currently there are two types of industry applications mainly quantitative techniques, it is carried out in MegEngine support:

  • Post-Training Quantization (PTQ);

  • Quantization-Aware Training (QAT).

Post-training quantization is a general technique used to convert a trained floating-point model into a low-precision/bit model. A common approach is to process the weight and activation value of the model and convert them to precision Lower type. In the conversion process, some statistical information about the weights and activations in the model to be quantified, such as the scale and zero_point, need to be used. Although the accuracy conversion occurs after training, in order to obtain these statistics, we still need to insert an observer (Observer) during model training, that is, during the forward calculation process.

The use of post-training quantization technology will cause the quantized model to drop points (that is, the prediction accuracy rate decreases). In severe cases, the quantitative model will be unavailable. A feasible approach is to use small batches of data to calibrate the Observer before quantification (Calibration), also called post-Calibration quantification.

Another feasible improvement plan is to use quantization perception training technology to insert some fake quantization (FakeQuantize) operators into the floating-point model as a transformation. During training, the fake quantize operator will perform quantization simulation based on the information observed by the Observer, that is, When the accuracy of the numerical value is truncated during the simulation calculation, the precision is reduced, and the numerical conversion is performed first, and then the converted value is restored to the original type. This allows the quantized object to “adapt in advance” to the quantization operation during training, and alleviate the impact of drop points during quantization after training.

The new FakeQuantize operator will introduce a lot of training overhead. In order to save the total time, the more general idea of model quantization is:

  1. According to the usual training model process, the Float model is designed and trained (equivalent to obtaining a pre-training model);

  2. Insert Observer and FakeQuantize operators to obtain the Quantized-Float model (QFloat model for short), and quantize perception training;

  3. After training and quantization, the real Quantized model (Q model for short) is obtained, which is the low-bit model that is finally used for inference.

flowchart LR FM[Float Model] --> |train| PFM[Pre-trained Float Model] PFM --> |Observer| PQFM[Pre-trained QFloat Model] PFM --> |FakeQuantize| PQFM PQFM --> |QAT| FQFM[Fine-tuned QFloat Model] FQFM --> |PTQ| QM[Q Model]

此时的量化感知训练 QAT 可被看作是在预训练好的 QFloat 模型上微调(Fine-tune),同时做了校准#

Note

According to some differences in actual scenarios, the quantification process can be flexibly changed, such as:

  • After spending without regard to training, in order to simplify the overall process can be constructed directly QFloat model, and training and quantify:

    flowchart LR FM[Float Model] --> |Observer| QFM[QFloat Model] FM[Float Model] --> |FakeQuantize| QFM[QFloat Model] QFM --> |QAT| TQFM[trained QFloat Model] TQFM --> |PTQ| QM[Q Model]
  • When constructing the QFloat model, if the FakeQuantize operator is not inserted, the training overhead can be reduced accordingly and the speed can be increased.

    But this time is not equivalent to quantify perception, were only the calibration data Calibration, the model may point out serious:

    flowchart LR PFM[Pre-trained Float Model] --> |Observer| PQFM[Pre-trained QFloat Model] PFM[Pre-trained Float Model] -.- |FakeQuantize| PQFM[Pre-trained QFloat Model] PQFM --> |Calibration| CQFM[Calibrated QFloat Model] CQFM --> |PTQ| QM[Q Model]

For the above different scenarios, a unified set of interfaces can be used in MegEngine to flexibly configure different situations.

Megengine quantification steps#

In MegEngine, the top-level quantization interface is to configure how to quantize :class: quantize_qat and quantize in the model conversion module :class:` The Observer and FakeQuantize operators used in ~.quantization.QConfig`, we can customize the quantization scheme. For further instructions, please refer to Quantitative configuration QConfig description, the following will show the steps required for the QAT quantification process:

import megengine.quantization as Q

model = ... # The pre-trained float model that needs to be quantified

Q.quantize_qat(model, qconfig=Q.ema_fakequant_qconfig) # EMA is a built-in QConfig for QAT

for _ in range(...):
    train(model)

Q.quantize(model)
  1. Use Module to define the model structure, and train according to the normal floating-point model to get the pre-trained model;

  2. Use quantize_qat to convert the Float model to the QFloat model. This step will be based on the quantization configuration QConfig to set up the Observer and FakeQuantize operators (common QConfig :ref:Suppose <qconfig-list>`, the EMA algorithm is used here);

  3. Use the QFloat model to continue training (fine-tuning), at this time Obersever statistics, FakeQuantize for pseudo-quantization;

  4. Use quantize to convert the QFloat model to the Q model. This step is also called “true quantization” (compared to pseudo quantization). At this time, the network can no longer be trained, and the operators in the network will be converted to low-bit calculation methods, which can be used for deployment.

flowchart LR PFM[Pre-trained Float Model] --> |quantize_qat| QFM[Pre-trained QFloat Model] QFM --> |train| FQFM[Fine-tuned QFloat Model] FQFM --> |quantize| QM[Q Model]

此处为标准量化流程,实际使用时也可有灵活的变化#

See also

  • We can also use the post-Calibration quantification scheme, which requires a calibration data set (reference code demonstration);

  • The quantitative model of MegEngine can be directly exported for inference deployment, refer to Export serialized model file (Dump).

The complete MegEngine model quantization code demonstration can be found at official/quantization.

Note

From a macro point of view, quantification is a conversion operation between Model levels, but apart from the details, it is all about the processing of Module.

Corresponding to Float, QFloat and Q Model, MegEngine the Module can be organized into three:

  1. The default for normal floating-point operations is Module (also called Float Module)

  2. qat.QATModule with Observer and FakeQuantize operators

  3. quantized.QuantizedModule that cannot be trained and specifically used for deployment

For the more common operators that can be quantified, there are implementations of the same name, such as-

  • module.Linear, module.qat.Linear and module.quantized.Linear

  • module.Conv2d, module.qat.Conv2d and module.quantized.Conv2d

Users do not need to perceive the processing of Module. By calling the model conversion interface quantize_qat and quantize, the framework will complete the batch replacement operation of the corresponding operators. Interested users can read the corresponding The source code logic will be introduced in more detail in section :ref:

Quantitative configuration QConfig description#

QConfig includes Observer and FakeQuantize two parts, users can 1. Use the preset 2. Customize the configuration.

flowchart LR FM[Float Model] --> QC{QConfig} QC -.- |Observer| QFM[QFloat Model] QC -.- |FakeQuantize| QFM[QFloat Model]

Use preset configuration#

MegEngine provides a preset similar to ema_fakequant_qconfig, which can be used as the qconfig of :func:

>>> import megengine.quantization as Q
>>> Q.quantize_qat(model, qconfig=Q.ema_fakequant_qconfig)

In fact, it is equivalent to using the following Qconfig (the following is the source code) for quantization perception training:

ema_fakequant_qconfig = QConfig(
    weight_observer=partial(MinMaxObserver, dtype="qint8", narrow_range=True),
    act_observer=partial(ExponentialMovingAverageObserver, dtype="qint8", narrow_range=False),
    weight_fake_quant=partial(FakeQuantize, dtype="qint8", narrow_range=True),
    act_fake_quant=partial(FakeQuantize, dtype="qint8", narrow_range=False),
)

Two Observers are used here for statistics, and FakeQuantize uses the default operator.

If you only do post-quantization, or calibration, since FakeQuantize is not needed, the ``fake_quant’’ attribute is None, then:

calibration_qconfig = QConfig(
    weight_observer=partial(MinMaxObserver, dtype="qint8", narrow_range=True),
    act_observer=partial(HistogramObserver, dtype="qint8", narrow_range=False),
    weight_fake_quant=None,
    act_fake_quant=None,
)

See also

  • The calibration_qconfig here is also the Qconfig preset configuration that can be used directly;

  • All available Qconfig presets can be :ref:quantify API reference ` <qconfig-list>` found.

Custom Observer and FakeQuantize#

In addition to using the preset configuration, users can also choose Observer and FakeQuantize flexibly according to their needs to implement their own QConfig.

See also

  • Observer example:MinMaxObserver / HistogramObserver / ExponentialMovingAverageObserver

  • FakeQuantize example:FakeQuantize / TQT / LSQ

  • All optional Observer and it has been cited in FakeQuantize :ref:quantization API Reference ` <qconfig-obsever>’page.

Note

In actual use, it may be necessary to let Observer count and update parameters during training, but stop updating during inference. Observer and FakeQuantize both support enable and disable methods, and Observer will call train and eval method automatically calls the corresponding Observer.enable/disable methods.

Generally, during data calibration, ``net.eval()’’ will be executed first to ensure that the network parameters are not updated, and then the enable_observer function will be called to manually enable the statistical modification function of the Observer in the Module ( That is to turn off the global first, and then turn on the specific part):

def calculate_scale(data, target):
    model.eval()  # all model observers are disabled now
    enable_observer(model)
    ...

Note that these switch processing are performed recursively. Similar interfaces include disable_observer, enable_fake_quant, disable_fake_quant, etc., which can be found in quantize-operation.

Model conversion module and related base classes#

QConfig provides a series of interfaces on how to quantify the model, and to use these interfaces, the module of the network needs to be able to add Observer to the weight and activation value and perform FakeQuantize when forwarding. The function of the conversion module is to change the ordinary Module is replaced with :class: QuantizedModule that cannot be trained and dedicated to deployment.

These three modules correspond to the Model, and can be replaced by modules of the same name with different implementations through the conversion interface.

flowchart LR M[module.Conv2d] -- quantize_qat --> QATM[module.qat.Conv2d] -- quantize --> QM[module.quantized.Conv2d]

以 Conv2d 为例,从 Moudle 到 QATModule 再到 QuantizedModule.#

At the same time, considering the high correlation between quantization and inference optimization commonly used operator fusion (Fuse) technology, MegEngine provides a series of pre-fused Modules, such as ConvRelu2d, ConvBn2d And ConvBnRelu2d etc. Explicit use of the fusion operator can ensure that the process is more controllable, and its corresponding QuantizedModule version will directly call the underlying implementation of the fusion operator; otherwise the framework needs to automatically match and optimize the fusion according to the network structure. The disadvantage of this implementation is that the user needs to modify the original network structure when using it, and use the integrated Module to build the network. The advantage is that users can more directly control how the network transforms. For example, there are Conv operators that need to be merged and those that do not need to be merged. Compared to providing a lengthy whitelist, we are more inclined to control explicitly in the network structure; Some operators that will perform conversion by default can also be disable_quantize method (there are examples below).

In addition, auxiliary modules such as QuantStub and DequantStub dedicated to quantification are also provided.

The principle of conversion is very simple, that is, replace the Quantable sub-Modules in the parent Module with the corresponding new Modules. However, some Quantable Modules also contain Quantable sub-Modules. For example, ConvBn contains a Conv2d and a BatchNorm2d. The conversion process These child modules will not be further converted, because after the parent module is replaced, its forward calculation process is completely different and will no longer depend on these child modules.

Note

If you need to keep a part of Module and its sub-modules in Float state without conversion, you can use disable_quantize to process. For example, when you find that after quantizing the fc layer, the model will drop points, you can turn off the quantization processing of this layer:

>>> model.fc.disable_quantize()

This interface can also be used as a decorator to facilitate the processing of multiple Modules.

Warning

If some binary and above ElementWise operators are involved in the network structure, such as addition and multiplication, etc., since the scales of multiple inputs are not consistent, you must use a special quantization operator and specify the output scale. In actual use, only need Just replace these operations with Elemwise, such as ``self.add_relu = Elemwise(“FUSE_ADD_RELU”)’’

The currently supported quantized Elemwise operators can be found in dnn/scripts/opr_param_defs.py

pdef('ElemwiseMultiType').add_enum(
    'Mode',
    # ...
    Doc('QFUSE_ADD_RELU = 7', 'Fused elemwise add two quantized int8 followed'
        ' by ReLU and typecvt to specified dtype'),
    # ...
)

Note:In the process of quantizing the model, the use of the Elemwise operator does not need to add the pre-Q.

In addition, because the original network structure is modified in the conversion process, model saving and loading cannot be directly applied to the converted network. When reading the parameters saved in the new network, you need to call the conversion interface to get the converted network before you can use load_state_dict loads the parameters.

ResNet example explanation#

Let’s take ResNet18 as an example to explain the complete quantification process. In the following steps:

  1. Modify the network structure and use the already fused ConvBn2d, ConvBnRelu2d, ElementWise to replace the original Module. Pre-train the model in normal mode, and save network checkpoints in each iteration;

  2. Call quantize_qat to convert the model, and perform quantization perception training fine-tuning (or calibration, depending on QConfig);

  3. Call quantize to convert to a quantized model, and export the model for subsequent model deployment.

See also

The code is simplified here. For the complete MegEngine official quantization example code, see: official/quantization

Train Float model#

We have modified some of the sub-modules in the model structure, replacing the original individual Conv'', ``BN, and ``ReLU’’ with quantifiable modules after fusion.

To modify `` BasicBlock`` front module Comparative Examples:

class BasicBlock(M.Module):
      def __init__(self, in_channels, channels):
         super().__init__()
         self.conv1 = M.Conv2d(in_channels, channels, 3, 1, padding=dilation, bias=False)
         self.bn1 = M.BatchNorm2d
         self.conv2 = M.Conv2d(channels, channels, 3, 1, padding=1, bias=False)
         self.bn2 = M.BatchNorm2d
         self.downsample = (
            M.Identity()
            if in_channels == channels and stride == 1
            else M.Sequential(
            M.Conv2d(in_channels, channels, 1, stride, bias=False)
            M.BatchNorm2d
         )

      def forward(self, x):
         identity = x
         x = F.relu(self.bn1(self.conv1(x)))
         x = self.bn2(self.conv2(x))
         identity = self.downsample(identity)
         x = F.relu(x + identity)
         return x
class BasicBlock(M.Module):
      def __init__(self, in_channels, channels):
         super().__init__()
         self.conv_bn_relu1 = M.ConvBnRelu2d(in_channels, channels, 3, 1, padding=dilation, bias=False)
         self.conv_bn2 = M.ConvBn2d(channels, channels, 3, 1, padding=1, bias=False)
         self.downsample = (
            M.Identity()
            if in_channels == channels and stride == 1
            else M.ConvBn2d(in_channels, channels, 1, 1, bias=False)
         )
         self.add_relu = M.Elemwise("FUSE_ADD_RELU")

      def forward(self, x):
         identity = x
         x = self.conv_bn_relu1(x)
         x = self.conv_bn2(x)
         identity = self.downsample(identity)
         x = self.add_relu(x, identity)
         return x

The model is then trained a number of iterations, and saving the checkpoint, details will be omitted here:

for step in range(0, total_steps):
    # Linear learning rate decay
    epoch = step // steps_per_epoch
    learning_rate = adjust_learning_rate(step, epoch)

    image, label = next(train_queue)
    image = tensor(image.astype("float32"))
    label = tensor(label.astype("int32"))

    n = image.shape[0]

    loss, acc1, acc5 = train_func(image, label, net, gm)  # traced
    optimizer.step().clear_grad()

    # Save checkpoints

Convert to QFloat model#

Call quantize_qat to convert the network to a QFloat model:

from megengine.quantization import ema_fakequant_qconfig, quantize_qat

model = ResNet18()

# QAT
quantize_qat(model, ema_fakequant_qconfig)

# Or Calibration:
# quantize_qat(model, calibration_qconfig)

Read the checkpoints saved by the pre-trained Float model, and continue to use the same code above for fine-tuning/calibration.

if args.checkpoint:
    logger.info("Load pretrained weights from %s", args.checkpoint)
    ckpt = mge.load(args.checkpoint)
    ckpt = ckpt["state_dict"] if "state_dict" in ckpt else ckpt
    model.load_state_dict(ckpt, strict=False)

# Fine-tune / Calibrate with new traced train_func
# Save checkpoints

Finally, it is also necessary to save the checkpoints of the QFloat model at this time, so that the QFloat model can be loaded and converted during testing and inference.

Warning

  • You need to convert the original Float model to a QFloat model before loading checkpoints;

  • If these two trainings are all executed in the same script, the trained traced function needs to be different, because the parameters of the model have changed at this time and need to be recompiled.

Convert to Q model#

The Q QFloat model to model and derived, comprising the following steps were:

from megengine.quantization import quantize

@jit.trace(capture_as_const=True)
def infer_func(processed_img):
    model.eval()
    logits = model(processed_img)
    probs = F.softmax(logits)
    return probs

quantize(model)

processed_img = transform.apply(image)[np.newaxis, :]
processed_img = processed_img.astype("int8")
probs = infer_func(processed_img)

infer_func.dump(output_file, arg_names=["data"])
  1. Define the trace function and open ``capture_as_const’’ to export the model;

  2. Call quantize to convert QAT model to Quantized model;

  3. Prepare data and perform an inference, call dump to export the model.

At this point, a quantitative model that can be used for deployment is obtained.

See also