Quantization#
Note
Tensor data type used by common neural network models is generally the ``float32’’ type, and the industry needs to convert the model to a low-precision/bit type like ``int8’’ for specific scenarios. —— The whole process is called Quantization.
Quantization can convert 32-bit floating-point numbers into 8-bit or even 4-bit fixed-point numbers, which has less runtime memory and cache requirements; in addition, since most hardware has specific optimizations for fixed-point operations, it is in terms of operating speed. There will also be a big improvement. Compared with ordinary models, **quantized models have the advantages of smaller memory capacity and bandwidth usage, lower power consumption and faster inference speed. **
Some computing devices only support fixed-point operations. In order for the model to run normally on these devices, we need to perform quantitative processing.
“In order to pursue the ultimate inference calculation speed, thereby giving up the precision of numerical representation”, intuition will bring a larger model drop point, but after a series of sophisticated quantization processing, the drop point can become minimal, and Can support normal deployment and use.
Users do not need to understand the implementation details behind them, and use quantization
to meet the basic quantization needs. We provide more introduction to the basic principles of quantization for interested users, please refer to Explanation of the principle of quantification scheme.
Users who are familiar with the basic principles can jump directly to Megengine quantification steps ↩ to see the basic usage.
Warning
Please do not confuse “quantization” with “Mixed precision”, please refer to the Automatic mixing accuracy (AMP) document.
Introduction to the basic quantification process#
Currently there are two types of industry applications mainly quantitative techniques, it is carried out in MegEngine support:
Post-Training Quantization (PTQ);
Quantization-Aware Training (QAT).
Post-training quantization is a general technique used to convert a trained floating-point model into a low-precision/bit model. A common approach is to process the weight and activation value of the model and convert them to precision Lower type. In the conversion process, some statistical information about the weights and activations in the model to be quantified, such as the scale and zero_point, need to be used. Although the accuracy conversion occurs after training, in order to obtain these statistics, we still need to insert an observer (Observer) during model training, that is, during the forward calculation process.
The use of post-training quantization technology will cause the quantized model to drop points (that is, the prediction accuracy rate decreases). In severe cases, the quantitative model will be unavailable. A feasible approach is to use small batches of data to calibrate the Observer before quantification (Calibration), also called post-Calibration quantification.
Another feasible improvement plan is to use quantization perception training technology to insert some fake quantization (FakeQuantize) operators into the floating-point model as a transformation. During training, the fake quantize operator will perform quantization simulation based on the information observed by the Observer, that is, When the accuracy of the numerical value is truncated during the simulation calculation, the precision is reduced, and the numerical conversion is performed first, and then the converted value is restored to the original type. This allows the quantized object to “adapt in advance” to the quantization operation during training, and alleviate the impact of drop points during quantization after training.
The new FakeQuantize operator will introduce a lot of training overhead. In order to save the total time, the more general idea of model quantization is:
According to the usual training model process, the Float model is designed and trained (equivalent to obtaining a pre-training model);
Insert Observer and FakeQuantize operators to obtain the Quantized-Float model (QFloat model for short), and quantize perception training;
After training and quantization, the real Quantized model (Q model for short) is obtained, which is the low-bit model that is finally used for inference.
Note
According to some differences in actual scenarios, the quantification process can be flexibly changed, such as:
After spending without regard to training, in order to simplify the overall process can be constructed directly QFloat model, and training and quantify:
flowchart LR FM[Float Model] --> |Observer| QFM[QFloat Model] FM[Float Model] --> |FakeQuantize| QFM[QFloat Model] QFM --> |QAT| TQFM[trained QFloat Model] TQFM --> |PTQ| QM[Q Model]When constructing the QFloat model, if the FakeQuantize operator is not inserted, the training overhead can be reduced accordingly and the speed can be increased.
But this time is not equivalent to quantify perception, were only the calibration data Calibration, the model may point out serious:
flowchart LR PFM[Pre-trained Float Model] --> |Observer| PQFM[Pre-trained QFloat Model] PFM[Pre-trained Float Model] -.- |FakeQuantize| PQFM[Pre-trained QFloat Model] PQFM --> |Calibration| CQFM[Calibrated QFloat Model] CQFM --> |PTQ| QM[Q Model]
For the above different scenarios, a unified set of interfaces can be used in MegEngine to flexibly configure different situations.
Megengine quantification steps#
In MegEngine, the top-level quantization interface is to configure how to quantize :class: quantize_qat
and quantize
in the model conversion module :class:` The Observer and FakeQuantize operators used in ~.quantization.QConfig`, we can customize the quantization scheme. For further instructions, please refer to Quantitative configuration QConfig description, the following will show the steps required for the QAT quantification process:
import megengine.quantization as Q
model = ... # The pre-trained float model that needs to be quantified
Q.quantize_qat(model, qconfig=Q.ema_fakequant_qconfig) # EMA is a built-in QConfig for QAT
for _ in range(...):
train(model)
Q.quantize(model)
Use Module to define the model structure, and train according to the normal floating-point model to get the pre-trained model;
Use
quantize_qat
to convert the Float model to the QFloat model. This step will be based on the quantization configurationQConfig
to set up the Observer and FakeQuantize operators (common QConfig :ref:Suppose <qconfig-list>`, the EMA algorithm is used here);Use the QFloat model to continue training (fine-tuning), at this time Obersever statistics, FakeQuantize for pseudo-quantization;
Use
quantize
to convert the QFloat model to the Q model. This step is also called “true quantization” (compared to pseudo quantization). At this time, the network can no longer be trained, and the operators in the network will be converted to low-bit calculation methods, which can be used for deployment.
See also
We can also use the post-Calibration quantification scheme, which requires a calibration data set (reference code demonstration);
The quantitative model of MegEngine can be directly exported for inference deployment, refer to Export serialized model file (Dump).
The complete MegEngine model quantization code demonstration can be found at official/quantization.
Note
From a macro point of view, quantification is a conversion operation between Model levels, but apart from the details, it is all about the processing of Module.
Corresponding to Float, QFloat and Q Model, MegEngine the Module can be organized into three:
The default for normal floating-point operations is
Module
(also called Float Module)qat.QATModule
with Observer and FakeQuantize operatorsquantized.QuantizedModule
that cannot be trained and specifically used for deployment
For the more common operators that can be quantified, there are implementations of the same name, such as-
module.Linear
,module.qat.Linear
andmodule.quantized.Linear
module.Conv2d
,module.qat.Conv2d
andmodule.quantized.Conv2d
Users do not need to perceive the processing of Module. By calling the model conversion interface quantize_qat
and quantize
, the framework will complete the batch replacement operation of the corresponding operators. Interested users can read the corresponding The source code logic will be introduced in more detail in section :ref:
Quantitative configuration QConfig description#
QConfig
includes Observer
and FakeQuantize
two parts, users can 1. Use the preset 2. Customize the configuration.
Use preset configuration#
MegEngine provides a preset similar to ema_fakequant_qconfig
, which can be used as the qconfig
of :func:
>>> import megengine.quantization as Q
>>> Q.quantize_qat(model, qconfig=Q.ema_fakequant_qconfig)
In fact, it is equivalent to using the following Qconfig
(the following is the source code) for quantization perception training:
ema_fakequant_qconfig = QConfig(
weight_observer=partial(MinMaxObserver, dtype="qint8", narrow_range=True),
act_observer=partial(ExponentialMovingAverageObserver, dtype="qint8", narrow_range=False),
weight_fake_quant=partial(FakeQuantize, dtype="qint8", narrow_range=True),
act_fake_quant=partial(FakeQuantize, dtype="qint8", narrow_range=False),
)
Two Observers are used here for statistics, and FakeQuantize uses the default operator.
If you only do post-quantization, or calibration, since FakeQuantize is not needed, the ``fake_quant’’ attribute is None, then:
calibration_qconfig = QConfig(
weight_observer=partial(MinMaxObserver, dtype="qint8", narrow_range=True),
act_observer=partial(HistogramObserver, dtype="qint8", narrow_range=False),
weight_fake_quant=None,
act_fake_quant=None,
)
See also
The
calibration_qconfig
here is also the Qconfig preset configuration that can be used directly;All available Qconfig presets can be :ref:quantify API reference ` <qconfig-list>` found.
Custom Observer and FakeQuantize#
In addition to using the preset configuration, users can also choose Observer and FakeQuantize flexibly according to their needs to implement their own QConfig.
See also
Observer example:
MinMaxObserver
/HistogramObserver
/ExponentialMovingAverageObserver
…FakeQuantize example:
FakeQuantize
/TQT
/LSQ
…All optional Observer and it has been cited in FakeQuantize :ref:quantization API Reference ` <qconfig-obsever>’page.
Note
In actual use, it may be necessary to let Observer count and update parameters during training, but stop updating during inference. Observer and FakeQuantize both support enable
and disable
methods, and Observer will call train
and eval
method automatically calls the corresponding Observer.enable/disable
methods.
Generally, during data calibration, ``net.eval()’’ will be executed first to ensure that the network parameters are not updated, and then the enable_observer
function will be called to manually enable the statistical modification function of the Observer in the Module ( That is to turn off the global first, and then turn on the specific part):
def calculate_scale(data, target):
model.eval() # all model observers are disabled now
enable_observer(model)
...
Note that these switch processing are performed recursively. Similar interfaces include disable_observer
, enable_fake_quant
, disable_fake_quant
, etc., which can be found in quantize-operation.
Model conversion module and related base classes#
QConfig provides a series of interfaces on how to quantify the model, and to use these interfaces, the module of the network needs to be able to add Observer to the weight and activation value and perform FakeQuantize when forwarding. The function of the conversion module is to change the ordinary Module
is replaced with :class: QuantizedModule
that cannot be trained and dedicated to deployment.
These three modules correspond to the Model, and can be replaced by modules of the same name with different implementations through the conversion interface.
At the same time, considering the high correlation between quantization and inference optimization commonly used operator fusion (Fuse) technology, MegEngine provides a series of pre-fused Modules, such as ConvRelu2d
, ConvBn2d
And ConvBnRelu2d
etc. Explicit use of the fusion operator can ensure that the process is more controllable, and its corresponding QuantizedModule version will directly call the underlying implementation of the fusion operator; otherwise the framework needs to automatically match and optimize the fusion according to the network structure. The disadvantage of this implementation is that the user needs to modify the original network structure when using it, and use the integrated Module to build the network. The advantage is that users can more directly control how the network transforms. For example, there are Conv operators that need to be merged and those that do not need to be merged. Compared to providing a lengthy whitelist, we are more inclined to control explicitly in the network structure; Some operators that will perform conversion by default can also be disable_quantize
method (there are examples below).
In addition, auxiliary modules such as QuantStub
and DequantStub
dedicated to quantification are also provided.
The principle of conversion is very simple, that is, replace the Quantable sub-Modules in the parent Module with the corresponding new Modules. However, some Quantable Modules also contain Quantable sub-Modules. For example, ConvBn contains a Conv2d and a BatchNorm2d. The conversion process These child modules will not be further converted, because after the parent module is replaced, its forward calculation process is completely different and will no longer depend on these child modules.
Note
If you need to keep a part of Module and its sub-modules in Float state without conversion, you can use disable_quantize
to process. For example, when you find that after quantizing the fc layer, the model will drop points, you can turn off the quantization processing of this layer:
>>> model.fc.disable_quantize()
This interface can also be used as a decorator to facilitate the processing of multiple Modules.
Warning
If some binary and above ElementWise operators are involved in the network structure, such as addition and multiplication, etc., since the scales of multiple inputs are not consistent, you must use a special quantization operator and specify the output scale. In actual use, only need Just replace these operations with Elemwise
, such as ``self.add_relu = Elemwise(“FUSE_ADD_RELU”)’’
The currently supported quantized Elemwise operators can be found in dnn/scripts/opr_param_defs.py:
pdef('ElemwiseMultiType').add_enum(
'Mode',
# ...
Doc('QFUSE_ADD_RELU = 7', 'Fused elemwise add two quantized int8 followed'
' by ReLU and typecvt to specified dtype'),
# ...
)
Note:In the process of quantizing the model, the use of the Elemwise operator does not need to add the pre-Q.
In addition, because the original network structure is modified in the conversion process, model saving and loading cannot be directly applied to the converted network. When reading the parameters saved in the new network, you need to call the conversion interface to get the converted network before you can use load_state_dict
loads the parameters.
ResNet example explanation#
Let’s take ResNet18 as an example to explain the complete quantification process. In the following steps:
Modify the network structure and use the already fused ConvBn2d, ConvBnRelu2d, ElementWise to replace the original Module. Pre-train the model in normal mode, and save network checkpoints in each iteration;
Call
quantize_qat
to convert the model, and perform quantization perception training fine-tuning (or calibration, depending on QConfig);Call
quantize
to convert to a quantized model, and export the model for subsequent model deployment.
See also
The code is simplified here. For the complete MegEngine official quantization example code, see: official/quantization
Train Float model#
We have modified some of the sub-modules in the model structure, replacing the original individual Conv'', ``BN
, and ``ReLU’’ with quantifiable modules after fusion.
Model structure before modification: official/vision/classification/resnet/model.py
Modified model structure: official/quantization/models/resnet.py
To modify `` BasicBlock`` front module Comparative Examples:
class BasicBlock(M.Module):
def __init__(self, in_channels, channels):
super().__init__()
self.conv1 = M.Conv2d(in_channels, channels, 3, 1, padding=dilation, bias=False)
self.bn1 = M.BatchNorm2d
self.conv2 = M.Conv2d(channels, channels, 3, 1, padding=1, bias=False)
self.bn2 = M.BatchNorm2d
self.downsample = (
M.Identity()
if in_channels == channels and stride == 1
else M.Sequential(
M.Conv2d(in_channels, channels, 1, stride, bias=False)
M.BatchNorm2d
)
def forward(self, x):
identity = x
x = F.relu(self.bn1(self.conv1(x)))
x = self.bn2(self.conv2(x))
identity = self.downsample(identity)
x = F.relu(x + identity)
return x
class BasicBlock(M.Module):
def __init__(self, in_channels, channels):
super().__init__()
self.conv_bn_relu1 = M.ConvBnRelu2d(in_channels, channels, 3, 1, padding=dilation, bias=False)
self.conv_bn2 = M.ConvBn2d(channels, channels, 3, 1, padding=1, bias=False)
self.downsample = (
M.Identity()
if in_channels == channels and stride == 1
else M.ConvBn2d(in_channels, channels, 1, 1, bias=False)
)
self.add_relu = M.Elemwise("FUSE_ADD_RELU")
def forward(self, x):
identity = x
x = self.conv_bn_relu1(x)
x = self.conv_bn2(x)
identity = self.downsample(identity)
x = self.add_relu(x, identity)
return x
The model is then trained a number of iterations, and saving the checkpoint, details will be omitted here:
for step in range(0, total_steps):
# Linear learning rate decay
epoch = step // steps_per_epoch
learning_rate = adjust_learning_rate(step, epoch)
image, label = next(train_queue)
image = tensor(image.astype("float32"))
label = tensor(label.astype("int32"))
n = image.shape[0]
loss, acc1, acc5 = train_func(image, label, net, gm) # traced
optimizer.step().clear_grad()
# Save checkpoints
See also
Convert to QFloat model#
Call quantize_qat
to convert the network to a QFloat model:
from megengine.quantization import ema_fakequant_qconfig, quantize_qat
model = ResNet18()
# QAT
quantize_qat(model, ema_fakequant_qconfig)
# Or Calibration:
# quantize_qat(model, calibration_qconfig)
Read the checkpoints saved by the pre-trained Float model, and continue to use the same code above for fine-tuning/calibration.
if args.checkpoint:
logger.info("Load pretrained weights from %s", args.checkpoint)
ckpt = mge.load(args.checkpoint)
ckpt = ckpt["state_dict"] if "state_dict" in ckpt else ckpt
model.load_state_dict(ckpt, strict=False)
# Fine-tune / Calibrate with new traced train_func
# Save checkpoints
Finally, it is also necessary to save the checkpoints of the QFloat model at this time, so that the QFloat model can be loaded and converted during testing and inference.
Warning
You need to convert the original Float model to a QFloat model before loading checkpoints;
If these two trainings are all executed in the same script, the trained traced function needs to be different, because the parameters of the model have changed at this time and need to be recompiled.
See also
Finetune- official/quantization/finetune.py
Calibration- official/quantization/calibration.py
Convert to Q model#
The Q QFloat model to model and derived, comprising the following steps were:
from megengine.quantization import quantize
@jit.trace(capture_as_const=True)
def infer_func(processed_img):
model.eval()
logits = model(processed_img)
probs = F.softmax(logits)
return probs
quantize(model)
processed_img = transform.apply(image)[np.newaxis, :]
processed_img = processed_img.astype("int8")
probs = infer_func(processed_img)
infer_func.dump(output_file, arg_names=["data"])
Define the trace function and open ``capture_as_const’’ to export the model;
Call
quantize
to convert QAT model to Quantized model;Prepare data and perform an inference, call
dump
to export the model.
At this point, a quantitative model that can be used for deployment is obtained.
See also
Inference and dump- official/quantization/inference.py