Automatic mixing accuracy (AMP)#

Mixed precision (Mix Precision) training refers to the use of different numerical precisions for different parts of the network during training. For operators that pursue speed (such as conv, matmul), lower precision (such as float16) can be used to obtain obvious performance. Improve, and retain higher precision for other operators that value more precision (such as log, softmax) (such as float32).

得益于 NVIDIA TensorCore 的存在(需要 Volta, Turing, Ampere 架构 GPU),对于 conv、matmul 占比较多的网络,混合精度训练一般能使网络整体的训练速度有较大的提升(2-3X)。

Interface introduction#

In MegEngine using autocast interface may be automatically converted to network-related data types op:

import numpy as np
import megengine as mge
from megengine import amp
from megengine.hub import load
net = load("megengine/models", "resnet18", pretrained=False)
inp = mge.tensor(np.random.normal(size=(1, 3, 224, 224)), dtype="float32")

with amp.autocast():    # 使用 autocast context 接口
    oup = net(inp)
print(oup.dtype)

In the above example, autocast is used as the context manager. You can also use the decorator:

@amp.autocast()
def train_func(inp):    # 使用 autocast 装饰器
    oup = net(inp)
    return oup

oup = train_func(inp)
print(oup.dtype)

Or use a separate switch:

amp.enabled = True
oup = net(inp)
amp.enabled = False
print(oup.dtype)

After autocast is turned on, the numerical type of the intermediate result of the network will become float16, and the corresponding gradient will naturally also be float16. Since the value range of float16 is smaller than that of float32, it is difficult to accurately express float16 if it encounters particularly small numbers (such as loss, gradient). At this time, gradient scaling (Gradient Scaling) is generally required. The method is to amplify the loss of the network, so that the gradient corresponding to the intermediate result of the network during backpropagation is also amplified in the same way, reducing the loss of accuracy, and when the gradient is transmitted back to the parameter, it will still be of the type of float32. After shrinking, it will not affect the update of the parameters.

Using gradient scaling in MegEngine can be done through the GradScaler interface.

import megengine.functional as F
from megengine.autodiff import GradManager
from megengine.optimizer import SGD

gm = GradManager().attach(net.parameters())
opt = SGD(net.parameters(), lr=0.01)
scaler = amp.GradScaler()           # 使用 GradScaler 进行梯度缩放

image = mge.tensor(np.random.normal(size=(1, 3, 224, 224)), dtype="float32")
label = mge.tensor(np.zeros(1), dtype="int32")

@amp.autocast()
def train_step(image, label):
    with gm:
        logits = net(image)
        loss = F.nn.cross_entropy(logits, label)
        scaler.backward(gm, loss)   # 通过 GradScaler 修改反传行为
    opt.step().clear_grad()
    return loss

train_step(image, label)

In the above example, by replacing ``gm.backward(loss)’’ with ``scaler.backward(gm, loss)’’, automatic scaling of loss and gradient can be realized, which actually involves three steps::

  • Modify GradManager.backward, so that the gradient returned from loss is multiplied by a constant scale_factor;

  • Call GradScaler.unscale to GradManager.attached_tensors and multiply it by the reciprocal of scale_factor;

  • Call GradScaler.update to update internal statistics and update scale_factor as appropriate.

So if you need more refined operations, such as accumulating multiple iter gradients, you can use the following equivalent form:

@amp.autocast()
def train_step(image, label):
    with gm:
        logits = net(image)
        loss = F.nn.cross_entropy(logits, label)
        gm.backward(loss, dy=mge.tensor(scaler.scale_factor))   # 对应步骤一
    # 这里可以插入对梯度的自定义操作
    scaler.unscale(gm.attached_tensors())                       # 对应步骤二
    scaler.update()                                             # 对应步骤三
    opt.step().clear_grad()
    return loss

train_step(image, label)

We can vividly refer to the above two methods as automatic transmission and manual transmission respectively.

Through the above interface, it is possible to modify only the training code to achieve mixed-precision training without modifying the model code, which greatly improves the training speed of the network.