Model performance data generation and analysis (Profiler)¶

Note

Due to implementation constraints,:ref:dynamic and static FIG. FIG <dynamic-and-static-graph>the Profiler interface in not consistent, the focus is not the same, will be described below, respectively.

Performance analysis under dynamic graph¶

Suppose we have written a dynamic graph code, the training part of the code is as follows：

def train_step(data, label, *, optimizer, gm, model)
    with gm:
        logits = model(data)
        loss = F.loss.cross_entropy(logits, label)
        gm.backward(loss)
        optimizer.step().clear_grad()
    return loss

Generate performance data¶

Warning

Mounting the Profiler will slow down the running speed of the model (about 8%).

If you want to use Profiler to generate performance data, there are two ways of writing (choose one of them)：

Use: py:data:~megengine.utils.profiler.profile decorator (profile is an alias of Profiler)
Use with Profiler syntax

The sample code is as：

from megengine.utils.profiler import profile, Profiler

# 装饰器写法
@profile()
def train_step(data, label, *, optimizer, gm, model)
    with gm:
        logits = model(data)
        loss = F.loss.cross_entropy(logits, label)
        gm.backward(loss)
        optimizer.step().clear_grad()
    return loss

# with 写法
profiler = Profiler()
def train_step(data, label, *, optimizer, gm, model)
    with profiler:
       with gm:
           logits = model(data)
           loss = F.loss.cross_entropy(logits, label)
           gm.backward(loss)
           optimizer.step().clear_grad()
    return loss

In this way, every time you enter the corresponding code block, MegEngine will do a separate Profiling for the code in the area.

At the end of the program (to be precise, when the Profiler is destructed), a JSON file will be generated in the running directory for the next performance analysis.

Parameter Description¶

The constructor of Profiler supports the following parameters：

path: The storage path of profile data, the default is the profile folder under the current path.
format: The format of the output data, the default is chrome_timeline.json, which is a standard format supported by Chrome, which displays the profiling results in the form of a timeline. There are also options memory_flow.svg, with the time x address Show memory usage in the form of space.
formats: If you need more than one output format, you can list it in the formats parameter.
sample_rate: If this item is not zero, the video memory information will be counted every n ops, and the video memory occupancy curve can be drawn when analyzing the data. The default is 0.
profile_device: Whether to record gpu time-consuming, the default is True.

Analyze performance data¶

You can use the Perfetto tool to load the JSON file generated in the previous step：

Open Perfetto webpage;
Click the ``Open trace file’’ button to load the data;
Expand the content.

At this point, you can see several threads in the window, and each thread displays the historical call stack in chronological order. The abscissa is the time axis, and the left and right edges of the color block are the start and end time of the event. The ordinate represents the thread to which the event belongs (where channel is the main python thread). For example, when we execute ``self.conv1(x)’’ in the model source code, there will be a corresponding ``conv1’’ block on the channel thread, and the same ``conv1’’ on other threads Blocks will lag behind. The worker’s main job is to send the kernel, and the real calculation is the gpu thread. The event density on gpu threads is significantly higher than that on channels and workers.

Note

Generally speaking, the busier the GPU thread, the higher the GPU utilization of the model.
Frequent use of Tensor.shape, Tensor.numpy operations may cause data synchronization and reduce GPU utilization.

The following operations will be presented in the form of color blocks by default in the Performance interface.：

By observing the duration of the event, the performance bottleneck of the model can be assessed. There will also be some curves above the timeline. These curves share the same time axis with the events below, showing the changing process of the corresponding data.

Performance analysis under static graphs¶

Suppose we have written a static image code, the training part of the code is as follows：

@trace(symbolic=True)
def train_step(data, label, *, optimizer, gm, model)
    with gm:
        logits = model(data)
        loss = F.loss.cross_entropy(logits, label)
        gm.backward(loss)
        optimizer.step().clear_grad()
    return loss

Generate performance data¶

You only need to pass in ``profiling=True’’ in the trace interface, and then call the get_profile method to get the performance data.

The modified code is as follows：

@trace(symbolic=True, profiling=True)
def train_step(data, label, *, optimizer, gm, model)
    with gm:
        logits = model(data)
        loss = F.loss.cross_entropy(logits, label)
        gm.backward(loss)
        optimizer.step().clear_grad()
    return loss

 ... # 训练代码，调用了 train_step()

 # 得到性能数据
prof_result = train_func.get_profile()

# 保存结果为 JSON 格式
with open("profiling.json", "w") as fout:
    json.dump(prof_result, fout, indent=2)

In this way we will get a JSON file, which can be used for the following performance analysis.

Analyze performance data¶

The ``JSON’’ file saved in the previous step can be analyzed using the ``profile_analyze.py’’ script provided by MegEngine in the ``tools’’ directory. The sample code is as follows：

# 输出详细帮助信息
python3 -m megengine.tools.profile_analyze -h

# 输出前 5 慢的算子
python3 -m megengine.tools.profile_analyze ./profiling.json -t 5

# 输出总耗时前 5 大的算子的类型
python3 -m megengine.tools.profile_analyze ./profiling.json -t 5 --aggregate-by type --aggregate sum

# 按 memory 排序输出用时超过 0.1ms 的 ConvolutionForward 算子
python3 -m megengine.tools.profile_analyze ./profiling.json -t 5 --order-by memory --min-time 1e-4  --type ConvolutionForward

The output will be a table, the meaning of each column is as follows：

device self time: The running time of the operator on the computing device (such as GPU)
cumulative: Accumulate the time of all previous operators
operator info: Print the basic information of the operator
computation: The number of floating-point operations required by the operator
FLOPS: The number of floating-point operations performed by the operator per second is obtained by dividing computation by device self time and converting the unit
memory: The size of storage (such as GPU memory) used by the operator
bandwidth: The bandwidth of the operator is obtained by dividing memory by device self time and converting the unit
in_shapes: The shape of the operator input tensor
out_shapes: The shape of the output tensor of the operator

使用 NHWC 格式进一步提速

使用 TracedModule 发版