Model performance data generation and analysis (Profiler)#
Note
Due to implementation constraints,:ref:dynamic and static FIG. FIG <dynamic-and-static-graph>the Profiler interface in not consistent, the focus is not the same, will be described below, respectively.
Performance analysis under dynamic graph#
Suppose we have written a dynamic graph code, the training part of the code is as follows:
def train_step(data, label, *, optimizer, gm, model)
with gm:
logits = model(data)
loss = F.loss.cross_entropy(logits, label)
gm.backward(loss)
optimizer.step().clear_grad()
return loss
Generate performance data#
Warning
Mounting the Profiler will slow down the running speed of the model (about 8%).
If you want to use Profiler to generate performance data, there are two ways of writing (choose one of them):
Use: py:data:~megengine.utils.profiler.profile decorator (profile is an alias of Profiler)
Use with
Profiler
syntax
The sample code is as:
from megengine.utils.profiler import profile, Profiler
# 装饰器写法
@profile()
def train_step(data, label, *, optimizer, gm, model)
with gm:
logits = model(data)
loss = F.loss.cross_entropy(logits, label)
gm.backward(loss)
optimizer.step().clear_grad()
return loss
# with 写法
profiler = Profiler()
def train_step(data, label, *, optimizer, gm, model)
with profiler:
with gm:
logits = model(data)
loss = F.loss.cross_entropy(logits, label)
gm.backward(loss)
optimizer.step().clear_grad()
return loss
In this way, every time you enter the corresponding code block, MegEngine will do a separate Profiling for the code in the area.
At the end of the program (to be precise, when the Profiler is destructed), a JSON file will be generated in the running directory for the next performance analysis.
Parameter Description#
The constructor of Profiler
supports the following parameters:
path
The storage path of profile data, the default is the
profile
folder under the current path.format
The format of the output data, the default is
chrome_timeline.json
, which is a standard format supported by Chrome, which displays the profiling results in the form of a timeline. There are also optionsmemory_flow.svg
, with the time x address Show memory usage in the form of space.formats
If you need more than one output format, you can list it in the formats parameter.
sample_rate
If this item is not zero, the video memory information will be counted every n ops, and the video memory occupancy curve can be drawn when analyzing the data. The default is 0.
profile_device
Whether to record gpu time-consuming, the default is True.
Analyze performance data#
You can use the Perfetto tool to load the JSON
file generated in the previous step:
Open Perfetto webpage;
Click the ``Open trace file’’ button to load the data;
Expand the content.
At this point, you can see several threads in the window, and each thread displays the historical call stack in chronological order. The abscissa is the time axis, and the left and right edges of the color block are the start and end time of the event. The ordinate represents the thread to which the event belongs (where channel is the main python thread). For example, when we execute ``self.conv1(x)’’ in the model source code, there will be a corresponding ``conv1’’ block on the channel thread, and the same ``conv1’’ on other threads Blocks will lag behind. The worker’s main job is to send the kernel, and the real calculation is the gpu thread. The event density on gpu threads is significantly higher than that on channels and workers.
Note
Generally speaking, the busier the GPU thread, the higher the GPU utilization of the model.
Frequent use of
Tensor.shape
,Tensor.numpy
operations may cause data synchronization and reduce GPU utilization.
The following operations will be presented in the form of color blocks by default in the Performance interface.:
GradManager.backward
Optimizer.step
Optimizer.clear_grad
Module.forward
By observing the duration of the event, the performance bottleneck of the model can be assessed. There will also be some curves above the timeline. These curves share the same time axis with the events below, showing the changing process of the corresponding data.
Performance analysis under static graphs#
Suppose we have written a static image code, the training part of the code is as follows:
@trace(symbolic=True)
def train_step(data, label, *, optimizer, gm, model)
with gm:
logits = model(data)
loss = F.loss.cross_entropy(logits, label)
gm.backward(loss)
optimizer.step().clear_grad()
return loss
Generate performance data#
You only need to pass in ``profiling=True’’ in the trace
interface, and then call the get_profile
method to get the performance data.
The modified code is as follows:
@trace(symbolic=True, profiling=True)
def train_step(data, label, *, optimizer, gm, model)
with gm:
logits = model(data)
loss = F.loss.cross_entropy(logits, label)
gm.backward(loss)
optimizer.step().clear_grad()
return loss
... # 训练代码,调用了 train_step()
# 得到性能数据
prof_result = train_func.get_profile()
# 保存结果为 JSON 格式
with open("profiling.json", "w") as fout:
json.dump(prof_result, fout, indent=2)
In this way we will get a JSON
file, which can be used for the following performance analysis.
Analyze performance data#
The ``JSON’’ file saved in the previous step can be analyzed using the ``profile_analyze.py’’ script provided by MegEngine in the ``tools’’ directory. The sample code is as follows:
# 输出详细帮助信息
python3 -m megengine.tools.profile_analyze -h
# 输出前 5 慢的算子
python3 -m megengine.tools.profile_analyze ./profiling.json -t 5
# 输出总耗时前 5 大的算子的类型
python3 -m megengine.tools.profile_analyze ./profiling.json -t 5 --aggregate-by type --aggregate sum
# 按 memory 排序输出用时超过 0.1ms 的 ConvolutionForward 算子
python3 -m megengine.tools.profile_analyze ./profiling.json -t 5 --order-by memory --min-time 1e-4 --type ConvolutionForward
The output will be a table, the meaning of each column is as follows:
device self time
The running time of the operator on the computing device (such as GPU)
cumulative
Accumulate the time of all previous operators
operator info
Print the basic information of the operator
computation
The number of floating-point operations required by the operator
FLOPS
The number of floating-point operations performed by the operator per second is obtained by dividing
computation
bydevice self time
and converting the unitmemory
The size of storage (such as GPU memory) used by the operator
bandwidth
The bandwidth of the operator is obtained by dividing
memory
bydevice self time
and converting the unitin_shapes
The shape of the operator input tensor
out_shapes
The shape of the output tensor of the operator