Defined in File cg.h
public std::enable_shared_from_this< ComputingGraph >
public mgb::CompNodeDepedentObject (Class CompNodeDepedentObject)
A computing graph manages operators and variables. It can be compiled to create an AsyncExecutable that computs given variables.
callback to be invoked when some output is ready
note that the output may be deallocated after the call returns if no further node depends on the output
specify the callback of one output var
specified what ouptputs are required in compile(); the callback could be empty, to ensure that the var is computed
Each graph would be assigned a unique increasing ID; useful for debugging
generate an executable object that when executed, would call the callbacks on the output values
Also note that only the most recent compiled function could be used, since oprs may have internal state
compile multiple graph parts for partial execution
The parts in out_specs correspond to the execution steps of this graph. The returned AsyncExecutable objects should be called in the same order of parts given here.
The created AsyncExecutable objects would belong to newly generated graphs (not this graph). So functions compiled by compile() and compile_multi_part() can co-exist. All the new graphs would share device memory with this graph.
insert a new operator node; its input must exist in current graph
the node in the graph (maybe another node due to deduplication)
used by OperatorNodeBase to allocate its outputs
get current computing sequence
get information on how a variable is needed in current comp seq
find var node by its ID
Note: this searches recursively in subgraphs, and its complexity is linear with respect to number of vars (there is no indexing on var node ID)
VarNode pointer if it is found, or nullptr if no var is found to have equal ID
get underlying event connector
get an instance for static var value infer manager
get an instance for sequence computing node optimizer
share static device memory with another computing graph
To share memory for all graphs g[0..n-1], the correct way is to call g[i].share_device_memory_with(g) for i in range(1, n).
This method must be called before compiling, and the user must ensure AsyncExecutable objects with shared static device memory would not be executed simultaneously.
set a custom DeviceMemoryAllocator to be used
The given allocator would be used allocation in all graphs involved in share_device_memory_with() calls related to this graph.
get size of currently allocated static device memory buffer on given computing node
memory size in bytes
clear statically allocated device memory
use count of device memory before clear; a value of 1 indicates the memory would be actually released
set this graph as subgraph of another
This mechanism is used to implement special control operators like loop. Being a subgraph has following consequences:
node ID counter would be shared
when an AsyncExecutable compiled from subgraph are called, it would not wait for previous run to finish; instead, when AsyncExecutable from parent graph is being waited, it would call wait() on AsyncExecutables from the subgraph.
some options would be passed from parent graph to sub graph
Note that reference to subgraph should be kept by its owner operator, whose reference is kept by parent graph.
get number of operators inserted in this graph
record given async error; it should call this function rather than throw exception directly for the errors occurred during calculation.
Public Static Functions
assert that refcnt for ptr is one and destories the ptr
pre-allocate static storage used for internal states of computing graphs
This is mainly used to reduce memory usage in single-threaded environments. If a newly compiled function requires larger memory size than previous ones, megbrain has to re-allocate static storage buffer and the previous buffers are all wasted (because they should have been shared with the largest buffer).
If we know the max buffer size for all functions, the buffer can be pre-allocated so it can be shared by all.
A common practice to call prealloc_static_storage(0) to get the current buffer size at the end of the program, and use this value as the buffer size in next run.
current buffer size
size: anticipated max size of all buffers, in bytes
provided by impl to support alloc_varnode
get attribute for an operator
graph optimization level: 0: disable 1: level-1: inplace arith transformations during graph construction 2: level-2: level-1, plus global optimization before graph compiling 3: also enable JIT <0: corresponding level, with result check for debug
max size of allreduce packs in MB set this option to zero to disable PackAllReducePass
do not pack the first n allreduces PackAllReducePass disabled if allreduce_pack_max_size is zero
set logging level, larger number means more verbose 0: no log info 1: static memory allocation status WorkspaceLimitGetter summary optimizer summary
optimizer var replace details during graph compiling duplicated operator
async exec: dispatch on separate threads for different comp_node 0: do not perform async dispatch 1: dispatch async if there are more than one comp node with limited queue mask 0b10: async if there are multiple comp nodes with mask 0b100: always async
force dynamic memory alloc for all vars
whether to perform var sanity check on first run
whether to allocate static memory just after compiling graph
whether only to perform non-computing tasks (like memory allocation and queue initialization) for next exec. This would be reset to false when the graph is executed.
whether to enable sublinear memory optimization
do not re-profile to select best impl algo when input shape changes (use previous algo)
whether to perform defragmenting when memory allocation for a dynamic var fails
whether to reshape grad var whose wrt shape is statically inferrable but its own shape is dynamic
whether to enable swap memory as swap’s performance is greatly worse than sublinear, it is recommended to use sublinear first
whether to use CompNodeSeqRecorder to record the execution sequence and directly replay it for later executions.
Level 1 is mainly used to speed up execution (especially for opencl); level 2 is used for reducing memory usage.
Level 1 constraints:
All vars must be statically allocated
Host input/output buffer pointers can not be changed if shape is not changed (this is not checked in execution for efficiency considerations; this is potentially dangerous)
Synchronization can only occur at the end of execution
Not all comp node implementations support recording computing sequence
Only one comp node can be used in the graph
Level 2: besides recording the computing sequence, the dependencies are also moved into the compiled func (see GraphExecutable::ExecDependency). Additional constraints:
Shapes can not change
both fake_next_exec and var_sanity_check_first_run must be disabled
Var shapes must be correctly setup before calling compile()
whether to evaulate var node values as they are inserted
Request that operators should not force update their inputs.
THIS FLAG IS RESERVED FOR INTERNAL USE
When this flag is set, operators like AddUpdate and BatchNorm will still attempt to inplace update their inputs, but failing to do so will not be considered as an error.
add extra deps for the comp seq if a specific var is dependent
contains any user data associated with this graph
graph optimization options
whether to enable JIT; JIT would also be enabled at O3 this value indicates JIT level: 1 for basic elemwise opr; 2 for including reduce oprs
whether to enable fine-grained TensorRT opr replace
attribute for a specific operator
sequence compile optimization options
whether to enable memory forwarding to optimize mem plans
whether to enable static memory reuse (i.e. using optimized static memory allocation algorithm)
whether to enable comp node optimization (e.g. using copy stream for I/O operators)
Control parameter for sublinear memory optimization.
whether nothing is needed completely
whether computing value is needed (i.e. either dev_value, or shape, or host_value)
whether this var can be empty
number of requests for directly computing by passing an empty callback
number of operators that need device value of this var
last dev value reader in the computing sequence
number of operators that need shape of this var, which can not be statically inferred
number of operators that need host value of this var, which can not be statically inferred
number of operators in dev_value and host_value that allow this var to be empty