Class ComputingGraph

Inheritance Relationships

Base Types

Class Documentation

class mgb::cg::ComputingGraph : public std::enable_shared_from_this<ComputingGraph>, public mgb::CompNodeDepedentObject

Computing graph.

A computing graph manages operators and variables. It can be compiled to create an AsyncExecutable that computs given variables.

Public Types

using Callback = thin_function<void(DeviceTensorND&)>

callback to be invoked when some output is ready

note that the output may be deallocated after the call returns if no further node depends on the output

using OutputSpecItem = std::pair<SymbolVar, Callback>

specify the callback of one output var

using OutputSpec = std::vector<OutputSpecItem>

specified what ouptputs are required in compile(); the callback could be empty, to ensure that the var is computed

Public Functions

ComputingGraph()
~ComputingGraph() = default
size_t id() const

graph ID

Each graph would be assigned a unique increasing ID; useful for debugging

size_t next_node_id() = 0
std::unique_ptr<AsyncExecutable> compile(const OutputSpec &out_spec) = 0

generate an executable object that when executed, would call the callbacks on the output values

Also note that only the most recent compiled function could be used, since oprs may have internal state

SmallVector<std::unique_ptr<AsyncExecutable>> compile_multi_part(const SmallVector<OutputSpec> &out_specs) = 0

compile multiple graph parts for partial execution

The parts in out_specs correspond to the execution steps of this graph. The returned AsyncExecutable objects should be called in the same order of parts given here.

The created AsyncExecutable objects would belong to newly generated graphs (not this graph). So functions compiled by compile() and compile_multi_part() can co-exist. All the new graphs would share device memory with this graph.

OperatorNodeBase *insert_opr(std::unique_ptr<OperatorNodeBase> opr) = 0

insert a new operator node; its input must exist in current graph

Return

the node in the graph (maybe another node due to deduplication)

template<typename ...Args>
VarNode *alloc_varnode(Args&&... args)

used by OperatorNodeBase to allocate its outputs

void free_varnode(VarNode *var)
AsyncExecutable *current_comp_seq() = 0

get current computing sequence

const VarReceiverInfo &var_receiver_in_current_comp_seq(const VarNode *var) const = 0

get information on how a variable is needed in current comp seq

std::string get_mem_allocation_info() const = 0
VarNode *find_var_by_id(size_t id) const = 0

find var node by its ID

Note: this searches recursively in subgraphs, and its complexity is linear with respect to number of vars (there is no indexing on var node ID)

Return

VarNode pointer if it is found, or nullptr if no var is found to have equal ID

SyncEventConnecter &event()

get underlying event connector

const SyncEventConnecter &event() const
Options &options()
const Options &options() const
static_infer::StaticInferManager &static_infer_manager() = 0

get an instance for static var value infer manager

SeqCompNodeOptimizer &seq_comp_node_optimizer() = 0

get an instance for sequence computing node optimizer

void share_device_memory_with(ComputingGraph &other) = 0

share static device memory with another computing graph

To share memory for all graphs g[0..n-1], the correct way is to call g[i].share_device_memory_with(g[0]) for i in range(1, n).

This method must be called before compiling, and the user must ensure AsyncExecutable objects with shared static device memory would not be executed simultaneously.

void set_device_memory_allocator(std::shared_ptr<DeviceMemoryAllocator> allocator) = 0

set a custom DeviceMemoryAllocator to be used

The given allocator would be used allocation in all graphs involved in share_device_memory_with() calls related to this graph.

size_t get_device_memory_size(CompNode cn) = 0

get size of currently allocated static device memory buffer on given computing node

Return

memory size in bytes

size_t clear_device_memory() = 0

clear statically allocated device memory

Return

use count of device memory before clear; a value of 1 indicates the memory would be actually released

void set_as_subgraph(ComputingGraph &par_graph) = 0

set this graph as subgraph of another

This mechanism is used to implement special control operators like loop. Being a subgraph has following consequences:

  1. node ID counter would be shared

  2. when an AsyncExecutable compiled from subgraph are called, it would not wait for previous run to finish; instead, when AsyncExecutable from parent graph is being waited, it would call wait() on AsyncExecutables from the subgraph.

  3. some options would be passed from parent graph to sub graph

Note that reference to subgraph should be kept by its owner operator, whose reference is kept by parent graph.

size_t nr_oprs_in_graph() const = 0

get number of operators inserted in this graph

void record_async_error(std::unique_ptr<MegBrainError> async_exc) = 0

record given async error; it should call this function rather than throw exception directly for the errors occurred during calculation.

Public Static Functions

std::shared_ptr<ComputingGraph> make()
void assert_destroy(std::shared_ptr<ComputingGraph> &ptr)

assert that refcnt for ptr is one and destories the ptr

size_t prealloc_static_storage(size_t size)

pre-allocate static storage used for internal states of computing graphs

This is mainly used to reduce memory usage in single-threaded environments. If a newly compiled function requires larger memory size than previous ones, megbrain has to re-allocate static storage buffer and the previous buffers are all wasted (because they should have been shared with the largest buffer).

If we know the max buffer size for all functions, the buffer can be pre-allocated so it can be shared by all.

A common practice to call prealloc_static_storage(0) to get the current buffer size at the end of the program, and use this value as the buffer size in next run.

Return

current buffer size

Parameters
  • size: anticipated max size of all buffers, in bytes

Protected Functions

void *alloc_varnode_storage() = 0

provided by impl to support alloc_varnode

void free_varnode_storage(void *ptr) = 0
struct Options

Public Functions

const OprAttribute &get_opr_attribute(OperatorNodeBase *opr) const

get attribute for an operator

Public Members

struct mgb::cg::ComputingGraph::Options::OprAttribute opr_attribute
struct mgb::cg::ComputingGraph::Options::SeqOpt seq_opt
mgb::cg::ComputingGraph::Options::GraphOpt graph_opt
int16_t graph_opt_level = 2

graph optimization level: 0: disable 1: level-1: inplace arith transformations during graph construction 2: level-2: level-1, plus global optimization before graph compiling 3: also enable JIT <0: corresponding level, with result check for debug

int16_t allreduce_pack_max_size = 0

max size of allreduce packs in MB set this option to zero to disable PackAllReducePass

int16_t allreduce_pack_ignore_first = 2

do not pack the first n allreduces PackAllReducePass disabled if allreduce_pack_max_size is zero

uint16_t log_level = 1

set logging level, larger number means more verbose 0: no log info 1: static memory allocation status WorkspaceLimitGetter summary optimizer summary

  1. optimizer var replace details during graph compiling duplicated operator

uint16_t async_exec_level = 1

async exec: dispatch on separate threads for different comp_node 0: do not perform async dispatch 1: dispatch async if there are more than one comp node with limited queue mask 0b10: async if there are multiple comp nodes with mask 0b100: always async

bool force_dynamic_alloc = false

force dynamic memory alloc for all vars

bool var_sanity_check_first_run = true

whether to perform var sanity check on first run

bool allocate_static_mem_after_graph_compile = false

whether to allocate static memory just after compiling graph

bool fake_next_exec = false

whether only to perform non-computing tasks (like memory allocation and queue initialization) for next exec. This would be reset to false when the graph is executed.

bool enable_sublinear_memory_opt = false

whether to enable sublinear memory optimization

struct mgb::cg::ComputingGraph::Options::SublinearMemConfig sublinear_mem_config
bool no_profiling_on_shape_change = false

do not re-profile to select best impl algo when input shape changes (use previous algo)

bool enable_var_mem_defragment = true

whether to perform defragmenting when memory allocation for a dynamic var fails

bool enable_grad_var_static_reshape = false

whether to reshape grad var whose wrt shape is statically inferrable but its own shape is dynamic

bool enable_memory_swap = false

whether to enable swap memory as swap’s performance is greatly worse than sublinear, it is recommended to use sublinear first

uint8_t comp_node_seq_record_level = 0

whether to use CompNodeSeqRecorder to record the execution sequence and directly replay it for later executions.

Level 1 is mainly used to speed up execution (especially for opencl); level 2 is used for reducing memory usage.

Level 1 constraints:

  1. All vars must be statically allocated

  2. Host input/output buffer pointers can not be changed if shape is not changed (this is not checked in execution for efficiency considerations; this is potentially dangerous)

  3. Synchronization can only occur at the end of execution

  4. Not all comp node implementations support recording computing sequence

  5. Only one comp node can be used in the graph

Level 2: besides recording the computing sequence, the dependencies are also moved into the compiled func (see GraphExecutable::ExecDependency). Additional constraints:

  1. Shapes can not change

  2. both fake_next_exec and var_sanity_check_first_run must be disabled

  3. Var shapes must be correctly setup before calling compile()

bool eager_evaluation = false

whether to evaulate var node values as they are inserted

bool imperative_proxy_graph = false
bool no_force_inplace = false

Request that operators should not force update their inputs.

THIS FLAG IS RESERVED FOR INTERNAL USE

When this flag is set, operators like AddUpdate and BatchNorm will still attempt to inplace update their inputs, but failing to do so will not be considered as an error.

ThinHashMap<VarNode*, VarNodeArray> extra_vardeps

add extra deps for the comp seq if a specific var is dependent

UserDataContainer user_data

contains any user data associated with this graph

struct GraphOpt : public mgb::cg::GraphCommonOptimizeOptions

graph optimization options

Public Members

uint8_t jit = 0

whether to enable JIT; JIT would also be enabled at O3 this value indicates JIT level: 1 for basic elemwise opr; 2 for including reduce oprs

bool tensorrt = false

whether to enable fine-grained TensorRT opr replace

struct OprAttribute

attribute for a specific operator

struct SeqOpt

sequence compile optimization options

Public Members

bool enable_mem_plan_opt = true

whether to enable memory forwarding to optimize mem plans

bool enable_mem_reuse_alloc = true

whether to enable static memory reuse (i.e. using optimized static memory allocation algorithm)

bool enable_seq_comp_node_opt = true

whether to enable comp node optimization (e.g. using copy stream for I/O operators)

struct SublinearMemConfig

Control parameter for sublinear memory optimization.

Public Members

int thresh_nr_try = 10
int genetic_nr_iter = 0
int genetic_pool_size = 20
int lb_memory = 0
int num_worker = sys::get_cpu_count() / 2
struct VarReceiverInfo

Public Functions

bool empty() const

whether nothing is needed completely

bool value_needed() const

whether computing value is needed (i.e. either dev_value, or shape, or host_value)

bool is_empty_allowed() const

whether this var can be empty

std::string to_string() const

Public Members

size_t nr_direct_comp_req = 0

number of requests for directly computing by passing an empty callback

size_t dev_value = 0

number of operators that need device value of this var

OperatorNodeBase *last_dev_value_reader = nullptr

last dev value reader in the computing sequence

size_t shape = 0

number of operators that need shape of this var, which can not be statically inferred

size_t host_value = 0

number of operators that need host value of this var, which can not be statically inferred

size_t allow_empty_value = 0

number of operators in dev_value and host_value that allow this var to be empty