Class CompNode

Class Documentation

class mgb::CompNode

abstraction of a streaming computing resource on localhost (a thread on CPU, a cuda stream, etc.)

Note that most of the operations are asynchronous with respect to the caller thread

Public Types

enum DeviceType

computing device type


enumerator UNSPEC = 0

for “xpu” comp node that would mapped to available cn on current system

enumerator CUDA = 1
enumerator CPU = 2
enumerator CAMBRICON = 3
enumerator ROCM = 8
enumerator ATLAS = 9
enumerator MULTITHREAD
enumerator MAX_DEVICE_ID
enum Flag


enumerator SUPPORT_RECORDER = 1 << 0

Whether computing recorder is supported on this comp node (i.e. whether non-zero comp_node_seq_record_level is allowed)


Whether dynamic memory allocation is supported in seq recorder. If this flag is not setted, ComputingSequence::do_execute() would skip the warm up and allow seq recorder to start immediately

enumerator QUEUE_LIMITED = 1 << 2

Whether the capacity of the asynchronous execution queue on this comp node is limited. If this flag is set, tasks on multiple comp nodes would be dispatched from multiple cpu threads.



enumerator HAS_COPY_STREAM = 1 << 3

Whether this comp node supports copy stream, so computation and I/O can be parallelized

enumerator EVENT_DTOR_UNSAFE = 1 << 4

Destructing an event is unsafe if the comp node is not synchronized; setting this flag would cause computing sequence to sync the comp node in its dtor.

enumerator SUPPORT_NO_THREAD = 1 << 5

CompNode is available even there is no thread support, i.e. MGB_HAVE_THREAD=0. Usually this means that execution on the CompNode is synchronous, i.e. behaves like cpu:default

enumerator SUPPORT_UNIFIED_ADDRESS = 1 << 6

Whether this comp node supports unified address. i.e. CPU and CUDA supports unified address.

using UnorderedSet = ThinHashSet<CompNode>
template<typename T>
using UnorderedMap = ThinHashMap<CompNode, T>

Public Functions

CompNode() = default
void *alloc_device(size_t size) const

allocate memory on this computing node

Note: allocation of device memory is synchronous with the host, meaning that the memory can be used immediately; however deallocation is asynchronous to ensure that the memory can be used by already-launched kernels on the computing node.

Exception should be raised if allocation fails.

void free_device(void *ptr) const

deallocate device buffer; see alloc_device() for more details

void *alloc_host(size_t size) const

allocate memory on host that is associated with the device, which may accelerate I/O

Both allocation and deallocation on host are synchronous.

void free_host(void *ptr) const
void copy_to_host(void *host_ptr, const void *device_ptr, size_t size) const

copy from underlying device to host

void copy_to_device(void *device_ptr, const void *host_ptr, size_t size) const

copy from host to underlying device

void peer_copy_to(CompNode dest_node, void *dest, const void *src, size_t size) const

copy from this device to another device; would use the computing resource on dest_node

  • src: source memory that must be allocated on this device

size_t get_mem_addr_alignment() const

get alignment requiement in bytes; guaranteed to be power of 2

size_t get_mem_padding() const

get the size of the paddings which must be reserved at the end of memory chunk; guaranteed to be power of 2

std::unique_ptr<Event> create_event(size_t flags = 0) const
void device_wait_event(Event &event) const

wait for an event created on another CompNode

void sync() const

block host thread to wait for all previous operations on this computing node to finish

MemNode mem_node() const

get id of underlying memory node; comp nodes that share the same mem node can access memory allocated by each other.

bool operator==(const CompNode &rhs) const
bool operator!=(const CompNode &rhs) const
bool valid() const
std::pair<size_t, size_t> get_mem_status_bytes() const

get total and free memory on the computing device in bytes

CompNode change_stream(int dest_stream) const

change to another stream on the same memory node

std::string to_string() const

get string representation of physical device

std::string to_string_logical() const

get string representation of logical device

uint64_t get_uid()
Locator locator() const

get the physical locator that created this comp node

Locator locator_logical() const

get the logical locator that created this comp node

void activate() const

see CompNodeEnv::activate

DeviceType device_type() const

get device type of this comp node

MGB_WARN_UNUSED_RESULT std::unique_ptr< MegBrainError > check_async_error () const

check for error on the asynchronous computing stream

This is used for devices with limited error handling such as CUDA.

It will return MegBrainError with error messages rather than directly throw exception; return nullptr if no error.

std::unique_ptr<CompNodeSeqRecorder> create_seq_recorder(cg::ComputingGraph *cg)

create a CompNodeSeqRecorder associated with this computing node

Note: the implementation must be thread safe: simultaneous calls to create_seq_recorder() must block until existing CompNodeSeqRecorder objects are either destructed or stopped.


the recorder object; nullptr is returned if recording is not supported

void add_callback(megdnn::thin_function<void()> &&cb)

insert callback into current compute stream. The callack is to be called after all currently enqueued iterms in the stream have completed. And the later tasks in the stream must wait for the callback to finish.

bool contain_flag(Flag flag)
CompNode(ImplBase *impl)

Public Static Functions

void finalize()

manually destroy all comp node resources

CompNode load(const std::string &id)

load a computing node from logical locator ID;



CompNode load(const Locator &locator)

create a CompNode object from logical locator

CompNode load(const Locator &locator_physical, const Locator &locator_logical)
void try_coalesce_all_free_memory()

release consecutive free chunks on all devices to defragment; see DevMemAlloc::try_coalesce_free

void set_prealloc_config(size_t alignment, size_t min_req, size_t max_overhead, double growth_factor, DeviceType device_type)
void sync_all()

synchronize all computing nodes

bool contain_flag(DeviceType device_type, Flag flag)
void foreach(thin_function<void(CompNode)> callback)

apply function to each initialized comp node

size_t get_device_count(DeviceType type, bool warn = true)

get total number of specific devices on this system

CompNode default_cpu()

get default CPU comp node

bool enable_affinity_for_cpu(bool flag)

set whether to enable affinity setting for CPU comp nodes

If enabled, computation on cpux would be bound to the x’th CPU.

This is disabled by default.

(implemented in comp_node/cpu/comp_node.cpp)


original setting

Public Static Attributes

constexpr size_t NR_DEVICE_TYPE = static_cast<size_t>(DeviceType::MAX_DEVICE_ID)

Protected Attributes

ImplBase *m_impl = nullptr

implementations are allocated statically, so no memory management is needed


friend class CompNodeEnv
friend struct HashTrait< CompNode >
friend class CompNodeImplHelper
class Event : public NonCopyableObj

event associated with a CompNode node, used for cross-device synchronization

Public Types

enum Flags


enumerator NEED_TIMER = 1

Public Functions

~Event() = default
void record() = 0

record this event on the comp node that creates it

Note that if a comp node is recorded multiple times, then subsequent calls would overwrite its internal state and other methods that examine the status would only examine the completion of the most recent call to record().

bool finished() = 0

whether this event has finished; it must has been recorded

void host_wait() = 0

block the host thread (caller thread) to wait for this event

double elapsed_time_until(Event &end) = 0

get elapsed time in seconds from this to another event; the events must be finished

void device_wait_by(CompNode cn) = 0

record an action on another comp node so it would wait for this event

CompNode comp_node() const = 0

get the comp node to which this event is associated

size_t create_flags() const

flags when this event is created

Public Static Functions

void set_cpu_sync_level(int level)

set CPU resource usage level when performing synchronization

  • level: CPU waiting level: 0. condition var (the default)

    1. busy wait with yield

    2. busy wait

Protected Functions

Event(size_t create_flags)

Protected Attributes

size_t const m_create_flags

flags when this event is created

Protected Static Attributes

int sm_cpu_sync_level
class EventPool

pool of events that can be reused

Public Functions

EventPool(CompNode cn, size_t flags = 0)
CompNode::Event *alloc()
void free(CompNode::Event *ev)
void assert_all_freed()

assert that all allocated events have been freed

class ImplBase : public NonCopyableObj, public DynTypeObj

Public Types

typedef void (*free_func_t)(ImplBase *self, void *ptr)

Public Functions

void *alloc_device(size_t size) = 0
void *alloc_host(size_t size) = 0
void copy_to_host(void *host_ptr, const void *device_ptr, size_t size) = 0
void copy_to_device(void *device_ptr, const void *host_ptr, size_t size) = 0
void peer_copy_to(Impl *dest_impl, void *dest, const void *src, size_t size) = 0
size_t get_mem_addr_alignment() = 0
size_t get_mem_padding()
std::unique_ptr<Event> create_event(size_t flags) = 0
void sync() = 0
MemNode mem_node() = 0
std::pair<size_t, size_t> get_mem_status_bytes() = 0
Locator locator() = 0
Locator locator_logical() = 0
std::unique_ptr<CompNodeSeqRecorder> create_seq_recorder(cg::ComputingGraph *cg)
void add_callback(megdnn::thin_function<void()>&&)
uint64_t get_uid()

Public Members

const free_func_t free_device

memory free might be called after finalize(); so we should not rely on virtual function for this

const free_func_t free_host

Protected Functions

ImplBase(free_func_t fd, free_func_t fh)
~ImplBase() = default
struct Locator

an identifier to specify a computing node

Note: logical locator is directly parsed from a string identifier given by user; it should be translated to physical locator by calling to_physical() before actual use.

Unless explicitly specified otherwise, all locators are physical locators.

Public Functions

Locator to_physical() const

get corresponding physical Locator

DeviceType::UNSPEC would be resolved, and device map would be applied on device number

std::string to_string() const

get string description of this locator that can be parsed again

bool operator==(const Locator &rhs) const

Public Members

DeviceType type = DeviceType::UNSPEC
int device = -1

corresponding to a physical computing device; memories between different devices are not shared.

device == -1 means logical default device (maps to 0 by default, and can be changed by set_device_map)

int stream = 0
int nr_threads
union mgb::CompNode::Locator::[anonymous] [anonymous]

multiple streams can execute on one computing device and share memory, when compnode type is multithread the field also stand for nr_threads

Public Static Functions

Locator parse(const std::string &id)

parse a string identifier

currently supported ID format: (gpu|cpu)<n>[:m] where n is the device number, possibly with m as the stream id.

void set_device_map(DeviceType type, int from, int to)

set mapping between device numbers of a device type

void set_unspec_device_type(DeviceType type)

set the actual device type to be used for DeviceType::UNSPEC

Public Static Attributes

constexpr int DEVICE_CPU_DEFAULT = -1024

special device number for the “cpu default” comp node, which dispatches all tasks in the caller thread

constexpr int DEVICE_MULTITHREAD_DEFAULT = -1025

special device number for the “multithread_default” comp node, which dispatches all tasks to thread pool and the caller thread is the main thread of thread pool

struct Stream

predefined special streams

Public Static Attributes

constexpr int COPY = -1
constexpr int REMOTE_SEND = -2
constexpr int LOOP_SWAP = -3