Tensor memory layout

Warning

  • This part of the content belongs to the low-level details, and users do not need to understand the design behind these in most scenarios. If you want to become a core developer of MegEngine, it will be helpful to understand the underlying details. For more information, please refer to the developer guide;

  • The relevant code is implemented in: dnn/include/megdnn/basic_types.h-megdnn::TensorLayout.

See also

NumPy’s explanation of ndarray memory layout: Internal memory layout of an ndarray

How Tensor values are stored in memory

An instance of the Tensor class is composed of one-dimensional continuous computer memory segments.

Binding Tensor element index mechanism, it may be mapped to a value in the corresponding element position in the memory block, and the index may be varied by a range of Tensor :ref:’shape <tensor-shape>designated` attribute. The number of bytes occupied by each element and how to interpret the byte Tensor the :ref:’data type <tensor-dtype>designated `attribute.

A segment of memory is essentially continuous, and there are many different schemes to arrange the items of the N-dimensional Tensor array in a one-dimensional block. The difference of the order, can be divided into a main column and row major order two styles, for example to let the simplest two-dimensional case:

../../../_images/Row_and_column_major_order.svg

The above figure uses row-major order and column-major order respectively for indexing:

  • Among them, :math:`a_{11} ldots a_{33}`represents the value of each of the nine elements;

  • There is an obvious relationship between offset and index.

The elements in the two-dimensional Tensor can actually mapped by the memory blocks, respectively, one-dimensionally continuous:

Offset

Access

Value

0

a[0][0]

a11

1

a[0][1]

a12

2

a[0][2]

a13

3

a[1][0]

a21

4

a[1][1]

a22

5

a[1][2]

a23

6

a[2][0]

a31

7

a[2][1]

a32

8

a[2][2]

a33

MegEngine is as flexible as NumPy and supports any stride indexing scheme. Here we need to mention a concept of:strides (Strides).

Tensor stride

See also

NumPy’s ndarray has: py:attr:~numpy.ndarray.strides property (this concept also exists in MegEngine, but no interface is provided).

Note

The stride of Tensor ``strides’’ is a tuple that tells us the number of bytes to step in each dimension when traversing Tensor elements; or it can be understood as the unit when indexing elements on a certain axis The memory range represented by the scale, that is, how many bytes must be skipped in the memory to move to the next position along a certain axis. This attribute usually does not need to be modified by the user.

Take the 2-dimensional case as an example

Imagine such a Tensor composed of 32-bit (4-byte) integer elements:

>>> x = megengine.tensor([[0, 1, 2, 3, 4],
...                       [5, 6, 7, 8, 9]], dtype="int32")

The elements in the Tensor are stored in memory one by one (called a contiguous memory block), occupying 40 bytes. We must skip 4 bytes to move to the next column, but we must skip 20 bytes to reach the same position in the next row. Therefore, the stride of x is (20, 4).

We use \(s^{\text {row }}\) to represent the stride obtained in the main sequence of the row, then there are \(s_0^{\text {row }} = 4 \times 5 = 20\), \(s_1^{\text {row }} = 4\).

Calculated with the help of \(s^{\text {row }}\), correspondingly, the byte offset of the position element of[1][2]`` (corresponding value is :math: \(1 \times 20 + 2 \times 4 = 28\).

Generalize

In a more general case, for an N-dimensional Tensor of shape shape, its stride is \(s^{\text {row }}\) The calculation formula is as follows:

\[s_{k}^{\text {row }}=\text { itemsize } \prod_{j=k+1}^{N-1} d_{j}\]

Wherein \(\tEXT {itemsize}\) depending `` dtype``, and \(D_{j}=\tEXT { self.shape }[j]\).

The index is \(T[n_0, n_1, \ldots, n_{N-1}]\) The byte offset of the element is:

\[n_{\text {offset }}=\sum_{k=0}^{N-1} s_{k} n_{k}\]

Use of stride concept

See also

For some Tensor operations that change the shape, we can modify the stride to avoid the actual memory copy.

format introduction

​In the deep learning framework, as shown in the figure below, the general neural network feature map is composed of 4-dimensional arrays. However, for computers, the storage of data can only be linear, so different data formatting methods, It will significantly affect computing performance. For the characteristics of GPU, Megengine uses:NCHW, NHWC, NCHW4, NCHW32, NCHW64, CHWN4 and so on.

In order to better explain the specific meaning of different formats, the following figure lists the logical structure of 128 tensors. Where N, H, W and C are:

  • N:Batch. Indicates the batch of pictures, here is 2;

  • H:Height. Indicates the height of the picture, here is 3;

  • W:Weight. Indicates the width of the picture, here is 3;

  • C:Channel. Indicates the number of channels of the picture, here is 64.

../../../_images/format_logical_construction.svg

NCHW and NHWC

  1. Arrangement method

For a computer, for storing data can be linear, and wherein NCHW NHWC most commonly used, the FIG lists NCHW physical storage structure and NHWC:

../../../_images/format_NCHW_NHWC.svg

For NCHW, the W dimension is stored first, and then stored separately according to H, C and N, so it is stored in order from 0000 to 1151;

For NHWC, the C dimension is stored first, so the priority is to store 0000, 0009 until 1143, and then continue to be stored according to W, H, and N respectively, and store 0001, 0010, etc.;

  1. characteristic

  • For “NCHW”, the pixel values of the same channel are arranged continuously, which is more suitable for operations that require each channel to be calculated separately, such as “MaxPooling”.

  • For “NHWC”, the elements in the same position in different channels are stored sequentially, so it is more suitable for operations that require a certain operation on the same pixel in different channels, such as “Conv”.

NCHWX

[Batch, Channels/X, Height, Width, X=4, 32 or 64]

  1. Arrangement method

As a typical convolutional neural network increases in the number of layers, the length and width of its feature map gradually decrease after downsampling, but the number of channels increases as the number of convolutional filters increases. Very deep feature maps such as 128 and 256 channels often appear. These deep feature maps and convolutional layers with a large number of filters require a lot of calculations. In order to make full use of the limited matrix calculation unit, it is necessary to split the channel dimension. Megengine splits the Channel dimension into Channel/4, Channel/32 and Channel/64 according to the characteristics of different data structures. The following figure shows the physical storage structure of NCHWX.

../../../_images/format_NCHWX.svg

NCHWX first stores Channel dimensions. The difference is that because of the difference in X, the number of channels that are stored first is different. NCHW4 first stores 4 channel dimensions, here are 0000, 0009, 0018 and 0027, and then continue to follow W, H, C, and N are stored, and continue to store 0001, 0010, etc.; NCHW32 and NCHW64 are similar, but the priority storage is 32 channels and 64 channels, respectively, and then continue to store according to W, H, C, and N.

  1. characteristic

  • ​Better adaptation to SIMT, where NCHW4 can use CUDA’s dp4a module for calculations for int8 data types, while NCHW32 and NCHW64 are for int8 and int4 data types, respectively, and better use CUDA’s tensorcore calculation unit for calculations;

  • It is more friendly to cache and reduces cache miss;

  • Easy padding, reduce boundary branch judgment, simple code logic.

CHWN4

In order to better adapt to cuda’s dp4a and tensorcore processing units, CHWN4 is introduced.

  1. Arrangement method

../../../_images/format_CHWN4.svg

CHWN4 stores the Channel dimension first, and stores 4 numbers. After 0000, 0009, 0018 and 0027, store 0576 to 0603 directly along the N dimension, and then store 0001 and 0010 along the W dimension and H dimension.

  1. characteristic

  • Compared with NCHWX, it can make better use of dp4a and tensorcore processing units, without layout conversion;

  • In addition, it still has the advantages of being cache-friendly and easy to padding.