Use Data to build the input pipeline#

The data sub-package in MegEngine provides primitives for processing data (data sets), among which megengine.data.DataLoader is used to load batch data, essentially Used to generate an iterable object, responsible for returning the batch size (ie batch_size) data from the data set described by Dataset. In short, Dataset tells DataLoader how to load a single sample into the memory, and DataLoader is responsible for obtaining data in batches according to a given configuration, which is convenient for subsequent training and testing.

>>> from megengine.data import DataLoader
>>> dataset = CustomDataset()
>>> dataloader = DataLoader(dataset)
>>> for batch_data in DataLoader:
...     pass

Some details are hidden in the above introduction. In fact, DataLoader'' is responsible for calling all the logic related to data loading, including but not limited to：get the next batch of data indexes, load data into memory, collect batch data, and multithreading. Data loading, etc... The functions corresponding to these steps are implemented by other different components in the ``data module (such as Sampler, Transform, etc.).

If you want to construct the input pipeline more correctly & efficiently, it is recommended to read all the contents of the current chapter.

Use DataLoader to load data#

The signature of the DataLoader class is as follows：

DataLoader(dataset, sampler=None, transform=None, collator=None,
num_workers=0, timeout=0, timeout_event=raise_timeout_error, divide=False)

Some options can be flexibly configured during initialization of DataLoader, we will focus on：

Note

Take model training as an example, the Pipeline of the input data in MegEngine is：

Create a Dataset object;
Create Sampler, Transform and Collator objects as needed;
Create a DataLoader object;
Iterate this DataLoader object and load the data into the model in batches for training;

每当我们向 DataLoader 索要一批数据时，DataLoader 将从 Sampler 获得下一批数据的索引，根据 Dataset 提供的 __getitem__ 方法将对应的数据逐个加载到内存，加载进来的数据可以通过指定的 Transform 做一些处理，再通过 Collator 将单独的数据组织成批数据。以上为单进程的情况，DataLoader 也支持多进程加载以提升数据加载处理速度。我们可以通过设置 DataLoader 的 worker 参数来决定具体要开启的子进程的数量。一般来说 worker 数量越多，数据加载处理的速度会越快。不过如果 worker 数过多，并大大超出了系统中 cpu 的数量，这些子进程可能会存在竞争 cpu 资源的情况，反而导致效率的降低。一般来说，我们建议根据系统中 cpu 的数量设置 worker 的值。比如在一台 64 cpu, 8 gpu 的机器上，预期中每个 gpu 会对应 8 个 cpu, 那么我们在使用时对应的把 worker 数设置在 8 左右就是个不错的选择。

Similarly, the verification and testing of the model can also use the respective DataLoader to complete the loading of the data part.

Warning

If you do not customize the above configuration, users should be aware of the processing logic of DataLoader by default.

Example：Load image classification data#

Below we take the basic process of loading image classification data as a simple example-

It is assumed that the image data is placed in the same directory according to certain rules (usually the data set homepage will introduce the directory organization and file naming rules). To create the corresponding data loader, you first need a class inherited from Dataset. Although for NumPy ndarray data, MegEngine provides: py:class:~.ArrayDataset implementation. But more should be standard practice is to create a custom data set：

import cv2
import numpy as np
import megengine
from megengine.data.dataset import Dataset

class CustomImageDataset(Dataset):
    def __init__(self, image_folder):
        # get all mapping indice
        self.image_folder = image_folder
        self.image_list = os.listdir(image_folder)

    # get the sample
    def __getitem__(self, idx):
        # get the index
        image_file = self.image_list[idx]

        # get the data
        # in this case we load image data and convert to ndarray
        image = cv2.imread(self.image_folder + image_file, cv2.IMREAD_COLOR)
        image = np.array(image)

        # get the label
        # in this case the label was noted in the name of the image file
        # ie: 1_image_28457.png where 1 is the label
        # and the number at the end is just the id or something
        target = int(image_file.split("_")[0])

        return image, target

    def __len__(self):
        return len(self.images)

To get the sample image, you can create a dataset object and pass the sample index to the __getitem__ method, and then the image array and the corresponding label will be returned, such as：

dataset = CustomImageDataset("/path/to/image/folder")
data, sample = dataset.__getitem__(0) # dataset[0]

Now we have pre-created a class ``CustomImageDataset’’ that can return a sample and its label, but only relying on ``Dataset’’ itself cannot achieve automatic batching, disorder, parallelism, etc.; we must create it next. DataLoader, which “wraps” around this class through other parameter configuration items, and can return a whole batch of samples from the data set class according to our requirements.
```
from megengine.data.transform import ToMode
from megengine.data import DataLoader, RandomSampler

dataset = YourImageDataset("/path/to/image/folder")

# you can implement the function to randomly split your dataset
train_set, val_set, test_set = random_split(dataset)

# B is your batch-size, ie. 128
train_dataloader = DataLoader(train_set,
      sampler=RandomSampler(train_set, batch_size=B),
      transform=ToMode('CHW'),
)
```
Note that in the above code, we also use ``Sampler’’ to determine the data loading (sampling) order, and ``Transform’’ to perform some transformation processing on the loaded data, which is not all configurable Items, we will introduce them in more detail in the subsequent sections.

Now we have created the data loader and ready to train! For example like this：

for epoch in range(epochs):

    for images, targets in train_dataloder:
        # now 'images' is a batch containing B samples
        # and 'targets' is a batch containing B targets
        # (of the images in 'images' with the same index

        # remember to convert data to tensor
        images = megengine.Tensor(images)
        targets = megengine.Tensor(targets)

        # train function
        # ...

After successfully obtaining the batch of data, the follow-up process on how to train and test the model is not introduced here.