Use Data to build the input pipeline#
The data
sub-package in MegEngine provides primitives for processing data (data sets), among which megengine.data.DataLoader
is used to load batch data, essentially Used to generate an iterable object, responsible for returning the batch size (ie batch_size
) data from the data set described by Dataset
. In short, Dataset
tells DataLoader
how to load a single sample into the memory, and DataLoader
is responsible for obtaining data in batches according to a given configuration, which is convenient for subsequent training and testing.
>>> from megengine.data import DataLoader
>>> dataset = CustomDataset()
>>> dataloader = DataLoader(dataset)
>>> for batch_data in DataLoader:
... pass
Some details are hidden in the above introduction. In fact, DataLoader'' is responsible for calling all the logic related to data loading, including but not limited to:get the next batch of data indexes, load data into memory, collect batch data, and multithreading. Data loading, etc... The functions corresponding to these steps are implemented by other different components in the ``data
module (such as Sampler
, Transform
, etc.).
If you want to construct the input pipeline more correctly & efficiently, it is recommended to read all the contents of the current chapter.
See also
Design of the main body portion and function PyTorch provided torch.utils.data ` <https://pytorch.org/docs/stable/data.html>` _ the like.
Use DataLoader to load data#
The signature of the DataLoader
class is as follows:
- DataLoader(dataset, sampler=None, transform=None, collator=None,
- num_workers=0, timeout=0, timeout_event=raise_timeout_error, divide=False)
Some options can be flexibly configured during initialization of DataLoader
, we will focus on:
Note
Take model training as an example, the Pipeline of the input data in MegEngine is:
Create a Dataset object;
Create Sampler, Transform and Collator objects as needed;
Create a DataLoader object;
Iterate this DataLoader object and load the data into the model in batches for training;
每当我们向 DataLoader 索要一批数据时,DataLoader 将从 Sampler 获得下一批数据的索引,
根据 Dataset 提供的 __getitem__
方法将对应的数据逐个加载到内存,
加载进来的数据可以通过指定的 Transform 做一些处理,再通过 Collator 将单独的数据组织成批数据。
以上为单进程的情况,DataLoader 也支持多进程加载以提升数据加载处理速度。
我们可以通过设置 DataLoader
的 worker
参数来决定具体要开启的子进程的数量。
一般来说 worker
数量越多,数据加载处理的速度会越快。不过如果 worker
数过多,
并大大超出了系统中 cpu 的数量,这些子进程可能会存在竞争 cpu 资源的情况,反而导致效率的降低。
一般来说,我们建议根据系统中 cpu 的数量设置 worker
的值。
比如在一台 64 cpu, 8 gpu 的机器上,预期中每个 gpu 会对应 8 个 cpu,
那么我们在使用时对应的把 worker
数设置在 8 左右就是个不错的选择。
Similarly, the verification and testing of the model can also use the respective DataLoader to complete the loading of the data part.
Warning
If you do not customize the above configuration, users should be aware of the processing logic of DataLoader by default.
Example:Load image classification data#
Below we take the basic process of loading image classification data as a simple example-
It is assumed that the image data is placed in the same directory according to certain rules (usually the data set homepage will introduce the directory organization and file naming rules). To create the corresponding data loader, you first need a class inherited from
Dataset
. Although for NumPy ndarray data, MegEngine provides: py:class:~.ArrayDataset implementation. But more should be standard practice is to create a custom data set:import cv2 import numpy as np import megengine from megengine.data.dataset import Dataset class CustomImageDataset(Dataset): def __init__(self, image_folder): # get all mapping indice self.image_folder = image_folder self.image_list = os.listdir(image_folder) # get the sample def __getitem__(self, idx): # get the index image_file = self.image_list[idx] # get the data # in this case we load image data and convert to ndarray image = cv2.imread(self.image_folder + image_file, cv2.IMREAD_COLOR) image = np.array(image) # get the label # in this case the label was noted in the name of the image file # ie: 1_image_28457.png where 1 is the label # and the number at the end is just the id or something target = int(image_file.split("_")[0]) return image, target def __len__(self): return len(self.images)
To get the sample image, you can create a dataset object and pass the sample index to the
__getitem__
method, and then the image array and the corresponding label will be returned, such as:dataset = CustomImageDataset("/path/to/image/folder") data, sample = dataset.__getitem__(0) # dataset[0]
Now we have pre-created a class ``CustomImageDataset’’ that can return a sample and its label, but only relying on ``Dataset’’ itself cannot achieve automatic batching, disorder, parallelism, etc.; we must create it next. DataLoader, which “wraps” around this class through other parameter configuration items, and can return a whole batch of samples from the data set class according to our requirements.
from megengine.data.transform import ToMode from megengine.data import DataLoader, RandomSampler dataset = YourImageDataset("/path/to/image/folder") # you can implement the function to randomly split your dataset train_set, val_set, test_set = random_split(dataset) # B is your batch-size, ie. 128 train_dataloader = DataLoader(train_set, sampler=RandomSampler(train_set, batch_size=B), transform=ToMode('CHW'), )
Note that in the above code, we also use ``Sampler’’ to determine the data loading (sampling) order, and ``Transform’’ to perform some transformation processing on the loaded data, which is not all configurable Items, we will introduce them in more detail in the subsequent sections.
Now we have created the data loader and ready to train! For example like this:
for epoch in range(epochs): for images, targets in train_dataloder: # now 'images' is a batch containing B samples # and 'targets' is a batch containing B targets # (of the images in 'images' with the same index # remember to convert data to tensor images = megengine.Tensor(images) targets = megengine.Tensor(targets) # train function # ...
After successfully obtaining the batch of data, the follow-up process on how to train and test the model is not introduced here.
See also
A complete model training and testing tutorial based on MNIST and CIFAR10 data sets is provided in the entry section of MegEngine novice;
More reference codes can be found in the MegEngine official model library `Models <MegEngine/Models>