Use Dataset to define a data set

There are various data sets in this world, and they are always distributed in different places in different formats (such as png, HDF5, npy, etc.). Perhaps the storage and organization of some data sets have become the reference standard, but not all data are the same. At the beginning, it is stored in the format supported by MegEngine. In many cases, we need to write scripts to process the original data with the help of some libraries or frameworks, and convert them into data set objects available in MegEngine.

In the How to create a Tensor, we mentioned that ndarray is a more commonly supported format in the Python data science community, so the data set related operations in MegEngine are all based on NumPy: py:class:~.numpy.ndarray As a processing object (at this time, it is assumed that the user has converted the original data into ndarray format through some means, which can be used as a subsequent data set package), the data format will not be automatically converted during the entire process, so you need to pay attention to:if you want to perform Tensor calculation in the future When, you need Convert NumPy ndarray to MegEngine Tensor.

See also

It would be helpful to search for questions like “How to load xxx type data with NumPy?” through a search engine.

Note

  • DataLoader must provide the ``dataset’’ parameter when initializing. By passing in a data set object, it tells how to load each data;

  • MegEngine 中可以 Use the implemented data set interface (如 PascalVOC, ImageNet 等) 替用户完成一些主流数据集的获取、切分等工作。但一些时候这些实现不能满足需求,或者我们需要使用自己采集和标注好的数据集, 因此在使用 DataLoader 之前,通常需要将要用到的数据集人为地封装。

Data set type

The access method of the sample, MegEngine data set can be divided into Map-style and Iterable-style two kinds:

type

Map-style

Iterable-style

Abstract base class 1

Dataset / ArrayDataset

StreamDataset

interview method

Support random access

Can only iterate sequentially

Applicable scenarios

Numpy arrays, dictionaries, disk files 2

Generator, streaming data from the network

1

These classes cannot be instantiated and used directly, and the corresponding types of data sets must be inherited by custom subclasses and implement the necessary protocols.

2

Generally, Map-style datasets should be used whenever possible. It provides a way to query the size of the data set, it is easier to mess up, and can be easily loaded in parallel. But MapDataset is not suitable for situations where the input data arrives as part of the stream, such as audio or video sources; or each data point may be a subset of a file that is too large to be stored in memory, so Requires incremental loading during training. Although these situations can be solved by adding more complex logic to the Map-style data set or performing additional preprocessing during input, it will also take more preparation time. Now there is a more natural solution, that is, use Iterable-style StreamDataset is used as input.

Map-style

Dataset (also called MapDataset)

The abstract base class for all data sets in MegEngine. The corresponding data set type is Map-style, which means the mapping from index/key to data sample, with random access capability. For example, use dataset[idx] to read the ``idx’’th image and its corresponding label from the folder on the disk. Need to realize the ``__getitem__() `` and ``__len__() `` agreement when using.

The following code shows how to generate a data set consisting of five numbers from 0 to 5 (without label):

from megengine.data.dataset import Dataset

class CustomMapDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __getitem__(self, idx):
        return self.data[idx]

    def __len__(self):
        return len(self.data)
>>> data = list(range(0, 5))
>>> map_dataset = CustomMapDataset(data)
>>> print(len(map_dataset))
>>> print(map_dataset[2])
5
2

Warning

  • Please note that in order to avoid OOM (Out Of Memory) when trying to load data into memory at one time when loading a large data set, we recommend that the actual data read operation be implemented in the __getitem__ method instead of ` In the __init__` method, the latter only records the index/key content in the mapping relationship (may be a list of file names or paths), which can greatly reduce memory usage. For specific examples, please refer to Example:Load image classification data;

  • If the data size is too large to load meta information such as indexes into the memory, you need to consider streaming acquisition methods;

  • But the situation is not always unique. If our data set is relatively small and can be resident in memory, then we can consider loading the entire data set when initializing the object, reducing the number of times that data is repeatedly read from the hard disk or other locations to the memory. For example, in different Epochs, the same sample will be used for training multiple times. At this time, it is more efficient to read directly from the memory.

ArrayDataset

The further encapsulation of the Dataset class is applicable to NumPy ndarray data, without the need to implement the __getitem__() and __len__() protocols.

The following code shows how to generate a random data set of RGB images with 100 samples and each sample is 32 x 32 pixels (the label is a random value). This is also what we often encounter when processing images ``(N, C, H, W), format:

import numpy as np
from megengine.data.dataset import ArrayDataset

data = np.random.random((100, 3, 32, 32))
target = np.random.random((100, 1))
dataset = ArrayDataset(data, target)
>>> print(len(dataset))
>>> print(type(dataset[0]), len(dataset[0]))
>>> print(dataset[0][0].shape)
100
<class 'tuple'> 2
(3, 32, 32)
ConcatDataset

由多个数据集组成的数据集。 这个数据集用于将多个映射式(map-style)组合为一个新的数据集。

下面的代码展示了如何将多个 ArrayDataset 组合成一个 ConcatDataset 数据集:

import numpy as np
from megengine.data.dataset import ArrayDataset, ConcatDataset

data1 = np.random.randint(0, 255, size=(100, 3, 32, 32), dtype=np.uint8)
data2 = np.random.randint(0, 255, size=(100, 3, 32, 32), dtype=np.uint8)
label1 = np.random.randint(0, 10, size=(100,), dtype=int)
label2 = np.random.randint(0, 10, size=(100,), dtype=int)

dataset1 = ArrayDataset(data1, label1)
dataset2 = ArrayDataset(data2, label2)
dataset = ConcatDataset([dataset1, dataset2])
>>> print(len(dataset))
>>> print(type(dataset[0]), len(dataset[0]))
>>> print(dataset[0][0].shape)
200
<class 'tuple'> 2
(3, 32, 32)

Iterable-style

StreamDataset (also called IterableDataset)

Iterable-style data set is suitable for streaming data, that is, accessing data iteratively. For example, using iter(dataset)'' can return data streams read from databases, remote servers, and even logs generated in real time. DataLoader will use next to continuously get data.

This type of data set is particularly suitable for:

  • Random read costs are too high, or the data size is too large to support random access;

  • The batch size actually depends on the data acquisition situation, that is, it must be based on the stream data to determine whether the current batch is complete.

__iter__()’’ protocol needs to be implemented when using it.

The following code shows how to generate a data set consisting of five numbers from 0 to 5 (without label):

from megengine.data.dataset import StreamDataset

class CustomIterableDataset(StreamDataset):
    def __init__(self, data):
        self.data = data

    def __iter__(self):
        return iter(self.data)
>>> data = list(range(0, 5))
>>> iter_dataset = CustomIterableDataset(data)
>>> it = iter(iter_dataset)
>>> print(type(it))
list_iterator
>>> print(next(it))
0
>>> print(next(it))
1

Obviously, the set does not support streaming data and acquiring the length of the random access through the index value:

>>> iterable_dataset[0]
AssertionError: can not get item from StreamDataset by index
>>> len(iterable_dataset)
AssertionError: StreamDataset does not have length

This example cannot reflect the real demand scenario of StreamDataset loading, but it is convenient to compare with MapDataset.

Why design streaming datasets

Map-style

>>> for data in map_dataset:
...     print(data)
0
1
2
3
4

Iterable-style

>>> for data in iter_dataset:
...     print(data)
0
1
2
3
4

According to the above example, it can be found that using the same original List data to generate two types of data sets and iteratively visit, both Map-style and Iterable-style data sets can return the same results, so what is the difference? From a high-level perspective, every time DataLoader returns batch data from the Map-style data set, it will first sample the data index to obtain a batch of indexes idx'', and use ``map_dataset[idx] to obtain Batch data. On the contrary, for Iterable-style data sets, DataLoader continuously calls next(it) to get the next data in sequence until it gets a complete batch. This is why we say that Iterable-style data sets are more suitable for providing data to sequential models.

See also

Refer to Use Sampler to define sampling rules to learn how to get a batch of indexes of length B from a data set with a sample size of N.

Use the implemented data set interface

In the dataset submodule, in addition to providing some abstract base classes to be implemented by user-defined subclasses, it also provides some interfaces encapsulated based on mainstream data sets, such as those that are often used For teaching and practice purposes: py:class:~.MNIST data set:

>>> from megengine.data.dataset import MNIST
>>> train_set = MNIST(root="path/to/data/", train=True, download=False)
>>> test_set = MNIST(root="path/to/data/", train=False, download=False)

With the help of the encapsulated interface, we can quickly obtain the training set train_set and the test set test_set of the MNIST data set, where the ``download’’ parameter can control whether to obtain the official address from the data set To download. For more details, please refer to the API documentation.

Warning

These data sets are all downloaded from their own official sites, and MegEngine does not provide mirroring or acceleration services.

Note

  • Some data sets will not provide a download interface for the original data due to the provisions of the license agreement (such as: py:class:~.ImageNet), and need to be downloaded manually;

  • The download speed is affected by the network environment and bandwidth. Users can also choose to use other scripts or tools to download the original data set;

  • These data set interface source code is a very good reference, it will be very helpful to help users learn how to design a data set interface.

How to add a new data set

At present, MegEngine provides some common mainstream data set interfaces, and users are welcome to provide us with more interface implementations.

However, we have not yet provided clear design specifications and requirements, so it is recommended that users try to communicate with official maintainers first.