Dataset¶

class Dataset[source]¶

An abstract base class for all map-style datasets.

Abstract methods

All subclasses should overwrite these two methods:

__getitem__(): fetch a data sample for a given key.
__len__(): return the size of the dataset.

They play roles in the data pipeline, see the description below.

Dataset in the Data Pipline

Usually a dataset works with DataLoader, Sampler, Collator and other components.

For example, the sampler generates indexes of batches in advance according to the size of the dataset (calling __len__), When dataloader need to yield a batch of data, pass indexes into the __getitem__ method, then collate them to a batch.

Highly recommended reading Use Dataset to define a data set for more details;
It might helpful to read the implementation of MNIST, CIFAR10 and other existed subclass.

Warning

By default, all elements in a dataset would be numpy.ndarray. It means that if you want to do Tensor operations, it’s better to do the conversion explicitly, such as:

dataset = MyCustomDataset()  # A subclass of Dataset
data, label = MyCustomDataset[0]  # equals to MyCustomDataset.__getitem__[0]
data = Tensor(data, dtype="float32")  # convert to MegEngine Tensor explicitly

megengine.functional.ops(data)

Tensor ops on ndarray directly are undefined behaviors.

DataLoader

ArrayDataset