Dataset¶
- class Dataset[source]¶
An abstract base class for all map-style datasets.
Abstract methods
All subclasses should overwrite these two methods:
__getitem__()
: fetch a data sample for a given key.__len__()
: return the size of the dataset.
They play roles in the data pipeline, see the description below.
Dataset in the Data Pipline
Usually a dataset works with
DataLoader
,Sampler
,Collator
and other components.For example, the sampler generates indexes of batches in advance according to the size of the dataset (calling
__len__
), When dataloader need to yield a batch of data, pass indexes into the__getitem__
method, then collate them to a batch.Highly recommended reading Use Dataset to define a data set for more details;
It might helpful to read the implementation of
MNIST
,CIFAR10
and other existed subclass.
Warning
By default, all elements in a dataset would be
numpy.ndarray
. It means that if you want to do Tensor operations, it’s better to do the conversion explicitly, such as:dataset = MyCustomDataset() # A subclass of Dataset data, label = MyCustomDataset[0] # equals to MyCustomDataset.__getitem__[0] data = Tensor(data, dtype="float32") # convert to MegEngine Tensor explicitly megengine.functional.ops(data)
Tensor ops on ndarray directly are undefined behaviors.