Use Sampler to define sampling rules

After the Use Dataset to define a data set step, DataLoader can know how to load data from the dataset to memory. However, in addition to being able to obtain a single sample data in the data set, the generation of each batch of data also has certain requirements, such as the size of each batch of data, and the requirements for sampling rules, etc., all need to have corresponding configurations, use `` Sampler, can customize the sampling rules of each batch of data. In other sections we will also introduce how to Use Collator to define a merge strategy, but in this section we mainly introduce the concept and use of Sampler.

To be precise, the responsibility of the sampler is to determine the order of data acquisition, so as to provide an index for multiple batches of data that can be iterated for the ``DataLoader’’.

>>> dataloader = DataLoader(dataset, sampler=RandomSampler)

In MegEngine, Sampler is the abstract base class of all samplers. In most cases, users do not need to customize the implementation of samplers, because various common samplers have been implemented in MegEngine. For example, the ``RandomSampler’’ sampler in the sample code above.

Note

Since Data set type can be divided into two types: Map-style and Iterable, so Sampler can also be divided into two types:

How to use MapSampler

MapSampler class signature is as follows:

MapSampler(dataset, batch_size=1, drop_last=False,
num_samples=None, world_size=None, rank=None, seed=None)

Among them, dataset is used to obtain data set information, the batch_size'' parameter is used to specify the size of the batch data, the ``drop_last'' parameter is used to set whether to discard the last batch of incomplete data, and ``num_samples ``, ``world_size, rank and seed these parameters are used in distributed training scenarios.

Warning

MapSampler will not actually read the data into the memory and finally return the sampled data, because it will bring a relatively large memory overhead.__len__`` protocol implemented in the Dataset, and forms the [0, 1, ...] integer index list, and implements the sample'' according to the subclass `` The method samples the integer index list and returns an iterable list, which stores the indexes corresponding to each batch of data obtained by sampling. Only when the ``DataLoader is iterated will the data be loaded according to these indexes.

Below we use the most common types of samplers provided in MegEngine to show related concepts.

First, randomly generate an image data set with a shape of ``(N, C, H, W)’’, corresponding to the sample size, number of channels, height and width respectively.

import numpy as np
from megengine.data.dataset import ArrayDataset

image_data = np.random.random((100, 3, 32, 32)) # (N, C, H, W)
image_dataset = ArrayDataset(image_data)

If you are not sure what the above code does, please refer to Use Dataset to define a data set.

Sequential sampling

Use: py: class: ~ .SequentialSampler the data set may be sequential sampling:

>>> from megengine.data import SequentialSampler
>>> sampler = SequentialSampler(image_dataset, batch_size=10)
>>> print(len(list(sampler)))
10

As shown above, for a data set containing 100 samples, with 10 as the batch_size sampling, 10 batches of sequential index can be obtained.

We can print out the value of each batch of index:

>>> for batch_id, indices in enumerate(sampler):
...     print(batch_id, indices)
0 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
1 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
2 [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
3 [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
4 [40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
5 [50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
6 [60, 61, 62, 63, 64, 65, 66, 67, 68, 69]
7 [70, 71, 72, 73, 74, 75, 76, 77, 78, 79]
8 [80, 81, 82, 83, 84, 85, 86, 87, 88, 89]
9 [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]

If you change batch_size to 30, you will get 4 batches of sequential index, and the last batch will have a length of 10:

>>> sampler = SequentialSampler(image_dataset, batch_size=30)
>>> for batch_id, indices in enumerate(sampler):
...     print(batch_id, len(indices))
0 30
1 30
2 30
3 10

We can set the `` drop_last = True`` lost last batch of incomplete index:

>>> sampler = SequentialSampler(image_dataset, 30, drop_last=True)
>>> for batch_id, indices in enumerate(sampler):
...     print(batch_id, len(indices))
0 30
1 30
2 30

Note

By default, if the user does not configure the sampler for the DataLoader of MapDataset, the following configuration will be used::

>>> SequentialSampler(dataset, batch_size=1, drop_last=False)

Obviously, when batch_size is 1, it is equivalent to traversing each sample in the data set one by one.

Random sampling without replacement

Use: py: class: ~ .RandomSampler the data set may be randomly selected without replacement:

>>> from megengine.data import RandomSampler
>>> sampler = RandomSampler(image_dataset, batch_size=10)
>>> for batch_id, indices in enumerate(sampler):
...     print(batch_id, indices)
0 [78, 20, 74, 6, 45, 65, 99, 67, 88, 57]
1 [81, 0, 94, 98, 71, 30, 66, 10, 85, 56]
2 [51, 87, 62, 42, 7, 75, 11, 12, 39, 95]
3 [73, 15, 77, 72, 89, 13, 55, 26, 49, 33]
4 [9, 8, 64, 3, 37, 2, 70, 29, 34, 47]
5 [22, 18, 93, 4, 40, 92, 79, 36, 84, 25]
6 [83, 90, 68, 58, 50, 48, 32, 54, 35, 1]
7 [14, 44, 17, 63, 60, 97, 96, 23, 52, 38]
8 [80, 59, 53, 19, 46, 43, 24, 61, 16, 5]
9 [86, 82, 31, 76, 28, 91, 27, 21, 69, 41]

See also

Random sampling without replacement, also known as simple random sampling, the Simple Random refer to the Sample ` <https://en.wikipedia.org/wiki/Simple_random_sample>` _

Random sampling with replacement

Use: py: class: ~ .ReplacementSampler the data set may be random sampling with replacement:

>>> from megengine.data import ReplacementSampler
>>> sampler = ReplacementSampler(image_dataset, batch_size=10)
>>> for batch_id, indices in enumerate(sampler):
...     print(batch_id, indices)
0 [58, 29, 42, 79, 91, 73, 86, 46, 85, 23]
1 [42, 33, 61, 8, 22, 10, 98, 56, 59, 96]
2 [38, 72, 26, 0, 40, 33, 30, 59, 1, 25]
3 [71, 95, 89, 88, 29, 97, 97, 46, 42, 0]
4 [42, 22, 28, 82, 49, 52, 88, 68, 46, 66]
5 [47, 62, 26, 17, 68, 31, 70, 69, 26, 4]
6 [43, 18, 17, 91, 99, 96, 91, 7, 24, 39]
7 [50, 55, 86, 65, 93, 38, 39, 4, 6, 60]
8 [92, 82, 61, 36, 67, 56, 24, 18, 70, 60]
9 [91, 63, 95, 99, 19, 47, 9, 9, 68, 37]

Unlimited sampling

Usually, the data set can only be divided into a limited number of batch with a given batch_size. This means that the number of data batches that can be sampled is limited. If you want to reuse data, the most common The method is to loop multiple cycles ``epochs’’ to repeatedly traverse the data set:

>>> for epoch in epochs:
>>>     for batch_data in dataloader:

The ``epochs’’ here is a relatively common hyperparameter in machine learning algorithms.

However, in some cases, we want to focus unlimited sampling directly from the data, and therefore MegEngine provides: py: class: ~ .Infinite packaging:

>>> from megengine.data import Infinite
>>> sampler = Infinite(SequentialSampler(image_dataset, batch_size=10))
>>> sample_queue = iter(sampler)
>>> for step in range(20):
...     indice = next(sample_queue)
...     print(step, indice)
0 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
1 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
2 [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
3 [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
4 [40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
5 [50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
6 [60, 61, 62, 63, 64, 65, 66, 67, 68, 69]
7 [70, 71, 72, 73, 74, 75, 76, 77, 78, 79]
8 [80, 81, 82, 83, 84, 85, 86, 87, 88, 89]
9 [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
10 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
11 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
12 [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
13 [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
14 [40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
15 [50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
16 [60, 61, 62, 63, 64, 65, 66, 67, 68, 69]
17 [70, 71, 72, 73, 74, 75, 76, 77, 78, 79]
18 [80, 81, 82, 83, 84, 85, 86, 87, 88, 89]
19 [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]

Infinite can pack all the existing MapSampler, and then get an infinitely iterable batch index list.

Its implementation principle is:When it is found that the current batch index list cannot be iterated, it indicates that a data traversal has been completed. At this time, it will immediately call the original sampler again to form a new batch index list for the next time. next` call.

See also

可以在官方 ResNet 训练代码 official/vision/classification/resnet/train.py 中找到 DataLoader 通过无限采样器加载 ImageNet 数据的示例。

Custom MapSampler example

How to use StreamSampler

The content of this part is waiting to be added…