Use Sampler to define sampling rules¶
After the Use Dataset to define a data set step, DataLoader
can know how to load data from the dataset to memory. However, in addition to being able to obtain a single sample data in the data set, the generation of each batch of data also has certain requirements, such as the size of each batch of data, and the requirements for sampling rules, etc., all need to have corresponding configurations, use `` Sampler, can customize the sampling rules of each batch of data. In other sections we will also introduce how to Use Collator to define a merge strategy, but in this section we mainly introduce the concept and use of Sampler
.
To be precise, the responsibility of the sampler is to determine the order of data acquisition, so as to provide an index for multiple batches of data that can be iterated for the ``DataLoader’’.
>>> dataloader = DataLoader(dataset, sampler=RandomSampler)
In MegEngine, Sampler
is the abstract base class of all samplers. In most cases, users do not need to customize the implementation of samplers, because various common samplers have been implemented in MegEngine. For example, the ``RandomSampler’’ sampler in the sample code above.
Note
Since Data set type can be divided into two types: Map-style and Iterable, so Sampler
can also be divided into two types:
MapSampler: Map-style applied to the data set decimator:
According to sampling method: Sequential sampling (default method) / Random sampling without replacement / Random sampling with replacement
We can also use Infinite package to achieve the above-described class Unlimited sampling;
If you want to implement your own
MapSampler
, you need to inherit this class yourself and implement thesample
method.
StreamSampler: A sampler suitable for Iterable-style data sets.
How to use MapSampler¶
MapSampler
class signature is as follows:
- MapSampler(dataset, batch_size=1, drop_last=False,
- num_samples=None, world_size=None, rank=None, seed=None)
Among them, dataset
is used to obtain data set information, the batch_size'' parameter is used to specify the size of the batch data, the ``drop_last'' parameter is used to set whether to discard the last batch of incomplete data, and ``num_samples ``, ``world_size
, rank
and seed
these parameters are used in distributed training scenarios.
Warning
MapSampler
will not actually read the data into the memory and finally return the sampled data, because it will bring a relatively large memory overhead.__len__`` protocol implemented in the Dataset
, and forms the [0, 1, ...]
integer index list, and implements the sample'' according to the subclass `` The method samples the integer index list and returns an iterable list, which stores the indexes corresponding to each batch of data obtained by sampling. Only when the ``DataLoader
is iterated will the data be loaded according to these indexes.
Below we use the most common types of samplers provided in MegEngine to show related concepts.
First, randomly generate an image data set with a shape of ``(N, C, H, W)’’, corresponding to the sample size, number of channels, height and width respectively.
import numpy as np
from megengine.data.dataset import ArrayDataset
image_data = np.random.random((100, 3, 32, 32)) # (N, C, H, W)
image_dataset = ArrayDataset(image_data)
If you are not sure what the above code does, please refer to Use Dataset to define a data set.
Sequential sampling¶
Use: py: class: ~ .SequentialSampler the data set may be sequential sampling:
>>> from megengine.data import SequentialSampler
>>> sampler = SequentialSampler(image_dataset, batch_size=10)
>>> print(len(list(sampler)))
10
As shown above, for a data set containing 100 samples, with 10 as the batch_size
sampling, 10 batches of sequential index can be obtained.
We can print out the value of each batch of index:
>>> for batch_id, indices in enumerate(sampler):
... print(batch_id, indices)
0 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
1 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
2 [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
3 [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
4 [40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
5 [50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
6 [60, 61, 62, 63, 64, 65, 66, 67, 68, 69]
7 [70, 71, 72, 73, 74, 75, 76, 77, 78, 79]
8 [80, 81, 82, 83, 84, 85, 86, 87, 88, 89]
9 [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
If you change batch_size
to 30, you will get 4 batches of sequential index, and the last batch will have a length of 10:
>>> sampler = SequentialSampler(image_dataset, batch_size=30)
>>> for batch_id, indices in enumerate(sampler):
... print(batch_id, len(indices))
0 30
1 30
2 30
3 10
We can set the `` drop_last = True`` lost last batch of incomplete index:
>>> sampler = SequentialSampler(image_dataset, 30, drop_last=True)
>>> for batch_id, indices in enumerate(sampler):
... print(batch_id, len(indices))
0 30
1 30
2 30
Note
By default, if the user does not configure the sampler for the DataLoader
of MapDataset
, the following configuration will be used::
>>> SequentialSampler(dataset, batch_size=1, drop_last=False)
Obviously, when batch_size
is 1, it is equivalent to traversing each sample in the data set one by one.
Random sampling without replacement¶
Use: py: class: ~ .RandomSampler the data set may be randomly selected without replacement:
>>> from megengine.data import RandomSampler
>>> sampler = RandomSampler(image_dataset, batch_size=10)
>>> for batch_id, indices in enumerate(sampler):
... print(batch_id, indices)
0 [78, 20, 74, 6, 45, 65, 99, 67, 88, 57]
1 [81, 0, 94, 98, 71, 30, 66, 10, 85, 56]
2 [51, 87, 62, 42, 7, 75, 11, 12, 39, 95]
3 [73, 15, 77, 72, 89, 13, 55, 26, 49, 33]
4 [9, 8, 64, 3, 37, 2, 70, 29, 34, 47]
5 [22, 18, 93, 4, 40, 92, 79, 36, 84, 25]
6 [83, 90, 68, 58, 50, 48, 32, 54, 35, 1]
7 [14, 44, 17, 63, 60, 97, 96, 23, 52, 38]
8 [80, 59, 53, 19, 46, 43, 24, 61, 16, 5]
9 [86, 82, 31, 76, 28, 91, 27, 21, 69, 41]
See also
Random sampling without replacement, also known as simple random sampling, the Simple Random refer to the Sample ` <https://en.wikipedia.org/wiki/Simple_random_sample>` _
Random sampling with replacement¶
Use: py: class: ~ .ReplacementSampler the data set may be random sampling with replacement:
>>> from megengine.data import ReplacementSampler
>>> sampler = ReplacementSampler(image_dataset, batch_size=10)
>>> for batch_id, indices in enumerate(sampler):
... print(batch_id, indices)
0 [58, 29, 42, 79, 91, 73, 86, 46, 85, 23]
1 [42, 33, 61, 8, 22, 10, 98, 56, 59, 96]
2 [38, 72, 26, 0, 40, 33, 30, 59, 1, 25]
3 [71, 95, 89, 88, 29, 97, 97, 46, 42, 0]
4 [42, 22, 28, 82, 49, 52, 88, 68, 46, 66]
5 [47, 62, 26, 17, 68, 31, 70, 69, 26, 4]
6 [43, 18, 17, 91, 99, 96, 91, 7, 24, 39]
7 [50, 55, 86, 65, 93, 38, 39, 4, 6, 60]
8 [92, 82, 61, 36, 67, 56, 24, 18, 70, 60]
9 [91, 63, 95, 99, 19, 47, 9, 9, 68, 37]
Unlimited sampling¶
Usually, the data set can only be divided into a limited number of batch
with a given batch_size
. This means that the number of data batches that can be sampled is limited. If you want to reuse data, the most common The method is to loop multiple cycles ``epochs’’ to repeatedly traverse the data set:
>>> for epoch in epochs:
>>> for batch_data in dataloader:
The ``epochs’’ here is a relatively common hyperparameter in machine learning algorithms.
However, in some cases, we want to focus unlimited sampling directly from the data, and therefore MegEngine provides: py: class: ~ .Infinite packaging:
>>> from megengine.data import Infinite
>>> sampler = Infinite(SequentialSampler(image_dataset, batch_size=10))
>>> sample_queue = iter(sampler)
>>> for step in range(20):
... indice = next(sample_queue)
... print(step, indice)
0 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
1 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
2 [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
3 [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
4 [40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
5 [50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
6 [60, 61, 62, 63, 64, 65, 66, 67, 68, 69]
7 [70, 71, 72, 73, 74, 75, 76, 77, 78, 79]
8 [80, 81, 82, 83, 84, 85, 86, 87, 88, 89]
9 [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
10 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
11 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
12 [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
13 [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
14 [40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
15 [50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
16 [60, 61, 62, 63, 64, 65, 66, 67, 68, 69]
17 [70, 71, 72, 73, 74, 75, 76, 77, 78, 79]
18 [80, 81, 82, 83, 84, 85, 86, 87, 88, 89]
19 [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
Infinite
can pack all the existing MapSampler
, and then get an infinitely iterable batch index list.
Its implementation principle is:When it is found that the current batch index list cannot be iterated, it indicates that a data traversal has been completed. At this time, it will immediately call the original sampler again to form a new batch index list for the next time. next` call.
See also
可以在官方 ResNet 训练代码 official/vision/classification/resnet/train.py
中找到 DataLoader
通过无限采样器加载 ImageNet 数据的示例。
Custom MapSampler example¶
See also
official/vision/detection/tools/utils.py#L67-
GroupedRandomSampler
official/vision/detection/tools/utils.py#L106-
InferenceSampler
How to use StreamSampler¶
The content of this part is waiting to be added…