Use Collator to define a merge strategy#
Note
In the whole process of using DataLoader
to obtain batch data, Collator
is responsible for merging samples and finally obtaining batch data.
>>> dataloader = DataLoader(dataset, collator=...)
Generally, users do not need to implement their own Collator
, and can handle most batch data merging situations by using the default merge strategy. But when encountering some situations that the default merge strategy is difficult to handle, users can use the Collator
implemented by themselves. Refer to Custom Collator.
Warning
Collator
is only applicable to Map-style data sets, because batch data of Iterable-style data sets must be merged one by one.
Default merge strategy#
After the previous processing flow, Collator
usually receives a list:
__getitem__`` method of your
Dataset
subclass returns a single element, thenCollator
will get a normal list;__getitem__`` method of your
Dataset
subclass returns a tuple, thenCollator
gets a list of tuples.
MegEngine used: py: class: ~ .Collator implemented as a default, by calling the method` apply` combined bulk data list data:
>>> from megengine.data import Collator
>>> collator = Collator()
Its implementation logic uses the numpy.stack
function to merge all the examples included in the list in the first dimension (axis=0
).
See also
A similar stack
function is also provided in MegEngine, but it is only applicable to Tensor data.
Warning
The default Collator
supports NumPy ndarrays, Numbers, Unicode strings, bytes, dicts or lists data types. The input must contain at least one of the above data types, otherwise users need to use their own defined Collator
.
Collator effect demonstration#
If at this time each sample is \((C, H, W)\), and the batch_size'' is specified as :math:`N` in the ``Sampler
. Then ` The main purpose of Collator` is to merge the obtained sample list into a \((N, C, H, W)\).
We can simulate such image_list
data, and use Collator
to get batch_image
:
>>> N, C, H, W = 5, 3, 32, 32
>>> image_list = []
>>> for i in range(N):
... image_list.append(np.random.random((C, H, W)))
>>> print(len(image_list), image_list[0].shape)
5 (3, 32, 32)
>>> batch_image = collator.apply(image_list)
>>> batch_image.shape
(5, 3, 32, 32)
If the sample is labeled, Collator
needs to merge the list of (image, label)
tuples to form a big (batch_image, bacth_label)'' tuple. This is what we usually get when we iterate over the ``DataLoader
.
In the following sample code, each element in sample_list
is a tuple (assuming that all labels are represented by integer ``1’’):
>>> sample_list = []
>>> for i in range(N):
... sample_list.append((np.random.random((C, H, W)), 1))
>>> type(sample_list[0])
tuple
>>> print(sample_list[0][0].shape, type(sample_list[0][1]))
(3, 32, 32) <class 'int'>
The default MegEngine provided `` Collator`` also works well with this situation:
>>> batch_image, batch_label = collator.apply(sample_list)
>>> print(batch_image.shape, batch_label.shape)
(5, 3, 32, 32) (5,)
Warning
It should be noted that at this time batch_label
has been converted into an ndarray data structure.
Custom Collator#
When the default stack'' merge strategy cannot meet our needs, we need to consider customizing the ``Collator
:
Need to inherit the
Collator
class and implement theapply
method in the subclass;The
apply
method we implemented will be called by theDataLoader
.
See also
official/vision/keypoints/dataset.py#L167-
HeatmapCollator
official/vision/detection/tools/utils.py#L125-
DetectionPadCollator