Use Collator to define a merge strategy

Note

In the whole process of using DataLoader to obtain batch data, Collator is responsible for merging samples and finally obtaining batch data.

>>> dataloader = DataLoader(dataset, collator=...)

Generally, users do not need to implement their own Collator, and can handle most batch data merging situations by using the default merge strategy. But when encountering some situations that the default merge strategy is difficult to handle, users can use the Collator implemented by themselves. Refer to Custom Collator.

Warning

Collator is only applicable to Map-style data sets, because batch data of Iterable-style data sets must be merged one by one.

Default merge strategy

After the previous processing flow, Collator usually receives a list:

  • __getitem__`` method of your Dataset subclass returns a single element, then Collator will get a normal list;

  • __getitem__`` method of your Dataset subclass returns a tuple, then Collator gets a list of tuples.

MegEngine used: py: class: ~ .Collator implemented as a default, by calling the method` apply` combined bulk data list data:

>>> from megengine.data import Collator
>>> collator = Collator()

Its implementation logic uses the numpy.stack function to merge all the examples included in the list in the first dimension (axis=0).

See also

A similar stack function is also provided in MegEngine, but it is only applicable to Tensor data.

Warning

The default Collator supports NumPy ndarrays, Numbers, Unicode strings, bytes, dicts or lists data types. The input must contain at least one of the above data types, otherwise users need to use their own defined Collator.

Collator effect demonstration

If at this time each sample is \((C, H, W)\), and the batch_size'' is specified as :math:`N` in the ``Sampler. Then ` The main purpose of Collator` is to merge the obtained sample list into a \((N, C, H, W)\).

We can simulate such image_list data, and use Collator to get batch_image:

>>> N, C, H, W = 5, 3, 32, 32
>>> image_list = []
>>> for i in range(N):
...     image_list.append(np.random.random((C, H, W)))
>>> print(len(image_list), image_list[0].shape)
5 (3, 32, 32)
>>> batch_image = collator.apply(image_list)
>>> batch_image.shape
(5, 3, 32, 32)

If the sample is labeled, Collator needs to merge the list of (image, label) tuples to form a big (batch_image, bacth_label)'' tuple. This is what we usually get when we iterate over the ``DataLoader.

In the following sample code, each element in sample_list is a tuple (assuming that all labels are represented by integer ``1’’):

>>> sample_list = []
>>> for i in range(N):
...     sample_list.append((np.random.random((C, H, W)), 1))
>>> type(sample_list[0])
tuple
>>> print(sample_list[0][0].shape, type(sample_list[0][1]))
(3, 32, 32) <class 'int'>

The default MegEngine provided `` Collator`` also works well with this situation:

>>> batch_image, batch_label = collator.apply(sample_list)
>>> print(batch_image.shape, batch_label.shape)
(5, 3, 32, 32) (5,)

Warning

It should be noted that at this time batch_label has been converted into an ndarray data structure.

Custom Collator

When the default stack'' merge strategy cannot meet our needs, we need to consider customizing the ``Collator:

  • Need to inherit the Collator class and implement the apply method in the subclass;

  • The apply method we implemented will be called by the DataLoader.