Stephen Cow Chau
2 min readJun 16, 2022

--

Thank you for your feedback. I have been going back the the pytorch source code and review the behavior, and here are my thoughts:

ref: https://pytorch.org/docs/stable/_modules/torch/utils/data/dataloader.html#DataLoader

https://github.com/pytorch/pytorch/blob/master/torch/utils/data/_utils/fetch.py

Summary point:

- I agree your approach would work, especially in a controlled sampling manner [which is, sapling in certain sequence, but drop certain items throughout the sequence]

- I would also care about the randomness of the sampling result, as I would want the model to see different composition of data in each batch

- If I am just trying to subsample randomly, I might just go for the "SubsetRandomSampler"

- one of the item in my TODO list (which, I believe someone might already done it and put up a library for others to reuse) is to sample for NLP data with sentence lenght, but grouping the length in buckets (e.g bucket 1 = length > 100, bucket 2 = length > 80...) and random suffle the sample indices within the bucket. The originl intention is to be able to reduce zero padding per batch with variable input length sentences in a single batch.

Other details:

1. For Iterative style dataset, I believe all the sampling behavior should be governed by Dataset's __iter__(), the dataloader source code does have a __iter__(), but seems like it try to call the dataset's __iter__() through fetching the dataset. Also, I believe the multi-worker scenrio might need to be taken care in the dataset's __iter__(), which I haven't got experience in doing so.

2. for map style dataset, on the other hand, it would use a sampler/batch sampler which provide some list of indeices and given the indices, it sample the data from dataset's __getitem__(), an interesting observation here is, when auto collation is turn off, it try to call "self.dataset[possibly_batched_index]", which imply if the __getitem__() call does not support slice by list of indices (which Pytorch tensor or Numpy Array does support), this might break.

3. I haven't been mentioned enough on the original post, but I think the Sampler likely need to manage the last batch if the dataset lenght divided by batch size have remaineder, otherwise the batch tensor created might crash the model's forward (in which I remenebr I tried to manage it before passing into model's forward function, OR just config dataloader to drop last batch)

--

--

Responses (1)