PyTorch Dataset

3 min readMay 5, 2021


Table of Contents

· Part 1: Dataset from
· Part 2: Dataset from IterableDataset
· Part 3: notes
· Part 4: reference

Part 1: Dataset from

Before PyTorch 1.2 the only available dataset class was the original “map-style” dataset. This simply requires the user to inherit from the class and implement the __len__ and __getitem__ methods, where __getitem__ receives an index which is mapped to some item in your dataset.

This is from the How to Build a Streaming DataLoader with PyTorch blog, and it well summarize the Dataset PyTorch class.

Part 2: Dataset from IterableDataset

IterableDataset is particularly suitable for stream file, where it is difficult to read everything from a stream to a container. The following example is also from How to Build a Streaming DataLoader with PyTorch blog:

In this example, one thing is interesting is the way how they handle stream handle. Normally, when we open a stream file (text file, video file, etc.) we are supposed to close it as well.

However, in this case, data loader does not support shuffling.

from import IterableDataset, Dataset
from import DataLoader

class IterableTextFile(IterableDataset):
def __init__(self, file_name):
self.file_name = file_name

def __iter__(self):
with open(self.file_name, "r") as fid:
for line in fid:
tokens = line.strip('\n').split(' ')
yield from tokens

if __name__ == "__main__":
file_name = "abc.txt"
dataset = IterableTextFile(file_name)
loader = DataLoader(dataset, batch_size=2, shuffle=False)
for batch in loader:


Part 3: notes

  • the item from the dataset must be common data structure that Python supports. For example, dictionary, numpy array. It will fail if the item is a user-defined class object.
  • shuffle=True means that at each epoch the item sequence is different.
import numpy as np
from import Dataset, DataLoader

class RandomDataset(Dataset):
def __getitem__(self, index):
#return np.random.randint(0, 1000, 3)
return np.array([index, index, index])
def __len__(self):
return 4

if __name__ == "__main__":
dataset = RandomDataset()
dataloader = DataLoader(dataset, batch_size=2, num_workers=4, shuffle=True)
for epoch in range(2):
for batch in dataloader:
tensor([[2, 2, 2],
[0, 0, 0]])
tensor([[3, 3, 3],
[1, 1, 1]])
tensor([[1, 1, 1],
[0, 0, 0]])
tensor([[2, 2, 2],
[3, 3, 3]])
  • pytorch provides a number of pre-loaded datasets.

PyTorch domain libraries provide a number of pre-loaded datasets (such as FashionMNIST) that subclass and implement functions specific to the particular data. They can be used to prototype and benchmark your model. You can find them here: Image Datasets, Text Datasets, and Audio Datasets

  • check random individual data from dataloader
dataiter = iter(train_loader)
data =
features, labels = data

This operation will not change train_loader, you can use it freely to check the property of individual data.

Part 4: consistent data loader

The following example shows how we can get consistent random outputs in data loader:

initial_seed = 43
def worker_init_fn(worker_id):
np.random.seed(np.random.get_state()[1][0] + worker_id)
class RandomDataset(Dataset):
def __getitem__(self, index):
return np.random.randint(0, 1000, 3)
def __len__(self):
return 4
def worker_init_fn(worker_id):
np.random.seed(np.random.get_state()[1][0] + worker_id)
dataset = RandomDataset()
dataloader = DataLoader(dataset, batch_size=2, num_workers=4,
for epoch in range(3):
for batch in dataloader:

This is from Using PyTorch + NumPy? You’re making a mistake.

I tried the solution mentioned there, but it did not work. My final solution is as follows:

(1) set random seed



(2) use deterministic algorithm

torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

(3) data_loader parameters

num_workers=0 # no multi-threading 

(4) augmentation program random seeding

cur_state = random.getstate()
rg = random.Random()

Part 5: reference