PyTorch Dataset

ifeelfree
3 min readMay 5, 2021

Table of Contents

· Part 1: Dataset from torch.utils.data
· Part 2: Dataset from IterableDataset
· Part 3: notes
· Part 4: reference

Part 1: Dataset from torch.utils.data

Before PyTorch 1.2 the only available dataset class was the original “map-style” dataset. This simply requires the user to inherit from the torch.utils.data.Dataset class and implement the __len__ and __getitem__ methods, where __getitem__ receives an index which is mapped to some item in your dataset.

This is from the How to Build a Streaming DataLoader with PyTorch blog, and it well summarize the Dataset PyTorch class.

Part 2: Dataset from IterableDataset

IterableDataset is particularly suitable for stream file, where it is difficult to read everything from a stream to a container. The following example is also from How to Build a Streaming DataLoader with PyTorch blog:

In this example, one thing is interesting is the way how they handle stream handle. Normally, when we open a stream file (text file, video file, etc.) we are supposed to close it as well.

However, in this case, data loader does not support shuffling.

from torch.utils.data import IterableDataset, Dataset
from torch.utils.data import DataLoader

class IterableTextFile(IterableDataset):
def __init__(self, file_name):
self.file_name = file_name

def __iter__(self):
with open(self.file_name, "r") as fid:
for line in fid:
tokens = line.strip('\n').split(' ')
yield from tokens

if __name__ == "__main__":
file_name = "abc.txt"
dataset = IterableTextFile(file_name)
loader = DataLoader(dataset, batch_size=2, shuffle=False)
for batch in loader:
print(batch)

see REPRODUCIBILITY

Part 3: notes

  • the item from the dataset must be common data structure that Python supports. For example, dictionary, numpy array. It will fail if the item is a user-defined class object.
  • shuffle=True means that at each epoch the item sequence is different.
import numpy as np
from torch.utils.data import Dataset, DataLoader


class RandomDataset(Dataset):
def __getitem__(self, index):
#return np.random.randint(0, 1000, 3)
return np.array([index, index, index])
def __len__(self):
return 4

if __name__ == "__main__":
dataset = RandomDataset()
dataloader = DataLoader(dataset, batch_size=2, num_workers=4, shuffle=True)
for epoch in range(2):
for batch in dataloader:
print(batch)
tensor([[2, 2, 2],
[0, 0, 0]])
tensor([[3, 3, 3],
[1, 1, 1]])
tensor([[1, 1, 1],
[0, 0, 0]])
tensor([[2, 2, 2],
[3, 3, 3]])
  • pytorch provides a number of pre-loaded datasets.

PyTorch domain libraries provide a number of pre-loaded datasets (such as FashionMNIST) that subclass torch.utils.data.Dataset and implement functions specific to the particular data. They can be used to prototype and benchmark your model. You can find them here: Image Datasets, Text Datasets, and Audio Datasets

  • check random individual data from dataloader
dataiter = iter(train_loader)
data = dataiter.next()
features, labels = data

This operation will not change train_loader, you can use it freely to check the property of individual data.

Part 4: consistent data loader

The following example shows how we can get consistent random outputs in data loader:

initial_seed = 43
np.random.seed(initial_seed)
def worker_init_fn(worker_id):
np.random.seed(np.random.get_state()[1][0] + worker_id)
class RandomDataset(Dataset):
def __getitem__(self, index):
#np.random.seed(None)
return np.random.randint(0, 1000, 3)
def __len__(self):
return 4
def worker_init_fn(worker_id):
np.random.seed(np.random.get_state()[1][0] + worker_id)
dataset = RandomDataset()
dataloader = DataLoader(dataset, batch_size=2, num_workers=4,
worker_init_fn=worker_init_fn)
for epoch in range(3):
np.random.seed(initial_seed+epoch)
for batch in dataloader:
print(batch)

This is from Using PyTorch + NumPy? You’re making a mistake.

I tried the solution mentioned there, but it did not work. My final solution is as follows:

(1) set random seed

random.seed(random_seed)

torch.cuda.manual_seed(random_seed)
torch.manual_seed(random_seed)
np.random.seed(self.random_seed)

(2) use deterministic algorithm

torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

(3) data_loader parameters

num_workers=0 # no multi-threading 

(4) augmentation program random seeding

cur_state = random.getstate()
rg = random.Random()
rg.setstate(cur_state)

Part 5: reference

--

--