Table of Contents
· Part 1: Dataset from torch.utils.data
· Part 2: Dataset from IterableDataset
· Part 3: notes
· Part 4: reference
Part 1: Dataset from torch.utils.data
Before PyTorch 1.2 the only available dataset class was the original “map-style” dataset. This simply requires the user to inherit from the
torch.utils.data.Dataset
class and implement the__len__
and__getitem__
methods, where__getitem__
receives an index which is mapped to some item in your dataset.
This is from the How to Build a Streaming DataLoader with PyTorch blog, and it well summarize the Dataset PyTorch class.
Part 2: Dataset from IterableDataset
IterableDataset is particularly suitable for stream file, where it is difficult to read everything from a stream to a container. The following example is also from How to Build a Streaming DataLoader with PyTorch blog:
In this example, one thing is interesting is the way how they handle stream handle. Normally, when we open a stream file (text file, video file, etc.) we are supposed to close it as well.
However, in this case, data loader does not support shuffling.
from torch.utils.data import IterableDataset, Dataset
from torch.utils.data import DataLoader
class IterableTextFile(IterableDataset):
def __init__(self, file_name):
self.file_name = file_name
def __iter__(self):
with open(self.file_name, "r") as fid:
for line in fid:
tokens = line.strip('\n').split(' ')
yield from tokens
if __name__ == "__main__":
file_name = "abc.txt"
dataset = IterableTextFile(file_name)
loader = DataLoader(dataset, batch_size=2, shuffle=False)
for batch in loader:
print(batch)
see REPRODUCIBILITY
Part 3: notes
- the item from the dataset must be common data structure that Python supports. For example, dictionary, numpy array. It will fail if the item is a user-defined class object.
shuffle=True
means that at each epoch the item sequence is different.
import numpy as np
from torch.utils.data import Dataset, DataLoader
class RandomDataset(Dataset):
def __getitem__(self, index):
#return np.random.randint(0, 1000, 3)
return np.array([index, index, index])
def __len__(self):
return 4
if __name__ == "__main__":
dataset = RandomDataset()
dataloader = DataLoader(dataset, batch_size=2, num_workers=4, shuffle=True)
for epoch in range(2):
for batch in dataloader:
print(batch)tensor([[2, 2, 2],
[0, 0, 0]])
tensor([[3, 3, 3],
[1, 1, 1]])
tensor([[1, 1, 1],
[0, 0, 0]])
tensor([[2, 2, 2],
[3, 3, 3]])
- pytorch provides a number of pre-loaded datasets.
PyTorch domain libraries provide a number of pre-loaded datasets (such as FashionMNIST) that subclass torch.utils.data.Dataset
and implement functions specific to the particular data. They can be used to prototype and benchmark your model. You can find them here: Image Datasets, Text Datasets, and Audio Datasets
- check random individual data from dataloader
dataiter = iter(train_loader)
data = dataiter.next()
features, labels = data
This operation will not change train_loader
, you can use it freely to check the property of individual data.
Part 4: consistent data loader
The following example shows how we can get consistent random outputs in data loader:
initial_seed = 43
np.random.seed(initial_seed)
def worker_init_fn(worker_id):
np.random.seed(np.random.get_state()[1][0] + worker_id)
class RandomDataset(Dataset):
def __getitem__(self, index):
#np.random.seed(None)
return np.random.randint(0, 1000, 3)
def __len__(self):
return 4def worker_init_fn(worker_id):
np.random.seed(np.random.get_state()[1][0] + worker_id)dataset = RandomDataset()
dataloader = DataLoader(dataset, batch_size=2, num_workers=4,
worker_init_fn=worker_init_fn)for epoch in range(3):
np.random.seed(initial_seed+epoch)
for batch in dataloader:
print(batch)
This is from Using PyTorch + NumPy? You’re making a mistake.
I tried the solution mentioned there, but it did not work. My final solution is as follows:
(1) set random seed
random.seed(random_seed)
torch.cuda.manual_seed(random_seed)
torch.manual_seed(random_seed)np.random.seed(self.random_seed)
(2) use deterministic algorithm
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
(3) data_loader parameters
num_workers=0 # no multi-threading
(4) augmentation program random seeding
cur_state = random.getstate()
rg = random.Random()
rg.setstate(cur_state)