Vanilla PyTorch Dataset pipeline #119

gitttt-1234 · 2024-11-14T14:54:22Z

We want to explore the implementation of a simple torch.utils.data.Dataset-based data pipeline for sleap-nn. In our initial implementation, we used IterDatapipe, which is being deprecated and new multi-threaded implementations are addressed currently. We then transitioned to LitData, which significantly improved the training speed. However, the use of LitData comes with a drawback of increase in disk footprint with the generation of bin files.

We want to add an alternative option for users to utilize either torch.utils.data.Dataset based data pipeline with custom caching implementation (Ref: Ultralytics BaseDataset) or current LitData pipeline. Additionally, we will benchmark ther performance of this new pipeline to evaluate its feasibility as a replacement for LitData to optimize resource utilization.

PR1:

Vanilla torch.utils.data.Dataset implementation with no caching
Have individual torch dataset classes for each model type.
- reuse the function pieces used for litdata get_chunks and streaming_datasets.

class BaseClass(torch.utils.data.Dataset):
        # will also include caching logic in PR2
	def __init__(self, labels):
			super().__init__()
			self.labels = labels
			
	def __len__(self):
			return len(self.labels)
			
	def __getitem__(self):
			pass

			

class CentroidDataset(BaseClass):
	
	def __init__(self, labels, args):
				self.labels = labels
	
	def __getitem__(self, idx):
				lf = self.labels[idx]
				sample = process_lf(lf) # returns dict with `image` and `instances` key.
				
				sample["image"] = apply_normalization(sample["image"])
				
				sample = apply_sizematcher(sample)
				
				sample = apply_augmentation(sample) # apply augmentation before cropping
				
				sample = get_centroids(sample)
				
				return sample
			
train_dataset = CentroidDataset(train_labels)
train_dataloader = torch.utils.data.dataloader.DataLoader(train_dataset)

PR2

Implement caching (something similar to Ultralytics [BaseDataset])(https://github.com/ultralytics/ultralytics/blob/1a5c35366ef4577b00c35f9e8c5d5d0f05a61859/ultralytics/data/base.py#L189)
Benchmark torch Dataset implementation with caching and compare with LitData and IterDatapipes.

The text was updated successfully, but these errors were encountered:

gitttt-1234 mentioned this issue Nov 15, 2024

Add torch Dataset classes #120

Merged

gitttt-1234 linked a pull request Dec 11, 2024 that will close this issue

Add caching to Torch Datasets pipeline #123

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vanilla PyTorch Dataset pipeline #119

Vanilla PyTorch Dataset pipeline #119

gitttt-1234 commented Nov 14, 2024

Vanilla PyTorch Dataset pipeline #119

Vanilla PyTorch Dataset pipeline #119

Comments

gitttt-1234 commented Nov 14, 2024

PR1:

PR2