Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vanilla PyTorch Dataset pipeline #119

Open
gitttt-1234 opened this issue Nov 14, 2024 · 0 comments · May be fixed by #123
Open

Vanilla PyTorch Dataset pipeline #119

gitttt-1234 opened this issue Nov 14, 2024 · 0 comments · May be fixed by #123

Comments

@gitttt-1234
Copy link
Contributor

We want to explore the implementation of a simple torch.utils.data.Dataset-based data pipeline for sleap-nn. In our initial implementation, we used IterDatapipe, which is being deprecated and new multi-threaded implementations are addressed currently. We then transitioned to LitData, which significantly improved the training speed. However, the use of LitData comes with a drawback of increase in disk footprint with the generation of bin files.

We want to add an alternative option for users to utilize either torch.utils.data.Dataset based data pipeline with custom caching implementation (Ref: Ultralytics BaseDataset) or current LitData pipeline. Additionally, we will benchmark ther performance of this new pipeline to evaluate its feasibility as a replacement for LitData to optimize resource utilization.

PR1:

  • Vanilla torch.utils.data.Dataset implementation with no caching
  • Have individual torch dataset classes for each model type.
class BaseClass(torch.utils.data.Dataset):
        # will also include caching logic in PR2
	def __init__(self, labels):
			super().__init__()
			self.labels = labels
			
	def __len__(self):
			return len(self.labels)
			
	def __getitem__(self):
			pass

			

class CentroidDataset(BaseClass):
	
	def __init__(self, labels, args):
				self.labels = labels
	
	def __getitem__(self, idx):
				lf = self.labels[idx]
				sample = process_lf(lf) # returns dict with `image` and `instances` key.
				
				sample["image"] = apply_normalization(sample["image"])
				
				sample = apply_sizematcher(sample)
				
				sample = apply_augmentation(sample) # apply augmentation before cropping
				
				sample = get_centroids(sample)
				
				return sample
			
train_dataset = CentroidDataset(train_labels)
train_dataloader = torch.utils.data.dataloader.DataLoader(train_dataset)

PR2

@gitttt-1234 gitttt-1234 linked a pull request Dec 11, 2024 that will close this issue
@gitttt-1234 gitttt-1234 linked a pull request Dec 11, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant