Skip to content

Releases: pytorch/data

v0.10.1

13 Dec 18:22
a4bed25
Compare
Choose a tag to compare

What's Changed

This release introduces 3 major changes:

  1. Introducing torchdata.nodes, a library of extensible and composable iterators that lets you chain together common dataloading and pre-proc operations! This initial release includes the following features, with more on the way:

  2. This release drops support for DataPipes and DataLoader2. Release v0.9 was the last stable release which includes them. Please see this issue for more details.

  3. PyTorch's official conda channel is deprecated. TorchData has removed its conda builds as well. TorchData will be available for installation through pip, on PyPI and download.pytorch.org.

Full Changelog: v0.9.0...v0.10.1

@divyanshk @ramanishsingh @andrewkho

TorchData v0.9.0

21 Oct 22:06
d4bb3e6
Compare
Choose a tag to compare

What's Changed

This was a relatively small release compared to previous. This will notably be the last stable release to feature DataPipes and DataLoader2!

New Contributors

Full Changelog: https://github.com/pytorch/data/commits/v0.9.0

TorchData 0.8.0

08 Aug 20:34
265d317
Compare
Choose a tag to compare

TorchData 0.8.0 Latest

Highlights

We are excited to announce the release of TorchData 0.8.0. This first release of StatefulDataLoader, which is a drop-in replacement for torch.utils.data.DataLoader, offering state_dict/load_state_dict methods for handling mid-epoch checkpointing.

Deprecations

⚠️ June 2024 Status Update: Removing DataPipes and DataLoader V2

We are re-focusing the torchdata repo to be an iterative enhancement of torch.utils.data.DataLoader. We do not plan on continuing development or maintaining the [DataPipes] and [DataLoaderV2] solutions, and they will be removed from the torchdata repo. We'll also be revisiting the DataPipes references in pytorch/pytorch. In release torchdata==0.8.0 (July 2024) they will be marked as deprecated, and in 0.9.0 (Oct 2024) they will be deleted. Existing users are advised to pin to torchdata==0.8.0 or an older version until they are able to migrate away. Subsequent releases will not include DataPipes or DataLoaderV2. The old version of this README is available here. Please reach out if you suggestions or comments (please use #1196 for feedback).

Full Changelog: https://github.com/pytorch/data/commits/v0.8.0

TorchData 0.7.1

15 Nov 23:23
Compare
Choose a tag to compare

TorchData 0.7.1 Latest

Current status
⚠️ As of July 2023, we have paused active development on TorchData and have paused new releases. We have learnt a lot from building it and hearing from users, but also believe we need to re-evaluate the technical design and approach given how much the industry has changed since we began the project. During the rest of 2023 we will be re-evaluating our plans in this space. Please reach out if you suggestions or comments (please use #1196 for feedback).

This is a patch release, which is compatible with PyTorch 2.1.1. There are no new features added.

TorchData 0.7.0

12 Oct 20:24
c5f2204
Compare
Choose a tag to compare

Current status

⚠️ As of July 2023, we have paused active development on TorchData and have paused new releases. We have learnt a lot from building it and hearing from users, but also believe we need to re-evaluate the technical design and approach given how much the industry has changed since we began the project. During the rest of 2023 we will be re-evaluating our plans in this space. Please reach out if you suggestions or comments (please use #1196 for feedback).

Bug Fixes

  • MPRS request/response cycle for workers (40dd648)
  • Sequential reading service checkpointing (8d452cf)
  • Cancel future object and always run callback in FullSync during shutdown (#1171)
  • DataPipe, Ensures Prefetcher shuts down properly (#1166)
  • DataPipe, Fix FullSync shutdown hanging issue while paused (#1153)
  • DataPipe, Fix a word in WebDS DataPipe (#1156)
  • DataPipe, Add handler argument to iopath DataPipes (#1154)
  • Prevent in_memory_cache from yielding from source_dp when it's fully cache (#1160)
  • Fix pin_memory to support single-element batch (#1158)
  • DataLoader2, Removing delegation for 'pause', 'limit', and 'resume' (#1067)
  • DataLoader2, Handle MapDataPipe by converting to IterDataPipe internally by default (#1146)

New Features

  • Implement InProcessReadingService (#1139)
  • Enable miniepoch for MultiProcessingReadingService (#1170)
  • DataPipe, Implement pause/resume for FullSync (#1130)
  • DataLoader2, Saving and restoring initial seed generator (#998)
  • Add ThreadPoolMapper (#1052)

TorchData 0.6.1 Release Notes

08 May 20:15
Compare
Choose a tag to compare

TorchData 0.6.1 Beta Release Notes

Highlights

This minor release is aligned with PyTorch 2.0.1 and primarily fixes bugs that are introduced in the 0.6.0 release. We sincerely thank our users and contributors for spotting various bugs and helping us to fix them.

Bug Fixes

DataLoader2

  • Properly clean up processes and queues for MPRS and Fix pause for prefetch (#1096)
  • Fix DataLoader2 seed = 0 bug (#1098)
    • Previously, if seed = 0 was passed into DataLoader2, the seed value in DataLoader2 would not be set and the seed would be unused. This change fixes that and allow seed = 0 to be used normally.
  • Fix worker_init_fn to update DataPipe graph and move worker prefetch to the end of Worker pipeline (#1100)

DataPipe

  • Fix pin_memory_fn to support namedtuple (#1086)
  • Fix typo for portalocker at import time (#1099)

Improvements

DataPipe

  • Skip FullSync operation when world_size == 1 (#1065)

Docs

  • Add long project description to setup.py for display on PyPI (#1094)

Beta Usage Note

This library is currently in the Beta stage and currently does not have a fully stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback. As always, we welcome new contributors to our repo.

TorchData 0.6.0 Release Notes

15 Mar 19:38
Compare
Choose a tag to compare

TorchData 0.6.0 Beta Release Notes

Highlights

We are excited to announce the release of TorchData 0.6.0. This release is composed of about 130 commits since 0.5.0, made by 27 contributors. We want to sincerely thank our community for continuously improving TorchData.

TorchData 0.6.0 updates are primarily focused on DataLoader2. We graduate some of its APIs from the prototype stage and introduce additional features. Highlights include:

  • Graduation of MultiProcessingReadingService from prototype to beta
    • This is the default ReadingService that we expect most users to use; it closely aligns with the functionalities of old DataLoader with improvements
    • With this graduation, we expect the APIs and behaviors to be mostly stable going forward. We will continue to add new features as they become ready.
  • Introduction of Sequential ReadingService
    • Enables the usage of multiple ReadingServices at the same time
  • Adding comprehensive tutorial of DataLoader2 and its subcomponents

Backwards Incompatible Change

DataLoader2

  • Officially graduate PrototypeMultiProcessingReadingService to MultiProcessingReadingService (#1009)
    • The APIs of MultiProcessingReadingService as well as the internal implementation have changed. Overall, this should provide a better user experience.
    • Please refer to our documentation for details.

0.5.00.6.0
It previously took the following arguments:
MultiProcessingReadingService(
    num_workers: int = 0,
    pin_memory: bool = False,
    timeout: float = 0,
    worker_init_fn: Optional[Callable[[int], None]] = None,
    multiprocessing_context=None,
    prefetch_factor: Optional[int] = None,
    persistent_workers: bool = False,
)
      
The new version takes these arguments:
MultiProcessingReadingService(
    num_workers: int = 0,
    multiprocessing_context: Optional[str] = None,
    worker_prefetch_cnt: int = 10,
    main_prefetch_cnt: int = 10,
    worker_init_fn: Optional[Callable[[DataPipe, WorkerInfo], DataPipe]] = None,
    worker_reset_fn: Optional[Callable[[DataPipe, WorkerInfo, SeedGenerator], DataPipe]] = None,
)
      

  • Deep copy ReadingService during DataLoader2 initialization (#746)
    • Within DataLoader2, a deep copy of the passed-in ReadingService object is created during initialization and will be subsequently used.
    • This prevents multiple DataLoader2s from accidentally sharing states when the same ReadingService object is passed into them.

0.5.00.6.0
Previously, a ReadingService object that is used in multiple DataLoader2 shared state among them.
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> rs = MultiProcessingReadingService(num_workers=2)
>>> dl1 = DataLoader2(dp, reading_service=rs)
>>> dl2 = DataLoader2(dp, reading_service=rs)
>>> next(iter(dl1))
>>> print(f"Number of processes that exist in `dl1`'s RS after initializing `dl1`: {len(dl1.reading_service._worker_processes)}")
# Number of processes that exist in `dl1`'s RS after initializing `dl1`: 2
>>> next(iter(dl2))
# Note that we are still examining `dl1.read_service` below
>>> print(f"Number of processes that exist in `dl1`'s RS after initializing `dl2`: {len(dl1.reading_service._worker_processes)}")
# Number of processes that exist in `dl1`'s RS after initializing `dl1`: 4
      
DataLoader2 now deep copies the ReadingService object during initialization and the ReadingService state is no longer shared.
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> rs = MultiProcessingReadingService(num_workers=2)
>>> dl1 = DataLoader2(dp, reading_service=rs)
>>> dl2 = DataLoader2(dp, reading_service=rs)
>>> next(iter(dl1))
>>> print(f"Number of processes that exist in `dl1`'s RS after initializing `dl1`: {len(dl1.reading_service._worker_processes)}")
# Number of processes that exist in `dl1`'s RS after initializing `dl1`: 2
>>> next(iter(dl2))
# Note that we are still examining `dl1.read_service` below
>>> print(f"Number of processes that exist in `dl1`'s RS after initializing `dl2`: {len(dl1.reading_service._worker_processes)}")
# Number of processes that exist in `dl1`'s RS after initializing `dl1`: 2
      

Deprecations

DataPipe

In PyTorch Core

  • Remove previously deprecated FileLoaderDataPipe (#89794)
  • Mark imports from torch.utils.data.datapipes.iter.grouping as deprecated (#94527)

TorchData

  • Remove certain deprecated functional APIs as previously scheduled (#890)

Releng

  • Drop support for Python 3.7 as aligned with PyTorch core library (#974)

New Features

DataLoader2

  • Add graph function to list DataPipes from DataPipe graphs (#888)
  • Add functions to set seeds to DataPipe graphs (#894)
  • Add worker_init_fn and worker_reset_fn to MultiProcessingReadingService (#907)
  • Add round robin sharding to support non-replicable DataPipe for MultiProcessing (#919)
    • Guarantee that DataPipes execute reset_iterator when all loops have received reset request in the dispatching process (#994)
  • Add initial support for randomness control within DataLoader2 (#801)
  • Add support for Sequential ReadingService (commit)
  • Enable SequentialReadingService to support MultiProcessing + Distributed (#985)
  • Add limit, pause, resume operations to halt DataPipes in DataLoader2 (#879)

DataPipe

  • Add ShardExpander IterDataPipe (#405)
  • Add RoundRobinDemux IterDataPipe (#903)
  • Implement PinMemory IterDataPipe (#1014)

Releng

  • Add conda Python 3.11 builds (#1010)
  • Enable Python 3.11 conda builds for Mac/Windows (#1026)
  • Update C++ standard to 17 (#1051)

Improvements

DataLoader2

In PyTorch Core

  • Fix apply_sharding to accept one sharding_filter per branch (#90769)

TorchData

  • Consolidate checkpoint contract with checkpoint component (#867)
  • Update load_state_dict() signature to align with TorchSnapshot (#887)
  • Apply sharding based on priority and combine DistInfo and ExtraInfo (used to store distributed metadata) (#916)
  • Prevent reset iteration message from being sent to workers twice (#917)
  • Add support to keep non-replicable DataPipe in the main process (#950)
  • Safeguard DataLoader2Iterator's __getattr__ method (#1004)
  • Forward worker exceptions and have DataLoader2 exit with them (#1003)
  • Attach traceback to Exception and test dispatching process (#1036)

DataPipe

In PyTorch Core

  • Add auto-completion to DataPipes in REPLs (e.g. Jupyter notebook) (#86960)
  • Add group support to sharding_filter (#88424)
  • Add keep_key option to Grouper (#92532)

TorchData

  • Add a masks option to filter files in S3 DataPipe (#880)
  • Make HeaderIterDataPipe with limit=None a no-op (#908)
  • Update fsspec DataPipe to be compatible with the latest version of fsspec (#957)
  • Expand the possible input options for HuggingFace DataPipe (#952)
  • Improve exception handling/skipping in online DataPipes (#968)
  • Allow the option to place key in output in MapKeyZipper (#1042)
  • Allow single key option for Slicer (#1041)

Releng

  • Add pure Python platform-agnostic wheel (#988)

Bug Fixes

DataLoader2

In PyTorch Core

  • Change serialization wrapper implementation to be an iterator (#87459)

DataPipe

In PyTorch Core

  • Fix type checking to accept both Iter and Map DataPipe (#87285)
  • Fix: Make __len__ of datapipes dynamic (#88302)
  • Properly cleanup unclosed files within generator function (#89973)
    ...
Read more

TorchData 0.5.1 Beta Release, small bug fix release

16 Dec 15:19
Compare
Choose a tag to compare

This is a minor release to update PyTorch dependency from 1.13.0 to 1.13.1. Please check the release note of TorchData 0.5.0 major release for more detail.

TorchData 0.5.0 Release Notes

27 Oct 17:08
Compare
Choose a tag to compare

TorchData 0.5.0 Release Notes

  • Highlights
  • Backwards Incompatible Change
  • Deprecations
  • New Features
  • Improvements
  • Bug Fixes
  • Performance
  • Documentation
  • Future Plans
  • Beta Usage Note

Highlights

We are excited to announce the release of TorchData 0.5.0. This release is composed of about 236 commits since 0.4.1, including ones from PyTorch Core since 1.12.1, made by more than 35 contributors. We want to sincerely thank our community for continuously improving TorchData.

TorchData 0.5.0 updates are focused on consolidating the DataLoader2 and ReadingService APIs and benchmarking. Highlights include:

  • Added support to load data from more cloud storage providers, now covering AWS, Google Cloud Storage, and Azure. Detailed tutorial can be found here
  • Consolidated API for DataLoader2 and provided a few ReadingServices, with detailed documentation now available here
  • Provided more comprehensive DataPipe operations, e.g., random_split, repeat, set_length, and prefetch.
  • Provided pre-compiled torchdata binaries for arm64 Apple Silicon

Backwards Incompatible Change

DataPipe

Changed the returned value of MapDataPipe.shuffle to an IterDataPipe (pytorch/pytorch#83202)

IterDataPipe is used to to preserve data order

MapDataPipe.shuffle
0.4.10.5.0
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
True
>>> isinstance(dp, IterDataPipe)
False
      
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
False
>>> isinstance(dp, IterDataPipe)
True
      

on_disk_cache now doesn’t accept generator functions for the argument of filename_fn (#810)

on_disk_cache
0.4.10.5.0
>>> url_dp = IterableWrapper(["https://path/to/filename", ])
>>> def filepath_gen_fn(url):
…     yield from [url + f/{i}” for i in range(3)]
>>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn)
      
>>> url_dp = IterableWrapper(["https://path/to/filename", ])
>>> def filepath_gen_fn(url):
…     yield from [url + f/{i}” for i in range(3)]
>>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn)
# AssertionError
      

DataLoader2

Imposed single iterator constraint on DataLoader2 (#700)

DataLoader2 with a single iterator
0.4.10.5.0
>>> dl = DataLoader2(IterableWrapper(range(10)))
>>> it1 = iter(dl)
>>> print(next(it1))
0
>>> it2 = iter(dl)  # No reset here
>>> print(next(it2))
1
>>> print(next(it1))
2
      
>>> dl = DataLoader2(IterableWrapper(range(10)))
>>> it1 = iter(dl)
>>> print(next(it1))
0
>>> it2 = iter(dl)  # DataLoader2 resets with the creation of a new iterator
>>> print(next(it2))
0
>>> print(next(it1))
# Raises exception, since it1 is no longer valid
      

Deep copy DataPipe during DataLoader2 initialization or restoration (#786, #833)

Previously, if a DataPipe is being passed to multiple DataLoaders, the DataPipe's state can be altered by any of those DataLoaders. In some cases, that may raise an exception due to the single iterator constraint; in other cases, some behaviors can be changed due to the adapters (e.g. shuffling) of another DataLoader.

Deep copy DataPipe during DataLoader2 constructor
0.4.10.5.0
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> dl1 = DataLoader2(dp)
>>> dl2 = DataLoader2(dp)
>>> for x, y in zip(dl1, dl2):
…     print(x, y)
# RuntimeError: This iterator has been invalidated because another iterator has been created from the same IterDataPipe...
      
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> dl1 = DataLoader2(dp)
>>> dl2 = DataLoader2(dp)
>>> for x, y in zip(dl1, dl2):
…     print(x, y)
0 0
1 1
2 2
3 3
4 4
      

Deprecations

DataLoader2

Deprecated traverse function and only_datapipe argument (pytorch/pytorch#85667)

Please use traverse_dps with the behavior the same as only_datapipe=True. (#793)

DataPipe traverse function
0.4.10.5.0
>>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False)
      
>>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False)
FutureWarning: `traverse` function and only_datapipe argument will be removed after 1.13.
      

New Features

DataPipe

  • Added AIStore DataPipe (#545, #667)
  • Added support for IterDataPipe to trace DataFrames operations (pytorch/pytorch#71931,
  • Added support for DataFrameMakerIterDataPipe to accept dtype_generator to solve unserializable dtype (#537)
  • Added graph snapshotting by counting number of successful yields for IterDataPipe (pytorch/pytorch#79479, pytorch/pytorch#79657)
  • Implemented drop operation for IterDataPipe to drop column(s) (#725)
  • Implemented FullSyncIterDataPipe to synchronize distributed shards (#713)
  • Implemented slice and flatten operations for IterDataPipe (#730)
  • Implemented repeat operation for IterDataPipe (#748)
  • Added LengthSetterIterDataPipe (#747)
  • Added RandomSplitter (without buffer) (#724)
  • Added padden_tokens to max_token_bucketize to bucketize samples based on total padded token length (#789)
  • Implemented thread based PrefetcherIterDataPipe (#770, #818, #826, #842)

DataLoader2

  • Added CacheTimeout Adapter to redefine cache timeout of the DataPipe graph (#571)
  • Added DistribtuedReadingService to support uneven data sharding (#727)
  • Added PrototypeMultiProcessingReadingService
    • Added prefetching (#826)
    • Fixed process termination (#837)
    • Enabled deterministic training in distributed/non-distributed environment (#827)
    • Handled empty queue exception properly (#785)

Releng

  • Provided pre-compiled torchdata binaries for arm64 Apple Silicon (#692)

Improvements

DataPipe

  • Fixed error message coming from singler iterator constraint (pytorch/pytorch#79547)
  • Enabled profiler record context in __next__ for IterDataPipe (pytorch/pytorch#79757)
  • Raised warning for unpickable local function (#547) (pytorch/pytorch#80232, #547)
  • Cleaned up opened streams on the best effort basis (#560, pytorch/pytorch#78952)
  • Used streaming reading mode for unseekable streams in TarArchiveLoader (#653)
    Improved GDrive 'content-disposition' error message (#654)
  • Added as_tuple argument for CSVParserIterDataPipe` to convert output from list to tuple (#646)
  • Raised Error when HTTPReader get 404 Response (#160) (#569)
  • Added default no-op behavior for flatmap (https://...
Read more

TorchData 0.4.1 Beta Release, small bug fix release

05 Aug 20:47
Compare
Choose a tag to compare

TorchData 0.4.1 Release Notes

Bug fixes

Documentation

Releng

  • Provided pre-compiled torchdata binaries for arm64 Apple Silicon (#692)
    • Python [3.8~3.10]