Releases: pytorch/data
v0.10.1
What's Changed
This release introduces 3 major changes:
-
Introducing
torchdata.nodes
, a library of extensible and composable iterators that lets you chain together common dataloading and pre-proc operations! This initial release includes the following features, with more on the way:- Multi-threaded paralellism, and experimental support for Free-Threaded (No-GIL) Python, in addition to the typical Multi-process parallelism.
- Note: FT Python support is experimental, requires Python 3.13t and torch>=2.5.0, and is currently only tested for Linux
- Multi-dataset weighted sampling
- State Management through state_dict/load_state_dict methods
- Near-feature-parity with torch.utils.data.DataLoader, with full support for existing torch.utils.data.Dataset (IterableDataset and persistent_workers coming soon!).
- Refer to the
torchdata.nodes
docs for more details.
- Multi-threaded paralellism, and experimental support for Free-Threaded (No-GIL) Python, in addition to the typical Multi-process parallelism.
-
This release drops support for DataPipes and DataLoader2. Release v0.9 was the last stable release which includes them. Please see this issue for more details.
-
PyTorch's official conda channel is deprecated. TorchData has removed its conda builds as well. TorchData will be available for installation through pip, on PyPI and download.pytorch.org.
Full Changelog: v0.9.0...v0.10.1
TorchData v0.9.0
What's Changed
This was a relatively small release compared to previous. This will notably be the last stable release to feature DataPipes and DataLoader2!
- Drop Python 3.8 support
- Make DistributedSampler stateful by @ramanishsingh in #1315
New Contributors
- @jovianjaison made their first contribution in #1314
- @ramanishsingh made their first contribution in #1315
Full Changelog: https://github.com/pytorch/data/commits/v0.9.0
TorchData 0.8.0
Highlights
We are excited to announce the release of TorchData 0.8.0. This first release of StatefulDataLoader, which is a drop-in replacement for torch.utils.data.DataLoader, offering state_dict/load_state_dict methods for handling mid-epoch checkpointing.
Deprecations
We are re-focusing the torchdata repo to be an iterative enhancement of torch.utils.data.DataLoader. We do not plan on continuing development or maintaining the [DataPipes] and [DataLoaderV2] solutions, and they will be removed from the torchdata repo. We'll also be revisiting the DataPipes references in pytorch/pytorch. In release torchdata==0.8.0 (July 2024) they will be marked as deprecated, and in 0.9.0 (Oct 2024) they will be deleted. Existing users are advised to pin to torchdata==0.8.0 or an older version until they are able to migrate away. Subsequent releases will not include DataPipes or DataLoaderV2. The old version of this README is available here. Please reach out if you suggestions or comments (please use #1196 for feedback).
Full Changelog: https://github.com/pytorch/data/commits/v0.8.0
TorchData 0.7.1
Current status
This is a patch release, which is compatible with PyTorch 2.1.1. There are no new features added.
TorchData 0.7.0
Current status
Bug Fixes
- MPRS request/response cycle for workers (40dd648)
- Sequential reading service checkpointing (8d452cf)
- Cancel future object and always run callback in FullSync during shutdown (#1171)
- DataPipe, Ensures Prefetcher shuts down properly (#1166)
- DataPipe, Fix FullSync shutdown hanging issue while paused (#1153)
- DataPipe, Fix a word in WebDS DataPipe (#1156)
- DataPipe, Add handler argument to iopath DataPipes (#1154)
- Prevent in_memory_cache from yielding from source_dp when it's fully cache (#1160)
- Fix pin_memory to support single-element batch (#1158)
- DataLoader2, Removing delegation for 'pause', 'limit', and 'resume' (#1067)
- DataLoader2, Handle MapDataPipe by converting to IterDataPipe internally by default (#1146)
New Features
TorchData 0.6.1 Release Notes
TorchData 0.6.1 Beta Release Notes
Highlights
This minor release is aligned with PyTorch 2.0.1 and primarily fixes bugs that are introduced in the 0.6.0 release. We sincerely thank our users and contributors for spotting various bugs and helping us to fix them.
Bug Fixes
DataLoader2
- Properly clean up processes and queues for MPRS and Fix pause for prefetch (#1096)
- Fix DataLoader2
seed = 0
bug (#1098)- Previously, if
seed = 0
was passed intoDataLoader2
, theseed
value inDataLoader2
would not be set and the seed would be unused. This change fixes that and allowseed = 0
to be used normally.
- Previously, if
- Fix
worker_init_fn
to update DataPipe graph and move worker prefetch to the end of Worker pipeline (#1100)
DataPipe
Improvements
DataPipe
- Skip
FullSync
operation whenworld_size == 1
(#1065)
Docs
- Add long project description to
setup.py
for display on PyPI (#1094)
Beta Usage Note
This library is currently in the Beta stage and currently does not have a fully stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback. As always, we welcome new contributors to our repo.
TorchData 0.6.0 Release Notes
TorchData 0.6.0 Beta Release Notes
Highlights
We are excited to announce the release of TorchData 0.6.0. This release is composed of about 130 commits since 0.5.0, made by 27 contributors. We want to sincerely thank our community for continuously improving TorchData.
TorchData 0.6.0 updates are primarily focused on DataLoader2
. We graduate some of its APIs from the prototype stage and introduce additional features. Highlights include:
- Graduation of
MultiProcessingReadingService
from prototype to beta- This is the default
ReadingService
that we expect most users to use; it closely aligns with the functionalities of oldDataLoader
with improvements - With this graduation, we expect the APIs and behaviors to be mostly stable going forward. We will continue to add new features as they become ready.
- This is the default
- Introduction of Sequential ReadingService
- Enables the usage of multiple
ReadingService
s at the same time
- Enables the usage of multiple
- Adding comprehensive tutorial of
DataLoader2
and its subcomponents
Backwards Incompatible Change
DataLoader2
- Officially graduate PrototypeMultiProcessingReadingService to MultiProcessingReadingService (#1009)
- The APIs of
MultiProcessingReadingService
as well as the internal implementation have changed. Overall, this should provide a better user experience. - Please refer to our documentation for details.
- The APIs of
0.5.0 | 0.6.0 |
---|---|
It previously took the following arguments:
MultiProcessingReadingService(
num_workers: int = 0,
pin_memory: bool = False,
timeout: float = 0,
worker_init_fn: Optional[Callable[[int], None]] = None,
multiprocessing_context=None,
prefetch_factor: Optional[int] = None,
persistent_workers: bool = False,
)
|
The new version takes these arguments: MultiProcessingReadingService(
num_workers: int = 0,
multiprocessing_context: Optional[str] = None,
worker_prefetch_cnt: int = 10,
main_prefetch_cnt: int = 10,
worker_init_fn: Optional[Callable[[DataPipe, WorkerInfo], DataPipe]] = None,
worker_reset_fn: Optional[Callable[[DataPipe, WorkerInfo, SeedGenerator], DataPipe]] = None,
)
|
- Deep copy ReadingService during
DataLoader2
initialization (#746)- Within
DataLoader2
, a deep copy of the passed-inReadingService
object is created during initialization and will be subsequently used. - This prevents multiple
DataLoader2
s from accidentally sharing states when the sameReadingService
object is passed into them.
- Within
0.5.0 | 0.6.0 |
---|---|
Previously, a ReadingService object that is used in multiple DataLoader2 shared state among them.
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> rs = MultiProcessingReadingService(num_workers=2)
>>> dl1 = DataLoader2(dp, reading_service=rs)
>>> dl2 = DataLoader2(dp, reading_service=rs)
>>> next(iter(dl1))
>>> print(f"Number of processes that exist in `dl1`'s RS after initializing `dl1`: {len(dl1.reading_service._worker_processes)}")
# Number of processes that exist in `dl1`'s RS after initializing `dl1`: 2
>>> next(iter(dl2))
# Note that we are still examining `dl1.read_service` below
>>> print(f"Number of processes that exist in `dl1`'s RS after initializing `dl2`: {len(dl1.reading_service._worker_processes)}")
# Number of processes that exist in `dl1`'s RS after initializing `dl1`: 4
|
DataLoader2 now deep copies the ReadingService object during initialization and the ReadingService state is no longer shared.
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> rs = MultiProcessingReadingService(num_workers=2)
>>> dl1 = DataLoader2(dp, reading_service=rs)
>>> dl2 = DataLoader2(dp, reading_service=rs)
>>> next(iter(dl1))
>>> print(f"Number of processes that exist in `dl1`'s RS after initializing `dl1`: {len(dl1.reading_service._worker_processes)}")
# Number of processes that exist in `dl1`'s RS after initializing `dl1`: 2
>>> next(iter(dl2))
# Note that we are still examining `dl1.read_service` below
>>> print(f"Number of processes that exist in `dl1`'s RS after initializing `dl2`: {len(dl1.reading_service._worker_processes)}")
# Number of processes that exist in `dl1`'s RS after initializing `dl1`: 2
|
Deprecations
DataPipe
In PyTorch Core
- Remove previously deprecated
FileLoaderDataPipe
(#89794) - Mark imports from
torch.utils.data.datapipes.iter.grouping
as deprecated (#94527)
TorchData
- Remove certain deprecated functional APIs as previously scheduled (#890)
Releng
- Drop support for Python 3.7 as aligned with PyTorch core library (#974)
New Features
DataLoader2
- Add graph function to list DataPipes from DataPipe graphs (#888)
- Add functions to set seeds to DataPipe graphs (#894)
- Add
worker_init_fn
andworker_reset_fn
to MultiProcessingReadingService (#907) - Add round robin sharding to support non-replicable DataPipe for MultiProcessing (#919)
- Guarantee that DataPipes execute
reset_iterator
when all loops have received reset request in the dispatching process (#994)
- Guarantee that DataPipes execute
- Add initial support for randomness control within
DataLoader2
(#801) - Add support for Sequential ReadingService (commit)
- Enable SequentialReadingService to support MultiProcessing + Distributed (#985)
- Add
limit
,pause
,resume
operations to halt DataPipes inDataLoader2
(#879)
DataPipe
- Add
ShardExpander
IterDataPipe (#405) - Add
RoundRobinDemux
IterDataPipe (#903) - Implement
PinMemory
IterDataPipe (#1014)
Releng
- Add conda Python 3.11 builds (#1010)
- Enable Python 3.11 conda builds for Mac/Windows (#1026)
- Update C++ standard to 17 (#1051)
Improvements
DataLoader2
In PyTorch Core
- Fix
apply_sharding
to accept onesharding_filter
per branch (#90769)
TorchData
- Consolidate checkpoint contract with checkpoint component (#867)
- Update
load_state_dict()
signature to align withTorchSnapshot
(#887) - Apply sharding based on priority and combine
DistInfo
andExtraInfo
(used to store distributed metadata) (#916) - Prevent reset iteration message from being sent to workers twice (#917)
- Add support to keep non-replicable DataPipe in the main process (#950)
- Safeguard
DataLoader2Iterator
's__getattr__
method (#1004) - Forward worker exceptions and have
DataLoader2
exit with them (#1003) - Attach traceback to Exception and test dispatching process (#1036)
DataPipe
In PyTorch Core
- Add auto-completion to DataPipes in REPLs (e.g. Jupyter notebook) (#86960)
- Add group support to
sharding_filter
(#88424) - Add
keep_key
option toGrouper
(#92532)
TorchData
- Add a masks option to filter files in S3 DataPipe (#880)
- Make HeaderIterDataPipe with
limit=None
a no-op (#908) - Update
fsspec
DataPipe to be compatible with the latest version offsspec
(#957) - Expand the possible input options for HuggingFace DataPipe (#952)
- Improve exception handling/skipping in online DataPipes (#968)
- Allow the option to place key in output in
MapKeyZipper
(#1042) - Allow single key option for
Slicer
(#1041)
Releng
- Add pure Python platform-agnostic wheel (#988)
Bug Fixes
DataLoader2
In PyTorch Core
- Change serialization wrapper implementation to be an iterator (#87459)
DataPipe
In PyTorch Core
TorchData 0.5.1 Beta Release, small bug fix release
This is a minor release to update PyTorch dependency from 1.13.0
to 1.13.1
. Please check the release note of TorchData 0.5.0
major release for more detail.
TorchData 0.5.0 Release Notes
TorchData 0.5.0 Release Notes
- Highlights
- Backwards Incompatible Change
- Deprecations
- New Features
- Improvements
- Bug Fixes
- Performance
- Documentation
- Future Plans
- Beta Usage Note
Highlights
We are excited to announce the release of TorchData 0.5.0. This release is composed of about 236 commits since 0.4.1, including ones from PyTorch Core since 1.12.1, made by more than 35 contributors. We want to sincerely thank our community for continuously improving TorchData.
TorchData 0.5.0 updates are focused on consolidating the DataLoader2
and ReadingService
APIs and benchmarking. Highlights include:
- Added support to load data from more cloud storage providers, now covering AWS, Google Cloud Storage, and Azure. Detailed tutorial can be found here
- AWS S3 Benchmarking result
- Consolidated API for
DataLoader2
and provided a fewReadingServices
, with detailed documentation now available here - Provided more comprehensive
DataPipe
operations, e.g.,random_split
,repeat
,set_length
, andprefetch
. - Provided pre-compiled torchdata binaries for arm64 Apple Silicon
Backwards Incompatible Change
DataPipe
Changed the returned value of MapDataPipe.shuffle
to an IterDataPipe
(pytorch/pytorch#83202)
IterDataPipe
is used to to preserve data order
MapDataPipe.shuffle | |
---|---|
0.4.1 | 0.5.0 |
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
True
>>> isinstance(dp, IterDataPipe)
False
|
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
False
>>> isinstance(dp, IterDataPipe)
True
|
on_disk_cache
now doesn’t accept generator functions for the argument of filename_fn
(#810)
on_disk_cache | |
---|---|
0.4.1 | 0.5.0 |
>>> url_dp = IterableWrapper(["https://path/to/filename", ])
>>> def filepath_gen_fn(url):
… yield from [url + f”/{i}” for i in range(3)]
>>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn)
|
>>> url_dp = IterableWrapper(["https://path/to/filename", ])
>>> def filepath_gen_fn(url):
… yield from [url + f”/{i}” for i in range(3)]
>>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn)
# AssertionError
|
DataLoader2
Imposed single iterator constraint on DataLoader2
(#700)
DataLoader2 with a single iterator | |
---|---|
0.4.1 | 0.5.0 |
>>> dl = DataLoader2(IterableWrapper(range(10)))
>>> it1 = iter(dl)
>>> print(next(it1))
0
>>> it2 = iter(dl) # No reset here
>>> print(next(it2))
1
>>> print(next(it1))
2
|
>>> dl = DataLoader2(IterableWrapper(range(10)))
>>> it1 = iter(dl)
>>> print(next(it1))
0
>>> it2 = iter(dl) # DataLoader2 resets with the creation of a new iterator
>>> print(next(it2))
0
>>> print(next(it1))
# Raises exception, since it1 is no longer valid
|
Deep copy DataPipe
during DataLoader2
initialization or restoration (#786, #833)
Previously, if a DataPipe is being passed to multiple DataLoaders, the DataPipe's state can be altered by any of those DataLoaders. In some cases, that may raise an exception due to the single iterator constraint; in other cases, some behaviors can be changed due to the adapters (e.g. shuffling) of another DataLoader.
Deep copy DataPipe during DataLoader2 constructor | |
---|---|
0.4.1 | 0.5.0 |
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> dl1 = DataLoader2(dp)
>>> dl2 = DataLoader2(dp)
>>> for x, y in zip(dl1, dl2):
… print(x, y)
# RuntimeError: This iterator has been invalidated because another iterator has been created from the same IterDataPipe...
|
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> dl1 = DataLoader2(dp)
>>> dl2 = DataLoader2(dp)
>>> for x, y in zip(dl1, dl2):
… print(x, y)
0 0
1 1
2 2
3 3
4 4
|
Deprecations
DataLoader2
Deprecated traverse
function and only_datapipe
argument (pytorch/pytorch#85667)
Please use traverse_dps
with the behavior the same as only_datapipe=True
. (#793)
DataPipe traverse function | |
---|---|
0.4.1 | 0.5.0 |
>>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False)
|
>>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False)
FutureWarning: `traverse` function and only_datapipe argument will be removed after 1.13.
|
New Features
DataPipe
- Added AIStore DataPipe (#545, #667)
- Added support for
IterDataPipe
to trace DataFrames operations (pytorch/pytorch#71931, - Added support for
DataFrameMakerIterDataPipe
to acceptdtype_generator
to solve unserializabledtype
(#537) - Added graph snapshotting by counting number of successful yields for
IterDataPipe
(pytorch/pytorch#79479, pytorch/pytorch#79657) - Implemented
drop
operation forIterDataPipe
to drop column(s) (#725) - Implemented
FullSyncIterDataPipe
to synchronize distributed shards (#713) - Implemented
slice
andflatten
operations forIterDataPipe
(#730) - Implemented
repeat
operation forIterDataPipe
(#748) - Added
LengthSetterIterDataPipe
(#747) - Added
RandomSplitter
(without buffer) (#724) - Added
padden_tokens
tomax_token_bucketize
to bucketize samples based on total padded token length (#789) - Implemented thread based
PrefetcherIterDataPipe
(#770, #818, #826, #842)
DataLoader2
- Added
CacheTimeout
Adapter
to redefine cache timeout of theDataPipe
graph (#571) - Added
DistribtuedReadingService
to support uneven data sharding (#727) - Added
PrototypeMultiProcessingReadingService
Releng
- Provided pre-compiled torchdata binaries for arm64 Apple Silicon (#692)
Improvements
DataPipe
- Fixed error message coming from singler iterator constraint (pytorch/pytorch#79547)
- Enabled profiler record context in
__next__
forIterDataPipe
(pytorch/pytorch#79757) - Raised warning for unpickable local function (#547) (pytorch/pytorch#80232, #547)
- Cleaned up opened streams on the best effort basis (#560, pytorch/pytorch#78952)
- Used streaming reading mode for unseekable streams in
TarArchiveLoader
(#653)
Improved GDrive 'content-disposition' error message (#654) - Added
as_tuple
argument for CSVParserIterDataPipe` to convert output from list to tuple (#646) - Raised Error when
HTTPReader
get 404 Response (#160) (#569) - Added default no-op behavior for
flatmap
(https://...
TorchData 0.4.1 Beta Release, small bug fix release
TorchData 0.4.1 Release Notes
Bug fixes
- Fixed
DataPipe
working withDataLoader
in the distributed environment (pytorch/pytorch#80348, pytorch/pytorch#81071, pytorch/pytorch#81071)
Documentation
Releng
- Provided pre-compiled
torchdata
binaries for arm64 Apple Silicon (#692)- Python [3.8~3.10]