Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataPipe] file cache #407

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

[DataPipe] file cache #407

wants to merge 2 commits into from

Conversation

tmbdev
Copy link
Contributor

@tmbdev tmbdev commented May 13, 2022

This PR adds a file caching filter. The filter receives (fname, stream) pairs; if necessary, it will download all the data in the stream to a local file based on the filename. Then it will pass an (fname, stream) pair to the next state.

This is particularly useful with WebDataset, where FileCache can be used to cache shards incrementally as they are downloaded from remote locations, but the filter works with arbitrary (fname, stream) pairs.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 13, 2022
@VitalyFedyunin
Copy link
Contributor

This can be archived with existing on_disk_cache DataPipe, PTAL:

        nfiles = 100
        testdata = "hello, world"
        dest = os.path.join(self.temp_dir.name, "testdata")
        with open(dest, "w") as stream:
            stream.write(testdata)

        dp = IterableWrapper([dest] * nfiles)

        def _noop(x):
            return x

        dp = dp.on_disk_cache(filepath_fn=_noop)

        # This could be download, for for sake of example
        # # I just writing text into the file
        def _write(filename):
            with open(filename, 'w') as fh:
                fh.write(testdata)
        dp = dp.map(lambda filename: _write(filename))

        dp = dp.end_caching(mode="t", filepath_fn=_noop, timeout=120)
        dp = FileOpener(dp)

        count = 0
        for path, stream in dp:
            data = stream.read()
            count += 1
            assert data == testdata
        assert count == nfiles

@VitalyFedyunin VitalyFedyunin self-requested a review May 19, 2022 19:37
@VitalyFedyunin VitalyFedyunin changed the title file cache [DataPipe] file cache May 19, 2022
@tmbdev
Copy link
Contributor Author

tmbdev commented May 20, 2022

I'm not particularly attached to my implementation, but I think file caching is something that people should be able to add very easily to a pipeline with just "dp.filecache(dirname)".

(In fact, it might be a good idea to just have it default to the environment variable and not cache if the environment variable is unset.)

I haven't seen use cases for the generality that the current cache implementation provides and think that it will discourage the use of caching. Also, it looks like your caching implementation may mix downloading with caching, whereas the .popen/.filecache combo separates it, making it easier to make training pipelines location transparent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants