Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoiding large data duplication #6

Open
Lestropie opened this issue May 20, 2024 · 0 comments
Open

Avoiding large data duplication #6

Lestropie opened this issue May 20, 2024 · 0 comments

Comments

@Lestropie
Copy link
Owner

Only to be attempted once the software has reached a certain level of maturity.

Consider a use case of the tool where the input is a very large dataset. If the user specifies an "input dataset" and "output dataset", then naively, while the tool would modify metadata files between input and output, it would be necessary to nevertheless duplicate very large data files from input to output.

There are multiple possible software augmentations to better facilitate this, some of which are more dangerous than others, therefore it would need to be under full user control.

  1. Output dataset contains softlinks to data files in the input dataset.
    👍 Safe.
    👎 Attempts to share the output dataset could have issues if softlinks are not properly resolved.
    👎 Implies preservation of both input and output datasets, which hamstrings the purported utility of the tool.

  2. User-specified input and output datasets are the same dataset.
    👍 Intuitive use case; modify a dataset according to user's desires.
    ✋ Some risk of dataset corruption. This can be mitigated in a couple of ways: - Always write new metadata files before deleting old ones, so that hopefully metadata is not lost upon unexpected termination. - Add signal handler to make a final attempt to write critical data to disk if a termination event occurs. - With adequate logging of the filesystem manipulations to be applied, should be possible to re-commence the tool if interrupted partway through a conversion.

  3. Perform filesystem move operations from input to output dataset, rather than copies.
    👎 Risky; could leave dataset in highly abnormal state if operation is interrupted.
    👎 Breaks BIDS App expected interface where input dataset is read-only.

  4. Do a "dummy" run; create empty data files in output dataset
    ✋ Only really useful for debugging / testing; not a viable end user solution.

Finally, if not specified otherwise, command would simply duplicate all data between input and output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant