Avoiding large data duplication #6

Lestropie · 2024-05-20T00:31:48Z

Only to be attempted once the software has reached a certain level of maturity.

Consider a use case of the tool where the input is a very large dataset. If the user specifies an "input dataset" and "output dataset", then naively, while the tool would modify metadata files between input and output, it would be necessary to nevertheless duplicate very large data files from input to output.

There are multiple possible software augmentations to better facilitate this, some of which are more dangerous than others, therefore it would need to be under full user control.

Output dataset contains softlinks to data files in the input dataset.
👍 Safe.
👎 Attempts to share the output dataset could have issues if softlinks are not properly resolved.
👎 Implies preservation of both input and output datasets, which hamstrings the purported utility of the tool.
User-specified input and output datasets are the same dataset.
👍 Intuitive use case; modify a dataset according to user's desires.
✋ Some risk of dataset corruption. This can be mitigated in a couple of ways: - Always write new metadata files before deleting old ones, so that hopefully metadata is not lost upon unexpected termination. - Add signal handler to make a final attempt to write critical data to disk if a termination event occurs. - With adequate logging of the filesystem manipulations to be applied, should be possible to re-commence the tool if interrupted partway through a conversion.
Perform filesystem move operations from input to output dataset, rather than copies.
👎 Risky; could leave dataset in highly abnormal state if operation is interrupted.
👎 Breaks BIDS App expected interface where input dataset is read-only.
Do a "dummy" run; create empty data files in output dataset
✋ Only really useful for debugging / testing; not a viable end user solution.

Finally, if not specified otherwise, command would simply duplicate all data between input and output.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoiding large data duplication #6

Avoiding large data duplication #6

Lestropie commented May 20, 2024

Avoiding large data duplication #6

Avoiding large data duplication #6

Comments

Lestropie commented May 20, 2024