Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements on dependencies calculations #172

Merged
merged 14 commits into from
Apr 16, 2021
Merged

Conversation

Remi-Gau
Copy link
Collaborator

@Remi-Gau Remi-Gau commented Feb 23, 2021

Metadata treatment

Split get_metadata into two functions:

  • metalist = get_meta_list(filename, pattern) returns list of files satisfying pattern with inheritance principle
  • get_metadata(metafile) construct meta structure from json files in passed meta-files list (from above function)

Each data file in dataset receives new field metafile that contains list of files from get_meta_list.

This will facilitate access to metafiles directly. Individual values can be extracted by user if needed using get_metadata.

Dependencies

Each file structure contains dependencies sub-structure with guaranteed fields:

  • explicit -- list of data files containing "IntendedFor" referencing current file.
  • data -- list of files with same name but different extension. This will combine files that are split in header and data (like in Brainvision), also takes care of bval/bvec files
  • group -- list of files that have same name except extension and suffix. This will group file that logically need each other, like functional mri and events tabular file. It also takes care of fmap magnitude1/2 and phasediff

Additional custom subfields can be added for given modalities.

This allow to treat dwi and fmap with generic parse_using_schema. perf likely also, but I can't certify it.

Issues

  • performance. I feel that retrieval of dependencies can be optimised
  • Tests -- was unable to install moxunit
  • Tests on other datasets

@codecov
Copy link

codecov bot commented Feb 27, 2021

Codecov Report

Merging #172 (d332496) into beliy-derivate (ac3bb7f) will not change coverage.
The diff coverage is 0.00%.

❗ Current head d332496 differs from pull request most recent head 6bfd46f. Consider uploading reports for the commit 6bfd46f to get more accurate results
Impacted file tree graph

@@              Coverage Diff               @@
##           beliy-derivate    #172   +/-   ##
==============================================
  Coverage            0.00%   0.00%           
==============================================
  Files                  25      29    +4     
  Lines                 920     951   +31     
==============================================
- Misses                920     951   +31     
Flag Coverage Δ
unittests 0.00% <0.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
+bids/+internal/append_to_layout.m 0.00% <0.00%> (ø)
+bids/+internal/ends_with.m 0.00% <0.00%> (ø)
+bids/+internal/get_meta_list.m 0.00% <0.00%> (ø)
+bids/+internal/get_metadata.m 0.00% <0.00%> (ø)
+bids/+internal/starts_with.m 0.00% <0.00%> (ø)
+bids/layout.m 0.00% <0.00%> (ø)
+bids/query.m 0.00% <0.00%> (ø)
+bids/derivate.m 0.00% <0.00%> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ac3bb7f...6bfd46f. Read the comment docs.

@Remi-Gau
Copy link
Collaborator Author

This should:

For qMRI some files in the derivatives folder also have an intendedFor and they would be missed by the current code.

I suggest we finish this PR as for now and then tackle #146 that concerns the indexing of nested derivative folders before we come back to finish fixing #157 completely.

@Remi-Gau Remi-Gau requested a review from nbeliy February 28, 2021 11:52
@Remi-Gau
Copy link
Collaborator Author

* [ ]  performance. I feel that retrieval of dependencies can be optimised

I suspect that having to load and read the metadata of each file takes a lot of time.

I suspect that this can indeed be optimized but I would worry about that later.

@Remi-Gau
Copy link
Collaborator Author

@nbeliy

I pretty much took your code as is, fixed the test that were not passing and did a bit of refactoring and commenting.

I also:

  • did a soft reset to ditch the derivatives function from this PR so we can tackle separately in your other PR
  • cherry picked a commit that you had on your copy derivatives branch to update the query function

The code is pretty much the same though: thanks for providing all this.

Let me know what you think.

Note: I feel that the current way of indexing all the dependencies creates a lot of "redundancies" in the layout (example: the phase image of fieldmap will list all the associated imagesin dependencies.group like magnitude and the other way around).

But I think that we can also optimize this later, if needed as this way of indexing things makes fewer assumptions about what file is / should be the "main" one for a given group.

@nbeliy
Copy link
Collaborator

nbeliy commented Mar 1, 2021

For qMRI some files in the derivatives folder also have an intendedFor and they would be missed by the current code.

It will still work if IntendedFor in derivatives reference file in itself, not in the original/main dataset. If it's not, we can imagine a search for files in main dataset at condition that main dataset is referenced in dataset_description.json. But then, it will be difficult to implement, considering that main dataset may be defined as DOI or URL.

I suspect that having to load and read the metadata of each file takes a lot of time.

Conserving full metadata for each file can explode memory usage. From inside we need only IntendedFor, and for user a query with 'target', {'meta1', 'meta2'} can be imagined to retrieve several parameters with one go.

Note: I feel that the current way of indexing all the dependencies creates a lot of "redundancies" in the layout (example: the phase image of fieldmap will list all the associated imagesin dependencies.group like magnitude and the other way around).

Sometimes the main file is not well defined -- in phasediff and 2 magnitude, which one is main? In order to simplify we can exclude .tsv files from looking for group dependencies.

+bids/layout.m Show resolved Hide resolved
+bids/layout.m Outdated Show resolved Hide resolved
+bids/query.m Outdated Show resolved Hide resolved
Copy link
Collaborator

@nbeliy nbeliy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for mee, more abstract and generic.

As for optimisation, the list of candidates for metadata (considering only json) is same for each call. Probably it can be saved somewhere.

If (to be confirmed) data files are scanned in alphabetical order, lokking for group dependencies can be limited in file's neighbours.

Also, we probably need to move get_metadata from internal to util. This function can be usefull outside the layout and query

@Remi-Gau
Copy link
Collaborator Author

Remi-Gau commented Mar 1, 2021

Will try to adress some of your questions. Before merging.

Totally agree with getting get_metadata out of internal, was suggested before and I have been dragging my feet to do it. 🤷‍♂️

Was thinking something along the same line for optimization: will open a separate issue to keep track of that.

@Remi-Gau Remi-Gau merged commit 61fd910 into beliy-derivate Apr 16, 2021
@Remi-Gau Remi-Gau deleted the beliy-dependencies branch April 16, 2021 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants