Uniformize .tsv and .tsv.gz: both to have header + both have "Columns" dictionary (!not list) in sidecar .json #71

yarikoptic · 2024-04-17T03:41:11Z

This issue is collating two aspects but I think it is warranted. If would be desired - we could split into two.

BIDS 1.x situation

ATM, .tsv.gz are a not just a compressed .tsv like e.g. it happens with .nii.gz and .nii -- they are special* as they are not to carry the header as .tsv files do.

The "specialty" extends into side-car .json files

for .tsvs we carry .json file where each entry is a structured record describing that column BIDS 1: tabular-files .
- In addition such .json files for some .tsv files, as in the case with _beh.json also carry metadata fields such as TaskName alongside with columns descriptors -- thus possibly leading to collisions (snake_case is just a recommendation for column names, and overall making schema for such .json files a concoction of two aspects).
- _motion.tsv are an exclusion to the rule (see more below) - they are headless, _channels.{tsv,json} describe columns, and then _motion.json contains extra metadata.
for .tsv.gz we carry .json file with a dedicated Columns field with a list of header fields and in addition again optionally descriptors per each column (same possible collision) BIDS 1: compressed-tabular-files

If I got it right (@effigies can correct) the header was excluded from .tsv.gz as "not readily readable". May be some folks remember also further details? IMHO argument is weak since it is just a matter of adequate abstraction of "file opener" like e.g. is done in Python. But even if we place that aspect aside I think we would benefit from a more harmonious approach, which only might require 1 extra check for validator:

BIDS 2 proposal

both the .tsv and .tsv.gz should carry a header. .gz would only signal compression.
both .tsv.gz and .tsv should be supported interchangeably across uses
- it would be for a user to choose most appropriate form based on use-case
- it will be RECOMMENDED to use .tsv form for the cases where immediate user readability is desired (subjects.tsv, sessions.tsv etc) unless prohibitive in size (e.g. subjects.tsv for 10000 subjects with 100 columns or smth like that)
.json for either case of .tsv or .tsv.gz MAY describe columns within Columns field of the .json which would be a dict containing records conforming current set of fields we reserve for .tsv files .json's but also adding 1 OPTIONAL field (but may be RECOMMENDED for .tsv.gz) - Index which would provide ordering information. bids-validator could easily ensure corresponding to the order in .tsv or .tsv.gz.
- I didn't look into JSON specification/libraries either we can also simply "enforce" that order should correspond to Index, ie. if dicts are ordered like now in Python.

Cons

"Breaking change" as tools would need to adjust from Columns field. But it might actually be even simplification in some cases (e.g. those _beh.json) where now they should "subselect" what to choose for descriptors and what for metadata.
Tools which can't read header directly from .tsv.gz through file abstraction would need to "build" index based on Index field . But it is really not a rocket science

Pros

uniform handling of .tsv and .tsv.gz so if someone obtains a long .tsv and decides to compress it -- it would be just a matter of exactly that -- compression, without changing content (removing header). Would allow for simpler/generic code/handling.
making it possible for .json to be sidecar for metadata and description of Columns without any ambiguity (jsonschema or linkml model would be much easier to construct) and possibility of collision.

The text was updated successfully, but these errors were encountered:

arnodelorme · 2024-04-17T19:24:12Z

I would vote for uniformity between .tsv and .tsv.gz

effigies · 2024-04-17T19:49:14Z

.tsv.gz was a pragmatically useful choice for working with existing non-BIDS tools (e.g., FSL's PNM) that expected separately-entered column identification and accepted compressed data. One possibility to compromise here would be something like:

Extension	Headers	Compression	Examples
`.tsv`	First line	None	`events.tsv`
`.tsv.gz`	First line	`gzip`	New
`.bare.tsv`	Sidecar	None	`motion.tsv` -> `motion.bare.tsv.gz`
`.bare.tsv.gz`	Sidecar	`gzip`	`physio.tsv.gz` -> `physio.bare.tsv.gz`
`.parquet`	In-file	Optional	New

We could state that any of these is acceptable (perhaps with a preference in some use cases), assume people will use one that matches the typical use case for their dataset, and make a simple tool available to convert among them.

tsalo · 2024-08-08T17:12:28Z

Just want to note that .bare.tsv and .bare.tsv.gz would conflict with BEP017's .sparse.tsv and .dense.tsv. I mean, I guess we could have .bare.dense.tsv... but 😬

EDIT: Based on maintainers discussion with Peer- maybe just assume .dense.tsv means "bare".

yarikoptic · 2024-08-09T20:53:12Z

FWIW, although being .tsv does not mandate having a header generally (outside of the BIDS), I feel there was somewhat of overfitting to the tool that .tsv.gz was made to be the one without header while .tsv had to have a header. I second @effigies suggestion above, besides I think addition of .parquet should be discussed separately since not directly related to the "flavors of .tsv" discussed here.

re .sparse -- commented on that PR. IMHO that BEP is already not entirely kosher with .tsv being used without header (if I got it right). Also it seems that it is only .bare which could be .sparse, so may be it is actually quite consistent and nice to extend @effigies table to

Extension	Headers	Compression	Examples
`.tsv`	First line	None	`events.tsv`
`.tsv.gz`	First line	`gzip`	New
`.bare.tsv`	Sidecar	None	`motion.tsv` -> `motion.bare.tsv.gz`
`.bare.tsv.gz`	Sidecar	`gzip`	`physio.tsv.gz` -> `physio.bare.tsv.gz`
`.sparse.tsv`	Sidecar	None	New (for BEP017: `_relmat.sparse.tsv`)
`.sparse.tsv.gz`	Sidecar	`gzip`	New (for BEP017: `_relmat.sparse.tsv.gz`)

yarikoptic · 2024-10-01T14:17:05Z

During BEP044 call I was made aware that situation is more "intricate" in case of motion files (an example is ds004460). There

_motion.tsv is headersless
columns are described not in _motion.json but rather in _channels.tsv (and _channels.json to describe the _channels.tsv)
_motion.json contains common metadata for _motion.tsv file.

This was referenced Apr 17, 2024

[ENH] BEP 020 Eye Tracking bids-standard/bids-specification#1128

Open

Allow for .tsv (in addition to .tsv.gz) for _physio files bids-standard/bids-specification#472

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uniformize .tsv and .tsv.gz: both to have header + both have "Columns" dictionary (!not list) in sidecar .json #71

Uniformize .tsv and .tsv.gz: both to have header + both have "Columns" dictionary (!not list) in sidecar .json #71

yarikoptic commented Apr 17, 2024 •

edited

Loading

arnodelorme commented Apr 17, 2024

effigies commented Apr 17, 2024 •

edited

Loading

tsalo commented Aug 8, 2024 •

edited

Loading

yarikoptic commented Aug 9, 2024 •

edited

Loading

yarikoptic commented Oct 1, 2024

Uniformize .tsv and .tsv.gz: both to have header + both have "Columns" dictionary (!not list) in sidecar .json #71

Uniformize .tsv and .tsv.gz: both to have header + both have "Columns" dictionary (!not list) in sidecar .json #71

Comments

yarikoptic commented Apr 17, 2024 • edited Loading

BIDS 1.x situation

BIDS 2 proposal

arnodelorme commented Apr 17, 2024

effigies commented Apr 17, 2024 • edited Loading

tsalo commented Aug 8, 2024 • edited Loading

yarikoptic commented Aug 9, 2024 • edited Loading

yarikoptic commented Oct 1, 2024

yarikoptic commented Apr 17, 2024 •

edited

Loading

effigies commented Apr 17, 2024 •

edited

Loading

tsalo commented Aug 8, 2024 •

edited

Loading

yarikoptic commented Aug 9, 2024 •

edited

Loading