Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BCF Character/String type MISSING/EOV encoding #618

Open
andersleung opened this issue Dec 22, 2021 · 1 comment
Open

BCF Character/String type MISSING/EOV encoding #618

andersleung opened this issue Dec 22, 2021 · 1 comment
Labels

Comments

@andersleung
Copy link

In BCF, the Character/String type does not have MISSING or EOV encoding given in the spec. htslib and GenomicsDB define MISSING and EOV for String/Character to be 0x07 and 0x00 respectively, but htslib only seems to convert 0x07 to . when converting BCF to VCF, but does not convert . to 0x07 when writing VCF as BCF.

My question is how a VCF record with missing Characters and missing Strings are encoded in BCF. If the spec is following htslib, I think missing Character should be defined to be encoded as a length 1 String whose only byte is 0x07, and a missing String, being an entirely missing vector of Character, would be [0x07,0x00,0x00,...] because of #617.

As a separate issue, it's not well defined what the Character type in VCF means. In BCF, Character is one 7-bit ASCII byte, but in VCF which is UTF-8 encoded, Character could be a byte, a Unicode codepoint, or a grapheme.

@jkbonfield jkbonfield added the vcf label Jan 6, 2022
@h-2
Copy link

h-2 commented Jan 25, 2022

I second this.

The specification of (partly) empty vectors is really inprecise. See also #593.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: To do (backlog)
Development

No branches or pull requests

3 participants