Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added class to URLs in the response #322

Merged
merged 9 commits into from
Apr 24, 2019
Merged

Conversation

cyenyxe
Copy link
Member

@cyenyxe cyenyxe commented Jun 7, 2018

New class attribute for URLs in the htsget response, with supported values 'header' and 'body'.

In the example, I first try writing the class after the body but it was a bit confusing because the URL was the same, only the byte range changed. So I decided to write the class at the end of each object.

@cyenyxe cyenyxe added the htsget label Jun 7, 2018
@jeromekelleher
Copy link
Member

I think the idea is very elegant. class seems a good word to use here, and "header" and "body" seem like the right values as well.

My only issue here is that we may need to include more explanatory text. In #311 I tried to lay out the assumptions that we're making about the file formats, but maybe that was overboard. What do others think?

@jmarshall
Copy link
Member

I was mulling over explanatory text over the weekend, and can write something up if @cyenyxe prefers.

Should this be optional so that existing implementations are still compliant? Leaving class off might mean you could make no assumptions about the contents of that sub-URL. Perhaps it would be best to say that class must be present on all of them or none of them.

@jeromekelleher
Copy link
Member

Should this be optional so that existing implementations are still compliant? Leaving class off might mean you could make no assumptions about the contents of that sub-URL. Perhaps it would be best to say that class must be present on all of them or none of them.

Good question. I think we have to support older implementations that don't know about this, so the burden is on the client. I think you're right: either ALL urls should have a class attribute, or none of them should have it.

@jmarshall
Copy link
Member

jmarshall commented Jun 12, 2018

I've added a suggested draft of some more explanatory text and also an idea about using ETag and other HTTP headers to ensure header and body consistency if the client chooses to reuse already-downloaded headers. (I think the latter was mentioned in passing on the Toronto meeting call.)

I've moved the suggestion of some explanatory text and the Cache-Control text to a separate PR — see #325. You may wish to mine that for its explanatory text.

htsget.md Outdated
`class`
_optional string_
</td><td>
A list of URL classes to include, see below.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add the briefest explanation here too such as "allowing client to request the data header only, or to opt-out of receiving the header in subsequent requests"

htsget.md Outdated
`class`
_string_
</td><td>
For file formats whose specification describes a header and a body, the class indicates which of the two will be retrieved when querying this URL. Either all or none of the URLs in the response must have a class attribute. The allowed values are `header` and `body`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For explicitness, I'd add something like "if class attributes are absent, client should assume data blocks include both header and body, possibly mixed"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a good idea. For instance, the EVA server provides separate URLs for headers and body by default, to facilitate downloading them separately even when the class field isn't available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if the server doesn't annotate the URLs with class fields, there is no way for the client to know that the EVA server is doing that. The text @mlin suggests just makes explicit the reality that clients can't make assumptions when class fields are missing.

Copy link
Member Author

@cyenyxe cyenyxe Nov 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I misunderstood what @mlin means. Which of these 2 is it?

  1. Each URL could contain header or body
  2. Each URL could contain header and body

If it is the latter, then a client would have to split the contents from each URL in order to build a valid output file. If it is the former, then I will think about a rewording that is completely unambiguous.

Copy link
Member

@jmarshall jmarshall Nov 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If class fields are absent, the client can make no assumptions about the contents of individual ticket URLs or the boundaries between their contents: each URL could contain headers, body records, partial headers or records, or both.¹

(Consider e.g. a request with no referenceName/start/end thus a request for the entire file, and the server just returns a ticket that chops it up into 1 Mb chunks.)

It's implicit explicit (in the diagram of core mechanic section) that the client proceeds as if it's concatenating the contents of all the ticket URLs in order, to get the full file contents. It might choose to avoid redownloading an URL or two because it's already got it (effectively) cached, but that's its business. I don't quite see what you're getting at about the “client [having] to split the contents”…?


¹ Actually perhaps some implementations would like data records or compression blocks not to be split across ticket URLs, but at present I don't think the protocol says anything about that — so such splitting is allowed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, I hadn't considered how they could be mixed due to the fix-sized chunks 😅 Now it's all clear.

I thought it meant that all the URLs could contain header and body, which would make individual blocks correct, but not the response as a whole.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see 😄 — yes: the client can't assume anything about the individual URL contents, but it is safe in expecting that the whole concatenated response is a valid header-body-body-…-body-EOFtrailer stream.

@jmarshall
Copy link
Member

jmarshall commented Sep 24, 2018

[Query parameters]
+class optional string A list of URL classes to include, see below.

The original functionality desired was to enable requesting just the headers (see #311 (comment)). Later we decided it would also be useful to enable clients to avoid re-downloading headers they already have, but this is a comparatively minor piece of functionality.

The elegance of the class: header / body in the JSON urls array is that it enables clients to do the latter (if they wish) while maintaining the invariant that all requests are asking for syntactically-valid BAM/CRAM/VCF/BCF files.

So IMHO class as a query parameter should implement only the originally desired functionality: the only allowable value for class as a query parameter should be header.

@jeromekelleher
Copy link
Member

So IMHO class as a query parameter should implement only the originally desired functionality: the only allowable value for class as a query parameter should be header.

I'd support this in the spirit of us making the smallest possible change that's useful --- particularly if we can get it implemented. If we have an active server I'm happy to put together a client interface.

@mlin
Copy link
Member

mlin commented Sep 25, 2018

Opting-out of the header should be useful for RNA-seq data where the header is liable to include all the transcript IDs mapped to. However, I support the principle of simplifying a prototype implementation.

@jmarshall
Copy link
Member

@mlin: This PR now adds two related facilities. The main point of this PR, “Add[ing] class to URLs in the response“ is that such a request for RNA-seq data would get an htsget ticket back that says

{ … "urls": [ { "url": "abc…", "class": "header" }, { "url": "def…", "class": "body" }, … ], … }

and it would be able to opt out of getting the header (if it wished) by not bothering to download the abc URL. That is functionality that is enabled by the main part of this PR. (This might be unclear, which IIRC is why “more explanatory text” was requested in the PR.)

In addition to this, since September 7th the PR also adds a class= query parameter to directly express the original desired functionality — of making a request for just the headers.

So not only is class=header-or-class-queryparam-not-present the smallest possible useful change, it's also the only query parameter value we'll ever need for this sort of opting-out! The PR adds the functionality @mlin describes by providing the tools for clever clients to cleverly interpret the htsget ticket.

@cyenyxe
Copy link
Member Author

cyenyxe commented Nov 28, 2018

I feel it would be confusing to have a query parameter and response field with the same name, but only allowing a subset of values to be requested. What are your thoughts about using a parameter like header-url=<true|false> instead?

@jmarshall
Copy link
Member

I don't think it is confusing. Describing the query parameter as “Use …?class=header to have the server return the SAM or VCF headers only” or similar will make this clear. (The purpose of this query parameter is to implement the original desire for a way for the client to request SAM/VCF headers only, not to be a general-purpose selector.)

IMHO it is elegant for the protocol to use the same class / header terminology in both places.

@cyenyxe
Copy link
Member Author

cyenyxe commented Nov 30, 2018

I hope the last changes address all the concerns about lack of clarity, please let me know if that is not the case.

@mlin mlin mentioned this pull request Jan 22, 2019
@mlin
Copy link
Member

mlin commented Jan 23, 2019

I've made an initial implementation of this in https://github.com/dnanexus-rnd/htsnexus/pull/29/files and think it looks very nice. Great job @cyenyxe @jmarshall @jeromekelleher! I think we can ask for +1s on this on the call today.

@jmarshall
Copy link
Member

This describes the new class annotations, but it doesn't really say anything about what they're for or how to use them. This is what I think #322 (comment) was getting at.

  • The Diagram of core mechanic section is an opportunity to add explanatory text. We should add something like:

While the blocks must be finally concatenated in the given order, the client may fetch them in parallel and/or reuse cached data from URLs that have previously been downloaded.

When making a series of requests to fetch reads or variants within different regions of the same <id> resource, clients may wish to avoid re-fetching the SAM/CRAM/VCF headers each time, especially if they are large. If the ticket contains class fields, the client may reuse previously downloaded and parsed headers rather than re-fetching the header-class URLs.

  • In the response JSON, class should be listed as optional string.

    IMHO the “If it is absent, clients should assume data blocks include both header and body, possibly mixed in one of the blocks” sentence is not particularly clear and would be better phrased as not assuming:

If class fields are not supplied, no assumptions can be made about which data blocks contain headers, body records, or parts of both.

@jmarshall
Copy link
Member

jmarshall commented Jan 23, 2019

Re the query parameter …?class=header facility:

  • It's somewhat unclear what kind of data headers it's talking about.

  • The text about “not possible to return the data body only” is irrelevant (the spec shouldn't complain about the protocol's design!) and incorrect (the desired effect is totally possible — the way to do it is by interpreting class in the response ticket). It might be worth adding something about “All responses elegantly denote complete SAM/CRAM/VCF files” to the Design principles section, but it's misplaced here.

IMHO we should write this parameter's description something like the following, to facilitate future additions:

format
optional string

Request data in this format. The allowed values for each type of record are: […snip…]

class
optional string

Request different classes of data. By default, i.e., when class is not specified, the response will represent a complete read or variant data stream, encompassing SAM/CRAM/VCF headers, body data records, and EOF marker.

If class is specified, its value MUST be one of the following:

header

Request the SAM/CRAM/VCF headers only.

The server SHOULD respond with an InvalidInput error if any other htsget query parameters other than format are specified at the same time as class=header.

  • Should we specify whether the class=header response has an EOF marker? IMHO we should probably be explicit that it might or might not.

  • Servers might want to return the BAM/etc headers directly rather than returning a ticket and needing an extra round trip. Should this be allowed? If so, we might mention it here and we would need to relax the “All API invocations are made to a configurable HTTP(S) endpoint, receive URL-encoded query string parameters, and return JSON output” text in Protocol essentials.

Copy link
Member

@jmarshall jmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to the design, but text can be improved as per recent comments.

@jeromekelleher
Copy link
Member

I basically agree to everything that @jmarshall has put in above. Perhaps he could open a PR on @cyenyxe's feature branch with these changes so that they will show up here in this PR? (Git history minutia: since both John and Cristina have put in so much here, I think it's probably fair if we have two commits, one authored by each, representing the actual history of the feature.)

@cyenyxe
Copy link
Member Author

cyenyxe commented Jan 23, 2019

Given that it is just a change in wording, and I wasn't too convinced about mine to being with, I am happy with @jmarshall pushing to my branch without a PR. Please let me know if you don't have permissions to do so.

Describe the class=header request query parameter more fully.
Add description of using response URL classes to the "diagram of
core mechanic" section.

[SQUASH INTO PREVIOUS] Add "optional", some wordsmithing, and add
blank lines so formatting works when displayed in the GH repo too.]
@mlin
Copy link
Member

mlin commented Feb 20, 2019

bumping @jmarshall questions

  • Should we specify whether the class=header response has an EOF marker? IMHO we should probably be explicit that it might or might not.
  • Servers might want to return the BAM/etc headers directly rather than returning a ticket and needing an extra round trip. Should this be allowed? If so, we might mention it here and we would need to relax the “All API invocations are made to a configurable HTTP(S) endpoint, receive URL-encoded query string parameters, and return JSON output” text in Protocol essentials.

@jeromekelleher
Copy link
Member

Weighing in here:

Servers might want to return the BAM/etc headers directly rather than returning a ticket and needing an extra round trip. Should this be allowed? If so, we might mention it here and we would need to relax the “All API invocations are made to a configurable HTTP(S) endpoint, receive URL-encoded query string parameters, and return JSON output” text in Protocol essentials.

I think we should always return a ticket, the alternative seems like it would cause a lot of issues.

Should we specify whether the class=header response has an EOF marker? IMHO we should probably be explicit that it might or might not.

Would you mind explaining when you think we would get an EOF marker @jmarshall? It's not clear to me what the tradeoffs are here.

@jmarshall
Copy link
Member

jmarshall commented Feb 20, 2019

Re returning headers directly, @daviesrob's point (made during the meeting) that clients expecting a ticket might be greatly surprised to get something that's not a ticket is a very good one.

Re EOF markers (as in the 28-byte empty BGZF block at the end of BGZFed files, and the CRAM equivalent):

The proposed class=header query text has “By default, i.e., when class is not specified, the response will represent a complete read or variant data stream, encompassing SAM/CRAM/VCF headers, body data records, and EOF marker”, which is perhaps more information than anyone really needs 😄 — it's there to set the scene for which headers/body the class stuff is talking about.

So for formats that have an EOF marker (BAM, BCF, and CRAM; not non-bgzipped SAM or VCF), this text spells out that the complete file datastream will have one.

For a class=header request for say BAM, the response will encompass BAM header and maybe an EOF marker. For a server implemented using HTSlib, it might write the response using hts_open/ sam_hdr_write/ hts_close in which case hts_close will write an EOF marker. Or it might hold the relevant stream open across multiple requests and only use sam_hdr_write, in which case there will be no EOF marker.

Hmmm… I guess if we're always returning a ticket for a class=header request, the server will have no difficulty in always putting a data:,<EOF-marker> URL at the end of the ticket. So we could say the response will be a BAM/etc header that SHOULD be followed by an EOF marker (for formats that have them).

@mlin
Copy link
Member

mlin commented Feb 20, 2019

  • Servers might want to return the BAM/etc headers directly rather than returning a ticket and needing an extra round trip. Should this be allowed? If so, we might mention it here and we would need to relax the “All API invocations are made to a configurable HTTP(S) endpoint, receive URL-encoded query string parameters, and return JSON output” text in Protocol essentials.

We have the way of returning a data: URI inside the ticket, this requires base64 encoding and decoding it but that shouldn't be too bad depending on the header size (and if it's very large, in principle the server can still offload it). Does that satisfy this idea?

  • Should we specify whether the class=header response has an EOF marker? IMHO we should probably be explicit that it might or might not.

I suppose the worry is that the response includes header & EOF marker, and the client later tries to be clever by reusing a large header by concatenation with body responses of subsequent requests, and ends up with an EOF marker in the middle of their file. For BAM and VCF this might actually be OK, in view of how the EOF marker is just gzip of the empty string. The CRAM EOF container might be more likely to cause problems, though.

On the other hand, including the EOF preserves the property that the concatenated response is a fully-formed file of the requested format, which is nice. Perhaps the clever client described above should be expected to be so clever as to detect and strip the EOF bytes from the header response, if they're present.

@jeromekelleher
Copy link
Member

On the other hand, including the EOF preserves the property that the concatenated response is a fully-formed file of the requested format, which is nice. Perhaps the clever client described above should be expected to be so clever as to detect and strip the EOF bytes from the header response, if they're present.

I think this a good general principle, and the clever client should be expected to be clever enough to know about the EOF markers.

`header`
</td><td>

Request the SAM/CRAM/VCF headers only.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmarshall What do you think of adding "(and EOF marker)" here? Useful clarification, thanks for bringing it up earlier.

@mlin
Copy link
Member

mlin commented Apr 18, 2019

The discussion on today's call leaned toward merging this and leaving the EOF as something we can clarify later if it's an issue. Rob Davies reported that the htslib CRAM reader isn't misled by having a CRAM EOF marker in the middle of the file, and even that is something that a carefully written client should be able to avoid.

I will merge this in another day if there are nothing further is raised. Thanks all!

@mlin mlin merged commit 476afba into samtools:master Apr 24, 2019
@jmarshall
Copy link
Member

jmarshall commented Apr 25, 2019

Mike, when merging it would be good to remember things like #322 (comment) that had been agreed on. I thought you accepted my offer during the meeting to prepare this PR accordingly…

@jeromekelleher
Copy link
Member

Mike, when merging it would be good to remember things like #322 (comment) that had been agreed on. I thought you accepted my offer during the meeting to prepare this PR accordingly…

This was my understanding as well. It probably doesn't make much difference for a low-traffic repo like this, but a clean and linear(ish) git history is worth having.

@jmarshall
Copy link
Member

I've pushed a small formatting fix that was in what I was preparing.

Also, we probably ought to bump the spec version number for this.

@jmarshall
Copy link
Member

And a second formatting fix. This is the sort of thing that ought to be caught by previewing the Jekyll formatting while doing the final PR merge or at least by checking the published website afterwards.

@mlin
Copy link
Member

mlin commented Apr 28, 2019

OK, sorry all, I evidently mis-remembered the action assignment. Let me know how I can help clean up. I agree on bumping the spec version number.

@jmarshall
Copy link
Member

It doesn't much matter who does what (as long as attributions are maintained), but merging to master does need to be done with care and attention. You're the only one empowered to push to htsget.md on master and IMHO the checklist should include:

  • Read over the PR conversation to check that nothing has been omitted;

  • Check the kramdown formatting of the PR as merged (which would usually be done by previewing locally with jekyll serve);

  • Check the GitHub repository formatting of the PR as it will be merged (which can be approximated with “View file” from the PR's Files changed tab or by pushing a temporary branch containing the eventual commit(s) to your own fork).

This pretty much means you get to do this locally rather than by using GitHub's UI merge buttons.

Let me know how I can help clean up.

Fortunately nothing else has yet been pushed to master, so you could clean this up by bumping the version number in/via an explicit merge commit so that the whole PR arrives on master as a unit. (This will make for a bubble in the history, but that's probably the lesser of two evils…)

In particular: reset your master to ad374ab and merge --no-ff --edit the 45c9feb branch into it with an appropriate commit message, either bumping the version number in that merge commit or by committing the bump to the branch first.

@jmarshall
Copy link
Member

@mlin: It's not the end of the world for the git history if this is not cleaned up, but you asked for advice on how to clean up. You have only a limited amount of time to do so, as other specifications want to carry on with their work.

In particular, the SAM people want to commit f48b522. If you wished, you could use this to clean up nicely: check out the sam-tp branch, git merge --no-commit master (where master == 45c9feb or that plus a commit bumping the htsget version) into it with an appropriate commit message, either bumping the version number in that merge commit or by committing the bump to your local master first. Then fast-forward master to that merge commit and push it, and you will have committed to origin/master this PR as a tidy branch merge and the small SAM change appropriately timed within its bubble.

mlin added a commit that referenced this pull request May 14, 2019
@mlin
Copy link
Member

mlin commented May 14, 2019

OK, I took your first suggestion as I didn't feel I myself ought to effect merger of the SAM change. Thanks for that!

Policywise, I wouldn't claim to be "the only one empowered to push to htsget.md on master" and when I ask "Let me know how I can help" I wouldn't add "and no one else may do anything to master until I decide." I wouldn't think such rigidity called for in our small, volunteer contributor group. Somebody has to be the default person to coordinate (and follow the helpful checklist), but if it's broke, fix it!

@jeromekelleher
Copy link
Member

Policywise, I wouldn't claim to be "the only one empowered to push to htsget.md on master" and when I ask "Let me know how I can help" I wouldn't add "and no one else may do anything to master until I decide." I wouldn't think such rigidity called for in our small, volunteer contributor group. Somebody has to be the default person to coordinate (and follow the helpful checklist), but if it's broke, fix it!

I agree with @mlin here --- once there's a consensus reached on a PR, anyone with write permissions on the repo should feel free to merge (following the very helpful checklist from @jmarshall above), so we don't need to block on Mike's limited availability. I'm assuming you have write permission on this repo @jmarshall , and I (for one) would be perfectly happy with you taking on the PR merge responsibility if it was a PR you're involved in.

@jmarshall
Copy link
Member

Mike is the sole htsget maintainer listed in MAINTAINERS.md, and LSG leadership has previously castigated people for not respecting this. If you want a different policy, you'll have to promulgate it either in MAINTAINERS.md or on the mailing list.

@jeromekelleher
Copy link
Member

Mike is the sole htsget maintainer listed in MAINTAINERS.md, and LSG leadership has previously castigated people for not respecting this. If you want a different policy, you'll have to promulgate it either in MAINTAINERS.md or on the mailing list.

Fair enough - sounds like a discussion point for the next meeting then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants