-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does the format allow multiple rows with the same height and timestamp? #48
Comments
What is allowed by the standard?
If we wanted to enforce a unique combination, it could be expressed with primaryKey, which would invalidate duplicate combinations: "primaryKey": ["radar", "datetime", "height"] I don't think we want that however, because when creating vpts files, it is hard to guarantee this (see further). What is causing duplicate records?We have encountered two causes for duplicate records:
The resulting VPTS CSV file will have different data for the same timestamp.
The resulting VPTS CSV file will have the same data for the same timestamp, except for How can we fix duplicate records?Processing from hdf55 to CSV is file-based, making it hard to catch this duplicates. As a result, they are present in the VPTS data. The VPTS data thus presents the full scope of the data that was there. I think any fix should be done in the readers of the data, like CROW or bioRad. |
Noted, thanks! CROW currently cannot deal with those but that's not a problem I guess, just something good to know for me!
I am not sure it should be clarified: the fact that a combination of field is allowed is the default case, so I general that's not the kind of things we need to be explicit about. But in this specific case and from the "data consumer" standpoint, I found it quite confusing. At least for CROW, we can only display a single value per height and time, so finding multiple ones in the source without clear guidance about what that represents (from the real world) nor indication of which one should be discarded feels a bit weird. So while circumventing the issue by reading the first one in a specific reader is easy (I'll do it for CROW soon!), I am not sure I agree with "any fix should be done in the readers of the data" if we're talking at the community/standard level. Let's take an analogy and say we are building a standard for satellite imagery, where each file is a mosaic of multiple pictures taken at different times by different satellites. If the goal of the standard is to provide an image of earth (each data point is a pixel with 3 dimensions: X, Y and color), it would feel strange to me to have multiple colors for a given X,Y coordinates, resulting in larger files and passing the following messages to readers (ala Google Maps): "Yeah, when you have two colors for the same point, just choose which one you want to show to the user, this happens because when generating the file we don't really want to deal with the case where two 'initial pictures' overlap.". I feel like the "data curation" part (having two competing values for a point and choosing which one to include - or maybe something else, like the mean of the two) is better suited at the "data production" step. The obvious downside is some data loss, because of this curation. Just my 2 cents, I don't pretend I am right :D |
I couldn't find the answer in the documentation page, but I've encountered such a file: https://aloftdata.s3-eu-west-1.amazonaws.com/baltrad/daily/bewid/2024/bewid_vpts_20240314.csv
(that causes issues with CROW, see enram/crow#16)
It would be great to clarify this expectation about the file (not just for me, but reflect it on the documentation page)
I also realize now that it's unclear to me if a single vpts-csv file can cover multiple radars?
The text was updated successfully, but these errors were encountered: