Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find a way to communicate the ordering of a file back with the existing listing table implementation #13891

Open
zhuqi-lucas opened this issue Dec 24, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Dec 24, 2024

Is your feature request related to a problem or challenge?

This is the follow-up for:
#13874 (review)

We add support (order by / sort) for DataFrameWriteOptions, but when a user try to query the table which the file already ordered, we can't get info from the table.

We need to find a way to communicate the ordering of a file back with the existing listing table implementation.

Describe the solution you'd like

It is also conceivable that DataFusion itself could write custom metadata in paquet and other formats that support that custom metadata with the ordering, but that seems like we can use iceberg and other table formats.

Describe alternatives you've considered

No response

Additional context

No response

@zhuqi-lucas
Copy link
Contributor Author

take

@alamb
Copy link
Contributor

alamb commented Dec 24, 2024

One way to do this would be to write DataFusion specific metadata into the files (e..g add something to https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterProperties.html#method.key_value_metadata)

It would be great if we can avoid making something DataFusion specific. Maybe someone could do some research and find how other systems handle this

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Dec 26, 2024

There is field for sort column, but it seems rowgroup level metadata, so when we set the sort column to parquet, it will applied to rowgroup level metadata.

apache/arrow-rs#3103

One way to do this would be to write DataFusion specific metadata into the files (e..g add something to https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterProperties.html#method.key_value_metadata)

@alamb This is a good idea for file level metadata storage. And i am wandering do we need to add sort column to parquet file metadata also besides the row group level metadata, so we can use it in datafusion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants