Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for Parquet file changes #1

Open
JNuss71 opened this issue Apr 8, 2024 · 2 comments
Open

Check for Parquet file changes #1

JNuss71 opened this issue Apr 8, 2024 · 2 comments
Assignees
Labels
feature New feature or request nice-to-have Stuff that would be nice to have optimization Optimizing some exisiting functionality

Comments

@JNuss71
Copy link

JNuss71 commented Apr 8, 2024

Is your feature request related to a problem? Please describe.
Currently the parquet performance data is downloaded and processed on a regular schedule regardless of changes to the parquet file. This unnecessarily downloads and process parquet data that has already been processed.

Describe the solution you'd like
Check whether the parquet performance data has changed by verifying the HTTP Response Headers for either the Etag or Last Modified date/time. If it hasn't changed then skip the downloading and processing step. If the file has been updated since the last time the parquet file was processed, download the new parquet file and process it.

@JNuss71 JNuss71 added the feature New feature or request label Apr 8, 2024
@JNuss71 JNuss71 changed the title Check for Parquet file changes before downloading and processing Check for Parquet file changes Apr 8, 2024
@JNuss71 JNuss71 transferred this issue from transitmatters/t-performance-dash Apr 8, 2024
@JNuss71 JNuss71 self-assigned this Apr 8, 2024
@JNuss71 JNuss71 added nice-to-have Stuff that would be nice to have optimization Optimizing some exisiting functionality labels Apr 8, 2024
@devinmatte
Copy link
Member

For this we're going to need a way to keep track of when the last process time was. Since we're taking one file and processing it into hundreds, I don't think the files on s3 will give us a great idea of when we processed it last without checking all of them (as some files won't be updated every run)

@JNuss71
Copy link
Author

JNuss71 commented May 10, 2024

Is it worth storing this kind of data in a DynamoDB or would that just be an unnecessary introduction of a database?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request nice-to-have Stuff that would be nice to have optimization Optimizing some exisiting functionality
Projects
None yet
Development

No branches or pull requests

2 participants