Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to publishing pipeline #108

Merged
merged 14 commits into from
Jun 26, 2024
Merged

Conversation

penelopeysm
Copy link
Member

@penelopeysm penelopeysm commented Jun 6, 2024

Commit messages have more detail, but the main changes of this PR are:

There is one commit (86746e5) which converts the NI catalog requests to use async HTTP requests. I had to implement this not because of anything wrong with the original implementation, but rather because my Internet connection over here is not quite the same as in the UK... 🥲 However, I am happy to drop it from this PR if this is deemed unnecessary.

@penelopeysm penelopeysm force-pushed the publishing-improvements branch 3 times, most recently from ee8fbb7 to b31223a Compare June 6, 2024 10:02
@penelopeysm penelopeysm marked this pull request as draft June 17, 2024 12:31
- Also: Introduces `GeometryOutput` and `MetricsOutput` dataclasses to
  represent the output types of the assets that produce geometry and
  metrics respectively. These will hopefully be easier to understand,
  document, and use, compared to raw tuples.

- Also: Updates Belgium DAG to use these new types.

Closes #94
Partitions are 'cached' between different Dagster runs, so doing this
helps to 'clean up' old partitions that are no longer applicable to the
version of the code being presently worked on.
This commit implements a class method called fix_types, which is called
whenever a list of metadata classes is serialised to a dataframe.
cls.fix_types(df) returns a new df where the types of columns are
properly coerced to what they should be in the resulting parquet file.
This avoids issues with pandas automatically inferring e.g. a None type
for a string column that just happens to all be Nones, and ensures that
dataframes from different countries can be concatenated by the CLI.

Closes #106
This makes it consistent with updates to the sensor
My poor Malaysian internet connection can't handle it otherwise
Just a simple change pending a full refactor of BE
@penelopeysm penelopeysm marked this pull request as ready for review June 24, 2024 10:00
@penelopeysm penelopeysm requested review from sgreenbury and andrewphilipsmith and removed request for sgreenbury June 24, 2024 10:03
docs/new_country.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@sgreenbury sgreenbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @penelopeysm, this looks great to me (including the async addition) and I think the docs are very clear! Just added some small comments in the code and think this is good to merge.

@penelopeysm
Copy link
Member Author

CI broke because geopandas 1.0.0 was released 2 days ago 😄

@penelopeysm penelopeysm merged commit 679ad6e into main Jun 26, 2024
8 checks passed
@penelopeysm penelopeysm deleted the publishing-improvements branch June 26, 2024 13:23
penelopeysm added a commit that referenced this pull request Jun 26, 2024
Forgot to change this line as part of #108
@penelopeysm penelopeysm mentioned this pull request Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done:
2 participants