Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add England and Wales census tables #105

Merged
merged 20 commits into from
Jul 10, 2024
Merged

Conversation

andrewphilipsmith
Copy link
Collaborator

@andrewphilipsmith andrewphilipsmith commented Jun 4, 2024

This is WIP PR seeks to incorporate census attributes for England and Wales.

This supersedes #40

Current Status:

  • At present, each zipfile is downloaded multiple times. This is a consequence of the fact that each partition is a unique combination of topic summary and geometry level. Each zip file contains multiple files, including every geometry level for a given topic summary. I cannot see an easy way to share a cached downloaded zip file where different files are then extracted for different partitions (given dagster can in-theory execute partitions in arbitrary order and even on a cluster of machines). However a better solution might be to create a custom IOManager, but it is unclear if this is worth the effort at this stage.

Todo

  • Implement def _derived_metrics() - This will need a very different implementation from the equivalent NI method. NI's source data is row-based, whereas EW is already column-based.

Problems to solve(d)

  • Not all of the information required for the catalogue is available in on the bulk downloads webpage. The geometry level is only discoverable after downloading the zip file. We need to update the catalogue based on what we discover in the zip files.
    • We build the partition on the (incorrect) assumption that all geometry levels are available for every source-table. We then allow those partitions for non-existing combinations of source-table and geometry-level to fail.
  • The zip files can contain multiple CSV files, one for each geometry level. Do we only only what the lowest level geometry or include all of them?
    • All of them - see above
  • What is the appropriate way to handle the "extra geographies" column where data exists? Can these be easily combined with "original release" files?
    • The "extra geographies" are not included at present
    • Decide whether or not to include the extra geometries in the v0.2 release.
  • Some of the downloaded files mistakenly have two consecutive . in the filename, e.g. census2021-ts002-lsoa..csv. We need to be able to gracefully handle this situation.
    • Done
  • There are multiple metrics in some source tables. We need to create the right (and right number of) MetricMetaData objects .
    • Stil to do; This will be handled by the _derived_metrics() method.
  • We need to find way to map the source metric names and HXL tags (possibly this should be in a separate PR)

For future issues/PRs (eg after v0.2)

  • Handle the "extra geographies" column where data exists
  • Improve mapping of source metric names and HXL tags
  • Root out repeated code with other countries

`with TemporaryDirectory(delete=del_temp_dir):` is Py3.12 only. We need to be able to support Py 3.11
@andrewphilipsmith andrewphilipsmith marked this pull request as ready for review July 9, 2024 11:20
@andrewphilipsmith andrewphilipsmith merged commit d53468b into main Jul 10, 2024
8 checks passed
@andrewphilipsmith andrewphilipsmith deleted the england_wales_census branch July 10, 2024 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done:
Development

Successfully merging this pull request may close these issues.

2 participants