Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document data sources and script generating the required input data #39

Open
6 tasks
sgreenbury opened this issue Aug 7, 2024 · 6 comments
Open
6 tasks

Comments

@sgreenbury
Copy link
Collaborator

sgreenbury commented Aug 7, 2024

This issue aims to document and automate where possible the set-up of required data for the pipeline. This will enable the pipeline to be run for other regions (specifically Greater London as a next case).

@Hussein-Mahfouz
Copy link
Collaborator

Hussein-Mahfouz commented Aug 23, 2024

This is the travel time workflow I used in another project

  1. Step 1: https://github.com/Hussein-Mahfouz/drt-potential/blob/main/code/routing_prep.R
  2. Step 2: https://github.com/Hussein-Mahfouz/drt-potential/blob/main/code/routing_r5r.R

Expanded workflow described here: #20 (comment)

@sgreenbury
Copy link
Collaborator Author

From OSMOX: "west-yorkshire_epsg_4326.parquet"

@sgreenbury sgreenbury changed the title Apply pipeline to run for other regions Document data sources and script generating the required input data Aug 28, 2024
@sgreenbury
Copy link
Collaborator Author

Documentation can be added in the scripts README: https://github.com/Urban-Analytics-Technology-Platform/acbm/blob/main/scripts/README.md

@sgreenbury
Copy link
Collaborator Author

@sgreenbury
Copy link
Collaborator Author

Discussion with @Hussein-Mahfouz around how to generalize the boundary and travel time inputs. Currently the geometry ID (e.g. OA or MSOA) is specified independently in each script with different logic paths.

Considerations going forward:

  • Only OA and MSOA or other zone layers
  • Standardizing names for columns for boundary and travel times (e.g. the input data should be relabelled and then can be validated in data processing). Aim to just use zone_id, from_id, to_id.

@Hussein-Mahfouz
Copy link
Collaborator

Hussein-Mahfouz commented Sep 20, 2024

@sgreenbury I've tried to capture all necessary updates in this comment. We will need to make edits to both the library functions and the scripts. We can move this to a separate issue if necessary

General

Preprocessing script

  • Decide on column names for zone_id column. this should be reflected in activity_chains, boundaries, and travel_times data. These names should work regardless of layer (OA, MSOA etc). It would then be ok to hardcode these columns in the functions/scripts and check their existence using pandera. Initial idea:
    • activity_chains: zone_id
    • boundaries: zone_id
    • travel_times: from_id and to_id
  • preprocess boundary layer boundary preprocessing #52: Currently we load the boundary layer for the UK in each script, and filter it (see here).
    We need to do this once in a preprocessing script. Steps:
    • load in our boundary layer (currently OA or MSOA)
    • if boundary layer has a city column, we can use that to filter, and grab the city name from the config.
    • Otherwise we need to use another layer that can be subsetted to our desired region, and then do a spatial intersection with our boundary layer to crop out the region we want
  • Spatial join: Use variable zone_id from config instead of hardcoded OA21CD here. This needs to be done for these scripts. Alternatively, do this join once in the once in the beginning in a preprocessing script.
  • Preparing travel demand data in 3.2.2_assign_primary_zone_work: (see here)

Config

  • Add ["TravDay"] == 3] to config - see here. We then need to remove this filter step in all scripts
  • Replace commute_level here. It could point to boundary_geography in the config

Per script

  • scripts/2_match_households_and_individuals.py

  • scripts/3.1_assign_primary_feasible_zones.py

    • add_locations_to_activity_chains(). This currently uses an OA level centroid layer. It also assumes that the activity_chains layer has an id column that matches the centroid_layer.
      • Get location from boundary layer instead: Get centroid of boundary layer -> Join using id column. Avoids use of extra layers
      • Decide whether to do this in a preprocessing script. It is done many times
    • Reading osm POI data: We are currently reading a static file here
      • Option 1: Point the user to osmox and tell them to get their POI data from there, and describe where to add it in the acbm directory
      • Option 2: Implement cli in workflow (see Should we add osmox to the repo? #19). Save the POI layer with a generic name so it works for all scripts (i.e. not west_yorkshire_epsg_4326)
    • get_possible_zones()
      • Add zone id columns to schemas. This should be done after deciding what the zone_id column will be called. It will reflect in all schemas
      • _get_possible_zones() internal functions: Remove hardcoded zones and replace with either (a) a parameter from the config, or a hardcoded generic name that we decide on from the zone_id
  • scripts/3.2.1_assign_primary_zone_edu.py

  • scripts/3.2.2_assign_primary_zone_work.py

    • See section on preprocessing script for preparing commuting flow data
  • scripts/3.2.3_assign_secondary_zone.py

    • Replace hardcoded OA21CD here and here
    • edit modes to use: Should be done here
    • remove hardcoded OA21CD here and here and here. DONE except for middle one (tick when completed)
    • Same hardcoding issue with add_location()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants