Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

86 update run scripts #93

Open
wants to merge 32 commits into
base: 65_scottish_reweighting
Choose a base branch
from

Conversation

crispy-wonton
Copy link
Collaborator

@crispy-wonton crispy-wonton commented Dec 6, 2024

Fixes #86

Description

This PR changes the key run scripts in the repo pipeline so that they can all be run with the same inputs (preprocessed & deduplicated EPC dataset), allowing them to be run out of order. This is meant to simplify and clarify the running of the full pipeline.

NB: I have also left comments on the files themselves to help you through the review. See below a summary of the main and minor changes (not in any order of importance).

Main changes:

  • pipeline/README.md - new instructions on how to run simplified pipeline.
  • pipeline/prepare_features/garden_size.py - new function (taken from previously reviewed script) to deduplicate garden size)
  • pipeline/prepare_features/land_area.py & pipeline/prepare_features/number_of_households.py - deleted. Functions refactored and moved to pipeline/prepare_features/property_density.py.
  • pipeline/prepare_features/output_areas.py - new function to load and clean postcode to LSOA lookup. We had a function that did this already but it also loaded MSOA, RUC, LAD columns which are unnecessary for the use case in which this new function was applied. Since the ONS PD is so big, I thought it would be more efficient to have a separate smaller function, rather than process more columns and then drop them.
  • pipeline/prepare_features/property_density.py - Refactoring to simplify existing property density code and combine Scotland with England and Wales property density df generation into a single function.
  • pipeline/prepare_features/protected_areas.py - New function to generate single df of protected areas from Scotland (world heritage sites) and England and Wales (building conservation areas).
  • pipeline/run_scripts/run_add_features.py & pipeline/run_scripts/run_calculate_garden_size.py - Uses new and refactored functions and script is refactored to take preprocessed deduplicated EPC as input.
  • pipeline/run_scripts/run_calculate_suitability.py - calculate_suitability.py refactored into 3 files, including this one. The whole file is showing diffs because it is newly created, but this script contains code that has been previously reviewed. I have labelled parts that have changed since the last review of this code, the rest is unchanged since the last review.
  • pipeline/suitability/calculate_suitability.py - refactored for various reasons as labelled on file

Minor changes:

  • getters/get_datasets.py - update documentation of some getters for clarity on type of land area measurement
  • analysis/exploratory/aggregate_build_year_per_la.py & analysis/exploratory/subset_epc_north_england.py & pipeline/run_scripts/run_generate_open_dataset.py & pipeline/run_scripts/run_process_garden_size.py- update call to renamed function
  • pipeline/prepare_features/anchor_properties.py & pipeline/prepare_features/grid_capacity.py - update outputs of functions to polars dataframes so that outputs can be joined to main EPC dataframe in run_add_features.py
  • pipeline/prepare_features/lat_lon.py - minor correction to code
  • pipeline/reweight_epc/prepare_sample.py - correction to code
  • pipeline/prepare_features/off_gas.py - standardise doc string and column name
  • pipeline/run_scripts/run_compute_epc_weights.py - Updated to use new ONSPD load transform function from output_areas.py
  • pipeline/suitability/scoring.py - moved scoring dicts to separate file. These are unchanged, just moved to their own file.
  • utils/save_utils.py - Save util updates to allow saving .parquet or .csv files

Instructions for Reviewer

In order to test the code in this PR you need to ...
There are instructions at the top of each run script for how to run the code, as well as instructions in the pipeline/README if you want to run any of them.
I have run the updated scripts and they seem to work for me.

Please pay special attention to ...

  • the new pipeline/README is clear and understandable. I'm not sure about the table format, so let me know if it doesn't work. I wanted to create something that could be easily updated. Let me know if I should change the order or anything to make it more clear.
  • New functions
  • Refactoring of existing functions to ensure they produce the same output, especially refactoring of the functions to calculate suitability
  • All run scripts have been appropriately refactored as required

Checklist:

  • I have refactored my code out from notebooks/
  • I have checked the code runs
  • I have tested the code
  • I have run pre-commit and addressed any issues not automatically fixed
  • I have merged any new changes from dev
  • I have documented the code
    • Major functions have docstrings
    • Appropriate information has been added to READMEs
  • I have explained this PR above
  • I have requested a code review

…ctions into property_density.py

- add new function to generate property density for all of GB
…polars DataFrames so they can be joined to EPC
…A and postcode columns and rename existing function
…nstead of weighted EPC

- refactor property density code so that England and Wales and Scotland property density can be added in one function
- add LSOA, MSOA, LAD, and RUC indicator cols to dataset
- select specific required from anchor_properties and grid_capacity dfs to create final clean dataset
…ntain score dicts, and run_calculate_suitability.py run script
… protected areas (either building conservation or WHS)
…ith full dataset and other with deduplicated UPRNs
…duplicated EPC

add try/except block for loading microsoft building footprint files
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easier to review this file by clicking the three dots in the top right and selecting 'View file'.

@@ -103,12 +102,12 @@ def load_gdf_and_process_poi() -> gpd.GeoDataFrame:
return anchor_properties


def identify_anchor_properties_gdf() -> gpd.GeoDataFrame:
def identify_anchor_properties_df() -> pl.DataFrame:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the output of this function to a polars.DataFrame object so that it can be joined to the EPC dataset (which is also a polars.DataFrame) in the run_add_features.py script.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted this file, refactored the functions and moved them to property_density.py.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted this file, refactored the functions and moved them to property_density.py.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I combined the functions to generate the in_building_conservation_area flag for UPRNs in England and Wales, and the in_world_heritage_site flag for UPRNs in Scotland. I combined the output into one single feature in_protected_area to make it easier to use in calculating suitability.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uses new and refactored functions and script is refactored to take preprocessed deduplicated EPC as input.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calculate_suitability.py refactored into 3 files, including this one. The whole file is showing diffs because it is newly created, but this script contains code that has been previously reviewed. I have labelled parts that have changed since the last review of this code, the rest is unchanged since the last review.

from asf_heat_pump_suitability.pipeline.suitability import calculate_suitability


def parse_arguments():
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parser function updated since last review to take the paths of the different input files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use new ONSPD load transform function from output_areas.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are unchanged, just moved to their own file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Save util updates to allow saving .parquet or .csv files

@crispy-wonton crispy-wonton marked this pull request as ready for review December 16, 2024 18:39
Copy link
Collaborator

@lizgzil lizgzil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really great @crispy-wonton - so much clearer! Well done - must have been a lot of gnarly refactoring! 🎉

I added some comments/questions.

I was wondering when reading through everything again whether all the suitability scoring assumptions are captured in a readme somewhere? - thinking of the various filters which decide whether a LSOA is included - e.g. >=15 properties, >=4 features, min proportion of properties in lsoa with weights.

I also had a comment on how we utilise the grid capacity feature.

I do find the readme tables slightly hard to parse since they render in a way that means I have to scroll across to read them all - so can't see them at a glance. But the info is there, so don't feel like you need to change these if you find them clearer this way.

I tried running python -i asf_heat_pump_suitability/pipeline/run_scripts/run_add_features.py --epc s3://asf-daps/lakehouse/processed/epc/old/deduplicated/processed_dedupl-0.parquet -y 2023 -q 4 and got an error at the grid_capacities =grid_capacity.calculate_grid_capacity().select(["lsoa", "heatpump_installation_percentage"]) line. Seems to be something to do with the loading?

File "/Users/elizabethgallagher/Code/asf_heat_pump_suitability/asf_heat_pump_suitability/getters/get_dno_datasets.py", line 289, in generate_ssen_gdf
    sepd_shape = gpd.read_file(config["data_source"]["S_SSEN_sepd_bounds"])
  File "/Users/elizabethgallagher/miniconda3/envs/asf_heat_pump_suitability/lib/python3.10/site-packages/geopandas/io/file.py", line 294, in _read_file
    return _read_file_pyogrio(
  File "/Users/elizabethgallagher/miniconda3/envs/asf_heat_pump_suitability/lib/python3.10/site-packages/geopandas/io/file.py", line 547, in _read_file_pyogrio
    return pyogrio.read_dataframe(path_or_bytes, bbox=bbox, **kwargs)
  File "/Users/elizabethgallagher/miniconda3/envs/asf_heat_pump_suitability/lib/python3.10/site-packages/pyogrio/geopandas.py", line 261, in read_dataframe
    result = read_func(
  File "/Users/elizabethgallagher/miniconda3/envs/asf_heat_pump_suitability/lib/python3.10/site-packages/pyogrio/raw.py", line 196, in read
    return ogr_read(
  File "pyogrio/_io.pyx", line 1239, in pyogrio._io.ogr_read
  File "pyogrio/_io.pyx", line 219, in pyogrio._io.ogr_open
pyogrio.errors.DataSourceError: 404: Send failure: Broken pipe

Perhaps something to do with a package version I have? I need to log off soon, but I'll investigate more tomorrow - but wanted to get my comments out now as they are.

Comment on lines -15 to -21
[
pl.col("Number of households 2021")
.fill_null(0)
.alias("Number of households 2021"),
pl.col("Land Count (Area in KM2)")
.fill_null(0)
.alias("Land Count (Area in KM2)"), # Fill with 1 to avoid division by zero
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come you didn't need to do this in the new version?

Copy link
Collaborator Author

@crispy-wonton crispy-wonton Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's no longer needed because when calculating property density, instead of calculating for the whole EPC dataset, I only calculate it where sufficient data is available - i.e. both household count and land area data is available. Household counts is joined to area data with an inner join. The minimum value for land area is 0.0097 km2, so we don't end up dividing anything by 0.

The resulting property density df is then left joined to the full EPC dataset. So where property density information is not available, there will be nulls in the final EPC dataset in this column.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good and clearer. I was confused where some of the cleaning steps went though - don't you still have the division by 0 issue?

Comment on lines 96 to 100
pl.when(
pl.col("heatpump_installation_percentage") > grid_installation_threshold
pl.col("heatpump_installation_percentage") >= grid_installation_threshold
)
.then(epc_threshold_scores.get(tech_type))
.then(scoring.epc_threshold_scores.get(tech_type))
.alias("grid_capacity_score"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this need updating since it uses the epc_threshold_scores scores (even though they are the same).

So let's have another scores dict in scoring with:

grid_capacity_scores = {
    "ASHP_S": 1,
    "ASHP_N": 0,
    "GSHP_S": 1,
    "GSHP_N": 0,
    "SGL_S": 1,
    "SGL_N": 0,
    "HN_S": 0,
    "HN_N": 0,
}

Also I like the idea of the 'grid_capacity_score' being relative to the actual heatpump_installation_percentage value, rather than a binary based off a threshold. So perhaps this is better:

(pl.col("heatpump_installation_percentage")/100)*scoring.grid_capacity_scores.get(tech_type).alias("grid_capacity_score")

[if that is the right polars syntax!]

What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lizgzil ! Great spot on the epc_threshold_scores. Weird that we didn't use the grid_capacity_scores dict before because it did already exist so not sure what happened! I've updated it now.

I've also updated the scoring calculation based on your suggestion which I think is a good idea because it gives more nuance in the final scores. So now, if the heatpump_installation_percentage is not null, then the score will be multiplied by the %. For this reason, we no longer require the grid_installation_threshold arg so I have removed it.

asf_heat_pump_suitability/pipeline/README.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@lizgzil lizgzil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crispy-wonton - when I ran it again today the grid_capacity.calculate_grid_capacity() issue I had yesterday didn't come up again - so must have just been an internet connection thing or something like that! 😌

I ran run_add_features in full (apart from saving) and it all ran, and the final epc_df looked good to me - just noticed some nulls that could be turned to False 👍

grid_capacities = grid_capacity.calculate_grid_capacity()
grid_capacities = grid_capacity.calculate_grid_capacity().select(
["lsoa", "heatpump_installation_percentage"]
)
epc_df = epc_df.join(grid_capacities, how="left", on="lsoa")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
epc_df = epc_df.join(grid_capacities, how="left", on="lsoa")
epc_df = epc_df.join(grid_capacities, how="left", on="lsoa").with_columns(
pl.col("heatpump_installation_percentage").fill_null(False)
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the nulls in heat_pump_installation_percentage and has_anchor_property are true nulls. Meaning I think we are missing the data to say what their HP installation percentage is and whether or not they have anchor properties. Therefore I think they should remain as nulls rather than filling with False.

@shabrf is this correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For heat pump installations the null values would be true nulls as these appear when we don't have geo data for the substation distribution area which covers the LSOA, so the capacity available to that LSOA is unknown. For anchor properties, we can assume the the points of interest dataset has good coverage so null values here would indicate that the LSOA doesn't contain any of the specified anchor properties within it, so I think we should fill these nulls with False.

anchor_properties_df = anchor_properties.identify_anchor_properties()
anchor_properties_df = anchor_properties.identify_anchor_properties_df().select(
["lsoa", "has_anchor_property"]
)
epc_df = epc_df.join(anchor_properties_df, how="left", on="lsoa")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
epc_df = epc_df.join(anchor_properties_df, how="left", on="lsoa")
epc_df = epc_df.join(anchor_properties_df, how="left", on="lsoa").with_columns(
pl.col("has_anchor_property").fill_null(False)
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +27 to +29
households_df = load_transform_df_n_households_ew()
area_df = load_transform_df_land_area_ew()
df = households_df.join(area_df, how="inner", on="lsoa")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lizgzil here is the inner join between households and area which means we don't end up with any rows with 0 value for area. Therefore, we don't need the code to prevent zero divisions.

…un instructions to docstring of run_calculate_suitability.py
…es for use of weights and data required for at least 15 properties

- correct typo in pipeline/README
- correct notation in run_calculate_suitability.py
lsoa_df, lsoa_code
)
)
# Must have at least 15 properties to be included in final dataset
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lizgzil I have now added a note about this to the main README based on your comment. Please check you're happy with it :)

…e checking if in WHS - this solves the issue of UPRNs being duplicated when concatenating england, wales, and scotland
…s after changing output df from pandas to polars
…lative to the actual heatpump_installation_percentage value, rather than a binary based off a threshold
@crispy-wonton
Copy link
Collaborator Author

@lizgzil thank you for your review and comments.

I have addressed your comments and answered your questions. I've made 3 new issues based on your comments to address separately: #94 #95 #96

I have fixed the bug you found in protected_areas.py, updated the READMEs as requested, and changed the way we use the grid capacity feature for scoring.

I left a comment about filling NaN values with False. I think it's correct that we leave them as NaN values but have requested @shabrf 's input to double check this.

Copy link
Collaborator

@lizgzil lizgzil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all this @crispy-wonton - looks great!

Comment on lines 31 to 34
- `--epc s3://asf-daps/lakehouse/processed/epc/old/deduplicated/processed_dedupl-0.parquet`
- `--year 2023`
- `--quarter 4`

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

README.md Outdated
Comment on lines 64 to 73
averaged per property and weighted\* before finally aggregating to LSOA level. Note that an LSOA must have data for at
least 15 properties to be included in the final suitability per LSOA dataset.

\*Scores will only be weighted for an LSOA if the proportion of EPC properties in that LSOA that have a weight is above a
specified threshold - the default threshold (and the threshold we have used for our published results) is 50%. Individual
properties do not receive a weight if they are missing data required for weighting.

If the threshold is not met for a given LSOA, suitability scores for that LSOA will be unweighted and labelled as such.
Unweighted scores may not accurately represent the suitability of an LSOA for a given heating technology as a whole and
should therefore be interpreted with caution.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! this is really useful to have made a note of

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants