86 update run scripts #93

crispy-wonton · 2024-12-06T17:35:45Z

Fixes #86

Description

This PR changes the key run scripts in the repo pipeline so that they can all be run with the same inputs (preprocessed & deduplicated EPC dataset), allowing them to be run out of order. This is meant to simplify and clarify the running of the full pipeline.

NB: I have also left comments on the files themselves to help you through the review. See below a summary of the main and minor changes (not in any order of importance).

Main changes:

pipeline/README.md - new instructions on how to run simplified pipeline.
pipeline/prepare_features/garden_size.py - new function (taken from previously reviewed script) to deduplicate garden size)
pipeline/prepare_features/land_area.py & pipeline/prepare_features/number_of_households.py - deleted. Functions refactored and moved to pipeline/prepare_features/property_density.py.
pipeline/prepare_features/output_areas.py - new function to load and clean postcode to LSOA lookup. We had a function that did this already but it also loaded MSOA, RUC, LAD columns which are unnecessary for the use case in which this new function was applied. Since the ONS PD is so big, I thought it would be more efficient to have a separate smaller function, rather than process more columns and then drop them.
pipeline/prepare_features/property_density.py - Refactoring to simplify existing property density code and combine Scotland with England and Wales property density df generation into a single function.
pipeline/prepare_features/protected_areas.py - New function to generate single df of protected areas from Scotland (world heritage sites) and England and Wales (building conservation areas).
pipeline/run_scripts/run_add_features.py & pipeline/run_scripts/run_calculate_garden_size.py - Uses new and refactored functions and script is refactored to take preprocessed deduplicated EPC as input.
pipeline/run_scripts/run_calculate_suitability.py - calculate_suitability.py refactored into 3 files, including this one. The whole file is showing diffs because it is newly created, but this script contains code that has been previously reviewed. I have labelled parts that have changed since the last review of this code, the rest is unchanged since the last review.
pipeline/suitability/calculate_suitability.py - refactored for various reasons as labelled on file

Minor changes:

getters/get_datasets.py - update documentation of some getters for clarity on type of land area measurement
analysis/exploratory/aggregate_build_year_per_la.py & analysis/exploratory/subset_epc_north_england.py & pipeline/run_scripts/run_generate_open_dataset.py & pipeline/run_scripts/run_process_garden_size.py- update call to renamed function
pipeline/prepare_features/anchor_properties.py & pipeline/prepare_features/grid_capacity.py - update outputs of functions to polars dataframes so that outputs can be joined to main EPC dataframe in run_add_features.py
pipeline/prepare_features/lat_lon.py - minor correction to code
pipeline/reweight_epc/prepare_sample.py - correction to code
pipeline/prepare_features/off_gas.py - standardise doc string and column name
pipeline/run_scripts/run_compute_epc_weights.py - Updated to use new ONSPD load transform function from output_areas.py
pipeline/suitability/scoring.py - moved scoring dicts to separate file. These are unchanged, just moved to their own file.
utils/save_utils.py - Save util updates to allow saving .parquet or .csv files

Instructions for Reviewer

In order to test the code in this PR you need to ...
There are instructions at the top of each run script for how to run the code, as well as instructions in the pipeline/README if you want to run any of them.
I have run the updated scripts and they seem to work for me.

Please pay special attention to ...

the new pipeline/README is clear and understandable. I'm not sure about the table format, so let me know if it doesn't work. I wanted to create something that could be easily updated. Let me know if I should change the order or anything to make it more clear.
New functions
Refactoring of existing functions to ensure they produce the same output, especially refactoring of the functions to calculate suitability
All run scripts have been appropriately refactored as required

Checklist:

… coord columns instead of hard-coding

…ctions into property_density.py - add new function to generate property density for all of GB

…tency

…polars DataFrames so they can be joined to EPC

…A and postcode columns and rename existing function

…mple.py

…o adding MSOA, RUC indicator, LAD

…nstead of weighted EPC - refactor property density code so that England and Wales and Scotland property density can be added in one function - add LSOA, MSOA, LAD, and RUC indicator cols to dataset - select specific required from anchor_properties and grid_capacity dfs to create final clean dataset

…ntain score dicts, and run_calculate_suitability.py run script

…n exploratory notebooks and run scripts

… protected areas (either building conservation or WHS)

…areas for all nations

…ith full dataset and other with deduplicated UPRNs

…duplicated EPC add try/except block for loading microsoft building footprint files

asf_heat_pump_suitability/getters/get_datasets.py

crispy-wonton · 2024-12-06T17:37:22Z

asf_heat_pump_suitability/pipeline/README.md

Easier to review this file by clicking the three dots in the top right and selecting 'View file'.

crispy-wonton · 2024-12-06T17:38:41Z

asf_heat_pump_suitability/pipeline/prepare_features/anchor_properties.py

@@ -103,12 +102,12 @@ def load_gdf_and_process_poi() -> gpd.GeoDataFrame:
    return anchor_properties


-def identify_anchor_properties_gdf() -> gpd.GeoDataFrame:
+def identify_anchor_properties_df() -> pl.DataFrame:


I changed the output of this function to a polars.DataFrame object so that it can be joined to the EPC dataset (which is also a polars.DataFrame) in the run_add_features.py script.

asf_heat_pump_suitability/pipeline/prepare_features/garden_size.py

asf_heat_pump_suitability/pipeline/prepare_features/grid_capacity.py

crispy-wonton · 2024-12-06T17:40:48Z

asf_heat_pump_suitability/pipeline/prepare_features/land_area.py

Deleted this file, refactored the functions and moved them to property_density.py.

crispy-wonton · 2024-12-06T17:41:00Z

asf_heat_pump_suitability/pipeline/prepare_features/number_of_households.py

Deleted this file, refactored the functions and moved them to property_density.py.

crispy-wonton · 2024-12-06T17:43:42Z

asf_heat_pump_suitability/pipeline/prepare_features/protected_areas.py

Here I combined the functions to generate the in_building_conservation_area flag for UPRNs in England and Wales, and the in_world_heritage_site flag for UPRNs in Scotland. I combined the output into one single feature in_protected_area to make it easier to use in calculating suitability.

asf_heat_pump_suitability/pipeline/run_scripts/run_calculate_garden_size.py

crispy-wonton · 2024-12-16T18:18:31Z

asf_heat_pump_suitability/pipeline/run_scripts/run_calculate_garden_size.py

Uses new and refactored functions and script is refactored to take preprocessed deduplicated EPC as input.

crispy-wonton · 2024-12-16T18:22:14Z

asf_heat_pump_suitability/pipeline/run_scripts/run_calculate_suitability.py

calculate_suitability.py refactored into 3 files, including this one. The whole file is showing diffs because it is newly created, but this script contains code that has been previously reviewed. I have labelled parts that have changed since the last review of this code, the rest is unchanged since the last review.

crispy-wonton · 2024-12-16T18:22:44Z

asf_heat_pump_suitability/pipeline/run_scripts/run_calculate_suitability.py

+from asf_heat_pump_suitability.pipeline.suitability import calculate_suitability
+
+
+def parse_arguments():


Parser function updated since last review to take the paths of the different input files.

asf_heat_pump_suitability/pipeline/run_scripts/run_calculate_suitability.py

crispy-wonton · 2024-12-16T18:26:58Z

asf_heat_pump_suitability/pipeline/run_scripts/run_compute_epc_weights.py

Updated to use new ONSPD load transform function from output_areas.py

asf_heat_pump_suitability/pipeline/suitability/calculate_suitability.py

crispy-wonton · 2024-12-16T18:31:28Z

asf_heat_pump_suitability/pipeline/suitability/scoring.py

These are unchanged, just moved to their own file.

crispy-wonton · 2024-12-16T18:32:11Z

asf_heat_pump_suitability/utils/save_utils.py

Save util updates to allow saving .parquet or .csv files

lizgzil

This looks really great @crispy-wonton - so much clearer! Well done - must have been a lot of gnarly refactoring! 🎉

I added some comments/questions.

I was wondering when reading through everything again whether all the suitability scoring assumptions are captured in a readme somewhere? - thinking of the various filters which decide whether a LSOA is included - e.g. >=15 properties, >=4 features, min proportion of properties in lsoa with weights.

I also had a comment on how we utilise the grid capacity feature.

I do find the readme tables slightly hard to parse since they render in a way that means I have to scroll across to read them all - so can't see them at a glance. But the info is there, so don't feel like you need to change these if you find them clearer this way.

I tried running python -i asf_heat_pump_suitability/pipeline/run_scripts/run_add_features.py --epc s3://asf-daps/lakehouse/processed/epc/old/deduplicated/processed_dedupl-0.parquet -y 2023 -q 4 and got an error at the grid_capacities =grid_capacity.calculate_grid_capacity().select(["lsoa", "heatpump_installation_percentage"]) line. Seems to be something to do with the loading?

File "/Users/elizabethgallagher/Code/asf_heat_pump_suitability/asf_heat_pump_suitability/getters/get_dno_datasets.py", line 289, in generate_ssen_gdf
    sepd_shape = gpd.read_file(config["data_source"]["S_SSEN_sepd_bounds"])
  File "/Users/elizabethgallagher/miniconda3/envs/asf_heat_pump_suitability/lib/python3.10/site-packages/geopandas/io/file.py", line 294, in _read_file
    return _read_file_pyogrio(
  File "/Users/elizabethgallagher/miniconda3/envs/asf_heat_pump_suitability/lib/python3.10/site-packages/geopandas/io/file.py", line 547, in _read_file_pyogrio
    return pyogrio.read_dataframe(path_or_bytes, bbox=bbox, **kwargs)
  File "/Users/elizabethgallagher/miniconda3/envs/asf_heat_pump_suitability/lib/python3.10/site-packages/pyogrio/geopandas.py", line 261, in read_dataframe
    result = read_func(
  File "/Users/elizabethgallagher/miniconda3/envs/asf_heat_pump_suitability/lib/python3.10/site-packages/pyogrio/raw.py", line 196, in read
    return ogr_read(
  File "pyogrio/_io.pyx", line 1239, in pyogrio._io.ogr_read
  File "pyogrio/_io.pyx", line 219, in pyogrio._io.ogr_open
pyogrio.errors.DataSourceError: 404: Send failure: Broken pipe

Perhaps something to do with a package version I have? I need to log off soon, but I'll investigate more tomorrow - but wanted to get my comments out now as they are.

asf_heat_pump_suitability/pipeline/prepare_features/garden_size.py

asf_heat_pump_suitability/pipeline/prepare_features/off_gas.py

lizgzil · 2024-12-17T11:34:18Z

asf_heat_pump_suitability/pipeline/prepare_features/property_density.py

-        [
-            pl.col("Number of households 2021")
-            .fill_null(0)
-            .alias("Number of households 2021"),
-            pl.col("Land Count (Area in KM2)")
-            .fill_null(0)
-            .alias("Land Count (Area in KM2)"),  # Fill with 1 to avoid division by zero


how come you didn't need to do this in the new version?

It's no longer needed because when calculating property density, instead of calculating for the whole EPC dataset, I only calculate it where sufficient data is available - i.e. both household count and land area data is available. Household counts is joined to area data with an inner join. The minimum value for land area is 0.0097 km2, so we don't end up dividing anything by 0.

The resulting property density df is then left joined to the full EPC dataset. So where property density information is not available, there will be nulls in the final EPC dataset in this column.

lizgzil · 2024-12-17T11:37:00Z

asf_heat_pump_suitability/pipeline/prepare_features/property_density.py

this looks good and clearer. I was confused where some of the cleaning steps went though - don't you still have the division by 0 issue?

asf_heat_pump_suitability/pipeline/prepare_features/protected_areas.py

asf_heat_pump_suitability/pipeline/suitability/calculate_suitability.py

lizgzil · 2024-12-17T12:06:35Z

asf_heat_pump_suitability/pipeline/suitability/calculate_suitability.py

        pl.when(
-            pl.col("heatpump_installation_percentage") > grid_installation_threshold
+            pl.col("heatpump_installation_percentage") >= grid_installation_threshold
        )
-        .then(epc_threshold_scores.get(tech_type))
+        .then(scoring.epc_threshold_scores.get(tech_type))
        .alias("grid_capacity_score"),


I think this need updating since it uses the epc_threshold_scores scores (even though they are the same).

So let's have another scores dict in scoring with:

grid_capacity_scores = { "ASHP_S": 1, "ASHP_N": 0, "GSHP_S": 1, "GSHP_N": 0, "SGL_S": 1, "SGL_N": 0, "HN_S": 0, "HN_N": 0, }

Also I like the idea of the 'grid_capacity_score' being relative to the actual heatpump_installation_percentage value, rather than a binary based off a threshold. So perhaps this is better:

(pl.col("heatpump_installation_percentage")/100)*scoring.grid_capacity_scores.get(tech_type).alias("grid_capacity_score")

[if that is the right polars syntax!]

What do you think?

Thanks @lizgzil ! Great spot on the epc_threshold_scores. Weird that we didn't use the grid_capacity_scores dict before because it did already exist so not sure what happened! I've updated it now.

I've also updated the scoring calculation based on your suggestion which I think is a good idea because it gives more nuance in the final scores. So now, if the heatpump_installation_percentage is not null, then the score will be multiplied by the %. For this reason, we no longer require the grid_installation_threshold arg so I have removed it.

asf_heat_pump_suitability/pipeline/suitability/scoring.py

asf_heat_pump_suitability/pipeline/README.md

asf_heat_pump_suitability/pipeline/run_scripts/run_calculate_suitability.py

lizgzil

@crispy-wonton - when I ran it again today the grid_capacity.calculate_grid_capacity() issue I had yesterday didn't come up again - so must have just been an internet connection thing or something like that! 😌

I ran run_add_features in full (apart from saving) and it all ran, and the final epc_df looked good to me - just noticed some nulls that could be turned to False 👍

lizgzil · 2024-12-18T16:20:20Z

asf_heat_pump_suitability/pipeline/run_scripts/run_add_features.py

-    grid_capacities = grid_capacity.calculate_grid_capacity()
+    grid_capacities = grid_capacity.calculate_grid_capacity().select(
+        ["lsoa", "heatpump_installation_percentage"]
+    )
    epc_df = epc_df.join(grid_capacities, how="left", on="lsoa")


Suggested change

epc_df = epc_df.join(grid_capacities, how="left", on="lsoa")

epc_df = epc_df.join(grid_capacities, how="left", on="lsoa").with_columns(

pl.col("heatpump_installation_percentage").fill_null(False)

)

I think the nulls in heat_pump_installation_percentage and has_anchor_property are true nulls. Meaning I think we are missing the data to say what their HP installation percentage is and whether or not they have anchor properties. Therefore I think they should remain as nulls rather than filling with False.

@shabrf is this correct?

For heat pump installations the null values would be true nulls as these appear when we don't have geo data for the substation distribution area which covers the LSOA, so the capacity available to that LSOA is unknown. For anchor properties, we can assume the the points of interest dataset has good coverage so null values here would indicate that the LSOA doesn't contain any of the specified anchor properties within it, so I think we should fill these nulls with False.

lizgzil · 2024-12-18T16:20:45Z

asf_heat_pump_suitability/pipeline/run_scripts/run_add_features.py

-    anchor_properties_df = anchor_properties.identify_anchor_properties()
+    anchor_properties_df = anchor_properties.identify_anchor_properties_df().select(
+        ["lsoa", "has_anchor_property"]
+    )
    epc_df = epc_df.join(anchor_properties_df, how="left", on="lsoa")


Suggested change

epc_df = epc_df.join(anchor_properties_df, how="left", on="lsoa")

epc_df = epc_df.join(anchor_properties_df, how="left", on="lsoa").with_columns(

pl.col("has_anchor_property").fill_null(False)

)

As above https://github.com/nestauk/asf_heat_pump_suitability/pull/93/files#r1894150275

…arden_size.py Co-authored-by: Liz G <[email protected]>

crispy-wonton · 2024-12-19T17:40:25Z

asf_heat_pump_suitability/pipeline/prepare_features/property_density.py

+    households_df = load_transform_df_n_households_ew()
+    area_df = load_transform_df_land_area_ew()
+    df = households_df.join(area_df, how="inner", on="lsoa")


@lizgzil here is the inner join between households and area which means we don't end up with any rows with 0 value for area. Therefore, we don't need the code to prevent zero divisions.

…un instructions to docstring of run_calculate_suitability.py

…es for use of weights and data required for at least 15 properties - correct typo in pipeline/README - correct notation in run_calculate_suitability.py

crispy-wonton · 2024-12-20T14:42:15Z

asf_heat_pump_suitability/pipeline/run_scripts/run_calculate_suitability.py

+                lsoa_df, lsoa_code
+            )
+        )
+    # Must have at least 15 properties to be included in final dataset


@lizgzil I have now added a note about this to the main README based on your comment. Please check you're happy with it :)

…e checking if in WHS - this solves the issue of UPRNs being duplicated when concatenating england, wales, and scotland

…s after changing output df from pandas to polars

…lative to the actual heatpump_installation_percentage value, rather than a binary based off a threshold

…lculate suitability score

crispy-wonton · 2024-12-20T16:43:54Z

@lizgzil thank you for your review and comments.

I have addressed your comments and answered your questions. I've made 3 new issues based on your comments to address separately: #94 #95 #96

I have fixed the bug you found in protected_areas.py, updated the READMEs as requested, and changed the way we use the grid capacity feature for scoring.

I left a comment about filling NaN values with False. I think it's correct that we leave them as NaN values but have requested @shabrf 's input to double check this.

lizgzil

Thanks for all this @crispy-wonton - looks great!

lizgzil · 2024-12-20T16:58:04Z

asf_heat_pump_suitability/pipeline/README.md

+- `--epc s3://asf-daps/lakehouse/processed/epc/old/deduplicated/processed_dedupl-0.parquet`
+- `--year 2023`
+- `--quarter 4`
+


lizgzil · 2024-12-20T16:59:16Z

README.md

+averaged per property and weighted\* before finally aggregating to LSOA level. Note that an LSOA must have data for at
+least 15 properties to be included in the final suitability per LSOA dataset.
+
+\*Scores will only be weighted for an LSOA if the proportion of EPC properties in that LSOA that have a weight is above a
+specified threshold - the default threshold (and the threshold we have used for our published results) is 50%. Individual
+properties do not receive a weight if they are missing data required for weighting.
+
+If the threshold is not met for a given LSOA, suitability scores for that LSOA will be unweighted and labelled as such.
+Unweighted scores may not accurately represent the suitability of an LSOA for a given heating technology as a whole and
+should therefore be interpreted with caution.


thanks! this is really useful to have made a note of

crispy-wonton added 20 commits December 4, 2024 12:17

update generate_gdf_uprn_coords in lat_lon.py to use inputted x and y…

466d5df

… coord columns instead of hard-coding

delete land_area.py and number_of_households.py and move&refactor fun…

7d6a5a9

…ctions into property_density.py - add new function to generate property density for all of GB

rename off_gas column created in functions in off_gas.py for consis…

515f527

…tency

update output dfs in anchor_properties.py and grid_capacity.py to be …

1e6e599

…polars DataFrames so they can be joined to EPC

add new function to output_areas.py to generate dataset with just LSO…

96efd27

…A and postcode columns and rename existing function

add otherwise(None) to add_col_property_type function in prepare_sa…

5b61813

…mple.py

update documentation for getters in get_datasets.py

1d80151

update run_compute_epc_weights.py to add only LSOA col instead of als…

a7513b2

…o adding MSOA, RUC indicator, LAD

update EPC cols used in calculate_suitability.py

f39abb6

refactor calculate_suitability.py into function module, scoring.py co…

a5796d4

…ntain score dicts, and run_calculate_suitability.py run script

update save_utils.py to save different file types and update usages i…

e143819

…n exploratory notebooks and run scripts

add new function to garden_size.py to deduplicate UPRNs

0b120af

add new function to protected_areas.py to generate full df with E/W/S…

25673b5

… protected areas (either building conservation or WHS)

update run_add_features.py to use new function to generate protected …

b944fc7

…areas for all nations

add code to run_calculate_garden_size.py to generate two files, one w…

53eedec

…ith full dataset and other with deduplicated UPRNs

update run_process_garden_size.py to use updated save_utils function

3db20bd

update pipeline/README with new instructions to run the full pipeline

9c96941

correct error in pipeline/README table

80976f3

update run_calaculate_garden_size.py script to take preprocessed & de…

b0cc1f6

…duplicated EPC add try/except block for loading microsoft building footprint files

crispy-wonton commented Dec 6, 2024

View reviewed changes

asf_heat_pump_suitability/getters/get_datasets.py Outdated Show resolved Hide resolved

crispy-wonton commented Dec 6, 2024

View reviewed changes

asf_heat_pump_suitability/pipeline/prepare_features/garden_size.py Show resolved Hide resolved

crispy-wonton commented Dec 6, 2024

View reviewed changes

asf_heat_pump_suitability/pipeline/prepare_features/grid_capacity.py Show resolved Hide resolved

crispy-wonton commented Dec 6, 2024

View reviewed changes

asf_heat_pump_suitability/pipeline/run_scripts/run_calculate_garden_size.py Show resolved Hide resolved

crispy-wonton commented Dec 6, 2024

View reviewed changes

asf_heat_pump_suitability/pipeline/run_scripts/run_calculate_garden_size.py Show resolved Hide resolved