-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
86 update run scripts #93
base: 65_scottish_reweighting
Are you sure you want to change the base?
Conversation
… coord columns instead of hard-coding
…ctions into property_density.py - add new function to generate property density for all of GB
…polars DataFrames so they can be joined to EPC
…A and postcode columns and rename existing function
…o adding MSOA, RUC indicator, LAD
…nstead of weighted EPC - refactor property density code so that England and Wales and Scotland property density can be added in one function - add LSOA, MSOA, LAD, and RUC indicator cols to dataset - select specific required from anchor_properties and grid_capacity dfs to create final clean dataset
…ntain score dicts, and run_calculate_suitability.py run script
…n exploratory notebooks and run scripts
… protected areas (either building conservation or WHS)
…areas for all nations
…ith full dataset and other with deduplicated UPRNs
…duplicated EPC add try/except block for loading microsoft building footprint files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Easier to review this file by clicking the three dots in the top right and selecting 'View file'.
@@ -103,12 +102,12 @@ def load_gdf_and_process_poi() -> gpd.GeoDataFrame: | |||
return anchor_properties | |||
|
|||
|
|||
def identify_anchor_properties_gdf() -> gpd.GeoDataFrame: | |||
def identify_anchor_properties_df() -> pl.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the output of this function to a polars.DataFrame
object so that it can be joined to the EPC dataset (which is also a polars.DataFrame
) in the run_add_features.py
script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted this file, refactored the functions and moved them to property_density.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted this file, refactored the functions and moved them to property_density.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I combined the functions to generate the in_building_conservation_area
flag for UPRNs in England and Wales, and the in_world_heritage_site
flag for UPRNs in Scotland. I combined the output into one single feature in_protected_area
to make it easier to use in calculating suitability.
asf_heat_pump_suitability/pipeline/run_scripts/run_calculate_garden_size.py
Show resolved
Hide resolved
asf_heat_pump_suitability/pipeline/run_scripts/run_calculate_garden_size.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uses new and refactored functions and script is refactored to take preprocessed deduplicated EPC as input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calculate_suitability.py
refactored into 3 files, including this one. The whole file is showing diffs because it is newly created, but this script contains code that has been previously reviewed. I have labelled parts that have changed since the last review of this code, the rest is unchanged since the last review.
from asf_heat_pump_suitability.pipeline.suitability import calculate_suitability | ||
|
||
|
||
def parse_arguments(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parser function updated since last review to take the paths of the different input files.
asf_heat_pump_suitability/pipeline/run_scripts/run_calculate_suitability.py
Show resolved
Hide resolved
asf_heat_pump_suitability/pipeline/run_scripts/run_calculate_suitability.py
Show resolved
Hide resolved
asf_heat_pump_suitability/pipeline/run_scripts/run_calculate_suitability.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use new ONSPD load transform function from output_areas.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are unchanged, just moved to their own file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Save util updates to allow saving .parquet
or .csv
files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really great @crispy-wonton - so much clearer! Well done - must have been a lot of gnarly refactoring! 🎉
I added some comments/questions.
I was wondering when reading through everything again whether all the suitability scoring assumptions are captured in a readme somewhere? - thinking of the various filters which decide whether a LSOA is included - e.g. >=15 properties, >=4 features, min proportion of properties in lsoa with weights.
I also had a comment on how we utilise the grid capacity feature.
I do find the readme tables slightly hard to parse since they render in a way that means I have to scroll across to read them all - so can't see them at a glance. But the info is there, so don't feel like you need to change these if you find them clearer this way.
I tried running python -i asf_heat_pump_suitability/pipeline/run_scripts/run_add_features.py --epc s3://asf-daps/lakehouse/processed/epc/old/deduplicated/processed_dedupl-0.parquet -y 2023 -q 4
and got an error at the grid_capacities =grid_capacity.calculate_grid_capacity().select(["lsoa", "heatpump_installation_percentage"])
line. Seems to be something to do with the loading?
File "/Users/elizabethgallagher/Code/asf_heat_pump_suitability/asf_heat_pump_suitability/getters/get_dno_datasets.py", line 289, in generate_ssen_gdf
sepd_shape = gpd.read_file(config["data_source"]["S_SSEN_sepd_bounds"])
File "/Users/elizabethgallagher/miniconda3/envs/asf_heat_pump_suitability/lib/python3.10/site-packages/geopandas/io/file.py", line 294, in _read_file
return _read_file_pyogrio(
File "/Users/elizabethgallagher/miniconda3/envs/asf_heat_pump_suitability/lib/python3.10/site-packages/geopandas/io/file.py", line 547, in _read_file_pyogrio
return pyogrio.read_dataframe(path_or_bytes, bbox=bbox, **kwargs)
File "/Users/elizabethgallagher/miniconda3/envs/asf_heat_pump_suitability/lib/python3.10/site-packages/pyogrio/geopandas.py", line 261, in read_dataframe
result = read_func(
File "/Users/elizabethgallagher/miniconda3/envs/asf_heat_pump_suitability/lib/python3.10/site-packages/pyogrio/raw.py", line 196, in read
return ogr_read(
File "pyogrio/_io.pyx", line 1239, in pyogrio._io.ogr_read
File "pyogrio/_io.pyx", line 219, in pyogrio._io.ogr_open
pyogrio.errors.DataSourceError: 404: Send failure: Broken pipe
Perhaps something to do with a package version I have? I need to log off soon, but I'll investigate more tomorrow - but wanted to get my comments out now as they are.
asf_heat_pump_suitability/pipeline/prepare_features/garden_size.py
Outdated
Show resolved
Hide resolved
[ | ||
pl.col("Number of households 2021") | ||
.fill_null(0) | ||
.alias("Number of households 2021"), | ||
pl.col("Land Count (Area in KM2)") | ||
.fill_null(0) | ||
.alias("Land Count (Area in KM2)"), # Fill with 1 to avoid division by zero |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how come you didn't need to do this in the new version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's no longer needed because when calculating property density, instead of calculating for the whole EPC dataset, I only calculate it where sufficient data is available - i.e. both household count and land area data is available. Household counts is joined to area data with an inner
join. The minimum value for land area is 0.0097 km2, so we don't end up dividing anything by 0.
The resulting property density df is then left joined to the full EPC dataset. So where property density information is not available, there will be nulls in the final EPC dataset in this column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good and clearer. I was confused where some of the cleaning steps went though - don't you still have the division by 0 issue?
pl.when( | ||
pl.col("heatpump_installation_percentage") > grid_installation_threshold | ||
pl.col("heatpump_installation_percentage") >= grid_installation_threshold | ||
) | ||
.then(epc_threshold_scores.get(tech_type)) | ||
.then(scoring.epc_threshold_scores.get(tech_type)) | ||
.alias("grid_capacity_score"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this need updating since it uses the epc_threshold_scores
scores (even though they are the same).
So let's have another scores dict in scoring with:
grid_capacity_scores = {
"ASHP_S": 1,
"ASHP_N": 0,
"GSHP_S": 1,
"GSHP_N": 0,
"SGL_S": 1,
"SGL_N": 0,
"HN_S": 0,
"HN_N": 0,
}
Also I like the idea of the 'grid_capacity_score' being relative to the actual heatpump_installation_percentage
value, rather than a binary based off a threshold. So perhaps this is better:
(pl.col("heatpump_installation_percentage")/100)*scoring.grid_capacity_scores.get(tech_type).alias("grid_capacity_score")
[if that is the right polars syntax!]
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lizgzil ! Great spot on the epc_threshold_scores
. Weird that we didn't use the grid_capacity_scores
dict before because it did already exist so not sure what happened! I've updated it now.
I've also updated the scoring calculation based on your suggestion which I think is a good idea because it gives more nuance in the final scores. So now, if the heatpump_installation_percentage
is not null, then the score will be multiplied by the %. For this reason, we no longer require the grid_installation_threshold
arg so I have removed it.
asf_heat_pump_suitability/pipeline/run_scripts/run_calculate_suitability.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@crispy-wonton - when I ran it again today the grid_capacity.calculate_grid_capacity()
issue I had yesterday didn't come up again - so must have just been an internet connection thing or something like that! 😌
I ran run_add_features
in full (apart from saving) and it all ran, and the final epc_df looked good to me - just noticed some nulls that could be turned to False 👍
grid_capacities = grid_capacity.calculate_grid_capacity() | ||
grid_capacities = grid_capacity.calculate_grid_capacity().select( | ||
["lsoa", "heatpump_installation_percentage"] | ||
) | ||
epc_df = epc_df.join(grid_capacities, how="left", on="lsoa") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
epc_df = epc_df.join(grid_capacities, how="left", on="lsoa") | |
epc_df = epc_df.join(grid_capacities, how="left", on="lsoa").with_columns( | |
pl.col("heatpump_installation_percentage").fill_null(False) | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the nulls in heat_pump_installation_percentage
and has_anchor_property
are true nulls. Meaning I think we are missing the data to say what their HP installation percentage is and whether or not they have anchor properties. Therefore I think they should remain as nulls rather than filling with False.
@shabrf is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I see!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For heat pump installations the null values would be true nulls as these appear when we don't have geo data for the substation distribution area which covers the LSOA, so the capacity available to that LSOA is unknown. For anchor properties, we can assume the the points of interest dataset has good coverage so null values here would indicate that the LSOA doesn't contain any of the specified anchor properties within it, so I think we should fill these nulls with False.
anchor_properties_df = anchor_properties.identify_anchor_properties() | ||
anchor_properties_df = anchor_properties.identify_anchor_properties_df().select( | ||
["lsoa", "has_anchor_property"] | ||
) | ||
epc_df = epc_df.join(anchor_properties_df, how="left", on="lsoa") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
epc_df = epc_df.join(anchor_properties_df, how="left", on="lsoa") | |
epc_df = epc_df.join(anchor_properties_df, how="left", on="lsoa").with_columns( | |
pl.col("has_anchor_property").fill_null(False) | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…arden_size.py Co-authored-by: Liz G <[email protected]>
households_df = load_transform_df_n_households_ew() | ||
area_df = load_transform_df_land_area_ew() | ||
df = households_df.join(area_df, how="inner", on="lsoa") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lizgzil here is the inner join between households and area which means we don't end up with any rows with 0 value for area. Therefore, we don't need the code to prevent zero divisions.
…un instructions to docstring of run_calculate_suitability.py
…es for use of weights and data required for at least 15 properties - correct typo in pipeline/README - correct notation in run_calculate_suitability.py
lsoa_df, lsoa_code | ||
) | ||
) | ||
# Must have at least 15 properties to be included in final dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lizgzil I have now added a note about this to the main README
based on your comment. Please check you're happy with it :)
…e checking if in WHS - this solves the issue of UPRNs being duplicated when concatenating england, wales, and scotland
…s after changing output df from pandas to polars
…lative to the actual heatpump_installation_percentage value, rather than a binary based off a threshold
…lculate suitability score
@lizgzil thank you for your review and comments. I have addressed your comments and answered your questions. I've made 3 new issues based on your comments to address separately: #94 #95 #96 I have fixed the bug you found in I left a comment about filling NaN values with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all this @crispy-wonton - looks great!
- `--epc s3://asf-daps/lakehouse/processed/epc/old/deduplicated/processed_dedupl-0.parquet` | ||
- `--year 2023` | ||
- `--quarter 4` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
README.md
Outdated
averaged per property and weighted\* before finally aggregating to LSOA level. Note that an LSOA must have data for at | ||
least 15 properties to be included in the final suitability per LSOA dataset. | ||
|
||
\*Scores will only be weighted for an LSOA if the proportion of EPC properties in that LSOA that have a weight is above a | ||
specified threshold - the default threshold (and the threshold we have used for our published results) is 50%. Individual | ||
properties do not receive a weight if they are missing data required for weighting. | ||
|
||
If the threshold is not met for a given LSOA, suitability scores for that LSOA will be unweighted and labelled as such. | ||
Unweighted scores may not accurately represent the suitability of an LSOA for a given heating technology as a whole and | ||
should therefore be interpreted with caution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! this is really useful to have made a note of
Fixes #86
Description
This PR changes the key run scripts in the repo pipeline so that they can all be run with the same inputs (preprocessed & deduplicated EPC dataset), allowing them to be run out of order. This is meant to simplify and clarify the running of the full pipeline.
NB: I have also left comments on the files themselves to help you through the review. See below a summary of the main and minor changes (not in any order of importance).
Main changes:
pipeline/README.md
- new instructions on how to run simplified pipeline.pipeline/prepare_features/garden_size.py
- new function (taken from previously reviewed script) to deduplicate garden size)pipeline/prepare_features/land_area.py
&pipeline/prepare_features/number_of_households.py
- deleted. Functions refactored and moved topipeline/prepare_features/property_density.py
.pipeline/prepare_features/output_areas.py
- new function to load and clean postcode to LSOA lookup. We had a function that did this already but it also loaded MSOA, RUC, LAD columns which are unnecessary for the use case in which this new function was applied. Since the ONS PD is so big, I thought it would be more efficient to have a separate smaller function, rather than process more columns and then drop them.pipeline/prepare_features/property_density.py
- Refactoring to simplify existing property density code and combine Scotland with England and Wales property density df generation into a single function.pipeline/prepare_features/protected_areas.py
- New function to generate single df of protected areas from Scotland (world heritage sites) and England and Wales (building conservation areas).pipeline/run_scripts/run_add_features.py
&pipeline/run_scripts/run_calculate_garden_size.py
- Uses new and refactored functions and script is refactored to take preprocessed deduplicated EPC as input.pipeline/run_scripts/run_calculate_suitability.py
-calculate_suitability.py
refactored into 3 files, including this one. The whole file is showing diffs because it is newly created, but this script contains code that has been previously reviewed. I have labelled parts that have changed since the last review of this code, the rest is unchanged since the last review.pipeline/suitability/calculate_suitability.py
- refactored for various reasons as labelled on fileMinor changes:
getters/get_datasets.py
- update documentation of some getters for clarity on type of land area measurementanalysis/exploratory/aggregate_build_year_per_la.py
&analysis/exploratory/subset_epc_north_england.py
&pipeline/run_scripts/run_generate_open_dataset.py
&pipeline/run_scripts/run_process_garden_size.py
- update call to renamed functionpipeline/prepare_features/anchor_properties.py
&pipeline/prepare_features/grid_capacity.py
- update outputs of functions topolars
dataframes so that outputs can be joined to main EPC dataframe inrun_add_features.py
pipeline/prepare_features/lat_lon.py
- minor correction to codepipeline/reweight_epc/prepare_sample.py
- correction to codepipeline/prepare_features/off_gas.py
- standardise doc string and column namepipeline/run_scripts/run_compute_epc_weights.py
- Updated to use new ONSPD load transform function fromoutput_areas.py
pipeline/suitability/scoring.py
- moved scoring dicts to separate file. These are unchanged, just moved to their own file.utils/save_utils.py
- Save util updates to allow saving.parquet
or.csv
filesInstructions for Reviewer
In order to test the code in this PR you need to ...
There are instructions at the top of each run script for how to run the code, as well as instructions in the pipeline/README if you want to run any of them.
I have run the updated scripts and they seem to work for me.
Please pay special attention to ...
Checklist:
notebooks/
pre-commit
and addressed any issues not automatically fixeddev
README
s