Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding file for calculating comparison of hn zones as well as plottin… #81

Open
wants to merge 6 commits into
base: dev
Choose a base branch
from

Conversation

helloaidank
Copy link
Collaborator

@helloaidank helloaidank commented Nov 6, 2024


Description

All files are contained within the asf_heat_pump_suitability/analysis/hn_zones directory.

The script comparison_of_hn_zones.py processes and compares DESNZ heat network (HN) to Nesta's heat pump suitability data for Liverpool, and outputs some files for plotting.

Steps taken:

  1. Load DESNZ HN GeoData and perform a spatial join with LSOA polygons.
  2. Process Nesta heat pump suitability scores for Liverpool.
  3. Identify LSOAs in DESNZ HN list not in HP suitability data.
  4. Calculate and log average scores and Mean Absolute Error (MAE).
  5. Save processed data to files.

Outputs:

  • 'liverpool_with_desnz_hn_lsoa.gpkg': GeoDataFrame with LSOA codes added to HN zones.
  • 'liverpool_hp_suitability_lsoas.json': List of unique LSOA codes in Liverpool.
  • 'liverpool_hp_suitability_scores_with_desnz.parquet': Processed suitability scores with DESNZ pilot scores and Mean Absolute Error (MAE).
  • 'script_output.log': Log file with metrics such as avg HN scores and MAE

The script comparison_of_hn_zones.py visualises heat network (HN) and heat pump suitability data for Liverpool.
Steps taken:

  1. Load and preprocess data:
    • Read Parquet and GeoPackage files.
    • Extract unique LSOA geometries.
    • Load LSOA geometries and convert CRS if needed.
    • Filter Liverpool LSOAs.
  2. Plot Liverpool LSOA geometries and save as PNG.
  3. Merge suitability scores with geometries and convert to GeoDataFrame.
  4. Plot and save overlay of DESNZ Pilot Score = 1 on Liverpool heat network zones.
  5. Define and use a function to plot absolute error maps by DESNZ Pilot Score, saving the plots in the output plots directory.

Fixes #80

Instructions for Reviewer

In order to test the code in this PR you need to run the following commands:
python asf_heat_pump_suitability/analysis/hn_zones/comparison_of_hn_zones.py

This should output a few data files to a directory called output_data, please check that they are there. If you could see how long this script will take as we have to scale it to ~20 other local authorities. For me it takes 30s-1 minute to run, which seems OK as we likely won't be running this repeatedly (i.e. it's good enough) but any major efficiency improvements are always welcome.

Then to visualise some of this data, run the following:
python asf_heat_pump_suitability/analysis/hn_zones/plot_comparison_of_hn_zones.py
This will output some visuals to output_plots for the Liverpool LA.

Please pay special attention to whether the outputs/data we are getting seem sensible and that the MAE checks out, this is the primary motivation for this PR.

Two things to be aware of regarding the methodology and visuals:

  • For the DESNZ Liverpool HN zones, a couple of the LSOAs are actually outside the LA of Liverpool and at the moment these are not taken into account in the script. There could be a case to include them in an exception but for the moment I have left them out.
  • The shape file with the UK LSOAs geometries that I do read in include land and water, whereas the DESNZ HN Zones are only for land. In terms of the absolute error, this does not make a difference (I don't think), but it does mean that the visuals for the DESNZ HN Zones look different to the Liverpool LSOA map as shown by the overlay plot.

For the next evolution of this script, we will scale it to the other 20 LAs we have and I think it's probably a good call to put these file paths into a config file. I am also curious as to where you think this should live in the repo!

Thanks so much for taking the time to review! It's much appreciated.

Checklist:

  • I have refactored my code out from notebooks/
  • I have checked the code runs
  • I have tested the code
  • I have run pre-commit and addressed any issues not automatically fixed
  • I have merged any new changes from dev
  • I have documented the code
    • Major functions have docstrings
    • Appropriate information has been added to READMEs
  • I have explained this PR above
  • I have requested a code review

@helloaidank helloaidank marked this pull request as ready for review November 7, 2024 09:29
Copy link
Collaborator

@crispy-wonton crispy-wonton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @helloaidank.
I've reviewed your comparison_of_hn_zones.py script and left comments. I need to come back and review the plotting script but wanted to publish these comments already so you can start scaling up :)

The code looks great! I do think we need to change the methodology slightly to exclude any small bits of LSOAs that might be skewing the error results because they look so similar for DESNZ pilot and non-pilot zones. And we want to make sure our results are representative of the same/similar locations so that we can assess them. If they still come out like this then maybe the features for HN model need to be improved!
Let me know if you have questions !

def main():
# Define the output directory and set up logging
output_dir = os.path.join(
PROJECT_DIR, "asf_heat_pump_suitability/analysis/hn_zones/output_data/"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest moving the hn_zone dir to outputs/hn_zones and then just save the results directly into that dir :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just on this, the outputs directory is in the .gitignore, so I am unsure if this is the best place to put the hn_zones directory?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine that the outputs go into a folder that's ignored by git. I don't think we want to push the results to github.

I can see you've change it in the new script. To clarify, I just meant to save the output files to the outputs dir. I think the scripts themselves should stay in the analysis folder. :) But perhaps I misunderstood your reasoning. What do you think?

Lastly, you could also save your outputs directly to S3 if you wanted. At least the .json files, if not the plots. We could make a new evaluation dir in the asf-heat-pump-suitability bucket.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, I misunderstood, I thought you meant moving the whole directory to the outputs folder. I will move it over after we've finished this review :).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After weighing it up in my head, I do think it probably makes sense to have a version saved to s3 (so people can play around with the plotting scripts without needing to run the comparison script) but if you are iterating locally, then make sure you save the files locally because it takes ages otherwise hehe and is a bit annoying.

@@ -0,0 +1,350 @@
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, this is great work. The code is clear and neat and I really like the logging. Great job!

Below are the logs from running the script. The results of average HN score for DESNZ pilot 0 and 1 seem very similar and with really high mean error rates. At least 0 has a lower score on avg than 1, but they are so close together and also both very central values at 0.52-0.61. If I saw those results out of context I don't think I would interpret that 0.52 is unsuitable for a HN but 0.61 really is.

It might be worth looking at the underlying data for the suitability features for these LSOAs and try to identify if they have anything in common that we seem to be misclassifying.

Looking at the relevant features for Heat Networks in the model, they are: not a listed building; not in building conservation zone (bear in mind a lot of properties will have a NAN flag here because of the patchiness of the conservation zone data. We assigned those not in a cons zone as NAN because we couldn't know if they were in one or if there was just cons data missing). ; off-gas; property in building with other properties (i.e. a flat); and in an urban area.

One reason that they have such similar scores could be that a lot of properties will score very similarly in a city even with variation between them. E.g. property density doesn't count for HN so there's no scale to density except urban/rural binary, and presumably all the LSOAs in this sample are urban. I'm guessing most properties in Liverpool will be not listed buildings and not off-gas(?). Those 3 features alone give a score of 0.47 (2.25/4.75). But still we would expect those on the higher scores to be around 0.9 if they are flats as well, which would put the score at 0.89 (4.25/4.75).

I think it might be true what you said in one of the stand ups, that small parts of LSOAs that have a tiny overlap with the DESNZ zones might be affecting the results. Maybe we need to set a threshold for the amount of overlap with DESNZ zone required to be considered in the analysis. We can start by setting it at 50% area (or 50% of OAs in an LSOA).

NFO: Loaded Liverpool with DESNZ HN LSOA data.
INFO: Created 133 records
INFO: Processed Nesta HP suitability scores for Liverpool.
INFO: number of LSOAs in DESNZ list not in HP suitability data: 2
INFO: Average HN_N_avg_score_weighted: 0.5608973509933775
INFO: Avg HN_N_avg_score_weighted for DESNZ_pilot_score = 1: 0.6132439024390244
INFO: Avg HN_N_avg_score_weighted for DESNZ_pilot_score = 0: 0.524927374301676
INFO: Mean Absolute Error (MAE) for DESNZ_pilot_score vs HN_N_avg_score_weighted: 0.46865231788079464
INFO: Mean Absolute Error (MAE) for DESNZ_pilot_score = 1: 0.3867560975609756
INFO: Mean Absolute Error (MAE) for DESNZ_pilot_score = 0: 0.524927374301676

Copy link
Collaborator

@crispy-wonton crispy-wonton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed the plot comparison code. It looks good and I don't see any issues or bugs. Mainly just a couple of suggestions for documentation.

I got 4 maps output from the script and they look the same as the ones you showed in the sprint review (see file names below). Great job with the heat mapping, it looks really good and the code is really clean and readable and simple! Really nice work!!

image

Comment on lines +33 to +34
INPUT_DIR = os.path.join(
PROJECT_DIR, "asf_heat_pump_suitability/analysis/hn_zones/output_data/"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above - can move this to the main outputs dir :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I meant moving the output_data and output_plots dirs only, not the scripts :)



def merge_data(
liverpool_hp_suitability_scores_pd: pd.DataFrame,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, you could shorten argument names and in-function variable names if you wanted. If you document what they represent well in the docstring, then you can use names like df and gdf here. I tend to do that in my code because I think it makes the code easier to read, because the lines then become a lot shorter.

Of course it's a matter of preference because your descriptive names can make it easier to understand what's going on in more complex pieces of code. So do whatever you prefer :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at least initially I like keeping the names descriptive especially on first iteration as I have a tendency to forget what each variable means! I'm also sure we can factor this down, as it will be for every LA rather than just Liverpool. But thinks like heat network (can abbreviate to hn as I'm very familiar with that by now).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, go with what you find easier to use. As you've said we can shorten them later if necessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @crispy-wonton!

Thank you for your comments, I’ve tweaked the methodology so that we are looking at the fraction of area covered in an LSOA by the DESNZ HN zones. We have taken this fraction to be the ‘DESNZ HN pilot score’ and this is what we compare to the Nesta HN score per LSOA.

We also now output a parquet file with the avg Nesta HN s as a function of the DESNZ HN pilot score (fraction of the LSOA that is covered by a DESNZ HN zone).
I’ve also moved this all to the ‘output’ directory as I mentioned, but we probably should put it somewhere else because I keep on getting the git .ignore warning.

In the plotting script, I’ve added a plot of avg Nesta HN vs DESNZ HN pilot zone and also perform and plot a linear regression and show the R^{2} on it.

Thanks for taking another look and for your really useful comments!

Copy link
Collaborator

@crispy-wonton crispy-wonton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @helloaidank ,

This looks great thank you!
I have mainly suggested some cosmetic changes and updates to documentation and a couple of places to simplify the inputs of functions.

There is one point on outputs/hn_zones/comparison_of_hn_zones.py that suggests to use overlay instead of sjoin with an intersection. I think it will be easier to apply. Let me know if you have questions about this.

Secondly, sorry my point in the previous review was not clear. I had meant for you to save the outputs of the script to the outputs directory in the repo, but I think the scripts themselves should be in the analysis folder where you originally had them.

Having said this, to make the final review easier and quicker, I would suggest keeping the files in the same place for now as you make further edits to the code. This means when I do my final review, I can review only the changes since this review, rather than the entire file showing up as diffs (due to it being moved to a new folder). If there's a way to do this on Github anyway that I don't know about , please let me know! After the final review, we can then move the files back to the analysis folder.

Lastly, I think you could create a new folder in S3, called evaluation/liverpool_heat_networks or something, and save your intermediate (and also final!) outputs to there if you wanted. I think this is better than saving locally because then the whole project team has direct access to the results :)

Great work! I ran your code and it produced the expected outputs and was clearly documented.

Let me know if you have any questions :)


Returns:
Tuple[pl.DataFrame, float, float]:
- Updated DataFrame with 'DESNZ_pilot_fraction' column added (fraction of area covered by HN zones).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Updated DataFrame with 'DESNZ_pilot_fraction' column added (fraction of area covered by HN zones).
- Updated DataFrame with 'DESNZ_pilot_fraction' column added (fraction of local authority area covered by HN zones).

Is this correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that's correct!

def main():
# Define the output directory and set up logging
output_dir = os.path.join(
PROJECT_DIR, "asf_heat_pump_suitability/analysis/hn_zones/output_data/"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine that the outputs go into a folder that's ignored by git. I don't think we want to push the results to github.

I can see you've change it in the new script. To clarify, I just meant to save the output files to the outputs dir. I think the scripts themselves should stay in the analysis folder. :) But perhaps I misunderstood your reasoning. What do you think?

Lastly, you could also save your outputs directly to S3 if you wanted. At least the .json files, if not the plots. We could make a new evaluation dir in the asf-heat-pump-suitability bucket.

Comment on lines +33 to +34
INPUT_DIR = os.path.join(
PROJECT_DIR, "asf_heat_pump_suitability/analysis/hn_zones/output_data/"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I meant moving the output_data and output_plots dirs only, not the scripts :)



def merge_data(
liverpool_hp_suitability_scores_pd: pd.DataFrame,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, go with what you find easier to use. As you've said we can shorten them later if necessary.

Comment on lines +300 to +303
coefficients = np.polyfit(x, y, 1)
model = np.poly1d(coefficients)
predicted_y = model(x)
r_squared = r2_score(y, predicted_y)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't used numpy for linear models before, cool!

Comment on lines +330 to +334
# Break mark on the y-axis
d = 0.015
kwargs = dict(transform=ax.transAxes, color="k", clip_on=False)
ax.plot((-d, +d), (-d, +d), **kwargs)
ax.plot((-d, +d), (1 - d, 1 + d), **kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

ax.plot((-d, +d), (1 - d, 1 + d), **kwargs)

ax.set_title(
"Average Nesta HN Score vs Fraction of DESNZ HN zone in LSOA", fontsize=14
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Average Nesta HN Score vs Fraction of DESNZ HN zone in LSOA", fontsize=14
"Average Nesta HN Score vs Fraction of LSOA in DESNZ Heat Network zone", fontsize=14

Should it be this way round? Otherwise lmk because I got confused!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the fraction of the DESNZ HN zone in the LSOA. We are increasing the threshold, so as we increase the amount of DESNZ HN zone in the LSOA, our average Nesta HN score. This is what we expect intuitively!

ax.text(
0.05,
0.95,
f"$R^2 = {r_squared:.2f}$",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, what does the dollar sign do here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's LaTeX notation, entering math mode!

and log the result.

Args:
hp_suitability_scores (pl.DataFrame): DataFrame containing the DESNZ pilot (actual) and Nesta HN suitability score (predicted) columns.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really helpful to label them as 'actual' and 'predicted' here! :)

@helloaidank
Copy link
Collaborator Author

Hi there, I've added in some of the changes that you suggested!

The code for evaluation has become quite lengthy, so I think in the next PR I will likely scale to the other LAs but also modularise the code a bit across a few different scripts. Hopefully this will improve things in terms of readability!

Thanks for taking the time to go through the code again and making suggestions, very useful ones, especially the overlay one.

I am happy to merge the PR and then make the changes I mentioned above in another one, once we're obviously happy with the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Comparing the HN zones selected by DESNZ to the HP suitability dataset
2 participants