-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding file for calculating comparison of hn zones as well as plottin… #81
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @helloaidank.
I've reviewed your comparison_of_hn_zones.py
script and left comments. I need to come back and review the plotting script but wanted to publish these comments already so you can start scaling up :)
The code looks great! I do think we need to change the methodology slightly to exclude any small bits of LSOAs that might be skewing the error results because they look so similar for DESNZ pilot and non-pilot zones. And we want to make sure our results are representative of the same/similar locations so that we can assess them. If they still come out like this then maybe the features for HN model need to be improved!
Let me know if you have questions !
def main(): | ||
# Define the output directory and set up logging | ||
output_dir = os.path.join( | ||
PROJECT_DIR, "asf_heat_pump_suitability/analysis/hn_zones/output_data/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest moving the hn_zone dir to outputs/hn_zones
and then just save the results directly into that dir :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just on this, the outputs directory is in the .gitignore
, so I am unsure if this is the best place to put the hn_zones
directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine that the outputs go into a folder that's ignored by git. I don't think we want to push the results to github.
I can see you've change it in the new script. To clarify, I just meant to save the output files to the outputs
dir. I think the scripts themselves should stay in the analysis
folder. :) But perhaps I misunderstood your reasoning. What do you think?
Lastly, you could also save your outputs directly to S3 if you wanted. At least the .json
files, if not the plots. We could make a new evaluation
dir in the asf-heat-pump-suitability
bucket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, I misunderstood, I thought you meant moving the whole directory to the outputs folder. I will move it over after we've finished this review :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After weighing it up in my head, I do think it probably makes sense to have a version saved to s3 (so people can play around with the plotting scripts without needing to run the comparison script) but if you are iterating locally, then make sure you save the files locally because it takes ages otherwise hehe and is a bit annoying.
@@ -0,0 +1,350 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, this is great work. The code is clear and neat and I really like the logging. Great job!
Below are the logs from running the script. The results of average HN score for DESNZ pilot 0 and 1 seem very similar and with really high mean error rates. At least 0 has a lower score on avg than 1, but they are so close together and also both very central values at 0.52-0.61. If I saw those results out of context I don't think I would interpret that 0.52 is unsuitable for a HN but 0.61 really is.
It might be worth looking at the underlying data for the suitability features for these LSOAs and try to identify if they have anything in common that we seem to be misclassifying.
Looking at the relevant features for Heat Networks in the model, they are: not a listed building; not in building conservation zone (bear in mind a lot of properties will have a NAN flag here because of the patchiness of the conservation zone data. We assigned those not in a cons zone as NAN because we couldn't know if they were in one or if there was just cons data missing). ; off-gas; property in building with other properties (i.e. a flat); and in an urban area.
One reason that they have such similar scores could be that a lot of properties will score very similarly in a city even with variation between them. E.g. property density doesn't count for HN so there's no scale to density except urban/rural binary, and presumably all the LSOAs in this sample are urban. I'm guessing most properties in Liverpool will be not listed buildings and not off-gas(?). Those 3 features alone give a score of 0.47 (2.25/4.75). But still we would expect those on the higher scores to be around 0.9 if they are flats as well, which would put the score at 0.89 (4.25/4.75).
I think it might be true what you said in one of the stand ups, that small parts of LSOAs that have a tiny overlap with the DESNZ zones might be affecting the results. Maybe we need to set a threshold for the amount of overlap with DESNZ zone required to be considered in the analysis. We can start by setting it at 50% area (or 50% of OAs in an LSOA).
NFO: Loaded Liverpool with DESNZ HN LSOA data.
INFO: Created 133 records
INFO: Processed Nesta HP suitability scores for Liverpool.
INFO: number of LSOAs in DESNZ list not in HP suitability data: 2
INFO: Average HN_N_avg_score_weighted: 0.5608973509933775
INFO: Avg HN_N_avg_score_weighted for DESNZ_pilot_score = 1: 0.6132439024390244
INFO: Avg HN_N_avg_score_weighted for DESNZ_pilot_score = 0: 0.524927374301676
INFO: Mean Absolute Error (MAE) for DESNZ_pilot_score vs HN_N_avg_score_weighted: 0.46865231788079464
INFO: Mean Absolute Error (MAE) for DESNZ_pilot_score = 1: 0.3867560975609756
INFO: Mean Absolute Error (MAE) for DESNZ_pilot_score = 0: 0.524927374301676
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed the plot comparison code. It looks good and I don't see any issues or bugs. Mainly just a couple of suggestions for documentation.
I got 4 maps output from the script and they look the same as the ones you showed in the sprint review (see file names below). Great job with the heat mapping, it looks really good and the code is really clean and readable and simple! Really nice work!!
INPUT_DIR = os.path.join( | ||
PROJECT_DIR, "asf_heat_pump_suitability/analysis/hn_zones/output_data/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above - can move this to the main outputs dir :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, I meant moving the output_data
and output_plots
dirs only, not the scripts :)
asf_heat_pump_suitability/analysis/hn_zones/plot_comparison_of_hn_zones.py
Show resolved
Hide resolved
|
||
|
||
def merge_data( | ||
liverpool_hp_suitability_scores_pd: pd.DataFrame, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, you could shorten argument names and in-function variable names if you wanted. If you document what they represent well in the docstring, then you can use names like df
and gdf
here. I tend to do that in my code because I think it makes the code easier to read, because the lines then become a lot shorter.
Of course it's a matter of preference because your descriptive names can make it easier to understand what's going on in more complex pieces of code. So do whatever you prefer :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at least initially I like keeping the names descriptive especially on first iteration as I have a tendency to forget what each variable means! I'm also sure we can factor this down, as it will be for every LA rather than just Liverpool. But thinks like heat network (can abbreviate to hn as I'm very familiar with that by now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem, go with what you find easier to use. As you've said we can shorten them later if necessary.
asf_heat_pump_suitability/analysis/hn_zones/plot_comparison_of_hn_zones.py
Show resolved
Hide resolved
asf_heat_pump_suitability/analysis/hn_zones/plot_comparison_of_hn_zones.py
Show resolved
Hide resolved
asf_heat_pump_suitability/analysis/hn_zones/plot_comparison_of_hn_zones.py
Show resolved
Hide resolved
asf_heat_pump_suitability/analysis/hn_zones/plot_comparison_of_hn_zones.py
Show resolved
Hide resolved
…ather than binary
…core vs desnz fraction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @crispy-wonton!
Thank you for your comments, I’ve tweaked the methodology so that we are looking at the fraction of area covered in an LSOA by the DESNZ HN zones. We have taken this fraction to be the ‘DESNZ HN pilot score’ and this is what we compare to the Nesta HN score per LSOA.
We also now output a parquet file with the avg Nesta HN s as a function of the DESNZ HN pilot score (fraction of the LSOA that is covered by a DESNZ HN zone).
I’ve also moved this all to the ‘output’ directory as I mentioned, but we probably should put it somewhere else because I keep on getting the git .ignore warning.
In the plotting script, I’ve added a plot of avg Nesta HN vs DESNZ HN pilot zone and also perform and plot a linear regression and show the R^{2} on it.
Thanks for taking another look and for your really useful comments!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @helloaidank ,
This looks great thank you!
I have mainly suggested some cosmetic changes and updates to documentation and a couple of places to simplify the inputs of functions.
There is one point on outputs/hn_zones/comparison_of_hn_zones.py
that suggests to use overlay
instead of sjoin
with an intersection. I think it will be easier to apply. Let me know if you have questions about this.
Secondly, sorry my point in the previous review was not clear. I had meant for you to save the outputs of the script to the outputs
directory in the repo, but I think the scripts themselves should be in the analysis
folder where you originally had them.
Having said this, to make the final review easier and quicker, I would suggest keeping the files in the same place for now as you make further edits to the code. This means when I do my final review, I can review only the changes since this review, rather than the entire file showing up as diffs (due to it being moved to a new folder). If there's a way to do this on Github anyway that I don't know about , please let me know! After the final review, we can then move the files back to the analysis
folder.
Lastly, I think you could create a new folder in S3, called evaluation/liverpool_heat_networks
or something, and save your intermediate (and also final!) outputs to there if you wanted. I think this is better than saving locally because then the whole project team has direct access to the results :)
Great work! I ran your code and it produced the expected outputs and was clearly documented.
Let me know if you have any questions :)
|
||
Returns: | ||
Tuple[pl.DataFrame, float, float]: | ||
- Updated DataFrame with 'DESNZ_pilot_fraction' column added (fraction of area covered by HN zones). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Updated DataFrame with 'DESNZ_pilot_fraction' column added (fraction of area covered by HN zones). | |
- Updated DataFrame with 'DESNZ_pilot_fraction' column added (fraction of local authority area covered by HN zones). |
Is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep that's correct!
def main(): | ||
# Define the output directory and set up logging | ||
output_dir = os.path.join( | ||
PROJECT_DIR, "asf_heat_pump_suitability/analysis/hn_zones/output_data/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine that the outputs go into a folder that's ignored by git. I don't think we want to push the results to github.
I can see you've change it in the new script. To clarify, I just meant to save the output files to the outputs
dir. I think the scripts themselves should stay in the analysis
folder. :) But perhaps I misunderstood your reasoning. What do you think?
Lastly, you could also save your outputs directly to S3 if you wanted. At least the .json
files, if not the plots. We could make a new evaluation
dir in the asf-heat-pump-suitability
bucket.
INPUT_DIR = os.path.join( | ||
PROJECT_DIR, "asf_heat_pump_suitability/analysis/hn_zones/output_data/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, I meant moving the output_data
and output_plots
dirs only, not the scripts :)
|
||
|
||
def merge_data( | ||
liverpool_hp_suitability_scores_pd: pd.DataFrame, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem, go with what you find easier to use. As you've said we can shorten them later if necessary.
coefficients = np.polyfit(x, y, 1) | ||
model = np.poly1d(coefficients) | ||
predicted_y = model(x) | ||
r_squared = r2_score(y, predicted_y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't used numpy
for linear models before, cool!
# Break mark on the y-axis | ||
d = 0.015 | ||
kwargs = dict(transform=ax.transAxes, color="k", clip_on=False) | ||
ax.plot((-d, +d), (-d, +d), **kwargs) | ||
ax.plot((-d, +d), (1 - d, 1 + d), **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
ax.plot((-d, +d), (1 - d, 1 + d), **kwargs) | ||
|
||
ax.set_title( | ||
"Average Nesta HN Score vs Fraction of DESNZ HN zone in LSOA", fontsize=14 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Average Nesta HN Score vs Fraction of DESNZ HN zone in LSOA", fontsize=14 | |
"Average Nesta HN Score vs Fraction of LSOA in DESNZ Heat Network zone", fontsize=14 |
Should it be this way round? Otherwise lmk because I got confused!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the fraction of the DESNZ HN zone in the LSOA. We are increasing the threshold, so as we increase the amount of DESNZ HN zone in the LSOA, our average Nesta HN score. This is what we expect intuitively!
ax.text( | ||
0.05, | ||
0.95, | ||
f"$R^2 = {r_squared:.2f}$", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding, what does the dollar sign do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's LaTeX notation, entering math mode!
and log the result. | ||
|
||
Args: | ||
hp_suitability_scores (pl.DataFrame): DataFrame containing the DESNZ pilot (actual) and Nesta HN suitability score (predicted) columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really helpful to label them as 'actual' and 'predicted' here! :)
Hi there, I've added in some of the changes that you suggested! The code for evaluation has become quite lengthy, so I think in the next PR I will likely scale to the other LAs but also modularise the code a bit across a few different scripts. Hopefully this will improve things in terms of readability! Thanks for taking the time to go through the code again and making suggestions, very useful ones, especially the overlay one. I am happy to merge the PR and then make the changes I mentioned above in another one, once we're obviously happy with the code. |
Description
All files are contained within the
asf_heat_pump_suitability/analysis/hn_zones
directory.The script
comparison_of_hn_zones.py
processes and compares DESNZ heat network (HN) to Nesta's heat pump suitability data for Liverpool, and outputs some files for plotting.Steps taken:
Outputs:
The script
comparison_of_hn_zones.py
visualises heat network (HN) and heat pump suitability data for Liverpool.Steps taken:
Fixes #80
Instructions for Reviewer
In order to test the code in this PR you need to run the following commands:
python asf_heat_pump_suitability/analysis/hn_zones/comparison_of_hn_zones.py
This should output a few data files to a directory called
output_data
, please check that they are there. If you could see how long this script will take as we have to scale it to ~20 other local authorities. For me it takes 30s-1 minute to run, which seems OK as we likely won't be running this repeatedly (i.e. it's good enough) but any major efficiency improvements are always welcome.Then to visualise some of this data, run the following:
python asf_heat_pump_suitability/analysis/hn_zones/plot_comparison_of_hn_zones.py
This will output some visuals to
output_plots
for the Liverpool LA.Please pay special attention to whether the outputs/data we are getting seem sensible and that the MAE checks out, this is the primary motivation for this PR.
Two things to be aware of regarding the methodology and visuals:
For the next evolution of this script, we will scale it to the other 20 LAs we have and I think it's probably a good call to put these file paths into a config file. I am also curious as to where you think this should live in the repo!
Thanks so much for taking the time to review! It's much appreciated.
Checklist:
notebooks/
pre-commit
and addressed any issues not automatically fixeddev
README
s