Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splink comparison viewer barplot and waterfall chart don't agree on match probability #2529

Open
2 tasks done
francisduval opened this issue Nov 27, 2024 · 2 comments
Open
2 tasks done
Labels
bug Something isn't working

Comments

@francisduval
Copy link

francisduval commented Nov 27, 2024

What happens?

Hello!

Splink comparison viewer barplot and waterfall chart don't seem to agree on the match probability. However, they agree on the match weight, which doesn't seem normal. The problem appears in the tutorial here:

https://moj-analytical-services.github.io/splink/demos/tutorials/06_Visualising_predictions.html

If you zoom to have all the bars with at least 10 records, then click on the bar with gamma_concat: 4, 4, 0, 0, -1, you can see that the estimated match probability is 82.8%, while the match weight is -1.17. If you then look at the waterfall chart, you can see that the estimated match probability is 0.308 (but with the same match weight of -1.17).

I have the same issue when running the comparison viewer with my own model and dataset.

Edit: it looks like the waterfall chart is right and that the barplot is wrong. In the screenshot I provided (this is a screenshot from the Splink tutorial: https://moj-analytical-services.github.io/splink/demos/tutorials/06_Visualising_predictions.html), the match weight is -1.17, which mean the associated match probability is (2^-1.17) / (1 + 2^-1.17) = 0.308

Image

Thanks!

To Reproduce

  1. Go to the Splink tutorial here: https://moj-analytical-services.github.io/splink/demos/tutorials/06_Visualising_predictions.html
  2. Scroll down to the "Comparison viewer dashboard" section.
  3. Set "Filter out comparison vector counts below" to 10 to facilitate the bar selection.
  4. Select the pale green bar (the one with gamma_concat: 4, 4, 0, 0, -1).
  5. Look at the match probability, which is 82.8%.
  6. Finally, scroll down and look at the match probability in the waterfall chart, which is 30.8%, which is not equal to 82.8%.

OS:

Windows 11

Splink version:

4.0.5

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree
@francisduval francisduval added the bug Something isn't working label Nov 27, 2024
@samnlindsay
Copy link
Contributor

There is definitely an issue here with the inconsistency between match probability and match weight. Both values are actually valid, but for reasons that need to be made clearer in the dashboard:

  • The histogram shows the match weights/probabilities based only on the comparison vector (4,4,0,0,-1)
  • The waterfall charts include the additional term frequency adjustments for the sampled record comparison

Image

Without these term frequency adjustments, the final match weight would be +2.26 (p=0.83)
With the adjustments for "Jacob" and "Campbell" being relatively common names, the figures are revised downwards. For a very uncommon name, they would be revised upwards.

So the comparison viewer only describes the average or baseline level for each comparison vector, with the caveat that some specific comparisons will be scored higher and some lower.

@francisduval
Copy link
Author

Hi Sam!

Thank you so much for your very fast answer. I get it now! I'd forgotten about the term frequency adjustments! It makes sense now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants