Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling order of list arguments when producing comparison levels #2560

Open
2 tasks done
medwar99 opened this issue Dec 13, 2024 · 1 comment
Open
2 tasks done

Handling order of list arguments when producing comparison levels #2560

medwar99 opened this issue Dec 13, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@medwar99
Copy link
Contributor

What happens?

Problem:
When creating a comparison with levels for different thresholds (such as xAtSizes or xAtThresholds) the ordering of the provided list matters when creating the comparison levels.

The expectation is that a list of levels or sizes should be in descending order to behave like a cascade. If the user provides a list with incorrect ordering or duplication then the resulting comparison may not perform as intended.

I've tried this on 4.0.6 and 3.9.15 and both show the same behaviour. Admittedly users should provide the lists in the correct order to obtain the intended cascade of comparison levels, but it can be confusing, as seen in discussion 2202.

Potential fixes:

  • Add sorting and duplicate removal to the comparison level inits for those taking sizes or threshold lists. e.g. self.thresholds = [*thresholds_as_iterable] becomes self.thresholds = sorted(set([*thresholds_as_iterable])). However this would be required in each init, so not ideal.
  • Create a new function similar to misc/dedupe_preserving_order() which also sorts the values, perhaps dedupe_sorting_order(ascending: str = True). The setting of the thresholds could then become self.thresholds = dedupe_sorting_order([*thresholds_as_iterable], ascending=False)
  • Update misc/dedupe_preserving_order() to take in an argument which allows a user to sort, e.g. dedupe_preserving_order(list_of_items, ascending: str = True) but this may then make the function name confusing.

To Reproduce

4.0.6 Examples:

Correct/Expected cascading order (sorted descending):

import splink.comparison_library as cl
cl.JaroWinklerAtThresholds('first_name', score_threshold_or_thresholds=[0.9, 0.7]).get_comparison('duckdb').as_dict()

{'output_column_name': 'first_name',
 'comparison_levels': [{'sql_condition': '"first_name_l" IS NULL OR "first_name_r" IS NULL',
   'label_for_charts': 'first_name is NULL',
   'is_null_level': True},
  {'sql_condition': '"first_name_l" = "first_name_r"',
   'label_for_charts': 'Exact match on first_name'},
  {'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.9',
   'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.9'},
  {'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.7',
   'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.7'},
  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
 'comparison_description': 'JaroWinklerAtThresholds'}

Incorrect order (sorted ascending):

import splink.comparison_library as cl
cl.JaroWinklerAtThresholds('first_name', score_threshold_or_thresholds=[0.7, 0.9]).get_comparison('duckdb').as_dict()

{'output_column_name': 'first_name',
 'comparison_levels': [{'sql_condition': '"first_name_l" IS NULL OR "first_name_r" IS NULL',
   'label_for_charts': 'first_name is NULL',
   'is_null_level': True},
  {'sql_condition': '"first_name_l" = "first_name_r"',
   'label_for_charts': 'Exact match on first_name'},
  {'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.7',
   'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.7'},
  {'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.9',
   'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.9'},
  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
 'comparison_description': 'JaroWinklerAtThresholds'}

Duplicated levels:

import splink.comparison_library as cl
cl.JaroWinklerAtThresholds('first_name', score_threshold_or_thresholds=[0.7, 0.7]).get_comparison('duckdb').as_dict()

{'output_column_name': 'first_name',
 'comparison_levels': [{'sql_condition': '"first_name_l" IS NULL OR "first_name_r" IS NULL',
   'label_for_charts': 'first_name is NULL',
   'is_null_level': True},
  {'sql_condition': '"first_name_l" = "first_name_r"',
   'label_for_charts': 'Exact match on first_name'},
  {'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.7',
   'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.7'},
  {'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.7',
   'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.7'},
  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
 'comparison_description': 'JaroWinklerAtThresholds'}

OS:

Windows

Splink version:

4.0.6

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree
@medwar99 medwar99 added the bug Something isn't working label Dec 13, 2024
@medwar99
Copy link
Contributor Author

Just to add, obviously this doesn't hold for all comparisons which implement levels, but for simple cases with multiple ordered levels of a single comparison type this can be an easy catch. Not sure how it would play nicely when mixing semantically different comparison levels though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant