Handling order of list arguments when producing comparison levels #2560

medwar99 · 2024-12-13T14:23:00Z

What happens?

Problem:
When creating a comparison with levels for different thresholds (such as xAtSizes or xAtThresholds) the ordering of the provided list matters when creating the comparison levels.

The expectation is that a list of levels or sizes should be in descending order to behave like a cascade. If the user provides a list with incorrect ordering or duplication then the resulting comparison may not perform as intended.

I've tried this on 4.0.6 and 3.9.15 and both show the same behaviour. Admittedly users should provide the lists in the correct order to obtain the intended cascade of comparison levels, but it can be confusing, as seen in discussion 2202.

Potential fixes:

Add sorting and duplicate removal to the comparison level inits for those taking sizes or threshold lists. e.g. self.thresholds = [*thresholds_as_iterable] becomes self.thresholds = sorted(set([*thresholds_as_iterable])). However this would be required in each init, so not ideal.
Create a new function similar to misc/dedupe_preserving_order() which also sorts the values, perhaps dedupe_sorting_order(ascending: str = True). The setting of the thresholds could then become self.thresholds = dedupe_sorting_order([*thresholds_as_iterable], ascending=False)
Update misc/dedupe_preserving_order() to take in an argument which allows a user to sort, e.g. dedupe_preserving_order(list_of_items, ascending: str = True) but this may then make the function name confusing.

To Reproduce

4.0.6 Examples:

Correct/Expected cascading order (sorted descending):

import splink.comparison_library as cl
cl.JaroWinklerAtThresholds('first_name', score_threshold_or_thresholds=[0.9, 0.7]).get_comparison('duckdb').as_dict()

{'output_column_name': 'first_name',
 'comparison_levels': [{'sql_condition': '"first_name_l" IS NULL OR "first_name_r" IS NULL',
   'label_for_charts': 'first_name is NULL',
   'is_null_level': True},
  {'sql_condition': '"first_name_l" = "first_name_r"',
   'label_for_charts': 'Exact match on first_name'},
  {'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.9',
   'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.9'},
  {'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.7',
   'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.7'},
  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
 'comparison_description': 'JaroWinklerAtThresholds'}

Incorrect order (sorted ascending):

import splink.comparison_library as cl
cl.JaroWinklerAtThresholds('first_name', score_threshold_or_thresholds=[0.7, 0.9]).get_comparison('duckdb').as_dict()

{'output_column_name': 'first_name',
 'comparison_levels': [{'sql_condition': '"first_name_l" IS NULL OR "first_name_r" IS NULL',
   'label_for_charts': 'first_name is NULL',
   'is_null_level': True},
  {'sql_condition': '"first_name_l" = "first_name_r"',
   'label_for_charts': 'Exact match on first_name'},
  {'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.7',
   'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.7'},
  {'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.9',
   'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.9'},
  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
 'comparison_description': 'JaroWinklerAtThresholds'}

Duplicated levels:

import splink.comparison_library as cl
cl.JaroWinklerAtThresholds('first_name', score_threshold_or_thresholds=[0.7, 0.7]).get_comparison('duckdb').as_dict()

{'output_column_name': 'first_name',
 'comparison_levels': [{'sql_condition': '"first_name_l" IS NULL OR "first_name_r" IS NULL',
   'label_for_charts': 'first_name is NULL',
   'is_null_level': True},
  {'sql_condition': '"first_name_l" = "first_name_r"',
   'label_for_charts': 'Exact match on first_name'},
  {'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.7',
   'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.7'},
  {'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.7',
   'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.7'},
  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
 'comparison_description': 'JaroWinklerAtThresholds'}

OS:

Windows

Splink version:

4.0.6

Have you tried this on the latest `master` branch?

I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

I agree

The text was updated successfully, but these errors were encountered:

medwar99 · 2024-12-13T14:31:53Z

Just to add, obviously this doesn't hold for all comparisons which implement levels, but for simple cases with multiple ordered levels of a single comparison type this can be an easy catch. Not sure how it would play nicely when mixing semantically different comparison levels though.

medwar99 added the bug Something isn't working label Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling order of list arguments when producing comparison levels #2560

Handling order of list arguments when producing comparison levels #2560

medwar99 commented Dec 13, 2024

medwar99 commented Dec 13, 2024

Handling order of list arguments when producing comparison levels #2560

Handling order of list arguments when producing comparison levels #2560

Comments

medwar99 commented Dec 13, 2024

What happens?

To Reproduce

OS:

Splink version:

Have you tried this on the latest master branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

medwar99 commented Dec 13, 2024

Have you tried this on the latest `master` branch?