You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem:
When creating a comparison with levels for different thresholds (such as xAtSizes or xAtThresholds) the ordering of the provided list matters when creating the comparison levels.
The expectation is that a list of levels or sizes should be in descending order to behave like a cascade. If the user provides a list with incorrect ordering or duplication then the resulting comparison may not perform as intended.
I've tried this on 4.0.6 and 3.9.15 and both show the same behaviour. Admittedly users should provide the lists in the correct order to obtain the intended cascade of comparison levels, but it can be confusing, as seen in discussion 2202.
Potential fixes:
Add sorting and duplicate removal to the comparison level inits for those taking sizes or threshold lists. e.g. self.thresholds = [*thresholds_as_iterable] becomes self.thresholds = sorted(set([*thresholds_as_iterable])). However this would be required in each init, so not ideal.
Create a new function similar to misc/dedupe_preserving_order() which also sorts the values, perhaps dedupe_sorting_order(ascending: str = True). The setting of the thresholds could then become self.thresholds = dedupe_sorting_order([*thresholds_as_iterable], ascending=False)
Update misc/dedupe_preserving_order() to take in an argument which allows a user to sort, e.g. dedupe_preserving_order(list_of_items, ascending: str = True) but this may then make the function name confusing.
To Reproduce
4.0.6 Examples:
Correct/Expected cascading order (sorted descending):
import splink.comparison_library as cl
cl.JaroWinklerAtThresholds('first_name', score_threshold_or_thresholds=[0.9, 0.7]).get_comparison('duckdb').as_dict()
{'output_column_name': 'first_name',
'comparison_levels': [{'sql_condition': '"first_name_l" IS NULL OR "first_name_r" IS NULL',
'label_for_charts': 'first_name is NULL',
'is_null_level': True},
{'sql_condition': '"first_name_l" = "first_name_r"',
'label_for_charts': 'Exact match on first_name'},
{'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.9',
'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.9'},
{'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.7',
'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.7'},
{'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
'comparison_description': 'JaroWinklerAtThresholds'}
Incorrect order (sorted ascending):
import splink.comparison_library as cl
cl.JaroWinklerAtThresholds('first_name', score_threshold_or_thresholds=[0.7, 0.9]).get_comparison('duckdb').as_dict()
{'output_column_name': 'first_name',
'comparison_levels': [{'sql_condition': '"first_name_l" IS NULL OR "first_name_r" IS NULL',
'label_for_charts': 'first_name is NULL',
'is_null_level': True},
{'sql_condition': '"first_name_l" = "first_name_r"',
'label_for_charts': 'Exact match on first_name'},
{'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.7',
'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.7'},
{'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.9',
'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.9'},
{'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
'comparison_description': 'JaroWinklerAtThresholds'}
Duplicated levels:
import splink.comparison_library as cl
cl.JaroWinklerAtThresholds('first_name', score_threshold_or_thresholds=[0.7, 0.7]).get_comparison('duckdb').as_dict()
{'output_column_name': 'first_name',
'comparison_levels': [{'sql_condition': '"first_name_l" IS NULL OR "first_name_r" IS NULL',
'label_for_charts': 'first_name is NULL',
'is_null_level': True},
{'sql_condition': '"first_name_l" = "first_name_r"',
'label_for_charts': 'Exact match on first_name'},
{'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.7',
'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.7'},
{'sql_condition': 'jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.7',
'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.7'},
{'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],
'comparison_description': 'JaroWinklerAtThresholds'}
OS:
Windows
Splink version:
4.0.6
Have you tried this on the latest master branch?
I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
I agree
The text was updated successfully, but these errors were encountered:
Just to add, obviously this doesn't hold for all comparisons which implement levels, but for simple cases with multiple ordered levels of a single comparison type this can be an easy catch. Not sure how it would play nicely when mixing semantically different comparison levels though.
What happens?
Problem:
When creating a comparison with levels for different thresholds (such as xAtSizes or xAtThresholds) the ordering of the provided list matters when creating the comparison levels.
The expectation is that a list of levels or sizes should be in descending order to behave like a cascade. If the user provides a list with incorrect ordering or duplication then the resulting comparison may not perform as intended.
I've tried this on 4.0.6 and 3.9.15 and both show the same behaviour. Admittedly users should provide the lists in the correct order to obtain the intended cascade of comparison levels, but it can be confusing, as seen in discussion 2202.
Potential fixes:
init
s for those taking sizes or threshold lists. e.g.self.thresholds = [*thresholds_as_iterable]
becomesself.thresholds = sorted(set([*thresholds_as_iterable]))
. However this would be required in each init, so not ideal.dedupe_sorting_order(ascending: str = True)
. The setting of the thresholds could then becomeself.thresholds = dedupe_sorting_order([*thresholds_as_iterable], ascending=False)
dedupe_preserving_order(list_of_items, ascending: str = True)
but this may then make the function name confusing.To Reproduce
4.0.6 Examples:
Correct/Expected cascading order (sorted descending):
Incorrect order (sorted ascending):
Duplicated levels:
OS:
Windows
Splink version:
4.0.6
Have you tried this on the latest
master
branch?Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
The text was updated successfully, but these errors were encountered: