TF adjustments not displayed and most likely not applied #2584

pierpaolocreanza · 2025-01-14T20:31:57Z

So, I defined a name comparison composed of various levels, several of which have term frequency adjustments. However, when I inspect the comparison viewer dashboard I only find evidence of tf adjustments being applied to the first exact level match on full name. How can I be sure that these adjustments are indeed being used to compute the final match score? Or is their absence from the dashboard enough evidence that something isn't working?

My comparison:

name_comparison = cl.CustomComparison(
    comparison_levels=[

        # Null 
        NullLevel("inv_name_clean_final"),

        # Exact match for inv_name_clean_final
        ExactMatchLevel("inv_name_clean_final", term_frequency_adjustments=True).configure(tf_minimum_u_value=tf_minimum_u_value),

        # Nulls
        NullLevel("last_name"),
        NullLevel("first_name"),
        NullLevel("middle_initial"),

        # First, last name & middle inital must be equal
        And(
            ExactMatchLevel("first_name", term_frequency_adjustments=True).configure(tf_minimum_u_value=tf_minimum_u_value),
            ExactMatchLevel("last_name", term_frequency_adjustments=True).configure(tf_minimum_u_value=tf_minimum_u_value),
            ExactMatchLevel("middle_initial")
        ),

        # First & last name must be equal
        And(
            ExactMatchLevel("first_name", term_frequency_adjustments=True).configure(tf_minimum_u_value=tf_minimum_u_value),
            ExactMatchLevel("last_name", term_frequency_adjustments=True).configure(tf_minimum_u_value=tf_minimum_u_value)
        ),

        # First name equal to last name of the other record, and last name equal to first name (swapped order)
        ColumnsReversedLevel("first_name", "last_name", symmetrical=True),

        # Custom level for Jaccard or Jaro similarity >= 0.9 on inv_name_clean_final
        CustomLevel(
            sql_condition="""
                JACCARD(inv_name_clean_final_l, inv_name_clean_final_r) >= 0.9
                OR JARO_SIMILARITY(inv_name_clean_final_l, inv_name_clean_final_r) >= 0.9
            """,
            label_for_charts="Jaccard or Jaro >= 0.9 on inv_name_clean_final"
        ),

        # Same last name and similar first name
        And(ExactMatchLevel("last_name", term_frequency_adjustments=True).configure(tf_minimum_u_value=tf_minimum_u_value),
            JaroWinklerLevel("first_name",0.85)
        ),

        # Custom level for Jaccard or Jaro similarity >= 0.9 on last_name and JW>=0.8 on first
        CustomLevel(
            sql_condition="""
                first_name_l IS NOT NULL 
                AND first_name_r IS NOT NULL 
                AND last_name_l IS NOT NULL 
                AND last_name_r IS NOT NULL 
                AND length(trim(first_name_l)) > 0 
                AND length(trim(first_name_r)) > 0 
                AND length(trim(last_name_l)) > 0 
                AND length(trim(last_name_r)) > 0 
                AND (
                    JACCARD(last_name_l, last_name_r) >= 0.9
                    OR JARO_SIMILARITY(last_name_l, last_name_r) >= 0.9
                )
                AND JARO_WINKLER_SIMILARITY(first_name_l, first_name_r) >= 0.85
                """,
            label_for_charts="Jaccard or Jaro >= 0.9 on last_name, JW >=0.85 on first_name"
        ),
        
        # Fallback for all other cases
        ElseLevel()
    ],
    output_column_name="name_comparison",
    comparison_description="Inventor name comparison with various levels for clean and stripped names, first/last name handling, and token reordering"
)

and here are two examples of two levels, both of which should have tf adjustments, but where only the one for the full name is showing up.

Originally posted by @pierpaolocreanza in #2523

I cross posted here as an issue rather than a discussion question

The text was updated successfully, but these errors were encountered:

RobinL · 2025-01-14T20:45:04Z

Apologies for not replying sooner. This is probably the most complex custom comparison I've ever seen!

I think it's unlikely the term frequency configuration 'plays' nicely with some of the other operators (e.g. And). Sorry - this is not documented anywhere - it's just an unusual use case

I think to get this going you'd need to use the low-level dictionary representation:
https://moj-analytical-services.github.io/splink/topic_guides/comparisons/customising_comparisons.html#method-3-providing-the-spec-as-a-dictionary

I'm not on my dev machine at the moment, but tomorrow I will try and add a couple of additional pointers and ways to debug this

RobinL · 2025-01-22T17:10:18Z

Sorry for the delay.

When you use the comparison/comparison level libraries, these are really just using helper functions that generate the underlying settings dictionary for you. i.e. Splink always uses the spec-as-a-dict under the hood, it's just it's fiddly to write, so we help people out.

So the way to know what Splink is going to do is to look at the dict representation:
name_comparison.create_comparison_dict(sql_dialect_str="duckdb")

Which gives you:

{'comparison_description': 'CustomComparison',
 'output_column_name': 'name_comparison',
 'comparison_levels': [{'sql_condition': '"inv_name_clean_final_l" IS NULL OR "inv_name_clean_final_r" IS NULL',
   'label_for_charts': 'inv_name_clean_final is NULL',
   'fix_m_probability': False,
   'fix_u_probability': False,
   'is_null_level': True},
  {'sql_condition': '"inv_name_clean_final_l" = "inv_name_clean_final_r"',
   'label_for_charts': 'Exact match on inv_name_clean_final',
   'fix_m_probability': False,
   'fix_u_probability': False,
   'tf_adjustment_column': 'inv_name_clean_final',
   'tf_minimum_u_value': 0.01,
   'tf_adjustment_weight': 1.0},
  {'sql_condition': '"last_name_l" IS NULL OR "last_name_r" IS NULL',
   'label_for_charts': 'last_name is NULL',
   'fix_m_probability': False,
   'fix_u_probability': False,
   'is_null_level': True},
  {'sql_condition': '"first_name_l" IS NULL OR "first_name_r" IS NULL',
   'label_for_charts': 'first_name is NULL',
   'fix_m_probability': False,
   'fix_u_probability': False,
   'is_null_level': True},
  {'sql_condition': '"middle_initial_l" IS NULL OR "middle_initial_r" IS NULL',
   'label_for_charts': 'middle_initial is NULL',
   'fix_m_probability': False,
   'fix_u_probability': False,
   'is_null_level': True},
  {'sql_condition': '("first_name_l" = "first_name_r") AND ("last_name_l" = "last_name_r") AND ("middle_initial_l" = "middle_initial_r")',
   'label_for_charts': '(Exact match on first_name) AND (Exact match on last_name) AND (Exact match on middle_initial)',
   'fix_m_probability': False,
   'fix_u_probability': False},
  {'sql_condition': '("first_name_l" = "first_name_r") AND ("last_name_l" = "last_name_r")',
   'label_for_charts': '(Exact match on first_name) AND (Exact match on last_name)',
   'fix_m_probability': False,
   'fix_u_probability': False},
  {'sql_condition': '"first_name_l" = "last_name_r" AND "first_name_r" = "last_name_l"',
   'label_for_charts': 'Match on reversed cols: first_name and last_name (both directions)',
   'fix_m_probability': False,
   'fix_u_probability': False},
  {'sql_condition': '\n                JACCARD(inv_name_clean_final_l, inv_name_clean_final_r) >= 0.9\n                OR JARO_SIMILARITY(inv_name_clean_final_l, inv_name_clean_final_r) >= 0.9\n            ',
   'label_for_charts': 'Jaccard or Jaro >= 0.9 on inv_name_clean_final',
   'fix_m_probability': False,
   'fix_u_probability': False},
  {'sql_condition': '("last_name_l" = "last_name_r") AND (jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.85)',
   'label_for_charts': '(Exact match on last_name) AND (Jaro-Winkler distance of first_name >= 0.85)',
   'fix_m_probability': False,
   'fix_u_probability': False},
  {'sql_condition': '\n                first_name_l IS NOT NULL\n                AND first_name_r IS NOT NULL\n                AND last_name_l IS NOT NULL\n                AND last_name_r IS NOT NULL\n                AND length(trim(first_name_l)) > 0\n                AND length(trim(first_name_r)) > 0\n                AND length(trim(last_name_l)) > 0\n                AND length(trim(last_name_r)) > 0\n                AND (\n                    JACCARD(last_name_l, last_name_r) >= 0.9\n                    OR JARO_SIMILARITY(last_name_l, last_name_r) >= 0.9\n                )\n                AND JARO_WINKLER_SIMILARITY(first_name_l, first_name_r) >= 0.85\n                ',
   'label_for_charts': 'Jaccard or Jaro >= 0.9 on last_name, JW >=0.85 on first_name',
   'fix_m_probability': False,
   'fix_u_probability': False},
  {'sql_condition': 'ELSE',
   'label_for_charts': 'All other comparisons',
   'fix_m_probability': False,
   'fix_u_probability': False}]}

This confirms my suspicion that that the following keys:

'tf_adjustment_column'
'tf_minimum_u_value'
'tf_adjustment_weight'
are not populated correctly for some of your complex comparisons

To be clear, this is a bug - so thank you for bringing it to our attention. But probably quite a fiddly one to fix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TF adjustments not displayed and most likely not applied #2584

TF adjustments not displayed and most likely not applied #2584

pierpaolocreanza commented Jan 14, 2025

RobinL commented Jan 14, 2025

RobinL commented Jan 22, 2025

TF adjustments not displayed and most likely not applied #2584

TF adjustments not displayed and most likely not applied #2584

Comments

pierpaolocreanza commented Jan 14, 2025

RobinL commented Jan 14, 2025

RobinL commented Jan 22, 2025