Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF adjustments not displayed and most likely not applied #2584

Open
pierpaolocreanza opened this issue Jan 14, 2025 · 2 comments
Open

TF adjustments not displayed and most likely not applied #2584

pierpaolocreanza opened this issue Jan 14, 2025 · 2 comments

Comments

@pierpaolocreanza
Copy link

So, I defined a name comparison composed of various levels, several of which have term frequency adjustments. However, when I inspect the comparison viewer dashboard I only find evidence of tf adjustments being applied to the first exact level match on full name. How can I be sure that these adjustments are indeed being used to compute the final match score? Or is their absence from the dashboard enough evidence that something isn't working?

My comparison:

name_comparison = cl.CustomComparison(
    comparison_levels=[

        # Null 
        NullLevel("inv_name_clean_final"),

        # Exact match for inv_name_clean_final
        ExactMatchLevel("inv_name_clean_final", term_frequency_adjustments=True).configure(tf_minimum_u_value=tf_minimum_u_value),

        # Nulls
        NullLevel("last_name"),
        NullLevel("first_name"),
        NullLevel("middle_initial"),

        # First, last name & middle inital must be equal
        And(
            ExactMatchLevel("first_name", term_frequency_adjustments=True).configure(tf_minimum_u_value=tf_minimum_u_value),
            ExactMatchLevel("last_name", term_frequency_adjustments=True).configure(tf_minimum_u_value=tf_minimum_u_value),
            ExactMatchLevel("middle_initial")
        ),

        # First & last name must be equal
        And(
            ExactMatchLevel("first_name", term_frequency_adjustments=True).configure(tf_minimum_u_value=tf_minimum_u_value),
            ExactMatchLevel("last_name", term_frequency_adjustments=True).configure(tf_minimum_u_value=tf_minimum_u_value)
        ),

        # First name equal to last name of the other record, and last name equal to first name (swapped order)
        ColumnsReversedLevel("first_name", "last_name", symmetrical=True),

        # Custom level for Jaccard or Jaro similarity >= 0.9 on inv_name_clean_final
        CustomLevel(
            sql_condition="""
                JACCARD(inv_name_clean_final_l, inv_name_clean_final_r) >= 0.9
                OR JARO_SIMILARITY(inv_name_clean_final_l, inv_name_clean_final_r) >= 0.9
            """,
            label_for_charts="Jaccard or Jaro >= 0.9 on inv_name_clean_final"
        ),

        # Same last name and similar first name
        And(ExactMatchLevel("last_name", term_frequency_adjustments=True).configure(tf_minimum_u_value=tf_minimum_u_value),
            JaroWinklerLevel("first_name",0.85)
        ),

        # Custom level for Jaccard or Jaro similarity >= 0.9 on last_name and JW>=0.8 on first
        CustomLevel(
            sql_condition="""
                first_name_l IS NOT NULL 
                AND first_name_r IS NOT NULL 
                AND last_name_l IS NOT NULL 
                AND last_name_r IS NOT NULL 
                AND length(trim(first_name_l)) > 0 
                AND length(trim(first_name_r)) > 0 
                AND length(trim(last_name_l)) > 0 
                AND length(trim(last_name_r)) > 0 
                AND (
                    JACCARD(last_name_l, last_name_r) >= 0.9
                    OR JARO_SIMILARITY(last_name_l, last_name_r) >= 0.9
                )
                AND JARO_WINKLER_SIMILARITY(first_name_l, first_name_r) >= 0.85
                """,
            label_for_charts="Jaccard or Jaro >= 0.9 on last_name, JW >=0.85 on first_name"
        ),
        
        # Fallback for all other cases
        ElseLevel()
    ],
    output_column_name="name_comparison",
    comparison_description="Inventor name comparison with various levels for clean and stripped names, first/last name handling, and token reordering"
)

and here are two examples of two levels, both of which should have tf adjustments, but where only the one for the full name is showing up.

Screenshot 2024-11-25 at 9 24 45 AM Screenshot 2024-11-25 at 9 25 14 AM

Originally posted by @pierpaolocreanza in #2523

I cross posted here as an issue rather than a discussion question

@RobinL
Copy link
Member

RobinL commented Jan 14, 2025

Apologies for not replying sooner. This is probably the most complex custom comparison I've ever seen!

I think it's unlikely the term frequency configuration 'plays' nicely with some of the other operators (e.g. And). Sorry - this is not documented anywhere - it's just an unusual use case

I think to get this going you'd need to use the low-level dictionary representation:
https://moj-analytical-services.github.io/splink/topic_guides/comparisons/customising_comparisons.html#method-3-providing-the-spec-as-a-dictionary

I'm not on my dev machine at the moment, but tomorrow I will try and add a couple of additional pointers and ways to debug this

@RobinL
Copy link
Member

RobinL commented Jan 22, 2025

Sorry for the delay.

When you use the comparison/comparison level libraries, these are really just using helper functions that generate the underlying settings dictionary for you. i.e. Splink always uses the spec-as-a-dict under the hood, it's just it's fiddly to write, so we help people out.

So the way to know what Splink is going to do is to look at the dict representation:
name_comparison.create_comparison_dict(sql_dialect_str="duckdb")

Which gives you:

{'comparison_description': 'CustomComparison',
 'output_column_name': 'name_comparison',
 'comparison_levels': [{'sql_condition': '"inv_name_clean_final_l" IS NULL OR "inv_name_clean_final_r" IS NULL',
   'label_for_charts': 'inv_name_clean_final is NULL',
   'fix_m_probability': False,
   'fix_u_probability': False,
   'is_null_level': True},
  {'sql_condition': '"inv_name_clean_final_l" = "inv_name_clean_final_r"',
   'label_for_charts': 'Exact match on inv_name_clean_final',
   'fix_m_probability': False,
   'fix_u_probability': False,
   'tf_adjustment_column': 'inv_name_clean_final',
   'tf_minimum_u_value': 0.01,
   'tf_adjustment_weight': 1.0},
  {'sql_condition': '"last_name_l" IS NULL OR "last_name_r" IS NULL',
   'label_for_charts': 'last_name is NULL',
   'fix_m_probability': False,
   'fix_u_probability': False,
   'is_null_level': True},
  {'sql_condition': '"first_name_l" IS NULL OR "first_name_r" IS NULL',
   'label_for_charts': 'first_name is NULL',
   'fix_m_probability': False,
   'fix_u_probability': False,
   'is_null_level': True},
  {'sql_condition': '"middle_initial_l" IS NULL OR "middle_initial_r" IS NULL',
   'label_for_charts': 'middle_initial is NULL',
   'fix_m_probability': False,
   'fix_u_probability': False,
   'is_null_level': True},
  {'sql_condition': '("first_name_l" = "first_name_r") AND ("last_name_l" = "last_name_r") AND ("middle_initial_l" = "middle_initial_r")',
   'label_for_charts': '(Exact match on first_name) AND (Exact match on last_name) AND (Exact match on middle_initial)',
   'fix_m_probability': False,
   'fix_u_probability': False},
  {'sql_condition': '("first_name_l" = "first_name_r") AND ("last_name_l" = "last_name_r")',
   'label_for_charts': '(Exact match on first_name) AND (Exact match on last_name)',
   'fix_m_probability': False,
   'fix_u_probability': False},
  {'sql_condition': '"first_name_l" = "last_name_r" AND "first_name_r" = "last_name_l"',
   'label_for_charts': 'Match on reversed cols: first_name and last_name (both directions)',
   'fix_m_probability': False,
   'fix_u_probability': False},
  {'sql_condition': '\n                JACCARD(inv_name_clean_final_l, inv_name_clean_final_r) >= 0.9\n                OR JARO_SIMILARITY(inv_name_clean_final_l, inv_name_clean_final_r) >= 0.9\n            ',
   'label_for_charts': 'Jaccard or Jaro >= 0.9 on inv_name_clean_final',
   'fix_m_probability': False,
   'fix_u_probability': False},
  {'sql_condition': '("last_name_l" = "last_name_r") AND (jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.85)',
   'label_for_charts': '(Exact match on last_name) AND (Jaro-Winkler distance of first_name >= 0.85)',
   'fix_m_probability': False,
   'fix_u_probability': False},
  {'sql_condition': '\n                first_name_l IS NOT NULL\n                AND first_name_r IS NOT NULL\n                AND last_name_l IS NOT NULL\n                AND last_name_r IS NOT NULL\n                AND length(trim(first_name_l)) > 0\n                AND length(trim(first_name_r)) > 0\n                AND length(trim(last_name_l)) > 0\n                AND length(trim(last_name_r)) > 0\n                AND (\n                    JACCARD(last_name_l, last_name_r) >= 0.9\n                    OR JARO_SIMILARITY(last_name_l, last_name_r) >= 0.9\n                )\n                AND JARO_WINKLER_SIMILARITY(first_name_l, first_name_r) >= 0.85\n                ',
   'label_for_charts': 'Jaccard or Jaro >= 0.9 on last_name, JW >=0.85 on first_name',
   'fix_m_probability': False,
   'fix_u_probability': False},
  {'sql_condition': 'ELSE',
   'label_for_charts': 'All other comparisons',
   'fix_m_probability': False,
   'fix_u_probability': False}]}

This confirms my suspicion that that the following keys:

  • 'tf_adjustment_column'
  • 'tf_minimum_u_value'
  • 'tf_adjustment_weight'
    are not populated correctly for some of your complex comparisons

To be clear, this is a bug - so thank you for bringing it to our attention. But probably quite a fiddly one to fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants