Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should BlockingRules store identifier column information? #2588

Open
ADBond opened this issue Jan 16, 2025 · 0 comments
Open

Should BlockingRules store identifier column information? #2588

ADBond opened this issue Jan 16, 2025 · 0 comments

Comments

@ADBond
Copy link
Contributor

ADBond commented Jan 16, 2025

There's a situation that bothers me slightly. In BlockingRule.exclude_pairs_generated_by_all_preceding_rules_sql() we pass in source_dataset_input_column and unique_id_input_column. These are only needed for ExplodingBlockingRule - and it was only in that PR that we had to introduce these being passed into this method (actually Linker in this case, which was later refactored to just the id columns).
This means that even in situations where you know you have ExactMatchRules you have to do something like:

blocking_rule.exclude_pairs_generated_by_all_preceding_rules_sql(
    "dummy",
    "dummy",
)

which is a bit irritating/unsatisfying.
(of course you could also pass the 'correct' objects, but then you need to construct an InputColumn that you may not need anywhere else).
The same is true for exclude_pairs_generated_by_this_rule_sql.

The reason we need this is that frequently in code we don't know what kind of BlockingRule we have, so we need to pass whatever we might need.

One possible way around this is that we store these on the BlockingRule object, and pass when we instantiate with BlockingRuleCreator.get_blocking_rule(). This might appear to be just pushing the problem around to a different place, but we in fact need this information for all BlockingRules independently within .create_blocked_pairs_sql() (so that we would not need to pass these when calling that).

I have done a (non-thorough) check of the codebase and it looks like we should always have this information available when we are instantiating BlockingRules (and these are 'internal' objects, so no worry about the user).
And I can't think of a scenario (though shout if I'm missing one!) where we would want to create a concrete BlockingRule and then afterwards change the id/source column name.

Semantically it feels a little strange, but maybe not massively, as we do know we need to associate identifier information with records while we are blocking, so the blocking rule having this configuration data doesn't feel totally wild.

Any thoughts/opposing views welcomed!

Similar things occur in a few places in the codebase (e.g. SplinkDialect.random_sample_sql(), so if there is a more general pattern we could use compared to what I am proposing here then please let me know, as a more general solution could potentially be used in these other places as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant