You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There's a situation that bothers me slightly. In BlockingRule.exclude_pairs_generated_by_all_preceding_rules_sql() we pass in source_dataset_input_column and unique_id_input_column. These are only needed for ExplodingBlockingRule - and it was only in that PR that we had to introduce these being passed into this method (actually Linker in this case, which was later refactored to just the id columns).
This means that even in situations where you know you have ExactMatchRules you have to do something like:
which is a bit irritating/unsatisfying.
(of course you could also pass the 'correct' objects, but then you need to construct an InputColumn that you may not need anywhere else).
The same is true for exclude_pairs_generated_by_this_rule_sql.
The reason we need this is that frequently in code we don't know what kind of BlockingRule we have, so we need to pass whatever we might need.
One possible way around this is that we store these on the BlockingRule object, and pass when we instantiate with BlockingRuleCreator.get_blocking_rule(). This might appear to be just pushing the problem around to a different place, but we in fact need this information for allBlockingRules independently within .create_blocked_pairs_sql() (so that we would not need to pass these when calling that).
I have done a (non-thorough) check of the codebase and it looks like we should always have this information available when we are instantiating BlockingRules (and these are 'internal' objects, so no worry about the user).
And I can't think of a scenario (though shout if I'm missing one!) where we would want to create a concrete BlockingRule and then afterwards change the id/source column name.
Semantically it feels a little strange, but maybe not massively, as we do know we need to associate identifier information with records while we are blocking, so the blocking rule having this configuration data doesn't feel totally wild.
Any thoughts/opposing views welcomed!
Similar things occur in a few places in the codebase (e.g. SplinkDialect.random_sample_sql(), so if there is a more general pattern we could use compared to what I am proposing here then please let me know, as a more general solution could potentially be used in these other places as well.
The text was updated successfully, but these errors were encountered:
There's a situation that bothers me slightly. In
BlockingRule.exclude_pairs_generated_by_all_preceding_rules_sql()
we pass insource_dataset_input_column
andunique_id_input_column
. These are only needed forExplodingBlockingRule
- and it was only in that PR that we had to introduce these being passed into this method (actuallyLinker
in this case, which was later refactored to just the id columns).This means that even in situations where you know you have
ExactMatchRule
s you have to do something like:which is a bit irritating/unsatisfying.
(of course you could also pass the 'correct' objects, but then you need to construct an
InputColumn
that you may not need anywhere else).The same is true for
exclude_pairs_generated_by_this_rule_sql
.The reason we need this is that frequently in code we don't know what kind of
BlockingRule
we have, so we need to pass whatever we might need.One possible way around this is that we store these on the
BlockingRule
object, and pass when we instantiate withBlockingRuleCreator.get_blocking_rule()
. This might appear to be just pushing the problem around to a different place, but we in fact need this information for allBlockingRule
s independently within.create_blocked_pairs_sql()
(so that we would not need to pass these when calling that).I have done a (non-thorough) check of the codebase and it looks like we should always have this information available when we are instantiating
BlockingRule
s (and these are 'internal' objects, so no worry about the user).And I can't think of a scenario (though shout if I'm missing one!) where we would want to create a concrete
BlockingRule
and then afterwards change the id/source column name.Semantically it feels a little strange, but maybe not massively, as we do know we need to associate identifier information with records while we are blocking, so the blocking rule having this configuration data doesn't feel totally wild.
Any thoughts/opposing views welcomed!
Similar things occur in a few places in the codebase (e.g.
SplinkDialect.random_sample_sql()
, so if there is a more general pattern we could use compared to what I am proposing here then please let me know, as a more general solution could potentially be used in these other places as well.The text was updated successfully, but these errors were encountered: