-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] One-to-one clustering #2562
Comments
See also #251 Yes - definitely interested in adding support for this. Another aspect is to do with the 'venn diagram' of the two datasets. Are they known to contain exactly the same records, is one a subset of the other, or can either side contain records not in the other. Perhaps the later is the most generally useful, because it can be used on all the other scenarios, just with some possible loss of accuracy. If you've already got some snippets, it'd be great to see a proposed/draft solution just a a script so we can get our heads around how it works. |
This is actually getting closer to a full implementation now, hopefully it makes sense. I read this review of bipartite graph matching algorithms and what I've got here is (I think) equivalent to the Exact Clustering method they describe (extended to multiple datasets). One variation on this implementation would be to filter out the edges that would introduce a duplicate, and THEN rank them and select the mutual best links. This would accept more links, but I like the simplicity of only accepting the best links and not moving on to the next best ones if the best option is already "taken". I think this corresponds to the Unique Mapping Clustering in the paper I mentioned. The comments from @zmbc about unique and exhaustive make a lot of sense. I think the 'unique' part of this could be captured well by an implementation like this one, by setting the list of Aside: one technical point I was stuck on was quoting the literal value of the source dataset name to be used in a query. I used single quotes below, but I think that might only work with DuckDB! See this line:
|
@aymonwuolanne I agree that this is "exact clustering" extended to multiple datasets, and I think this is worth including! Out of the methods described in that paper, this one seems like the most amenable to efficient implementation in SQL. |
Actually, now that I look at the code a bit closer, I'm a bit confused about |
@zmbc it does rank them regardless of source dataset. If you have A1 <--> B1 with a higher weight than A1 <--> C1, then after the first iteration A1 and B1 will have the same representative, and the ranking is over links where the two records have different representatives, so A1 <--> B1 is excluded from the ranking in the second iteration, leaving A1 <--> C1 free to link. If you partition by source dataset too, you can get situations like A1 <--> B1, B1 <--> C1, and A2 <--> C1 all being accepted, creating an implicit duplicate between A1 and A2. |
Ah! This is subtle and tricky to think through. Certainly if the datasets were totally scrambled between left and right, you could get A1 <--> B1, B1 <--> A2 (where the left and right there reflect the left and right tables). I think the key assumption here is that there is a valid ordering of the datasets D1, D2, D3, etc such that if DN is compared with DM, DN is on the left if and only if N < M. A record in DN could only be linked to another record in DN if there were a loop of links back to DN. If our record if interest is linked to a record in DM, where M > N, that record in DM can only link again in the same iteration if it is on the left, which means it must link with a dataset DX where X > M. Iterating this can never lead back to DN. |
It is tricky! I had to draw quite a few diagrams to think through different examples. I can see the logic of what you're saying, but even with an ordering of the datasets it doesn't work to partition by representative and source_dataset. Examples like the one I gave before (with the ordering A < B < C) involve a kind of "leapfrogging", i.e. the links A1 -> B1, B1 -> C1, and A2 -> C1 all respect the ordering you described, but you'd end up with all four records in the same cluster eventually. The approach I used in the code was to ensure that there are no chains of links at all within the one iteration. For two records A1 and B1 that link, you can be sure that A1 links to nothing else and B1 links to nothing else. Then it's just a matter of checking that merging the cluster that A1 belongs to and the cluster that B1 belongs to doesn't implicitly form a duplicate in one of the source datasets. |
Sorry, let me clarify. I'm saying that it is necessary both to partition by representative only (not dataset) and to have an ordering that determines left and right records. If you don't have that second assumption, your |
To state a bit more explicitly what I think your comment is missing -- you
_can_ form chains in one iteration, when the number of datasets > 2. You
could have A1 <--> B1, and B1 <--> C1, and C1 <--> D1, all in one
iteration, because each record that is linking twice is appearing once on
the left and once on the right.
|
Oh I see, sorry! The code I've got adds in the reverse edges too, so in the case of A1 <--> B1 and B1 <--> A2 the edges B1 <--> A1 and A2 <--> B1 are also included when the ranking happens, so you only get one of those links accepted. In the example of A1 -> B1 and B1 -> C1 you'd also only be able to accept one of the two, since one of them will have rank 2 with respect to B1. Just to lay it out very explicitly, this is what the
I think it's better to not rely on the ordering of the datasets, though I believe it is guaranteed with the way the blocking code works. |
Ah, I see, I missed that. I guess what I said above is a potential
optimization, then, if we can assume an ordering of the datasets :)
|
Is there any interest in a one-to-one clustering implementation to be included as an alternative to connected components clustering?
When you have two datasets, you can do something like this:
but it can get much more complicated when you start thinking about it, for example:
A proposed method
I have a method in mind for doing this, it's not pull request ready but I have some code snippets I could share too. The basic idea is that you run through some iterations where you guarantee no duplicates can form at each iteration.
The representatives table at iteration
i
must track several things:node_id
: the id corresponding (uniquely, across all source datasets) to that record.source_dataset
the source dataset corresponding to thenode_id
. Unlike connected components, this can't be dropped since it's needed to determine whether links should be accepted at each iteration.representative
: a representative id for the cluster this record belongs to.Within an iteration we calculate some extra columns for the current representatives table:
contains_{sd}
a boolean value for each source_dataset namesd
, indicating whether the cluster contains a record fromsd
. For example, with datasetsA
,B
, andC
, we'd have columnscontains_A
,contains_B
,contains_C
.Then we determine what the neighbours at this iteration will be. For each
node_id
in the neighbours table, we rank all of the incoming and outgoing edges by their match probability (where the edges aren't between nodes in the same cluster). We take the edges withrank_l
andrank_r
equal to 1, and then we filter out any edges that would introduce a duplicate, by checking thecontains_{sd}
flags on each side of the edge.Ties could be dealt with at this step, using something arbitrary like
row_number()
instead ofrank()
or by detecting then dropping tied links.Finally, the representatives table is updated in the same way as it is in the connected components iterations (join the previous representatives table with the neighbours.
Iterations finish when the clusters stabilise, i.e. when the count of nodes where
rep_match = true
reaches 0. I think the way this is set it should converge in at most k-1 iterations where k is the number of datasets.The text was updated successfully, but these errors were encountered: