Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on the scope of groupby #865

Open
meggart opened this issue Nov 26, 2024 · 11 comments
Open

Question on the scope of groupby #865

meggart opened this issue Nov 26, 2024 · 11 comments
Labels
ecosystem enhancement New feature or request question Further information is requested

Comments

@meggart
Copy link

meggart commented Nov 26, 2024

I am trying to implement a bit more groupby logic into DiskArrayEngine and was wondering if you have any thoughts on allowing grouping on other DimArrays in the future. Currently we can only group by single dimensions, which is already useful. However, imagine I have a 3d DimArray a (lon x lat x Ti) and a 2d mask DimArray m (lon x lat) e.g. containing country codes, ideally I would like to be able to write mean.(groupby(a, m)) to get country means where both. Is something like this on your radar or does it already exist in DD and just have a different name? Just interested to hear your thoughts before continuing to work on this in DAE to make sure the interface is somewhat compatible.

@rafaqz
Copy link
Owner

rafaqz commented Nov 26, 2024

That sounds good, no reason it can't work.

You can group by multiple dims now but as separate axes, not as a mask. I guess we should allow grouping by any arbitrary Bool DimArray DimArray?

Do you also imagine the dims of the mask are different to the main array? Like we have to use selectors like Near?

@rafaqz
Copy link
Owner

rafaqz commented Nov 26, 2024

I just realised we already have that behaviour of passing a DimArray a the same as dims(A). But maybe we can change that in the next breaking change to work how you suggest.

@meggart
Copy link
Author

meggart commented Nov 27, 2024

Do you also imagine the dims of the mask are different to the main array? Like we have to use selectors like Near?

No, I did not think so far yet, currently having everything share the same dims should be good enough, at least for my use cases and otherwise there is the possibility to explicitly interpolate the mask.

@meggart
Copy link
Author

meggart commented Nov 27, 2024

The only thing I am pondering about is if it would be possible to create a groupby object with unknown classes and number of classes, because the mask array itself might be huge so you only know the resulting groups after doing the grouped aggregation.

I have a few benchmarks of DAE against this library: https://flox.readthedocs.io/en/latest/intro.html . In their main example they also group by an array larger than memory. In flox, they have a keyword argument expected_groups that you can pass manually, which would be something we could do as well but maybe we could even make something like this work without knowing the groups? It is definitely possible from the DAE-side, just asking if we can stretch the DimGroupByArray so far that it can have unknown length.

@rafaqz
Copy link
Owner

rafaqz commented Nov 27, 2024

That is a cool idea.

How do we store the group indices for masks that big? That array could also get large.

Or is it also doing online stats? (Ok seems like that's what flox dies)

@meggart
Copy link
Author

meggart commented Nov 27, 2024

The idea is to do online stats, but it would still assume that the aggregated results still fit in memory. Ofc users would have to make sure the mask array does not contain millions of classes...

@rafaqz
Copy link
Owner

rafaqz commented Nov 27, 2024

Ok guess we will need a groupby/combine function for that. xarray_reduce is a pretty weird name (with no mention of the grouping part).

Any idea what to call it?

@meggart
Copy link
Author

meggart commented Nov 27, 2024

I am really bad at naming things. What about groupby_dimarray_combine(combinefunc, array, groups... ) or similar? The main differences to the normal groupby would be that:

  1. Supplying a reduction function is mandatory
  2. Grouping happens by value of the group arguments
  3. Ideally, there would be automatic broadcasting of the array argument, so that we don't need to artificially create huge repeated arrays of ones like in the flox example but can just write groupby_dimarray_combine(sum, 1, mask) to get the equivalent of something like countmap(mask)

Of course there will be some overlap with the mean.(groupby(a,...)) workflow so I was initially thinking we could somehow get this behavior by special-casing broadcast, but I would be much happier with a separate function that we can specialize on for DiskArrays, or for now one that I would simply overload on YAXArrays.

@rafaqz
Copy link
Owner

rafaqz commented Nov 27, 2024

I'm going to add combine anyway for combine(mean, groupedarray), maybe we can use combine(mean, GroupBy(A, args...; kw...)) as a lazy version?

(Where GroupBy has identical syntax to groupby)

@meggart
Copy link
Author

meggart commented Nov 27, 2024

Nice, I think this might work. I will try to implement a DiakArrayEninge-based combine that builds on this interface

@rafaqz
Copy link
Owner

rafaqz commented Dec 13, 2024

I just realised allowing this will let us have the DataFrames.jl groupby syntax for AbstractDimStack like groupby(st, :layername) and it will group the whole stack by that layer.

@rafaqz rafaqz added enhancement New feature or request question Further information is requested ecosystem labels Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ecosystem enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants