-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide an equivalent to dplyrs summarise function #84
Comments
hi @davidanthoff are you looking for some help with that? the cost for you is of course that you would lose some time managing and explaining what I do. of course assuming that this does not require a fundamental change of your setup. |
I'm mainly still struggling with the design for this... But I kind of have an idea now, would be great to hear your feedback on that. And yes, I could generally use help with the whole package/ecosystem, so that would be most welcome and I would not mind at all explaining things. Are you coming to juliacon this year? That might be an efficient way. Ok, here is my idea. For the case where you want to summarize a grouped result, you can today use the following syntax: @from i in df begin
@group i by i.state into g
@select {age=mean(map(j->j.age,g)), oldest=maximum(map(j->j.age,g))}
@collect DataFrame
end Some of these reduction functions in base allow you to pass a function that transforms things before the reduction happens, e.g. there is @select {age=mean(j->j.age,g), oldest=maximum(map(j->j.age,g))} That is a bit better, but many of the reduction functions in base don't support this, and I find it still clunky. I think there are two ways out of this:
I thought for a while that the story for summarizing a whole query is more tricky. I could add a df |> @query(i, begin
@select i
end) |>
@summarize(age=mean(age), oldest=maximum(age) This whole piping syntax works already on One general question is what |
For the grouped summary story, see #121. |
hi @davidanthoff so i finally got round to look at this. on the upside: i'm able to run the tests. on the downside, I dont' even know where to start with the code. :-( It's very advanced with metaprogramming, maybe a bit too much for me - I'd like to learn but not sure it's worth your time, as I said. (not at juliaCon unfortunately) So I find both the piping and your solution number 2 above appealing. number 2 seems the right thing for summaries within a query. So just to get the main setup right:
|
I'm wonder if it wouldn't be possible to turn a vector of namedtuples to a named tuple of vectors. |
I think I've got a solution here |
Figuring out some story about ungroup would be useful too. |
Hm, that would imply yet another allocation, right? Unless it would be a named tuple of vector views...
That should be easily done via a nested |
I guess what you really might want is generators (row.name for row in i). |
I tried out the generators in LazyQuery seems to be working fine on master. I was hoping you could help me out with the ungroup. Say for example I use LazyQuery to do something like this: @chain @evaluate begin
DataFrame(
a = [1, 1, 2, 2],
b = [1, 2, 3, 4],
c = [4, 3, 2, 1]
)
query(it)
@group it a
@make_from it a d = collect(b) / sum(b) e = collect(c) / sum(c)
collect(it, DataFrame)
end I end up with nested vectors in d and e. How would I ungroup them? If you want you can send back query syntax and I can macroexpand my way through it. |
Something like this: @from i in df begin
@group i by i.a into g
@select {g.key, some_avg = mean(j->j.b, g), group = g} into i
@from j in i.group
@select {i.key, i.some_avg, j.b, j.c}
@collect DataFrame
end Not a perfect match, but it shows the general idea. One problematic aspect here is that this won't work if you have more than one vector in the group that you want to unroll. I.e. in my example, only |
Right, so then the solution would be to take non-grouping columns, zip them back up into a vector of named tuples, unnest, then unzip them out again? |
Yeah... Not ideal... |
Ok, well, I've decided add an additional dataframes backed for lazyquery to fully support grouped operations. It seems like to me that the namedtuples row approach isn't really compatible with grouping/ungrouping. |
No description provided.
The text was updated successfully, but these errors were encountered: