-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R-Forge #1705] Change CJ internally not to expand, for efficiency #652
Comments
Yes please. Maybe some non standard evaluation hacking in see for example this >50x speedup:
EDIT: Ah sry, before posting I did some further "optimization" and didn't notice it produced wrong results with it. My point now holds hopefully:) |
@jan-glx I'm confused - what exactly are you trying to demonstrate? I didn't run your code, but you show that the result of your two functions is not identical, then you talk about a speedup... Fwiw I would want to see a different function if this idea gets implemented, and not a drastic change to what |
Here is the example with the "optimization" on again. And another direct implementation that might make clearer when this is useful.
|
Hm, are you just asking that autoindexing be extended? Seems like the clearest demonstration is just
(All of your |
Kinda, yes. I would prefer to to describe what I want instead of heaving to explicitly say want R to do. However current optimization only works for some (through the most important) cases. And I am afraid it is supporting the use of data.table in a not data.table way leading to problems as soon as optimization does not kick in. E.g. see:
But I am not sure what I want or what the best solution would be...an how much of it concerns this issue... PS: In your %in% query: Why is the key preserved (only) if both good arrays are sorted? |
Regarding multiple indexes you've linked, it quite heavy implementation as it stores not just integer order sequence for each index but all the column values included in the index for each index, see idxv.R#L10. I'm not sure if this what you discuss here fits to the first post which is clearly focused on single change for CJ. |
where does this stand now that R has |
If the following version includes
|
I wouldn't go that way unless there will be officially documented (at least on r-devel mailing) ALTREP api to represent cross product of data.frames, otherwise such API is likely to change without any notice. |
I think it could be our own ALTREP class, though I still don't find good documentation on how to use ALTREP, exactly :) |
Submitted by: Matt Dowle; Assigned to: Nobody; R-Forge link
CJ() could return an irregular column length data.table (its inputs unchanged), marked somehow so that binarysearch knows to %% index through the CJ columns, rather than allocating long vectors and populating them as CJ currently does. Purely for speed, with no changes for users unless perhaps CJ() has been used outside [.data.table.
Or, more generally, data.table columns could be marked so that indexing to them knows they should be recycled, without actually allocating and recycling the columns. Discuss on datatable-help needed as under-the-hood code e.g. DT$foo[903] might not work if it was internally a short vector. In reality it's probably only useful for CJ() as regular data.table's only have recycled data perhaps to start with creating them with say an NA column but then are quickly populated. If a large table has a column with the same value (say NA or 0) for every row, perhaps the column is redundant. Can't imagine that recycling >1 items is often useful (other than in CJ()).
The text was updated successfully, but these errors were encountered: