-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.I in DT[i, .I, by] should return row numbers of DT not of DT[i] #1494
Comments
@mattdowle I agree that this is inconsistent. However changing .I will most likely break existing code without adding significant functionality. What OP asked for can be achieved by having an explicit id column in the DT. So, instead of using .I can return the id column. require(data.table)
DT <- data.table(X=c(5,15,20,25,30))
DT[, id := .I]
# rows 4 & 5 are greater than 20
DT[X > 20, .I]
#[1] 1 2
DT[X > 20, id]
#[1] 4 5 If you decide to change .I functionality please add a warning in v1.9.8 when .I is called without grouping and i is present, and implement in v2.0.0 |
I think I prefer it be left as-is. To keep this kind of consistency, you should also have Maybe
|
@ChristK understood about backwards compatibility but in this case One intention of .I is to be a way to get row numbers from the full DT so it can be saved to a variable and then used in another query (e.g. set() or DT[i+1]). Kind of an advanced which() since you can do things with .I in j by by= before being returned by the query. (Adding the .I as a column has a memory cost.) @franknarf1 Good thought. But Is the existing possible usage something like : DT[colA>42, .(sum(colB), .I), by=colC] ? I'm not sure why the existing behaviour of .I is useful there. i.e. recycling sum(colB) alongside the .I row numbers from the colA>42 subset. But I can see the following usage useful (if .I were changed):
on the other hand maybe that is clearer (and potential most efficient) as : DT[colA>42, if (sum(colB) > 200) .SD[,colD:=NA], by=colC] But := inside .SD is currently unimplemented iirc. |
What I meant about
So, the group counter assigned to each group depends on the subset I only use For your last example, I'd rather see syntax like #788
though, now that I think about it, I guess that is potentially much less efficient (assigning separately for each group)... |
Ok will mull it over. Agreed having= looks nicer. |
@mattdowle I agree with this approach. My comment regarding backward compatibility was that these type of changes can break code that you don't even remember it exists and this specific update could break it silently. I use data.table for microsimulation and in this context existing functionality of .I can be used for example to sample from the rows of the subsetted table and then apply some function: ## extending my previous code
DT[X > 20,
{xx <- sample(.I, 1)
.SD[xx]
}][#some function here
] |
I'm pretty sure there's another issue about this lying around, let me see if I can find it... found it; #539 gets at the same issue. |
@ChristK Good try but shouldn't that @MichaelChirico Nice find of #539. Thanks. |
@mattdowle Yes, I agree with the proposed approach regarding warning messages. My example was just an example :) . I agree that there are faster and safer ways to achieve the same result. |
I think my answer to this simple question on SO is related here as well. It seems reasonable to expect that |
Related to #539 |
I would prefer to see .I unchanged, but introduce a .ROW function (or similar) to return row numbers from DT irrespective of what is in i. Below is an example which seems to be hard and convoluted to do using the current features. I need to get row numbers of DT1 in order to use set() to make some complex updates to a subset of rows.
This gives a vector of rows. But if I want to filter DT1 with other criteria, such as In this case the following works, but is clumsy: I can then use these row numbers in a set() loop. In my original problem I have >100,000 rows in DT1. The intermediate vector rows has length around 10,000 and the final list for updating has about 100 entries. I want to be able to access the row number of DT1 at any point in the 'nested' query so that I can use this to refer back to the original table for set(DT,'row number',j,value) |
Different result: Thanks for the intersect suggestion. This appears to be around 8x faster than my version. With respect to the use of .ROW, I was thinking rows <- DT1[DT2[price > 8],.ROW[qty < .75], on = "product"] #followed by If I use .I where this has .ROW, then I get [1] 3 4 Note that in my particular problem I haven’t found a way to vectorise myFunction as it does some quite complex calculations of inventory levels and capacity to determine the amount of the adjustment. As the effect is cumulative it appears to be most efficient to do this within a for loop and set(). Regards, Simon From: franknarf1 [mailto:[email protected]] @sch56https://github.com/sch56 I ran your code and I see a different result (3 9 12 not 9 12). You didn't show the syntax you imagine would be possible with .ROW, so it's not clear how you expect it to clean up your code, but this is an option, fyi: intersect(rows, DT1[qty < .75, which=TRUE]) — |
Sorry, I impulsively deleted my comment after realizing the Strangely, it looks like we are seeing different data:
I've never seen that with R before.
Yeah, that makes sense. I would prefer |
I want to add my vote of creating a row number variable. Another naming candidate is I just asked a related question in stackoverflow. I wanted to exclude some rows based on row number, and select rows with condition, update columns. Row number indexing and condition expression in The proposed answer used It's not the first time that I often found I have this kind of usage:
This suggested a question about the row number behavior: what happened when a subset data table is created? Should we keep the original row number in a column named Otherwise we will have a row number column in different table with different meanings, and join them by same name row number column will create problems. |
@user3351605 might have a point here. Think I'm in favour of .I being changed here for consistency so it always refers to rows of DT, even when i is present. Views?
http://stackoverflow.com/q/22408306/403310
The text was updated successfully, but these errors were encountered: