Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.I in DT[i, .I, by] should return row numbers of DT not of DT[i] #1494

Open
mattdowle opened this issue Jan 12, 2016 · 15 comments
Open

.I in DT[i, .I, by] should return row numbers of DT not of DT[i] #1494

mattdowle opened this issue Jan 12, 2016 · 15 comments
Labels
Milestone

Comments

@mattdowle
Copy link
Member

@user3351605 might have a point here. Think I'm in favour of .I being changed here for consistency so it always refers to rows of DT, even when i is present. Views?

http://stackoverflow.com/q/22408306/403310

@ChristK
Copy link

ChristK commented Jan 13, 2016

@mattdowle I agree that this is inconsistent. However changing .I will most likely break existing code without adding significant functionality. What OP asked for can be achieved by having an explicit id column in the DT. So, instead of using .I can return the id column.

require(data.table)
DT <- data.table(X=c(5,15,20,25,30))
DT[, id := .I]

# rows 4 & 5 are greater than 20
DT[X > 20, .I]
#[1] 1 2 
DT[X > 20, id]
#[1] 4 5 

If you decide to change .I functionality please add a warning in v1.9.8 when .I is called without grouping and i is present, and implement in v2.0.0

@franknarf1
Copy link
Contributor

I think I prefer it be left as-is. To keep this kind of consistency, you should also have .GRP refer to the group number within the full DT even when subsetting DT[i, .GRP, by=.(x,y,z)], which I guess could be inefficient (?).

Maybe .WHICH could be defined as the constant for row numbers of DT (matching the usage in DT[DT2, which=TRUE]? Extending ChristK's example...

DT2 = data.table(X=c(5,20), Y=c(1,2))

# proposed code (doesn't work currently):
DT[DT2, .(w = .WHICH, y2 = Y^2), on="X"]
#    w y2
# 1: 1  1
# 2: 3  4

@mattdowle
Copy link
Member Author

@ChristK understood about backwards compatibility but in this case .I isn't very useful so I doubt anyone is using it in this case. Interested to hear from anyone who is. We can certainly test the waters by adding a warning as you suggest e.g. ".I is about to change when used with i. Since you are using .I in this way, please contact us." and see if anybody contacts us before then changing it. Then if ok, introduce the change optionally with a slow migration path, just like the by=.EACHI breaking change was introduced.

One intention of .I is to be a way to get row numbers from the full DT so it can be saved to a variable and then used in another query (e.g. set() or DT[i+1]). Kind of an advanced which() since you can do things with .I in j by by= before being returned by the query. (Adding the .I as a column has a memory cost.)

@franknarf1 Good thought. But .GRP is intended to be a simple group counter as the groups are processed. It isn't supposed to refer to anything outside the groups being looped-through. Not sure I see any link between .I and .GRP really.

Is the existing possible usage something like : DT[colA>42, .(sum(colB), .I), by=colC] ? I'm not sure why the existing behaviour of .I is useful there. i.e. recycling sum(colB) alongside the .I row numbers from the colA>42 subset. But I can see the following usage useful (if .I were changed):

w = DT[colA>42,  if (sum(colB) > 200) .I, by=colC]$I
DT[w, colD:=NA]

on the other hand maybe that is clearer (and potential most efficient) as :

DT[colA>42, if (sum(colB) > 200) .SD[,colD:=NA], by=colC]

But := inside .SD is currently unimplemented iirc.

@franknarf1
Copy link
Contributor

What I meant about .GRP was:

DT = data.table(g=c("A","A","B","C"), x=1:4)
DT[, .GRP, by=g] # versus...
DT[x>2, .GRP, by=g]

So, the group counter assigned to each group depends on the subset i, just as .I currently does. It seemed to me like the consistency argument put forward for changing .I extends to .GRP, but I can see your point if .I is meant to be independent of i while .GRP is not.

I only use .I for row numbers of the full DT, so I guess I'm fine with the change. I don't think backward incompatibility would bite my code much if at all.

For your last example, I'd rather see syntax like #788

DT[colA>42, colD := NA, by=colC, having={sum(colB) > 200}]

though, now that I think about it, I guess that is potentially much less efficient (assigning separately for each group)...

@mattdowle
Copy link
Member Author

Ok will mull it over. Agreed having= looks nicer.

@ChristK
Copy link

ChristK commented Jan 14, 2016

@mattdowle I agree with this approach. My comment regarding backward compatibility was that these type of changes can break code that you don't even remember it exists and this specific update could break it silently. I use data.table for microsimulation and in this context existing functionality of .I can be used for example to sample from the rows of the subsetted table and then apply some function:

## extending my previous code
DT[X > 20,
   {xx <- sample(.I, 1)
     .SD[xx]
  }][#some function here
]

@MichaelChirico
Copy link
Member

I'm pretty sure there's another issue about this lying around, let me see if I can find it...

found it;

#539 gets at the same issue.

@mattdowle
Copy link
Member Author

@ChristK Good try but shouldn't that sample(.I, 1) be sample(.N, 1)? Since sample(1:.N, 1) is the same as sample(.N,1) but the latter is (or should be) more efficient by avoiding creating the long 1:.N vector. Actually that could be almost infinitely faster so very significant speed penalty of using .I in that way that we could detect and warn about anyway. But fully agree that we should be concerned about .I changing silently and so yes the warning would be in place throughout the switchover so that users would be warned in some way. What we've done before (iirc) is that users have to set options(datatable.warnIchangedWhenWithi=FALSE) to explicitly say they know about it, otherwise, if they're using .I in j along with i they'll always get a warning. We'd only turn off that warning after a few years. This method conveys the message that we're serious about backwards compatibility and silent errors/warnings, which we are. For example to new users in one years time, they'll get that warning and will need to switch it off (if they use .I with i too). That minor annoyance is worth it since they'll then understand how we manage to make such changes in a safe and robust way.

@MichaelChirico Nice find of #539. Thanks.

@ChristK
Copy link

ChristK commented Jan 15, 2016

@mattdowle Yes, I agree with the proposed approach regarding warning messages. My example was just an example :) . I agree that there are faster and safer ways to achieve the same result.

@MichaelChirico
Copy link
Member

I think my answer to this simple question on SO is related here as well.

It seems reasonable to expect that yDT[!xDT, .I, on = paste0("V", 1:3)] would have returned the result directly instead of having to define .I as a fixed column beforehand.

@arunsrinivasan
Copy link
Member

Related to #539

@sch56
Copy link

sch56 commented May 9, 2016

I would prefer to see .I unchanged, but introduce a .ROW function (or similar) to return row numbers from DT irrespective of what is in i.

Below is an example which seems to be hard and convoluted to do using the current features.

I need to get row numbers of DT1 in order to use set() to make some complex updates to a subset of rows.
set.seed(1)
DT1 <- data.table(loc = c("L1", "L2"), product = c("P1", "P2", "P3"), qty = runif(12))
DT2 <- data.table(product = c("P2", "P3"), family=c("A", "B"), price = c(7, 10))

DT1[DT2[price > 8], on = "product", which = TRUE]
#[1] 3 6 9 12

This gives a vector of rows. But if I want to filter DT1 with other criteria, such as qty<.75, or I want this sorted by some other element, then I have a problem.

In this case the following works, but is clumsy:
rows <- DT1[DT2[price > 8], on = "product", which = TRUE]
DT1[rows, .(rows, qty)][qty < .75, rows]
#[1] 9 12

I can then use these row numbers in a set() loop.

In my original problem I have >100,000 rows in DT1. The intermediate vector rows has length around 10,000 and the final list for updating has about 100 entries.

I want to be able to access the row number of DT1 at any point in the 'nested' query so that I can use this to refer back to the original table for set(DT,'row number',j,value)

@sch56
Copy link

sch56 commented May 13, 2016

Different result:
I still get the result I posted (9 12). Perhaps we are getting different answers for runif with set.seed(1).
I get the following values for DT1
loc product qty
1: L1 P1 0.1297134
2: L2 P2 0.9822407
3: L1 P3 0.8267184
4: L2 P1 0.2423550
5: L1 P2 0.8568853
6: L2 P3 0.8408788
7: L1 P1 0.3421633
8: L2 P2 0.7062672
9: L1 P3 0.6212432
10: L2 P1 0.6537663
11: L1 P2 0.9224086
12: L2 P3 0.5363538

Thanks for the intersect suggestion. This appears to be around 8x faster than my version.

With respect to the use of .ROW, I was thinking

rows <- DT1[DT2[price > 8],.ROW[qty < .75], on = "product"]
rows
[1] 9 12

#followed by
for (r in rows) set(DT1, r, 3, myFunction(qty))

If I use .I where this has .ROW, then I get

[1] 3 4
which I can’t use in the set loop.

Note that in my particular problem I haven’t found a way to vectorise myFunction as it does some quite complex calculations of inventory levels and capacity to determine the amount of the adjustment. As the effect is cumulative it appears to be most efficient to do this within a for loop and set().

Regards, Simon

From: franknarf1 [mailto:[email protected]]
Sent: Friday, 13 May 2016 2:34 a.m.
To: Rdatatable/data.table [email protected]
Cc: Simon Harrison [email protected]; Mention [email protected]
Subject: Re: [Rdatatable/data.table] .I in DT[i, .I, by] should return row numbers of DT not of DTi

@sch56https://github.com/sch56 I ran your code and I see a different result (3 9 12 not 9 12).

You didn't show the syntax you imagine would be possible with .ROW, so it's not clear how you expect it to clean up your code, but this is an option, fyi:

intersect(rows, DT1[qty < .75, which=TRUE])


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHubhttps://github.com//issues/1494#issuecomment-218776301

@franknarf1
Copy link
Contributor

franknarf1 commented May 13, 2016

Sorry, I impulsively deleted my comment after realizing the intersect idea is not so great (since it's rather inefficient to make the full comparison qty <.75 on all rows instead of the just the subset for price > 8). Glad to hear it's relatively fast.

Strangely, it looks like we are seeing different data:

> DT1; DT2
    loc product        qty
 1:  L1      P1 0.26550866
 2:  L2      P2 0.37212390
 3:  L1      P3 0.57285336
 4:  L2      P1 0.90820779
 5:  L1      P2 0.20168193
 6:  L2      P3 0.89838968
 7:  L1      P1 0.94467527
 8:  L2      P2 0.66079779
 9:  L1      P3 0.62911404
10:  L2      P1 0.06178627
11:  L1      P2 0.20597457
12:  L2      P3 0.17655675
   product family price
1:      P2      A     7
2:      P3      B    10

I've never seen that with R before.

With respect to the use of .ROW, I was thinking

rows <- DT1[DT2[price > 8],.ROW[qty < .75], on = "product"]

Yeah, that makes sense. I would prefer .XROW and .IROW in case both were useful (from X[i] merges).

@dracodoc
Copy link
Contributor

I want to add my vote of creating a row number variable. Another naming candidate is No.

I just asked a related question in stackoverflow. I wanted to exclude some rows based on row number, and select rows with condition, update columns. Row number indexing and condition expression in i cannot be used together. Although I can chain two selections, that make the column update happen in intermediate data table instead of original data table.

The proposed answer used .I in j to create logical vector then use it in i. It's a clever workaround but I feel it's a little bit against data table idiom.

It's not the first time that .I had been used in j position to solve problems, I think this is a good indication that we need a row number instead of current work around.

I often found I have this kind of usage:

  • x is the original data.table.
  • x.sub1 and x.sub2 are different subsets of x. I added columns in x.sub1 and x.sub2, then I need to merge the new columns back to x.
  • I have to create a row number/row id column in x explicitly to work as key in joining subset table back, because there is no unique key column in x.
  • I could add columns in x directly with the subset condition in i. However that will make the statement very long and repetitive. I think it also will select the rows again and again.

This suggested a question about the row number behavior: what happened when a subset data table is created? Should we keep the original row number in a column named No.x?

Otherwise we will have a row number column in different table with different meanings, and join them by same name row number column will create problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants