Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

joinbyv - batch join multiple tables #694

Closed
wants to merge 0 commits into from
Closed

joinbyv - batch join multiple tables #694

wants to merge 0 commits into from

Conversation

jangorecki
Copy link
Member

Hello,
Utility function to batch join multiple tables. Using one function call you can perform many deeply customized joins. I found it very useful in process of denormalization hierarchical data from DataWarehouse/DB organized in star schema or snowflake schema design patterns. Generally it can be useful any time when series of joins is required.
Any comments?
Jan

@mattdowle mattdowle added this to the v1.9.4 milestone Jul 9, 2014
@mattdowle
Copy link
Member

Sorry for delay - this looks important so want to get it right. Have added to the 1.9.4 milestone. There's a related issue here to tackle as well before release to CRAN (which signals we're happy with function names, syntax etc) :

http://r.789695.n4.nabble.com/data-table-is-asking-for-help-tp4692080p4692273.html

These might tie in with secondary keys, too.

@jangorecki
Copy link
Member Author

We may skip this pull request, I'm going to push separate branch to include this function plus some additional.
I would even have one question related to this,
is it possible to query in i using binary search using artificial values TRUE or character() passed as J() argument to omit filtering out on such field?
to reproduce:

devtools::install_github("data.table", "Rdatatable", build_vignettes=FALSE, ref = NULL, pull = 694)
library(data.table)
# run `populate example data` section from examples:
?joinbyv
# then run code below:
time[,date_code := as.character(date_code)]
n <- 1e5
set.seed(2345)
sales <- 
  data.table(prod_code = sample(product[,prod_code], n, TRUE),
             cust_code = sample(customer[,cust_code], n, TRUE),
             state_code = sample(geography[,state_code], n, TRUE),
             date_code = sample(time[,date_code], n, TRUE),
             quantity = rnorm(n, 500, 200),
             value = rnorm(n, 10000,2000))
setkeyv(sales,c("prod_code","cust_code","state_code","date_code"))
# binary search filtering works fine on all four keys
sales[CJ(c(12,13,15), c(15,16,21,24),c("TX","AK","CA"),c("2013-01-18","2013-01-31","2014-08-06")), nomatch=0]
# but if I want to binary search on 3 keys, excluding `state_code` I would like to use some kind of:
sales[CJ(c(12,13,15), c(15,16,21,24),TRUE,c("2013-01-18","2013-01-31","2014-08-06")), nomatch=0]
sales[CJ(c(12,13,15), c(15,16,21,24),NA,c("2013-01-18","2013-01-31","2014-08-06")), nomatch=0]
sales[CJ(c(12,13,15), c(15,16,21,24),character(),c("2013-01-18","2013-01-31","2014-08-06")), nomatch=0]
# none of them works

# binary search workaround, not so nice:
sales[CJ(c(12,13,15), c(15,16,21,24), unique(sales[,state_code]), c("2013-01-18","2013-01-31","2014-08-06")), nomatch=0]
# non binary search workaround:
sales[CJ(c(12,13,15), c(15,16,21,24)), nomatch=0][date_code %in% c("2013-01-18","2013-01-31","2014-08-06")]

As 1.9.4 is already long awaited, we may may not hurry with pull request feature for 1.9.4 (as it is tagged now).
Additionally because of dependency on CRAN's data.table we cannot use 1.9.3 features in other CRAN packages until 1.9.4 will be released.

@jangorecki
Copy link
Member Author

the pull request has been removed from master due to branch switch, the one mentioned here is temporary available here: jangorecki@b50bb9d
I will submit a new PR when I will manage to organize the code into separate up-to-date branch.
The question from above post I've put in #797

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants