-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duckdb comparison #170
Comments
Never heard before about it but looks very interesting. Thank for filling the request. Yes, it is possible. |
thanks let me know if you need any help. if you are lazy you could just run the dplyr code through dbplyr but that would do duckdb a huge injustice. also make sure you set up the pragmas to use multithreading. https://duckdb.org/docs/sql/pragmas |
Help will be appreciated, otherwise it may not happen very soon. There are other tools waiting to be added, as well as some improvements. |
I never heard of duckdb either so I wanted to give it a test drive. I have some code below to show off some basic benchmarks to get some insight to @jangorecki about the potential performance. I have no clue at this point about optimizing duckdb inside R but I ran with the basic example from the docs. Also, @jangorecki I ran into a data.table error when trying to process 1.7 billion rows. Not sure if that is expected or not. The error code I was running into was this: "Error in gforce(thisEnv, jsub, o__, f__, len__, irows) : Internal error: Failed to allocate counts or TMP when assigning g in gforce" Nonetheless, here the code I was running. I tested using 32 threads and 64 threads using AVG / mean and MIN / min across two numeric grouping variables (I used the Kaggle Walmart data set and appended it repeatedly to grow its size for subsequent runs). Threads only affected data.table as I'm unaware if there is a way to alter thread usage in duckdb. However, I didn't see the CPU usage go above 3-4% when running the duckdb operations.
|
@AdrianAntico Thank you for benchmarks. As or data.table error, it is a regression, reported here: Rdatatable/data.table#4818 |
@AdrianAntico you need to set duckdb to use multithreading here https://duckdb.org/docs/sql/pragmas |
@ArmanAttaran inside R I ran the below but given the warning I'm not sure if that is how to run PRAGMA statements. Can you let me know if that's correct and if not, how to run those? *Edit - looks like this is the way?
not this way
|
Yes, DBI queries which are not retrieving results should be sent rather than get. Not sure if using that from R is the proper way as there might be significant overhead related to pulling query results into R memory. That requires exploration, or comments from someone who already explored that. |
@jangorecki - thanks for the info on the tmp issue. It would be nice to start incorporating other operations as well. For example, speed to build lags or rolling stats is probably the most important to me for forecasting applications. While duckdb appears to be solid for aggregation operations, the significant difference in performance for lag operations is too much to overcome to make use of. The code below generates lags 1 through 10 (I typically generate way more than that). @ArmanAttaran - let me know if the query I set up below is not correct or can be optimized because the run times aren't promising as is.
|
Ok give me some time I’ll look at on the weekend
…On Fri, Dec 4, 2020 at 4:16 PM Adrian ***@***.***> wrote:
@jangorecki <https://github.com/jangorecki> - thanks for the info on the
tmp issue.
It would be nice to start incorporating other operations as well. For
example, speed to build lags or rolling stats is probably the most
important to me for forecasting applications. While duckdb appears to be
solid for aggregation operations, the significant difference in performance
for lag operations is too much to overcome to make use of. The code below
generates lags 1 through 10 (I typically generate way more than that).
@ArmanAttaran <https://github.com/ArmanAttaran> - let me know if the
query I set up below is not correct or can be optimized because the run
times aren't promising as is.
data.table mean run time: 95 milliseconds (357x faster)
duckdb mean run time: 33,969 milliseconds
# DuckDB initialization
con <- DBI::dbConnect(duckdb::duckdb(), ":memory:")
# Set to 32 Threads
# Set duckdb multithreading
DBI::dbSendQuery(conn = con, 'PRAGMA threads=32')
# data.table initialization
data.table::setDTthreads(threads = parallel::detectCores() / 2)
data.table::getDTthreads(verbose = TRUE)
# omp_get_num_procs() 64
# R_DATATABLE_NUM_PROCS_PERCENT unset (default 50)
# R_DATATABLE_NUM_THREADS unset
# R_DATATABLE_THROTTLE unset (default 1024)
# omp_get_thread_limit() 2147483647
# omp_get_max_threads() 64
# OMP_THREAD_LIMIT unset
# OMP_NUM_THREADS unset
# RestoreAfterFork true
# data.table is using 32 threads with throttle==1024. See ?setDTthreads.
# load walmart data
data <- data.table::fread("https://www.dropbox.com/s/2str3ek4f4cheqi/walmart_train.csv?dl=1")
# Convert IDate to Date
data[, Date := as.Date(Date)]
# Prepare for duckdb
DBI::dbWriteTable(con, "data", data, overwrite = TRUE)
Query <- paste0('
SELECT
*,
lag(Weekly_Sales, 1) OVER(PARTITION BY "Store", "Dept" ORDER BY "Date" ASC) AS Lag_1,
lag(Weekly_Sales, 2) OVER(PARTITION BY "Store", "Dept" ORDER BY "Date" ASC) AS Lag_2,
lag(Weekly_Sales, 3) OVER(PARTITION BY "Store", "Dept" ORDER BY "Date" ASC) AS Lag_3,
lag(Weekly_Sales, 4) OVER(PARTITION BY "Store", "Dept" ORDER BY "Date" ASC) AS Lag_4,
lag(Weekly_Sales, 5) OVER(PARTITION BY "Store", "Dept" ORDER BY "Date" ASC) AS Lag_5,
lag(Weekly_Sales, 6) OVER(PARTITION BY "Store", "Dept" ORDER BY "Date" ASC) AS Lag_6,
lag(Weekly_Sales, 7) OVER(PARTITION BY "Store", "Dept" ORDER BY "Date" ASC) AS Lag_7,
lag(Weekly_Sales, 8) OVER(PARTITION BY "Store", "Dept" ORDER BY "Date" ASC) AS Lag_8,
lag(Weekly_Sales, 9) OVER(PARTITION BY "Store", "Dept" ORDER BY "Date" ASC) AS Lag_9,
lag(Weekly_Sales, 10) OVER(PARTITION BY "Store", "Dept" ORDER BY "Date" ASC) AS Lag_10
FROM
data')
# Avg
microbenchmark::microbenchmark(
times = 30,
DBI::dbGetQuery(con, Query),
data[, paste0("Lag_", 1L:10L) := data.table::shift(x = Weekly_Sales, n = 1L:10L, type = "lag"), by = list(Store,Dept)])
# Unit: milliseconds
# expr min lq mean median uq max neval
# DBI::dbGetQuery(con, Query)
# 33882.7288 33924.5433 33969.42098 33963.94245 34003.2308 34098.5795 30
# data[, `:=`(paste0("Lag_", 1L:10L), data.table::shift(x = Weekly_Sales, n = 1L:10L, type = "lag")), by = list(Store, Dept)]
# 75.5192 77.7901 95.12174 86.01295 88.5566 252.3392 30
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#170 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3YPENNWRU6SSBKH5AXIXTSTFGUBANCNFSM4UJ4JXZA>
.
|
This is an example where overhead of pulling query results to R will be more clearly visible. Lag and rolling stats returns the same number of rows as they got on input, so the data to copy from duckdb engine to R is bigger than in case of aggregation, where output is usually much smaller than input. Although I haven't made any tests so I cannot confirm if that is the case here. As for the "tasks" like lag and rolling statistics, I would like to encourage you to create an issue requesting that. Ideally explaining your use case and providing example code. I would like this project to be more community driven, so the more upvotes an issue will get the higher priority it will be. |
@jangorecki Thanks for the response. I'll put something together for the request. |
I'm not familiar with |
These results look pretty good for both systems, indeed the setting of data.table::shift(x = Weekly_Sales, n = 1L:10L, type = "lag")), by = list(Store, Dept)] Does this include an implicit sorting step? Is the data kept in insertion order which happens to be correct? Because the window expression used for DuckDB does include such a sorting step, which is of course not free. |
@hannesmuehleisen data.table doesn't do a sorting step but it also does sorting really fast. I just updated the code snippet for the lag and moving average operation below. Note that I re-shuffled after the first run and that I only ran the operation once for both tests since the time to sort already sorted data might bias the result. Nonetheless, I think @jangorecki was correct in that the benchmark times look bad for duckdb due to the load data into R step that is taking place.
|
We just merged some improvements to window functions (duckdb/duckdb#1500), when running with those the timings look as follows on our machine (with 8 threads):
Looks much better I'd say! |
@hannah-tillman I can see the difference to some extent on my side but the run time is still overwhelmed by pulling the data into R memory after the calculations are complete. Previous run times for DuckDB was roughly 33k milliseconds and now I'm seeing 29.5k milliseconds. Getting data into R fast is a very important task which is why I tend to store flat files for machine learning scoring purposes (that's what I do professionally). Pulling data out of most DB systems is pretty slow while data.table::fread() is super fast. For on-demand scoring tasks, there is usually some upper limit of time to deliver predictions and the less time spent going from a warehouse to R (or Python / Julia) means that more time can be spent on recreating features (feature engineering) before model scoring, and feature engineering is where I tend to get the most uplift from an ML-performance perspective. I'd love to be able to take advantage of the DB benefits of storing data (versus flat files) and I'd be thrilled if DuckDB could find a way to be a leader in this area. Is there any way the dev team could prioritize the data transfer from DuckDB to R to be as fast as data.table::fread()? I would think the solution would be relatively similar for Python and Julia as well, which would make a lot of uses really happy. |
@AdrianAntico I fully agree pulling data into R is important. I think DuckDB is already the fastest SQL database wrt transfer into R (by far). Whether this can be as fast as |
@hannesmuehleisen For the tables that are of concern we're looking at between 10M and 300M records. The tables have a mix of numeric, int, categorical (lower cardinality), string (free hand comment types), date, and datetimes types. The company I'm at is a microsoft shop so I can do some run time comparisons. I've tried working with micosoft's bcp cli and even their ssis pacakges to download data as flat files and then loading via data.table::fread() but they are a headache to get working properly in a short time. |
I've implemented a prototype Overall, is there anything else blocking the addition of DuckDB on the db-benchmark website? CC @jangorecki |
@hannesmuehleisen I tried running the below and ran into an error. Do you have any ideas about how to proceed?
"> devtools::install_github(repo = "hannesmuehleisen/duckdb", subdir = "tools/Rpkg", dependencies = FALSE)
Installing package into ‘C:/Users/Bizon/Documents/R/win-library/4.0’
|
Yeah the |
@hannesmuehleisen only the lack of time for this project at the moment. |
Can I do anything to help? |
I'll get started on a PR or are there any political reasons not to add DuckDB? |
See #200 |
is it possible to add duckdb to this test ?
The text was updated successfully, but these errors were encountered: