Integrate DuckDB #200

hannes · 2021-05-04T19:42:49Z

This PR integrates DuckDB in the benchmark by way of its R package. Packages are installed from CRAN, queries are run using the Spark SQL queries.

CC @jangorecki

jangorecki · 2021-05-05T10:12:31Z

Thank you for PR, it looks great. It will help a lot to have duckdb in db-benchmark. I will make some adjustments, biggest I see now is to remove helper function so everything can be executed line by line. If I will have any questions I will ask them here. We need to wait couple days because benchmark machine is already occupied by another project.

jangorecki · 2021-05-05T11:38:04Z

I made new branch so it is easier to push to than to a remote "master".

jangorecki · 2021-05-05T12:07:02Z

Printing dimensions of an answer has to be included in the timing. This is to enforce lazy computation that other solutions might be using.
I have taken out queries out of helper functions, it will look now like this: ac6c29d
If there is a way to get nrow and ncol in a single query, then I could simplify that and reduce number of queries to db.

jangorecki · 2021-05-06T09:31:29Z

@hannesmuehleisen how are operations like SUM handled in case of integer overflow? if the type used for int is int32 (which is reasonable and should be preferred to not suffer extra penalty comparing to other tools), then when calculating checksum in bigger datasets (1e9) the overflow will happen, we normally use integer64 in those cases.
Is there a support for pulling int64 types to R?

hannes · 2021-05-06T09:57:42Z

DuckDB decides what type the SUM result has based on column statistics. BIGINT/int64 is currently returned as double.

jangorecki · 2021-05-07T08:19:29Z

@hannesmuehleisen Good to hear that. If you have a release channel for your development version which is considered stable-ish - well tested so broken builds should not happen - we can plug that in to db-benchmark instead of using CRAN releases. This was initial idea for this project but most of solutions does not provide stable-ish devel releases. We currently support such for data.table and python datatable.

jangorecki · 2021-05-07T08:29:26Z

@hannesmuehleisen Benchmark finished.

groupby 1e9 could not be completed due to out of memory error. CSV size is slightly under 50GB size so machine with 128GB should be generally able to compute that in-memory. For now I will change duckdb to run groupby 1e9 using on-disk storage. After next benchmark run it should be reflected on the report. If you will reduce memory usage in future please let me know, then I will revert this change 58ab74f.

join 1e9 has been already attempted to be computed using on-disk memory but all 3 data cases of that size has been killed by OS oom killer.

$ more out/run_duckdb_join_J1_1e9_*.err
::::::::::::::
out/run_duckdb_join_J1_1e9_NA_0_0.err
::::::::::::::
Killed
::::::::::::::
out/run_duckdb_join_J1_1e9_NA_0_1.err
::::::::::::::
Killed
::::::::::::::
out/run_duckdb_join_J1_1e9_NA_5_0.err
::::::::::::::
Killed

In all 3 cases it happen when doing first join query, so data are being "loaded" properly. If there is any debug setting I can switch on to dump more logs somewhere I can do it, just tell me how.

jangorecki · 2021-05-07T09:06:34Z

@hannesmuehleisen I updated exceptions and duckdb is now on the report. If you have any preferences about colors we can adjust them. It is currently set as follows: https://github.com/h2oai/db-benchmark/blob/duckdb/_benchplot/benchplot-dict.R#L45

hannes · 2021-05-07T09:11:07Z

Great its now on the list! Wrt color, perhaps something that is a bit more readable? I don't really have a preference.

jangorecki · 2021-05-07T09:12:54Z

Does duckdb stores column statistics like mean? I see q4 is way much faster than any other solution.

hannes · 2021-05-07T09:13:07Z

It does not, we have some statistics like min/max or whether there are NULLs but those are only upper/lower bounds so we can plan better.

jangorecki · 2021-05-07T09:14:47Z

Any idea why q4 is so much faster?

hannes · 2021-05-07T09:16:31Z

Probably the perfect hashing we use when we know that there can only be few groups based on stats

hannes · 2021-05-07T12:13:16Z

Looks great, one thing I don't get is why the 5 GB group by q3 is so much slower than q5. it should be similar, no?

jangorecki · 2021-05-07T14:31:22Z

Grouping is made by a different column so one should expect difference.

jangorecki · 2021-05-07T16:35:00Z

duckdb timings in some cases are already very impressive, congratulation @hannesmuehleisen.

After 0.2.6 release (or if we decide to switch to devel) I will re-run duckdb, then groupby 1e9 will use on-disk storage, so duckdb 1e9 timings should be displayed on the report as well.

Be sure to check https://h2oai.github.io/db-benchmark/#explore-more-data-cases at the bottom of the report. There are more benchmark plots linked there, having different cardinality of id columns, missing values and being pre-sorted. You can also obtain all timings data by adding /time.csv (and /logs.csv) to report website url.

We also have https://h2oai.github.io/db-benchmark/history.html (duckdb will appear there soon) for tracking performance regression, but it is an internal only report, not meant to be published really. For public consumption only "latest" timings are meant to be presented on the main report, but for developers it can be very useful to look at history plots when they are presume performance regression.

Thanks for joining this project!

hannes · 2021-05-09T08:18:06Z

@jangorecki 0.2.6 made it to CRAN today, curious about the benchmark perf changes!

jangorecki · 2021-05-09T08:56:05Z

@hannesmuehleisen It is now running, so far I noticed that NA data case (G1_1e7_1e2_5_0) crashed ~~during q10~~ with the following error

Error in duckdb_execute(res) : duckdb_execute_R: Failed to run query
Error: Out of Range Error: SUM is out of range!
Calls: system.time ... dbSendQuery -> .local -> duckdb_result -> duckdb_execute
Timing stopped at: 0.201 0 0.007
Execution halted

jangorecki · 2021-05-09T09:25:58Z

Sorry, I made me mistake when looking at logs. 1e7 q10 did finished.
q10 run1 has been logged to timings, q10 run2 hasn't been logged so the error must have occurred during checksum computation.

Then on 1e8 NA data case I observed that second run of q10 haven't finished and crash was after first run of q10. Run1 has not been logged so error was again during checksum computation.

log outputs for completeness:

1e7

[1] 9216    3            ## q9 run1 dim
[1] 9216    3            ## q9 run2 dim
    id2 id4           r2 ## q9 ans head
1 id084  26 0.0026820621
2 id076  28 0.0006597825
3 id079  48 0.0004708965
    id2 id4          r2  ## q9 ans tail
1 id023  94 0.002061448
2 id029  64 0.003592627
3 id085   7 0.001361118
[1] 9999993       8      ## q10 run1 dim
[1] 9999993       8      ## q10 run2 dim
Error in duckdb_execute(res) : duckdb_execute_R: Failed to run query
Error: Out of Range Error: SUM is out of range!
Calls: system.time ... dbSendQuery -> .local -> duckdb_result -> duckdb_execute
Timing stopped at: 0.201 0 0.007
Execution halted

1e8

[1] 9216    3            ## q9 run1 dim
[1] 9216    3            ## q9 run2 dim
    id2 id4           r2 ## q9 ans head
1 id050  89 8.477036e-05
2 id038   3 4.091803e-06
3 id063  79 6.383826e-05
    id2 id4           r2 ## q9 ans tail
1 id079  77 4.393387e-05
2 id029  64 2.185473e-04
3 id085   7 3.726668e-05
[1] 99999338        8    ## q10 run1 dim
Error in duckdb_execute(res) : duckdb_execute_R: Failed to run query
Error: Out of Range Error: SUM is out of range!
Calls: system.time ... dbSendQuery -> .local -> duckdb_result -> duckdb_execute
Timing stopped at: 1.529 0.01 0.042
Execution halted

jangorecki · 2021-05-09T10:58:21Z

1e9 groupby which goes on-disk has been killed by OOM during q2.

# groupby-duckdb.R
loading dataset G1_1e9_1e2_0_0
[1] "using disk memory-mapped data storage"
[1] 1e+09
grouping...
[1] 100   2
[1] 100   2
    id1       v1
1 id039 30000168
2 id035 30003917
3 id080 29979880
    id1       v1
1 id067 29996312
2 id022 29994847
3 id086 30003608
Killed

@hannesmuehleisen If you believe that 0.2.6 release could improve 1e9 data size handling, which previously was running OOM, I can switch back to in-memory processing.

jangorecki · 2021-05-09T19:59:01Z

all joins fail early with the same internal error, after second question with the same

Error in duckdb_execute(res) : duckdb_execute_R: Failed to run query
Error: Out of Range Error: SUM is out of range!
Calls: system.time ... dbSendQuery -> .local -> duckdb_result -> duckdb_execute
Timing stopped at: 1.529 0.01 0.042
Execution halted

jangorecki · 2021-05-09T20:12:12Z

@hannesmuehleisen I filled 3 issues in duckdb repo, two of them are regression in 0.2.6. Sorry I haven't narrow down those to the minimal reproducible reports but already those three reports, tracking down, comparison, took a while.

hannes · 2021-05-10T06:50:27Z

Thanks for filing them, will address asap.

first duckdb version

67c251b

hannes mentioned this pull request May 4, 2021

Duckdb comparison #170

Closed

jangorecki added the duckdb label May 5, 2021

jangorecki changed the base branch from master to duckdb May 5, 2021 11:34

jangorecki merged commit b3fed0d into h2oai:duckdb May 5, 2021

jangorecki mentioned this pull request May 5, 2021

duckdb #201

Merged

This comment has been minimized.

Sign in to view

jangorecki mentioned this pull request May 9, 2021

duckdb join fails after upgrade to 0.2.6 #207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate DuckDB #200

Integrate DuckDB #200

hannes commented May 4, 2021

jangorecki commented May 5, 2021

jangorecki commented May 5, 2021

jangorecki commented May 5, 2021

jangorecki commented May 6, 2021

hannes commented May 6, 2021 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

jangorecki commented May 7, 2021

jangorecki commented May 7, 2021 •

edited

Loading

jangorecki commented May 7, 2021 •

edited

Loading

hannes commented May 7, 2021

jangorecki commented May 7, 2021

hannes commented May 7, 2021 •

edited

Loading

jangorecki commented May 7, 2021

hannes commented May 7, 2021

hannes commented May 7, 2021

jangorecki commented May 7, 2021

jangorecki commented May 7, 2021 •

edited

Loading

hannes commented May 9, 2021 •

edited

Loading

jangorecki commented May 9, 2021 •

edited

Loading

jangorecki commented May 9, 2021 •

edited

Loading

jangorecki commented May 9, 2021

jangorecki commented May 9, 2021 •

edited

Loading

jangorecki commented May 9, 2021

hannes commented May 10, 2021

Integrate DuckDB #200

Integrate DuckDB #200

Conversation

hannes commented May 4, 2021

jangorecki commented May 5, 2021

jangorecki commented May 5, 2021

jangorecki commented May 5, 2021

jangorecki commented May 6, 2021

hannes commented May 6, 2021 • edited Loading

This comment has been minimized.

This comment has been minimized.

jangorecki commented May 7, 2021

jangorecki commented May 7, 2021 • edited Loading

jangorecki commented May 7, 2021 • edited Loading

hannes commented May 7, 2021

jangorecki commented May 7, 2021

hannes commented May 7, 2021 • edited Loading

jangorecki commented May 7, 2021

hannes commented May 7, 2021

hannes commented May 7, 2021

jangorecki commented May 7, 2021

jangorecki commented May 7, 2021 • edited Loading

hannes commented May 9, 2021 • edited Loading

jangorecki commented May 9, 2021 • edited Loading

jangorecki commented May 9, 2021 • edited Loading

jangorecki commented May 9, 2021

jangorecki commented May 9, 2021 • edited Loading

jangorecki commented May 9, 2021

hannes commented May 10, 2021

hannes commented May 6, 2021 •

edited

Loading

jangorecki commented May 7, 2021 •

edited

Loading

jangorecki commented May 7, 2021 •

edited

Loading

hannes commented May 7, 2021 •

edited

Loading

jangorecki commented May 7, 2021 •

edited

Loading

hannes commented May 9, 2021 •

edited

Loading

jangorecki commented May 9, 2021 •

edited

Loading

jangorecki commented May 9, 2021 •

edited

Loading

jangorecki commented May 9, 2021 •

edited

Loading