-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate DuckDB #200
Integrate DuckDB #200
Conversation
Thank you for PR, it looks great. It will help a lot to have duckdb in db-benchmark. I will make some adjustments, biggest I see now is to remove helper function so everything can be executed line by line. If I will have any questions I will ask them here. We need to wait couple days because benchmark machine is already occupied by another project. |
I made new branch so it is easier to push to than to a remote "master". |
Printing dimensions of an answer has to be included in the timing. This is to enforce lazy computation that other solutions might be using. |
@hannesmuehleisen how are operations like SUM handled in case of integer overflow? if the type used for int is int32 (which is reasonable and should be preferred to not suffer extra penalty comparing to other tools), then when calculating checksum in bigger datasets (1e9) the overflow will happen, we normally use integer64 in those cases. |
DuckDB decides what type the SUM result has based on column statistics. BIGINT/int64 is currently returned as double. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@hannesmuehleisen Good to hear that. If you have a release channel for your development version which is considered stable-ish - well tested so broken builds should not happen - we can plug that in to db-benchmark instead of using CRAN releases. This was initial idea for this project but most of solutions does not provide stable-ish devel releases. We currently support such for data.table and python datatable. |
@hannesmuehleisen Benchmark finished. groupby 1e9 could not be completed due to out of memory error. CSV size is slightly under 50GB size so machine with 128GB should be generally able to compute that in-memory. For now I will change duckdb to run groupby 1e9 using on-disk storage. After next benchmark run it should be reflected on the report. If you will reduce memory usage in future please let me know, then I will revert this change 58ab74f. join 1e9 has been already attempted to be computed using on-disk memory but all 3 data cases of that size has been killed by OS oom killer.
In all 3 cases it happen when doing first join query, so data are being "loaded" properly. If there is any debug setting I can switch on to dump more logs somewhere I can do it, just tell me how. |
@hannesmuehleisen I updated exceptions and duckdb is now on the report. If you have any preferences about colors we can adjust them. It is currently set as follows: https://github.com/h2oai/db-benchmark/blob/duckdb/_benchplot/benchplot-dict.R#L45 |
Great its now on the list! Wrt color, perhaps something that is a bit more readable? I don't really have a preference. |
Does duckdb stores column statistics like mean? I see q4 is way much faster than any other solution. |
It does not, we have some statistics like min/max or whether there are NULLs but those are only upper/lower bounds so we can plan better. |
Any idea why q4 is so much faster? |
Probably the perfect hashing we use when we know that there can only be few groups based on stats |
Looks great, one thing I don't get is why the 5 GB group by q3 is so much slower than q5. it should be similar, no? |
Grouping is made by a different column so one should expect difference. |
duckdb timings in some cases are already very impressive, congratulation @hannesmuehleisen. After 0.2.6 release (or if we decide to switch to devel) I will re-run duckdb, then groupby 1e9 will use on-disk storage, so duckdb 1e9 timings should be displayed on the report as well. Be sure to check https://h2oai.github.io/db-benchmark/#explore-more-data-cases at the bottom of the report. There are more benchmark plots linked there, having different cardinality of id columns, missing values and being pre-sorted. You can also obtain all timings data by adding We also have https://h2oai.github.io/db-benchmark/history.html (duckdb will appear there soon) for tracking performance regression, but it is an internal only report, not meant to be published really. For public consumption only "latest" timings are meant to be presented on the main report, but for developers it can be very useful to look at history plots when they are presume performance regression. Thanks for joining this project! |
@jangorecki 0.2.6 made it to CRAN today, curious about the benchmark perf changes! |
@hannesmuehleisen It is now running, so far I noticed that NA data case (G1_1e7_1e2_5_0) crashed
|
Sorry, I made me mistake when looking at logs. 1e7 q10 did finished. Then on 1e8 NA data case I observed that second run of q10 haven't finished and crash was after first run of q10. Run1 has not been logged so error was again during checksum computation. log outputs for completeness: 1e7
1e8
|
1e9 groupby which goes on-disk has been killed by OOM during q2.
@hannesmuehleisen If you believe that 0.2.6 release could improve 1e9 data size handling, which previously was running OOM, I can switch back to in-memory processing. |
all joins fail early with the same internal error, after second question with the same
|
@hannesmuehleisen I filled 3 issues in duckdb repo, two of them are regression in 0.2.6. Sorry I haven't narrow down those to the minimal reproducible reports but already those three reports, tracking down, comparison, took a while. |
Thanks for filing them, will address asap. |
This PR integrates DuckDB in the benchmark by way of its R package. Packages are installed from CRAN, queries are run using the Spark SQL queries.
CC @jangorecki