forked from tidyverse/tidyverse.org
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Edits #1
Merged
Merged
Edits #1
Changes from 16 commits
Commits
Show all changes
26 commits
Select commit
Hold shift + click to select a range
8b6d0bd
Space at EOL
krlmlr e25fb07
Sentence
krlmlr a0b9b39
FIXME
krlmlr b90e8af
Shorten
krlmlr 5ac6f6e
Verbose link
krlmlr 14ec2f6
Not dying on this particular hill here
krlmlr b9a277a
Tweak query, let's see
krlmlr 5762c0a
Prune
krlmlr 6aaf953
This works
krlmlr f847736
Tweak narrative
krlmlr c78073f
Choose pivoting as an important op not yet supported
krlmlr 1f898c0
Link style
krlmlr 5a1f22c
aeolus
krlmlr 2b4b421
Help
krlmlr d97b031
Exclude maintainers
krlmlr 4be5ea9
Thanks
krlmlr f344b9f
Link
krlmlr a13315a
Restore narrative
krlmlr ad9825f
Add vignette link
krlmlr a734638
FIXME
krlmlr fc8122d
Date
krlmlr 3211710
Why bother
krlmlr eea955a
Level
krlmlr 4a20ca3
Move
krlmlr f5e4a38
Detail
krlmlr 20dff03
TBC
krlmlr File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,16 +6,16 @@ title: duckplyr fully joins the tidyverse! | |
date: 2025-02-11 | ||
author: Kirill Müller and Maëlle Salmon | ||
description: > | ||
duckplyr 1.0.0 is on CRAN and part of the tidyverse! duckplyr is a drop-in | ||
replacement for dplyr, powered by DuckDB for speed. It is the most dplyr-like | ||
of dplyr backends. | ||
duckplyr 1.0.0 is on CRAN and part of the tidyverse! | ||
A drop-in replacement for dplyr, powered by DuckDB for speed. | ||
It is the most dplyr-like of dplyr backends. | ||
|
||
photo: | ||
url: https://www.pexels.com/photo/a-mallard-duck-on-water-6918877/ | ||
author: Kiril Gruev | ||
|
||
# one of: "deep-dive", "learn", "package", "programming", "roundup", or "other" | ||
categories: [package] | ||
categories: [package] | ||
tags: | ||
- duckplyr | ||
- dplyr | ||
|
@@ -35,105 +35,112 @@ TODO: | |
* [x] `usethis::use_tidy_thanks()` | ||
--> | ||
|
||
We're very chuffed to announce the release of [duckplyr](https://duckplyr.tidyverse.org) 1.0.0. | ||
We're very chuffed to announce the release of [duckplyr](https://duckplyr.tidyverse.org) 1.0.0. | ||
duckplyr is a drop-in, fully compatible replacement for dplyr, powered by [DuckDB](https://duckdb.org/) for speed. | ||
It joins the rank of dplyr backends together with [dtplyr](https://dtplyr.tidyverse.org) and [dbplyr](https://dbplyr.tidyverse.org). | ||
You can use it instead of dplyr for data small or large. | ||
|
||
<!-- FIXME: | ||
|
||
We have many more dplyr backends, the two above are just from the tidyverse. | ||
GitHub search: https://github.com/search?q=org%3Acran+%2FS3method%5B%28%5D%28mutate%7Csummarise%29+*%2C%2F&type=code | ||
Do we need an "awesome dplyr" like https://github.com/krlmlr/awesome-vctrs/? | ||
|
||
--> | ||
|
||
You can install it from CRAN with: | ||
|
||
```{r, eval = FALSE} | ||
install.packages("duckplyr") | ||
``` | ||
|
||
In this article, we'll introduce you to the basic concepts behind duckplyr, show how it can help you handle normal sized but also large data, and explain how you can help improve the package. | ||
In this article, we'll introduce you to the basic concepts behind duckplyr, show how it can help you data of different sizes, and explain how you can help improve the package. | ||
|
||
## A drop-in replacement for dplyr | ||
|
||
The duckplyr package is a _drop-in replacement for dplyr_ that uses _DuckDB for speed_. | ||
You can simply _drop_ duckplyr into your pipeline by loading it, then computations will be efficiently carried out by DuckDB. | ||
DuckDB is a [fast database system](https://www.youtube.com/watch?v=GELhdezYmP0&feature=youtu.be). | ||
DuckDB is a fast in-memory analytical database system. | ||
If you haven't heard about it, watch [Hannes Mühleisen's keynote at posit::conf(2024)](https://www.youtube.com/watch?v=GELhdezYmP0&feature=youtu.be). | ||
|
||
```{r} | ||
library(conflicted) | ||
library(duckplyr) | ||
conflict_prefer("filter", "dplyr", quiet = TRUE) | ||
library("babynames") | ||
|
||
library(babynames) | ||
|
||
out <- babynames |> | ||
filter(n > 1000) |> | ||
mutate(prevalence = if_else(prop >= 0.01, "frequent", "rare")) |> | ||
summarize( | ||
.by = c(sex, year), | ||
.by = c(sex, year, prevalence), | ||
babies_n = sum(n) | ||
) |> | ||
filter(sex == "F") | ||
class(out) | ||
|
||
out | ||
``` | ||
|
||
Like with other dplyr backends like dtplyr and dbplyr, duckplyr allows you to get faster results. | ||
Unlike other dplyr backends, duckplyr does not require you to learn a different syntax. | ||
|
||
The duckplyr package is fully compatible with dplyr: if an operation cannot be carried out with DuckDB, it is automatically outsourced to dplyr. | ||
In that case, the operation is not slower than dplyr but not faster either. | ||
The duckplyr package is actively developed so that over time, we expect fewer and fewer fallbacks to dplyr to be needed. | ||
Over time, we expect fewer and fewer fallbacks to dplyr to be needed. | ||
|
||
## How to use duckplyr | ||
|
||
To _replace_ dplyr with duckplyr, you can either | ||
To _replace_ dplyr with duckplyr, you can: | ||
|
||
- load duckplyr and then keep your pipeline as is. Calling `library(duckplyr)` overwrites dplyr methods, enabling duckplyr for the entire session no matter how data.frames are created. | ||
- Load duckplyr and then keep your pipeline as is. Calling `library(duckplyr)` overwrites dplyr methods, enabling duckplyr for the entire session no matter how data.frames are created. | ||
This is shown in the example above. | ||
|
||
```{r} | ||
library(conflicted) | ||
library(duckplyr) | ||
conflict_prefer("filter", "dplyr", quiet = TRUE) | ||
``` | ||
- Create individual "duck frames" using _conversion functions_ like `duckdb_tibble()` or `as_duckdb_tibble()`, or _ingestion functions_ like `read_csv_duckdb()`. | ||
Then, the data manipulation pipeline uses the exact same syntax as a dplyr pipeline. | ||
The duckplyr package performs the computation using DuckDB. | ||
|
||
```{r} | ||
# Undo the effect of library(duckplyr) | ||
methods_restore() | ||
|
||
- Create individual "duck frames" which allows you to control their automatic materialization parameters to [protect memory](https://duckplyr.tidyverse.org/articles/prudence.html). To do so, you can use _conversion functions_ like `duckdb_tibble()` or `as_duckdb_tibble()`, or _ingestion functions_ like `read_csv_duckdb()`. | ||
out <- babynames |> | ||
as_duckdb_tibble() |> | ||
mutate(prevalence = if_else(prop >= 0.01, "frequent", "rare")) |> | ||
summarize( | ||
.by = c(sex, year, prevalence), | ||
babies_n = sum(n) | ||
) |> | ||
filter(sex == "F") | ||
class(out) | ||
``` | ||
|
||
Then, the data manipulation pipeline uses the exact same syntax as a dplyr pipeline. | ||
The duckplyr package performs the computation using DuckDB. | ||
In both cases, printing the result only shows the first few rows, as with dbplyr. | ||
|
||
```{r} | ||
library("babynames") | ||
out <- babynames |> | ||
filter(n > 1000) |> | ||
summarize( | ||
.by = c(sex, year), | ||
babies_n = sum(n) | ||
) |> | ||
filter(sex == "F") | ||
out | ||
``` | ||
|
||
The result can finally be materialized to memory, or computed temporarily, or computed to a file. | ||
|
||
```{r} | ||
# to memory | ||
out | ||
collect(out) | ||
|
||
# to a file | ||
csv_file <- withr::local_tempfile() | ||
file.size(csv_file) | ||
compute_csv(out, csv_file) | ||
file.size(csv_file) | ||
fs::file_size(csv_file) | ||
``` | ||
|
||
When duckplyr itself does not support specific functionality, it falls back to dplyr. | ||
For instance, row names are not supported yet: | ||
For instance, pivoting is not supported yet, still it works thanks to the fallback mechanism. | ||
|
||
```{r} | ||
mtcars |> | ||
summarize( | ||
.by = cyl, | ||
disp = mean(disp, na.rm = TRUE), | ||
sd = sd(disp, na.rm = TRUE) | ||
) | ||
out |> | ||
tidyr::pivot_wider(names_from = prevalence, values_from = babies_n, values_fill = 0L) |> | ||
mutate(share_frequent = frequent / (frequent + rare)) | ||
``` | ||
|
||
Current limitations are documented in a [vignette](https://duckplyr.tidyverse.org/articles/limits.html). | ||
You can change the verbosity of fallbacks, refer to [`duckplyr::fallback_sitrep()`](https://duckplyr.tidyverse.org/reference/fallback.html). | ||
For performance reasons, the output order of the result is not guaranteed to be stable. | ||
If you need a stable order, you can use `arrange()`. | ||
Other limitations are documented in [`vignette("limits")`](https://duckplyr.tidyverse.org/articles/limits.html). | ||
|
||
### For large data | ||
|
||
|
@@ -144,11 +151,12 @@ With large datasets, you want: | |
- input data in an efficient format, like Parquet files, which duckplyr allows thanks to its ingestion functions like `read_parquet_duckdb()`. | ||
- efficient computation, which duckplyr provides via DuckDB's holistic optimization, without your having to use another syntax than dplyr. | ||
- the output to not clutter all the memory, which duckplyr supports through two features: | ||
- the [control of automatic materialization](https://duckplyr.tidyverse.org/articles/prudence.html) (collection of results into memory) thanks to the `prudence` parameter. You can disable automatic materialization completely or, as a compromise, disable it up to a certain output size. | ||
- [computation to files](https://duckplyr.tidyverse.org/reference/compute_file.html) using `compute_parquet()` or `compute_csv()`. | ||
- computation to files using [`compute_parquet()`](https://duckplyr.tidyverse.org/reference/compute_file.html) or [`compute_csv()`](https://duckplyr.tidyverse.org/reference/compute_file.html). | ||
- the control of automatic materialization (collection of results into memory). You can disable automatic materialization completely or, as a compromise, disable it up to a certain output size. See [`vignette("prudence")`](https://duckplyr.tidyverse.org/articles/prudence.html) for details | ||
|
||
A drawback of analyzing large data with duckplyr is that the limitations of duckplyr won't be compensated by fallbacks, since fallbacks to dplyr necessitate putting data into memory. | ||
Therefore, if your pipeline encounters fallbacks, you might want to work around them by converting the duck frame into a table through `compute()` then running SQL code through the experimental `read_sql_duckdb()` function. Again, over time, we expect more native support for dplyr functionality. | ||
Therefore, if your pipeline encounters fallbacks, you might want to work around them by converting the duck frame into a table through `compute()` then running SQL code through the experimental `read_sql_duckdb()` function. | ||
Again, over time, we expect more native support for dplyr functionality. | ||
|
||
```{r} | ||
data <- | ||
|
@@ -169,18 +177,20 @@ sql_data | |
|
||
Our goals for future development of duckplyr include: | ||
|
||
- Increasing the native support for dplyr functionality; | ||
- Enabling users to provide [custom translations](https://github.com/tidyverse/duckplyr/issues/158) of dplyr functionality; | ||
- Making it easier to contribute code to duckplyr. | ||
- Making it easier to contribute code to duckplyr; | ||
- Supporting more dplyr and tidyr functionality natively in DuckDB. | ||
|
||
You can help! | ||
You can help! | ||
|
||
- Please report any issue especially regarding unknown incompabilities. See [`vignette("limits")`](https://duckplyr.tidyverse.org/articles/limits.html). | ||
- Please report any issues, especially regarding unknown incompabilities. See [`vignette("limits")`](https://duckplyr.tidyverse.org/articles/limits.html). | ||
- Contribute to the codebase after reading duckplyr's [contributing guide](https://duckplyr.tidyverse.org/CONTRIBUTING.html). | ||
- Turn on telemetry to help us hear about the most frequent fallbacks so we can prioritize working on the corresponding missing dplyr translation. See [`vignette("telemetry")`](https://duckplyr.tidyverse.org/articles/telemetry.html) and the [`duckplyr::fallback_sitrep()`](https://duckplyr.tidyverse.org/reference/fallback.html) function. | ||
|
||
## Acknowledgements | ||
|
||
A big thanks to all 54 folks who filed issues, created PRs and generally helped to improve duckplyr! | ||
A big thanks to all folks who filed issues, created PRs and generally helped to improve duckplyr! | ||
|
||
[@adamschwing](https://github.com/adamschwing), [@andreranza](https://github.com/andreranza), [@apalacio9502](https://github.com/apalacio9502), [@apsteinmetz](https://github.com/apsteinmetz), [@barracuda156](https://github.com/barracuda156), [@beniaminogreen](https://github.com/beniaminogreen), [@bob-rietveld](https://github.com/bob-rietveld), [@brichards920](https://github.com/brichards920), [@cboettig](https://github.com/cboettig), [@davidjayjackson](https://github.com/davidjayjackson), [@DavisVaughan](https://github.com/DavisVaughan), [@Ed2uiz](https://github.com/Ed2uiz), [@eitsupi](https://github.com/eitsupi), [@era127](https://github.com/era127), [@etiennebacher](https://github.com/etiennebacher), [@eutwt](https://github.com/eutwt), [@fmichonneau](https://github.com/fmichonneau), [@hadley](https://github.com/hadley), [@hannes](https://github.com/hannes), [@hawkfish](https://github.com/hawkfish), [@IndrajeetPatil](https://github.com/IndrajeetPatil), [@JanSulavik](https://github.com/JanSulavik), [@JavOrraca](https://github.com/JavOrraca), [@jeroen](https://github.com/jeroen), [@jhk0530](https://github.com/jhk0530), [@joakimlinde](https://github.com/joakimlinde), [@JosiahParry](https://github.com/JosiahParry), [@larry77](https://github.com/larry77), [@lnkuiper](https://github.com/lnkuiper), [@lorenzwalthert](https://github.com/lorenzwalthert), [@luisDVA](https://github.com/luisDVA), [@maelle](https://github.com/maelle), [@math-mcshane](https://github.com/math-mcshane), [@meersel](https://github.com/meersel), [@multimeric](https://github.com/multimeric), [@mytarmail](https://github.com/mytarmail), [@nicki-dese](https://github.com/nicki-dese), [@PMassicotte](https://github.com/PMassicotte), [@prasundutta87](https://github.com/prasundutta87), [@rafapereirabr](https://github.com/rafapereirabr), [@Robinlovelace](https://github.com/Robinlovelace), [@romainfrancois](https://github.com/romainfrancois), [@sparrow925](https://github.com/sparrow925), [@stefanlinner](https://github.com/stefanlinner), [@thomasp85](https://github.com/thomasp85), [@TimTaylor](https://github.com/TimTaylor), [@Tmonster](https://github.com/Tmonster), [@toppyy](https://github.com/toppyy), [@wibeasley](https://github.com/wibeasley), [@yjunechoe](https://github.com/yjunechoe), [@ywhcuhk](https://github.com/ywhcuhk), and [@zhjx19](https://github.com/zhjx19). | ||
|
||
[@adamschwing](https://github.com/adamschwing), [@andreranza](https://github.com/andreranza), [@apalacio9502](https://github.com/apalacio9502), [@apsteinmetz](https://github.com/apsteinmetz), [@barracuda156](https://github.com/barracuda156), [@beniaminogreen](https://github.com/beniaminogreen), [@bob-rietveld](https://github.com/bob-rietveld), [@brichards920](https://github.com/brichards920), [@cboettig](https://github.com/cboettig), [@davidjayjackson](https://github.com/davidjayjackson), [@DavisVaughan](https://github.com/DavisVaughan), [@Ed2uiz](https://github.com/Ed2uiz), [@eitsupi](https://github.com/eitsupi), [@era127](https://github.com/era127), [@etiennebacher](https://github.com/etiennebacher), [@eutwt](https://github.com/eutwt), [@fmichonneau](https://github.com/fmichonneau), [@github-actions[bot]](https://github.com/github-actions[bot]), [@hadley](https://github.com/hadley), [@hannes](https://github.com/hannes), [@hawkfish](https://github.com/hawkfish), [@IndrajeetPatil](https://github.com/IndrajeetPatil), [@JanSulavik](https://github.com/JanSulavik), [@JavOrraca](https://github.com/JavOrraca), [@jeroen](https://github.com/jeroen), [@jhk0530](https://github.com/jhk0530), [@joakimlinde](https://github.com/joakimlinde), [@JosiahParry](https://github.com/JosiahParry), [@krlmlr](https://github.com/krlmlr), [@larry77](https://github.com/larry77), [@lnkuiper](https://github.com/lnkuiper), [@lorenzwalthert](https://github.com/lorenzwalthert), [@luisDVA](https://github.com/luisDVA), [@maelle](https://github.com/maelle), [@math-mcshane](https://github.com/math-mcshane), [@meersel](https://github.com/meersel), [@multimeric](https://github.com/multimeric), [@mytarmail](https://github.com/mytarmail), [@nicki-dese](https://github.com/nicki-dese), [@PMassicotte](https://github.com/PMassicotte), [@prasundutta87](https://github.com/prasundutta87), [@rafapereirabr](https://github.com/rafapereirabr), [@Robinlovelace](https://github.com/Robinlovelace), [@romainfrancois](https://github.com/romainfrancois), [@sparrow925](https://github.com/sparrow925), [@stefanlinner](https://github.com/stefanlinner), [@thomasp85](https://github.com/thomasp85), [@TimTaylor](https://github.com/TimTaylor), [@Tmonster](https://github.com/Tmonster), [@toppyy](https://github.com/toppyy), [@wibeasley](https://github.com/wibeasley), [@yjunechoe](https://github.com/yjunechoe), [@ywhcuhk](https://github.com/ywhcuhk), and [@zhjx19](https://github.com/zhjx19). | ||
Special thanks to Joe Thorley ([@joethorley](https://github.com/joethorley)) for help with choosing the right words. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the right names, not words? |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it potentially confusing since pivoting is tidyr not dplyr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And
nrow()
is base, not dplyr 🙃There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Row names too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to avoid the confusion then? Could we use this to highlight how seamless all of this is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
row names are a characteristic of the data. You cannot use duckplyr with data that have row names or factors.
Now tidyr would be something like the fallbacks needed for select() etc.