Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage of read_csv_chunked() in conjunction with a gzip compressed file #1200

Closed
nbenn opened this issue Apr 27, 2021 · 4 comments
Closed

Comments

@nbenn
Copy link

nbenn commented Apr 27, 2021

Recently I have been running into Error: vector memory exhausted (limit reached?) errors when reading large gzip compressed .csv files using the chunked API. IIRC, earlier versions of readr would explicitly create a temporary file, containing the full uncompressed data, which then was fed into read_csv_chunked().

Looking at reported memory usage, this no longer seems to be the case. If this change was intentional, I apologize for having missed that, but I could not find any announcement hinting at this (neither from NEWS nor docs). Also, I feel this takes away some of the convenience of the chunked API. Of course, this can easily be resolved outside of readr by decompressing files manually beforehand (using for example R.utils::gunzip()).

As it's not straightforward to create an example for this, I'll just add my session info (but I'm happy to provide further information if requested):

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.0.4 (2021-02-15)
 os       macOS Big Sur 10.16
 system   x86_64, darwin17.0
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Zurich
 date     2021-04-27

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date       lib source
 assertthat  * 0.2.1   2019-03-21 [1] CRAN (R 4.0.0)
 callr         3.5.1   2020-10-13 [1] CRAN (R 4.0.2)
 cli           2.5.0   2021-04-26 [1] CRAN (R 4.0.4)
 clisymbols    1.2.0   2017-05-21 [1] CRAN (R 4.0.0)
 colorout    * 1.2-2   2020-05-04 [1] Github (jalvesaq/colorout@726d681)
 crayon        1.4.1   2021-02-08 [1] CRAN (R 4.0.3)
 desc          1.2.0   2018-05-01 [1] CRAN (R 4.0.0)
 devtools    * 2.3.2   2020-09-18 [1] CRAN (R 4.0.2)
 digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.2)
 ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.2)
 fansi         0.4.2   2021-01-15 [1] CRAN (R 4.0.2)
 fs            1.4.1   2020-04-04 [1] CRAN (R 4.0.0)
 glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
 hms           1.0.0   2021-01-13 [1] CRAN (R 4.0.2)
 lifecycle     1.0.0   2021-02-15 [1] CRAN (R 4.0.3)
 magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.2)
 memoise       1.1.0   2017-04-21 [1] CRAN (R 4.0.0)
 memuse        4.1-0   2020-02-17 [1] CRAN (R 4.0.0)
 pillar        1.6.0   2021-04-13 [1] CRAN (R 4.0.2)
 pkgbuild      1.1.0   2020-07-13 [1] CRAN (R 4.0.2)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.0)
 pkgload       1.1.0   2020-05-29 [1] CRAN (R 4.0.2)
 prettycode  * 1.1.0   2019-12-16 [1] CRAN (R 4.0.2)
 prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.0)
 processx      3.4.5   2020-11-30 [1] CRAN (R 4.0.2)
 prompt        1.0.0   2020-05-04 [1] Github (gaborcsardi/prompt@b332c42)
 ps            1.4.0   2020-10-07 [1] CRAN (R 4.0.2)
 purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.0)
 R6            2.5.0   2020-10-28 [1] CRAN (R 4.0.2)
 readr       * 1.4.0   2020-10-05 [1] CRAN (R 4.0.2)
 remotes       2.2.0   2020-07-21 [1] CRAN (R 4.0.2)
 rlang         0.4.10  2020-12-30 [1] CRAN (R 4.0.2)
 rprojroot     2.0.2   2020-11-15 [1] CRAN (R 4.0.2)
 rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.0.2)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.0)
 testthat      3.0.2   2021-02-14 [1] CRAN (R 4.0.2)
 tibble        3.1.1   2021-04-18 [1] CRAN (R 4.0.4)
 usethis     * 2.0.1   2021-02-10 [1] CRAN (R 4.0.2)
 utf8          1.2.1   2021-03-12 [1] CRAN (R 4.0.2)
 vctrs         0.3.7   2021-03-29 [1] CRAN (R 4.0.2)
 withr         2.3.0   2020-09-22 [1] CRAN (R 4.0.2)

[1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library
@jimhester
Copy link
Collaborator

jimhester commented Apr 27, 2021

This is likely a duplicate of #1161, it is fixed in the devel version of readr but not yet on CRAN.

@hsbadr
Copy link

hsbadr commented Apr 29, 2021

This is likely a duplicate of #1161, it is fixed in the devel version of readr but not yet on CRAN.

@jimhester On the development version, vroom backend (2nd edition) has problems when (incorrectly) reading gzip compressed files, which also increases the object size and memory footprint. This doesn't happen when using the 1st edition (the data is read correctly with no problems), and can be easily captured by setting the correct column types. Here's an example:

COVID19.BR_Municipality <- read_delim(
  "https://github.com/wcota/covid19br/raw/master/cases-brazil-cities-time.csv.gz",
  delim = ",",
  col_types = cols(
    epi_week = col_integer(),
    date = col_date(format = "%Y-%m-%d"),
    country = col_character(),
    state = col_character(),
    city = col_character(),
    ibgeID = col_character(),
    cod_RegiaoDeSaude = col_character(),
    name_RegiaoDeSaude = col_character(),
    newDeaths = col_integer(),
    deaths = col_integer(),
    newCases = col_integer(),
    totalCases = col_integer(),
    deaths_per_100k_inhabitants = col_double(),
    totalCases_per_100k_inhabitants = col_double(),
    deaths_by_totalCases = col_double(),
    `_source` = col_character(),
    last_info_date = col_date(format = "%Y-%m-%d")
  ),
  lazy = FALSE
)

@jimhester
Copy link
Collaborator

@hsbadr, that issue is tracked by tidyverse/vroom#331 and should be fixed.

@hsbadr
Copy link

hsbadr commented Apr 29, 2021

that issue is tracked by r-lib/vroom#331 and should be fixed.

Thanks @jimhester! I confirm that tidyverse/vroom@5fc54e6 fixed the problem. I'll let you know if I run into a related issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants