Memory usage of `read_csv_chunked()` in conjunction with a gzip compressed file #1200

nbenn · 2021-04-27T17:07:00Z

Recently I have been running into Error: vector memory exhausted (limit reached?) errors when reading large gzip compressed .csv files using the chunked API. IIRC, earlier versions of readr would explicitly create a temporary file, containing the full uncompressed data, which then was fed into read_csv_chunked().

Looking at reported memory usage, this no longer seems to be the case. If this change was intentional, I apologize for having missed that, but I could not find any announcement hinting at this (neither from NEWS nor docs). Also, I feel this takes away some of the convenience of the chunked API. Of course, this can easily be resolved outside of readr by decompressing files manually beforehand (using for example R.utils::gunzip()).

As it's not straightforward to create an example for this, I'll just add my session info (but I'm happy to provide further information if requested):

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.0.4 (2021-02-15)
 os       macOS Big Sur 10.16
 system   x86_64, darwin17.0
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Zurich
 date     2021-04-27

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date       lib source
 assertthat  * 0.2.1   2019-03-21 [1] CRAN (R 4.0.0)
 callr         3.5.1   2020-10-13 [1] CRAN (R 4.0.2)
 cli           2.5.0   2021-04-26 [1] CRAN (R 4.0.4)
 clisymbols    1.2.0   2017-05-21 [1] CRAN (R 4.0.0)
 colorout    * 1.2-2   2020-05-04 [1] Github (jalvesaq/colorout@726d681)
 crayon        1.4.1   2021-02-08 [1] CRAN (R 4.0.3)
 desc          1.2.0   2018-05-01 [1] CRAN (R 4.0.0)
 devtools    * 2.3.2   2020-09-18 [1] CRAN (R 4.0.2)
 digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.2)
 ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.2)
 fansi         0.4.2   2021-01-15 [1] CRAN (R 4.0.2)
 fs            1.4.1   2020-04-04 [1] CRAN (R 4.0.0)
 glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
 hms           1.0.0   2021-01-13 [1] CRAN (R 4.0.2)
 lifecycle     1.0.0   2021-02-15 [1] CRAN (R 4.0.3)
 magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.2)
 memoise       1.1.0   2017-04-21 [1] CRAN (R 4.0.0)
 memuse        4.1-0   2020-02-17 [1] CRAN (R 4.0.0)
 pillar        1.6.0   2021-04-13 [1] CRAN (R 4.0.2)
 pkgbuild      1.1.0   2020-07-13 [1] CRAN (R 4.0.2)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.0)
 pkgload       1.1.0   2020-05-29 [1] CRAN (R 4.0.2)
 prettycode  * 1.1.0   2019-12-16 [1] CRAN (R 4.0.2)
 prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.0)
 processx      3.4.5   2020-11-30 [1] CRAN (R 4.0.2)
 prompt        1.0.0   2020-05-04 [1] Github (gaborcsardi/prompt@b332c42)
 ps            1.4.0   2020-10-07 [1] CRAN (R 4.0.2)
 purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.0)
 R6            2.5.0   2020-10-28 [1] CRAN (R 4.0.2)
 readr       * 1.4.0   2020-10-05 [1] CRAN (R 4.0.2)
 remotes       2.2.0   2020-07-21 [1] CRAN (R 4.0.2)
 rlang         0.4.10  2020-12-30 [1] CRAN (R 4.0.2)
 rprojroot     2.0.2   2020-11-15 [1] CRAN (R 4.0.2)
 rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.0.2)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.0)
 testthat      3.0.2   2021-02-14 [1] CRAN (R 4.0.2)
 tibble        3.1.1   2021-04-18 [1] CRAN (R 4.0.4)
 usethis     * 2.0.1   2021-02-10 [1] CRAN (R 4.0.2)
 utf8          1.2.1   2021-03-12 [1] CRAN (R 4.0.2)
 vctrs         0.3.7   2021-03-29 [1] CRAN (R 4.0.2)
 withr         2.3.0   2020-09-22 [1] CRAN (R 4.0.2)

[1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

The text was updated successfully, but these errors were encountered:

jimhester · 2021-04-27T19:08:47Z

This is likely a duplicate of #1161, it is fixed in the devel version of readr but not yet on CRAN.

hsbadr · 2021-04-29T12:32:56Z

This is likely a duplicate of #1161, it is fixed in the devel version of readr but not yet on CRAN.

@jimhester On the development version, vroom backend (2nd edition) has problems when (incorrectly) reading gzip compressed files, which also increases the object size and memory footprint. This doesn't happen when using the 1st edition (the data is read correctly with no problems), and can be easily captured by setting the correct column types. Here's an example:

COVID19.BR_Municipality <- read_delim(
  "https://github.com/wcota/covid19br/raw/master/cases-brazil-cities-time.csv.gz",
  delim = ",",
  col_types = cols(
    epi_week = col_integer(),
    date = col_date(format = "%Y-%m-%d"),
    country = col_character(),
    state = col_character(),
    city = col_character(),
    ibgeID = col_character(),
    cod_RegiaoDeSaude = col_character(),
    name_RegiaoDeSaude = col_character(),
    newDeaths = col_integer(),
    deaths = col_integer(),
    newCases = col_integer(),
    totalCases = col_integer(),
    deaths_per_100k_inhabitants = col_double(),
    totalCases_per_100k_inhabitants = col_double(),
    deaths_by_totalCases = col_double(),
    `_source` = col_character(),
    last_info_date = col_date(format = "%Y-%m-%d")
  ),
  lazy = FALSE
)

jimhester · 2021-04-29T18:25:21Z

@hsbadr, that issue is tracked by tidyverse/vroom#331 and should be fixed.

hsbadr · 2021-04-29T22:47:14Z

that issue is tracked by r-lib/vroom#331 and should be fixed.

Thanks @jimhester! I confirm that tidyverse/vroom@5fc54e6 fixed the problem. I'll let you know if I run into a related issue.

jimhester closed this as completed Apr 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage of `read_csv_chunked()` in conjunction with a gzip compressed file #1200

Memory usage of `read_csv_chunked()` in conjunction with a gzip compressed file #1200

nbenn commented Apr 27, 2021

jimhester commented Apr 27, 2021 •

edited

Loading

hsbadr commented Apr 29, 2021

jimhester commented Apr 29, 2021

hsbadr commented Apr 29, 2021

Memory usage of read_csv_chunked() in conjunction with a gzip compressed file #1200

Memory usage of read_csv_chunked() in conjunction with a gzip compressed file #1200

Comments

nbenn commented Apr 27, 2021

jimhester commented Apr 27, 2021 • edited Loading

hsbadr commented Apr 29, 2021

jimhester commented Apr 29, 2021

hsbadr commented Apr 29, 2021

Memory usage of `read_csv_chunked()` in conjunction with a gzip compressed file #1200

Memory usage of `read_csv_chunked()` in conjunction with a gzip compressed file #1200

jimhester commented Apr 27, 2021 •

edited

Loading