Single-digit hour fails in `col_datetime()` but not `parse_datetime()` #1276

k5cents · 2021-08-17T21:06:25Z

In this case, col_datetime() fails when using the format %m/%d/%Y %H:%M:%S %p.

Seems to stem from the single digit day in the %d position. Using %e does not fix the problem.

Possibly related to tidyverse/vroom#170 and tidyverse/vroom#123, which were fixed and closed.

packageVersion("readr")
#> [1] '2.0.0'
library(readr)

tmp <- tempfile(fileext = ".csv")
dat <- tibble::tibble(
  x = c("A", "B"),
  y = c("4/9/2021 2:18:25 PM", "4/13/2021 10:15:57 PM")
)
write_csv(dat, file = tmp)
fmt <- "%m/%d/%Y %H:%M:%S %p"

read_csv(
  file = tmp,
  col_types = cols(
    x = col_character(),
    y = col_datetime(fmt)
  )
)
#> Warning: One or more parsing issues, see `problems()` for details
#> # A tibble: 2 × 2
#>   x     y                  
#>   <chr> <dttm>             
#> 1 A     NA                 
#> 2 B     2021-04-13 22:15:57

parse_datetime(dat$y, format = fmt)
#> [1] "2021-04-09 14:18:25 UTC" "2021-04-13 22:15:57 UTC"

type_convert(
  df = dat,
  col_types = cols(
    x = col_character(),
    y = col_datetime(fmt)
  )
)
#> # A tibble: 2 × 2
#>   x     y                  
#>   <chr> <dttm>             
#> 1 A     2021-04-09 14:18:25
#> 2 B     2021-04-13 22:15:57

^{Created on 2021-08-17 by the reprex package (v2.0.0)}

The text was updated successfully, but these errors were encountered:

nickyrong · 2021-08-18T19:35:49Z

@kiernann I think the culprit is not the lack of leading zero in day or month but the lack of leading zero in hour.

The unfortunate thing about this is that when saving a CSV in Excel, the default format is m/d/yyyy h:mm with a single digit in month, day, and hour (at least default in Canadian Excel). As you see in the example below, col_datetime() can actually deal with single digit day and month, just not hour.

library(tidyr)
library(readr)
packageVersion("readr")
[1] ‘2.0.1’
################# Scenario 1 ############################
tibble(Datetime = "3/2/2018 15:09") %>% write_csv("test_1.csv")
read_csv("test_1.csv",
         col_types = cols(
           Datetime = col_datetime(format = "%m/%d/%Y %H:%M")
         )
)
# A tibble: 1 x 1                                                                                                                                                                               
  Datetime           
  <dttm>             
1 2018-03-02 15:09:00

################# Scenario 2 ############################
tibble(Datetime = "3/2/2018 5:09") %>% write_csv("test_2.csv")
read_csv("test_2.csv",
         col_types = cols(
           Datetime = col_datetime(format = "%m/%d/%Y %H:%M")
         )
)
# A tibble: 1 x 1                                                                                                                                                                               
  Datetime
  <dttm>  
1 NA      
Warning message:
One or more parsing issues, see `problems()` for details 
> problems()
# A tibble: 1 x 5
    row   col expected                 actual        file                                                                      
  <int> <int> <chr>                    <chr>         <chr>                                                                     
1     2     1 date like %m/%d/%Y %H:%M 3/2/2018 5:09 ...test_2.csv

################# Scenario 3 ############################
tibble(Datetime = "3/2/2018 05:09") %>% write_csv("test_3.csv")
read_csv("test_3.csv",
         col_types = cols(
           Datetime = col_datetime(format = "%m/%d/%Y %H:%M")
         )
)

# A tibble: 1 x 1                                                                                                                                                                               
  Datetime           
  <dttm>             
1 2018-03-02 05:09:00

The possible go around is to parse the datetime column as character first then use other functions to convert. But that kinda defeats the point of having the option to directly parse datetime in read_csv.

read_csv("test_2.csv",
         col_types = cols(
           Datetime = col_character()
         )
) %>% mutate(Datetime = as.POSIXct(Datetime, format = "%m/%d/%Y %H:%M"))

# A tibble: 1 x 1                                                                                                                                                                               
  Datetime           
  <dttm>             
1 2018-03-02 05:09:00

Update

After scanning the open issues, I think this issue is related to #1269. Jim gave a possible solution to adding \n to the end of the string to make it recognizable. But...I am not sure how to inject string into a csv before it is read in by read_csv()?

################# UPDATED Scenario 2 ############################
tibble(Datetime = "3/2/2018 5:09\n") %>% write_csv("test_4.csv")
read_csv("test_4.csv",
         col_types = cols(
           Datetime = col_datetime(format = "%m/%d/%Y %H:%M")
         )
)
# A tibble: 1 x 1                                                                                                                                                                               
  Datetime           
  <dttm>             
1 2018-03-02 05:09:00

nickyrong · 2021-08-18T20:05:46Z

This bug seems to be related to the recent major update of version 2.x. Lack of leading zero in version 1.4.0 does not give error.

packageVersion("readr")
[1] ‘1.4.0’

tibble(Datetime = "3/2/2018 5:09") %>% write_csv("test_2.csv")
read_csv("test_2.csv",
         col_types = cols(
           Datetime = col_datetime(format = "%m/%d/%Y %H:%M")
         )
)
# A tibble: 1 x 1
  Datetime           
  <dttm>             
1 2018-03-02 05:09:00

``

k5cents · 2021-08-18T20:19:40Z

Good find, definitely related to the hour not the day.

Per help(parse_datetime), I tried using %I for the hour and the problem persists.

Hour: "%H" or "%I" or "%h", use I (and not H) with AM/PM, use h (and not H) if your times represent durations longer than one day.

packageVersion("readr")
#> [1] '2.0.0'
library(readr)

tmp <- tempfile(fileext = ".csv")
dat <- tibble::tibble(
  x = c("A", "B"),
  y = c("4/19/2021 2:18:25 PM", "4/13/2021 10:15:57 PM")
)
write_csv(dat, file = tmp)
fmt <- "%m/%d/%Y %I:%M:%S %p"

read_csv(
  file = tmp,
  col_types = cols(
    x = col_character(),
    y = col_datetime(fmt)
  )
)
#> Warning: One or more parsing issues, see `problems()` for details
#> # A tibble: 2 × 2
#>   x     y                  
#>   <chr> <dttm>             
#> 1 A     NA                 
#> 2 B     2021-04-13 22:15:57

^{Created on 2021-08-18 by the reprex package (v2.0.0)}

peterdesmet · 2021-08-26T21:00:13Z

Good to see this one is already reported, I encountered the same issue: frictionlessdata/frictionless-r#29 (with col_time() vs parse_time()).

# vroom 1.5.7 * Jenny Bryan is now the official maintainer. * Fix uninitialized bool detected by CRAN's UBSAN check (tidyverse/vroom#386) * Fix buffer overflow when trying to parse an integer field that is over 64 characters long (tidyverse/readr#1326) * Fix subset indexing when indexes span a file boundary multiple times (#383) # vroom 1.5.6 * `vroom(col_select=)` now works if `col_names = FALSE` as intended (#381) * `vroom(n_max=)` now correctly handles cases when reading from a connection and the file does _not_ end with a newline (tidyverse/readr#1321) * `vroom()` no longer issues a spurious warning when the parsing needs * to be restarted due to the presence of embedded newlines * (tidyverse/readr#1313) Fix performance * issue when materializing subsetted vectors (#378) * `vroom_format()` now uses the same internal multi-threaded code as `vroom_write()`, improving its performance in most cases (#377) * `vroom_fwf()` no longer omits the last line if it does _not_ end with a newline (tidyverse/readr#1293) * Empty files or files with only a header line and no data no longer cause a crash if read with multiple files (tidyverse/readr#1297) * Files with a header but no contents, or a empty file if `col_names = FALSE` no longer cause a hang when `progress = TRUE` (tidyverse/readr#1297) * Commented lines with comments at the end of lines no longer hang R (tidyverse/readr#1309) * Comment lines containing unpaired quotes are no longer treated as unterminated quotations (tidyverse/readr#1307) * Values with only a `Inf` or `NaN` prefix but additional data afterwards, like `Inform` or no longer inappropriately guessed as doubles (tidyverse/readr#1319) * Time types now support `%h` format to denote hour durations greater than 24, like readr (tidyverse/readr#1312) * Fix performance issue when materializing subsetted vectors (#378) # vroom 1.5.5 * `vroom()` now supports files with only carriage return newlines (`\r`). (#360, tidyverse/readr#1236) * `vroom()` now parses single digit datetimes more consistently as readr has done (tidyverse/readr#1276) * `vroom()` now parses `Inf` values as doubles (tidyverse/readr#1283) * `vroom()` now parses `NaN` values as doubles (tidyverse/readr#1277) * `VROOM_CONNECTION_SIZE` is now parsed as a double, which supports scientific notation (#364) * `vroom()` now works around specifying a `\n` as the delimiter (#365, tidyverse/dplyr#5977) * `vroom()` no longer crashes if given a `col_name` and `col_type` both less than the number of columns (tidyverse/readr#1271) * `vroom()` no longer hangs if given an empty value for `locale(grouping_mark=)` (tidyverse/readr#1241) * Fix performance regression when guessing with large numbers of rows (tidyverse/readr#1267)

k5cents changed the title ~~col_datetime() fails for single digit day when parse_datetime() does not~~ Single-digit hour fails in col_datetime() but not parse_datetime() Aug 18, 2021

jimhester closed this as completed in tidyverse/vroom@a93cc68 Aug 27, 2021

hongyuanjia mentioned this issue Sep 6, 2021

Failed to parse datetime with a space in format #1295

Closed

damianooldoni mentioned this issue Sep 6, 2021

Remove quoted_na arg and solve other bugs due to major release of readr frictionlessdata/frictionless-r#27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single-digit hour fails in `col_datetime()` but not `parse_datetime()` #1276

Single-digit hour fails in `col_datetime()` but not `parse_datetime()` #1276

k5cents commented Aug 17, 2021

nickyrong commented Aug 18, 2021 •

edited

Loading

nickyrong commented Aug 18, 2021

k5cents commented Aug 18, 2021 •

edited

Loading

peterdesmet commented Aug 26, 2021 •

edited

Loading

Single-digit hour fails in col_datetime() but not parse_datetime() #1276

Single-digit hour fails in col_datetime() but not parse_datetime() #1276

Comments

k5cents commented Aug 17, 2021

nickyrong commented Aug 18, 2021 • edited Loading

nickyrong commented Aug 18, 2021

k5cents commented Aug 18, 2021 • edited Loading

peterdesmet commented Aug 26, 2021 • edited Loading

Single-digit hour fails in `col_datetime()` but not `parse_datetime()` #1276

Single-digit hour fails in `col_datetime()` but not `parse_datetime()` #1276

nickyrong commented Aug 18, 2021 •

edited

Loading

k5cents commented Aug 18, 2021 •

edited

Loading

peterdesmet commented Aug 26, 2021 •

edited

Loading