Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CsvReader cannot parse datetime 2016/1/1 0:00 in csv file. #16092

Closed
2 tasks done
miaocb opened this issue May 7, 2024 · 8 comments · Fixed by #20144
Closed
2 tasks done

CsvReader cannot parse datetime 2016/1/1 0:00 in csv file. #16092

miaocb opened this issue May 7, 2024 · 8 comments · Fixed by #20144
Labels
bug Something isn't working P-low Priority: low rust Related to Rust Polars

Comments

@miaocb
Copy link

miaocb commented May 7, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

contents of example.csv:

datetime
2016/1/1 0:00

use polars::prelude::*;

fn main() {
    let csv_file = "example.csv";
    let df = CsvReader::from_path(csv_file).unwrap()
    .has_header(true)
    .with_try_parse_dates(true)
    .finish().unwrap();
    println!("{}", df);
}

Log output

file < 128 rows, no statistics determined
no. of chunks: 1 processed by: 1 threads.
thread 'main' panicked at src\main.rs:8:15:
called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("could not parse `2016/1/1 0:00` as dtype `datetime[μs]` at column 'datetime' (column number 1)\n\nThe current offset in the file is 10 bytes.\n\nYou might want to try:\n- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),\n- specifying correct dtype with the `dtypes` argument\n- setting `ignore_errors` to `True`,\n- adding `2016/1/1 0:00` to the `null_values` list.\n\nOriginal error:  not parse '2016/1/1 0:00' with pattern 'DatetimeYMD'"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
error: process didn't exit successfully: `target\debug\test_datetime.exe` (exit code: 101)

Issue description

2016/1/1 00:00 works, 2016/1/1 0:00 fails.

Expected behavior

CsvReader can parse datetime 2016/1/1 0:00

Installed versions

[dependencies]
polars = "0.39.2"

@miaocb miaocb added bug Something isn't working needs triage Awaiting prioritization by a maintainer rust Related to Rust Polars labels May 7, 2024
@deanm0000
Copy link
Collaborator

'2016/1/1 0:00' isn't a valid format. chrono expects month and day to be 2 digits so 2016/01/01. The same is true for hour. It might (but I'm not sure) also require seconds for a datetime but not sure. @MarcoGorelli do "we" want to make the parser robust to those types of discrepancies?

@miaocb
Copy link
Author

miaocb commented May 7, 2024

2016/1/1 0:00 is the default format for saving csv in Microsoft Excel in windows Chinsese version.

Chrono can parse "2016/1/1 0:00" by providing format string "%Y/%m/%d %H:%M".

In polars, the alternative is to read it as a string and then parse it into datetime using lazyframe with format "%Y/%m/%d %H:%M".
Since to_datetime in lazyframe and chrono are both able to parse "2016/1/1 0:00", but CsvReader refuse to parse.

use polars::{lazy::dsl::StrptimeOptions, prelude::*};

fn main() {
    let csv_file = "example.csv";
    let df = CsvReader::from_path(csv_file).unwrap()
    .has_header(true)
    .with_try_parse_dates(false)
    .finish().unwrap();

    let options = StrptimeOptions{
        format: Some("%Y/%m/%d %H:%M".to_string()),
        strict: true,
        .. StrptimeOptions::default()
    };
    let df = df.lazy().with_columns([
        col("datetime")
        .str()
        .to_datetime(None, None, options, lit(""))
    ]).collect().unwrap();

    println!("{}", df);
}

@Julian-J-S
Copy link
Contributor

Julian-J-S commented May 7, 2024

I think it is impossible to infer all date/datetime formats out there.

BUT imo it should be possible to pass in a date/datetime format to read_csv

This feature is accepted but needs implementation 😉 #9550

@miaocb
Copy link
Author

miaocb commented May 7, 2024

Actually, polars-time has a very complete date/datetime pattern list,
The pattern "%Y/%m/%d %H:%M" which can parse "2016/1/1 0:00" is already there, and it succeeds to find this pattern at be beginning, but I can not figure out why it failed to parse later.

see polars-time-0.39.2\src\chunkedarray\string\infer.rs
see polars-time-0.39.2\src\chunkedarray\string\patterns.rs

pub(super) static DATETIME_Y_M_D: &[&str] = &[
    // ---
    // ISO8601-like, generated via the `iso8601_format_datetime` test fixture
    // ---
    "%Y/%m/%dT%H:%M:%S",
    "%Y-%m-%dT%H:%M:%S",
    "%Y/%m/%dT%H%M%S",
    "%Y-%m-%dT%H%M%S",
    "%Y/%m/%dT%H:%M",
    "%Y-%m-%dT%H:%M",
    "%Y/%m/%dT%H%M",
    "%Y-%m-%dT%H%M",
    "%Y/%m/%dT%H:%M:%S.%9f",
    "%Y-%m-%dT%H:%M:%S.%9f",
    "%Y/%m/%dT%H:%M:%S.%6f",
    "%Y-%m-%dT%H:%M:%S.%6f",
    "%Y/%m/%dT%H:%M:%S.%3f",
    "%Y-%m-%dT%H:%M:%S.%3f",
    "%Y/%m/%dT%H%M%S.%9f",
    "%Y-%m-%dT%H%M%S.%9f",
    "%Y/%m/%dT%H%M%S.%6f",
    "%Y-%m-%dT%H%M%S.%6f",
    "%Y/%m/%dT%H%M%S.%3f",
    "%Y-%m-%dT%H%M%S.%3f",
    "%Y/%m/%d",
    "%Y-%m-%d",
    "%Y/%m/%d %H:%M:%S",
    "%Y-%m-%d %H:%M:%S",
    "%Y/%m/%d %H%M%S",
    "%Y-%m-%d %H%M%S",
    "%Y/%m/%d %H:%M",
    "%Y-%m-%d %H:%M",
    "%Y/%m/%d %H%M",
    "%Y-%m-%d %H%M",
    "%Y/%m/%d %H:%M:%S.%9f",
    "%Y-%m-%d %H:%M:%S.%9f",
    "%Y/%m/%d %H:%M:%S.%6f",
    "%Y-%m-%d %H:%M:%S.%6f",
    "%Y/%m/%d %H:%M:%S.%3f",
    "%Y-%m-%d %H:%M:%S.%3f",
    "%Y/%m/%d %H%M%S.%9f",
    "%Y-%m-%d %H%M%S.%9f",
    "%Y/%m/%d %H%M%S.%6f",
    "%Y-%m-%d %H%M%S.%6f",
    "%Y/%m/%d %H%M%S.%3f",
    "%Y-%m-%d %H%M%S.%3f",
    // ---
    // other
    // ---
    // we cannot know this one, because polars needs to know
    // the length of the parsed fmt
    // ---
    "%FT%H:%M:%S%.f",
];

@MarcoGorelli
Copy link
Collaborator

I think the issue is here

const DATETIME_YMD_PATTERN: &str = r#"(?x)
^
['"]? # optional quotes
(?:\d{4,}) # year
[-/] # separator
(?P<month>[01]?\d{1}) # month
[-/] # separator
(?:\d{1,2}) # day
(?:
[T\ ] # separator
(?:\d{2}) # hour
:? # separator
(?:\d{2}) # minute
(?:
:? # separator
(?:\d{2}) # seconds
(?:
\.(?:\d{1,9}) # subsecond
)?
)?
)?
['"]? # optional quotes
$
"#;
static DATETIME_YMD_RE: Lazy<Regex> = Lazy::new(|| Regex::new(DATETIME_YMD_PATTERN).unwrap());

The pattern could be adjusted so that hour, minute, and second are allowed to only be 1-digit long if there's a separator :

@MarcoGorelli MarcoGorelli added P-low Priority: low and removed needs triage Awaiting prioritization by a maintainer labels May 7, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog May 7, 2024
@miaocb
Copy link
Author

miaocb commented May 7, 2024

I think the issue is here

const DATETIME_YMD_PATTERN: &str = r#"(?x)
^
['"]? # optional quotes
(?:\d{4,}) # year
[-/] # separator
(?P<month>[01]?\d{1}) # month
[-/] # separator
(?:\d{1,2}) # day
(?:
[T\ ] # separator
(?:\d{2}) # hour
:? # separator
(?:\d{2}) # minute
(?:
:? # separator
(?:\d{2}) # seconds
(?:
\.(?:\d{1,9}) # subsecond
)?
)?
)?
['"]? # optional quotes
$
"#;
static DATETIME_YMD_RE: Lazy<Regex> = Lazy::new(|| Regex::new(DATETIME_YMD_PATTERN).unwrap());

The pattern could be adjusted so that hour, minute, and second are allowed to only be 1-digit long if there's a separator :

You are right. There are three regex expressions, DATETIME_DMY_PATTERN, DATETIME_YMD_PATTERN, DATETIME_YMDZ_PATTERN. Change hour (line 24, 50, 74), minute (line 26, 52, 76), second (line 29, 55, 79) to allow 1-2 digits fixed the problem. Month and day are already allowed to have 1-2 digits.

            (?:\d{1,2})                # hour
            (?:\d{1,2})                # minute
                (?:\d{1,2})            # second

@wsyxbcl
Copy link
Contributor

wsyxbcl commented Dec 4, 2024

I think the issue is here

const DATETIME_YMD_PATTERN: &str = r#"(?x)
^
['"]? # optional quotes
(?:\d{4,}) # year
[-/] # separator
(?P<month>[01]?\d{1}) # month
[-/] # separator
(?:\d{1,2}) # day
(?:
[T\ ] # separator
(?:\d{2}) # hour
:? # separator
(?:\d{2}) # minute
(?:
:? # separator
(?:\d{2}) # seconds
(?:
\.(?:\d{1,9}) # subsecond
)?
)?
)?
['"]? # optional quotes
$
"#;
static DATETIME_YMD_RE: Lazy<Regex> = Lazy::new(|| Regex::new(DATETIME_YMD_PATTERN).unwrap());

The pattern could be adjusted so that hour, minute, and second are allowed to only be 1-digit long if there's a separator :

You are right. There are three regex expressions, DATETIME_DMY_PATTERN, DATETIME_YMD_PATTERN, DATETIME_YMDZ_PATTERN. Change hour (line 24, 50, 74), minute (line 26, 52, 76), second (line 29, 55, 79) to allow 1-2 digits fixed the problem. Month and day are already allowed to have 1-2 digits.

            (?:\d{1,2})                # hour
            (?:\d{1,2})                # minute
                (?:\d{1,2})            # second

So is there any remaining issue here?

@MarcoGorelli
Copy link
Collaborator

I think someone would just need to make a PR

I'm busy with some other things right now, and given that the workaround here to read as string then then call to_datetime / to_date is extremely simple, I consider this low-priority

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P-low Priority: low rust Related to Rust Polars
Projects
Archived in project
5 participants