Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add way to specify date format in read_csv #9550

Open
MarcoGorelli opened this issue Jun 25, 2023 · 6 comments
Open

Add way to specify date format in read_csv #9550

MarcoGorelli opened this issue Jun 25, 2023 · 6 comments
Labels
A-temporal Area: date/time functionality accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@MarcoGorelli
Copy link
Collaborator

Problem description

Taking this forwards from #8168

Originally the report was about performance, but then turned into a conversation about whether to be able to specify the date format in read_csv

There do seem to be potential perf gains:

# t.py
from datetime import datetime
import polars as pl

df = pl.DataFrame({
    'ts': pl.date_range(datetime(1000, 1, 1), datetime(9999, 1, 1), eager=True),
    })
df.write_csv('tmp.csv')
In [1]: %timeit pl.read_csv('tmp.csv', try_parse_dates=True)
28.3 ms ± 402 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [2]: %timeit pl.read_csv('tmp.csv').with_columns(pl.col('ts').str.to_datetime(format='%Y-%m-%dT%H:%M:%S.%6f'))
668 ms ± 41.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
@MarcoGorelli MarcoGorelli added the enhancement New feature or an improvement of an existing feature label Jun 25, 2023
@MarcoGorelli MarcoGorelli added the A-temporal Area: date/time functionality label Jul 14, 2023
@fernandocast
Copy link
Contributor

Hello I'm new to polars and I would like to contribute on this project ,
could you please assign this issue to me?

@MarcoGorelli
Copy link
Collaborator Author

sure, go ahead, and please check https://github.com/pola-rs/polars/blob/main/CONTRIBUTING.md

@MarcoGorelli MarcoGorelli added the accepted Ready for implementation label Jan 12, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Jan 12, 2024
@MarcoGorelli
Copy link
Collaborator Author

from discussion, this should either be a single string, or a mapping from column names to formats

@Julian-J-S
Copy link
Contributor

would love to have this as well!

Is there any progress on this? 🤔 😄

@MarcoGorelli
Copy link
Collaborator Author

hey - i'm not working on this at the moment, there's other higher-priority things (rolling functions, interpolate), would be happy to review a pr though

@deanm0000
Copy link
Collaborator

Should this be an optimization to scan_csv? As in having pl.scan_csv('tmp.csv').with_columns(pl.col('ts').str.to_datetime(format='%Y-%m-%dT%H:%M:%S.%6f')).collect() do the same thing as pl.read_csv('tmp.csv', try_parse_dates=True)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-temporal Area: date/time functionality accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Status: Ready
Development

No branches or pull requests

4 participants