-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: more flexible iso8601 parsing #12060
Conversation
A question: is is worth the added complexity to the ISO 8601 C-parser to be able to handle the dates without separator (which is actually not really ISO 8601 anymore), if when passing the |
self.strings = [x.strftime('%Y-%m-%d %H:%M:%S') for x in self.rng] | ||
self.strings = self.rng.strftime('%Y-%m-%d %H:%M:%S') | ||
self.strings_nosep = self.rng.strftime('%Y%m%d %H:%M:%S') | ||
self.strings_tz_space = pd.Series(self.rng.strftime('%Y-%m-%d %H:%M:%S')) + ' -0800' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The strftime
method on a DatetimeIndex is rather new. I don't really know what our policy on this should be, but I think this will make it difficult to run the asv benchmarks for a longer time back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree here. why don't you leave the generation of the test cases using the list comprehension (how self.strings
was)
Hard to say if it's worth it (re dates with separators), not sure it's a super common format. But there is a substantial speedup even over providing a format to the strptime parser (20x in that asv bench). Not that it really matters, but from I've read dates without separators would still be considered ISO 8601, e.g. http://www.cl.cam.ac.uk/~mgk25/iso-time.html. |
how hard would it to allow a selection of separators eg w/o sacrificing ambiguity? |
also, let's add another issue to make |
@chris-b1 Ah, yes, I missed that you also passed this to the ISO parser if that format is given. I misinterpreted the asv output so thinking that the generic format parser was as fast as the ISO parser, but indeed quite a bit slower. |
@jreback - I don't think it would be too bad to parse multiple separators, I'll try it out |
@chris-b1 just some minor comments on the benchmarks themselves here. if the addtl separators is a bit complicated can do in another PR. pls rebase on master as lots of PEP changes recently and do a ping if you want to merge now (else, pls open a new issue for the other sep chars if not) |
@jreback - updated for that asv note. I was most of the way there already, so also updated to handle
Passes the current test suite (adjusted for a couple timezone changes), not sure if there are other edge cases to try? a couple perf regressions to figure out too
|
@@ -664,7 +722,7 @@ parse_iso_8601_datetime(char *str, int len, | |||
if (sublen >= 2 && isdigit(substr[0]) && isdigit(substr[1])) { | |||
out->min = 10 * (substr[0] - '0') + (substr[1] - '0'); | |||
|
|||
if (out->hour < 0 || out->min >= 60) { | |||
if (out->min < 0 || out->min >= 60) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is out->min < 0
ever True
?
@jreback - if you want to have a look, I've got everything working now. |
can you add tests for differnt seps (and invalid ones)? |
} | ||
} | ||
if (i == valid_sep_len) { | ||
goto parse_error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this validate all seps are the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - This first check (sep between year and month) validates its in the list - the second check (between month/day) is required to match sep
did you figure out the perf regressions? |
Well, I seemed to fix it, but didn't figure it out.
|
hmm, I think this interferes with the |
@jreback - updated. Perf issue turned out to be unrelated - just some kind of caching that
|
@@ -695,7 +753,7 @@ parse_iso_8601_datetime(char *str, int len, | |||
if (sublen >= 2 && isdigit(substr[0]) && isdigit(substr[1])) { | |||
out->sec = 10 * (substr[0] - '0') + (substr[1] - '0'); | |||
|
|||
if (out->sec < 0 || out->sec >= 60) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these negative checks still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kawochen pointed this out - unless I'm misunderstanding something they will always be false.
@@ -461,6 +458,19 @@ def calc_with_mask(carg, mask): | |||
|
|||
return None | |||
|
|||
def _format_is_iso(f): | |||
""" | |||
Does format match the iso8601 set that can be handled by the C parser? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add here that we require same seps for date fields
some minor comments. ping when pushed / green. |
@jreback - updated for your comments |
@chris-b1 thanks! your PR's are always great! |
closes #9714
closes #11899
closes #11871
makes ISO parser in C handle the following. I think 2 & 3 aren't actually iso8601 anymore, but close and unambiguous.
(this ONLY works if the date has separators, eg.
"2015-1-1"
parses, but"201511"
doesn't because it's ambiguous)asv results - adds a small amount of overhead to the standard case (
'2014-01-01'
)