-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add temporal functions #272
Conversation
Hi, may be add some add some abbreviations? like DOW, DOY and ISODOW |
I'm not sure substrait supports abbreviations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, timezone strings need to be documented. If they're well-defined in some spec I'm not aware of yet, a link would be sufficient. We might also want to consider requiring them to be constant, but I'm not sure about that one.
If I were to implement the primitives for timezone handling I would do something like this:
timezone_offset(string, timestamp_tz) -> interval_day
to compute the timezone offset for a given timezone string at the given point in time (so DST/regulatory changes can be handled).timezone_offset(string, timestamp) -> interval_day
to do the above but for local time, if possible (if the timezone string refers to a region rather than being explicit about its offset to UTC, the result of this conversion may be ambiguous or may not exist due to DST and regulatory changes; I guess this would return null if unknown).timezone_offset(string) -> interval_day
like above, but fails if the timezone is ambiguous about DST etc.add(timestamp_tz, interval_day) -> timestamp
to represent a fixed point in time with a given timestamp offset.subtract(timestamp, interval_day) -> timestamp_tz
to combine a local timestamp with the local timezone to get a fixed point in time.
where interval_day
is (ab)used as a timezone offset w.r.t. UTC for lack of a better type. Some integer with some documented unit would also work. Using this method, all timezone parsing nonsense can be confined to a single function.
@jacques-n curious to hear your opinion about timezone handling.
Semi-related: it looks like add(timestamp_tz, interval_year, [string]) -> timestamp
and add(timestamp_tz, interval_day, [string]) -> timestamp
are defined incorrectly; they should probably have been returning timestamp_tz
.
extensions/functions_datetime.yaml
Outdated
@@ -40,6 +64,7 @@ scalar_functions: | |||
- args: | |||
- value: timestamp_tz | |||
- value: interval_year | |||
- value: string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are all these strings coming from (in the other functions as well)? This smells like Arrow timezone shenanigans. Substrait timestamp_tz
values don't conceptually have timezones, nor do they need them for computations. They represent fixed points in time, typically implemented as offsets to the Unix epoch, which is defined in UTC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If interval_year
is a physical unit and timestamp_tz
is in UTC I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reread and since interval_year
is a count of days and hence not a physical unit we need the timezone. Imagine the case where you add time to a timestamp_tz
and cross a DST switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I didn't consider that interval_year
isn't physical. My bad.
extensions/functions_datetime.yaml
Outdated
name: The part of the value to extract. | ||
required: true | ||
- value: timestamp_tz | ||
- value: string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see why you might want a timezone string here, but at the very least it must be documented that that is what it is, and which formats it supports. However, I would much rather see the reverse of assume_timezone
, e.g. with_timezone(string, timestamp_tz) -> timestamp
, in which case this implementation can just always use UTC, and a user could write extract(..., with_timezone(timezone, timestamp))
if they want a different one. That way, the existing function signature doesn't need to be changed, and the functions represent more primitive operations (only assume_timezone
and with_timezone
would need to understand what timezones are, which is great because timezones are terrible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure doing extract(assume_timezone(timestamp, timezone), component)
would work. As far as I understand extracted components should be local. So timezone would need to be passed to extract
as well.
e.g. extract(assume_timezone(timestamp, timezone), component, timezone)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand extracted components should be local
Okay, on second thought, that makes sense. I was thinking in UTC, but timezone_tz
is supposed to be agnostic of that.
extensions/functions_datetime.yaml
Outdated
- options: [ YEAR, ISO_YEAR, QUARTER, MONTH, DAY, DAY_OF_WEEK, ISO_DAY_OF_WEEK, DAY_OF_YEAR, WEEK, | ||
ISO_WEEK, HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND, NANOSECOND, SUBSECOND, EPOCH, | ||
TIMEZONE_OFFSET ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It's not clear to me what
MILLISECOND
,MICROSECOND
andNANOSECOND
are relative to. Does for exampleextract(MICROSECOND, 2022-08-15T11:08:24.123456789)
yield 24123456, 123456, or 456? - What does
SUBSECOND
mean? - What does
EPOCH
mean? - I'm not 100% convinced
TIMEZONE_OFFSET
belongs here, even if we keep the timezone string argument. Is that supposed to depend on the timestamp due to regulatory changes, DST, and whatnot? If so, can we please restrict that nonsense to dedicated functions converting betweentimestamp
andtimestamp_tz
instead? If not, the result would not depend on thetimestamp_tz
value. - If we keep it, with what unit are
TIMEZONE_OFFSET
s represented (minutes? seconds?) - What's the difference between
YEAR
andISO_YEAR
? - Do
MONTH
,DAY
, ...,ISO_WEEK
start counting at 1 or 0? - Which day is the first day of the week?
Please elaborate in the function description; it should probably have an exhaustive description of all the options, including the ones that were already defined prior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a description to extract
function that should address all questions above.
TIMEZONE_OFFSET
should return seconds because that's the precision provided by the tzdb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well perhaps microseconds could work for TIMEZONE_OFFSET
too since everything is microseconds. I don't have a preference, leaving seconds for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with seconds, that's good enough for timezone offsets.
@jvanstraten thanks for the review! I'll try to address it today or tomorrow. |
Apologies @jvanstraten I didn't get to this and will now be traveling for a week. Will reply after. |
| curious to hear your opinion about timezone handling It's mostly just a red-hot hatred. Why we can't all use julian time is beyond me. Why do I care if 8pm is at night one place and during the day another? I suppose that isn't what you were asking. My main experience is just to copy a bunch of SQL databases. In general, I consider timezone to be a display problem, not a processing problem. Some problems with timezones in general is that there are illegal times (bleh) and timezone rules are not immutable. Beyond that, I have no strong opinions. The main thing is for people to convert display time between timezones and interact with timeline values. |
d8b584e
to
ad56a10
Compare
I would propose we use IANA timezone database, (wikipedia) which tracks time zone offsets here. To my understanding that is what most DBs do. |
Also adding this table for completness:
[1] https://cloud.google.com/bigquery/docs/reference/standard-sql/datetime_functions |
db50626
to
599accd
Compare
I've also added temporal rounding, see examples here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_temporal_test.cc#L2733. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! It's also a lot to look at in one go, so I don't know if I caught everything... but either way, I don't think any of the remaining things are fundamental problems.
How would you express adding N days (calendar days, not 86400s) to |
0b18f48
to
de566ae
Compare
@jvanstraten thanks for the review! I think I addressed all the points and would kindly ask for another review round. |
extensions/functions_datetime.yaml
Outdated
- | ||
name: "cast_timezone" | ||
description: >- | ||
Convert UTC-relative timestamp_tz to local timestamp using given local time's timezone. | ||
|
||
Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). | ||
Examples: "Pacific/Marquesas", "Etc/GMT+1". | ||
If timezone is not present then UTC is assumed. If timezone is invalid an error is thrown. | ||
impls: | ||
- args: | ||
- name: x | ||
value: timestamp_tz | ||
- name: timezone | ||
description: Timezone string from IANA tzdb. | ||
value: string | ||
return: timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to actually not offer this option until it's actually requested. In my experience local times without the zone information are not very useful and if you just have to know the wall time to show to the user there's strftime
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pretty fundamental in order to add/subtract calendar units to a timestamp_tz
though, or to convert from one timezone to another, so I don't really see this as a formatting thing at all. I don't know. Keep in mind though that a consumer doesn't have to implement everything in Substrait core. If it's not implemented and doesn't end up being used much or at all there's no harm done IMO.
extensions/functions_datetime.yaml
Outdated
name: "round_calendar" | ||
description: >- | ||
Round a given timestamp/date/time to a multiple of a time unit. Time unit can be physical or calendar based. | ||
Input value is floored to multiple of temporal units until the next origin unit. Conversely value would be | ||
ceiled to the multiple of temporal units since the last origin unit. Rounding would return the closest of the | ||
ceiled and floored results. | ||
|
||
Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones). | ||
Examples: "Pacific/Marquesas", "Etc/GMT+1". | ||
If timezone is not present then UTC is assumed. If timezone is invalid an error is thrown. | ||
impls: | ||
- args: | ||
- name: x | ||
value: timestamp | ||
- name: rounding | ||
options: [ FLOOR, CEIL, ROUND ] | ||
required: true | ||
- name: unit | ||
options: [ YEAR, MONTH, WEEK, DAY, HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND ] | ||
required: true | ||
- name: origin | ||
options: [ YEAR, MONTH, WEEK, DAY, HOUR, MINUTE, SECOND, MILLISECOND ] | ||
required: true | ||
- name: multiple | ||
value: i64 | ||
return: timestamp | ||
- args: | ||
- name: x | ||
value: timestamp_tz | ||
- name: rounding | ||
options: [ FLOOR, CEIL, ROUND ] | ||
required: true | ||
- name: unit | ||
options: [ YEAR, MONTH, WEEK, DAY, HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND ] | ||
required: true | ||
- name: origin | ||
options: [ YEAR, MONTH, WEEK, DAY, HOUR, MINUTE, SECOND, MILLISECOND ] | ||
required: true | ||
- name: multiple | ||
value: i64 | ||
- name: timezone | ||
description: Timezone string from IANA tzdb. | ||
value: string | ||
return: timestamp_tz | ||
- args: | ||
- name: x | ||
value: date | ||
- name: rounding | ||
options: [ FLOOR, CEIL, ROUND ] | ||
required: true | ||
- name: unit | ||
options: [ YEAR, MONTH, WEEK, DAY ] | ||
required: true | ||
- name: origin | ||
options: [ YEAR, MONTH, WEEK, DAY ] | ||
required: true | ||
- name: multiple | ||
value: i64 | ||
- name: origin | ||
value: date | ||
return: date | ||
- args: | ||
- name: x | ||
value: time | ||
- name: rounding | ||
options: [ FLOOR, CEIL, ROUND ] | ||
- name: unit | ||
options: [ DAY, HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND ] | ||
required: true | ||
- name: origin | ||
options: [ DAY, HOUR, MINUTE, SECOND, MILLISECOND ] | ||
required: true | ||
- name: multiple | ||
value: i64 | ||
- name: origin | ||
value: time | ||
return: time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rounding to unit-based-origin is lubridate behaviour and as per experiences in the Acero rounding kernel it is best separated from the instant-based-origin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior comes across as really confusing to me, but I suppose it does seem to cover the edge cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's useful if you want to round to say 3 days, but not from 1970-01-01 but rather from the beginning of each month.
Rounding timestamps (in local time) is a deep rabbit hole.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what it's useful for, it's mostly just the combination of floor/ceil and since/until in that description that's confusing me. So I ended up working out manually what your description actually implies in terms of behavior when taken to the letter.
Input value is floored to multiple of temporal units until the next origin unit.
-> Round input value down such that there is an integer number of multiple
s between it and the next origin
.
though it could also be interpreted as
-> Round input value up such that there is an integer number of multiple
s between it and the next origin
.
if "floored" refers to the difference between the input and the next origin unit, rather than the input itself. I'll call those "floored value" and "floored difference". Based on the rounding options argument I guess the correct one is the one that actually floors the value, but it makes a huge mess for the rounded version.
Conversely value would be ceiled to the multiple of temporal units since the last origin unit.
-> Round input value up such that there is an integer number of multiple
s between it and the previous origin
.
This is not ambiguous because there's no negation here due to the "time until" and always ceils. However, the way it's written implies to me that it both flooring and ceiling happen at the same and that an assertion is made that this is the same thing or something. It'd be very helpful to actually refer to the options here, i.e. "Conversely, if rounding is set to CEILING
, value would be ceiled...".
Rounding would return the closest of the ceiled and floored results.
-> Return whichever of the above two values is closer to the input.
I think this is actually ambiguous though. What if the input is equidistant to both options?
I ended up just writing my various algorithm interpretations as a Python script to try it out, because I got way confused. The script operates on integers rather than calendar units, but the logic is the same. I used multiple = 300 and origin = 1000 (so that could be rounding to multiples of 300 milliseconds with the origin set to whole seconds) to hit all corner cases due to indivisibility and halfway values.
Results
Input | Floored (val) | Floored (dif) | Ceiled | Rounded (val) | Rounded (dif) |
---|---|---|---|---|---|
-100 | -300 | 0 | -100 | -100 | -100 |
-50 | -300 | 0 | 200 | -300 or 200 | 0 |
0 | 0 | 0 | 0 | 0 | 0 |
50 | -200 | 100 | 300 | -200 or 300 | 100 |
100 | 100 | 100 | 300 | 100 | 100 |
150 | 100 | 400 | 300 | 100 | 300 |
200 | 100 | 400 | 300 | 100 or 300 | 300 |
250 | 100 | 400 | 300 | 300 | 300 |
300 | 100 | 400 | 300 | 300 | 300 |
350 | 100 | 400 | 600 | 100 or 600 | 400 |
400 | 400 | 400 | 600 | 400 | 400 |
450 | 400 | 700 | 600 | 400 | 600 |
500 | 400 | 700 | 600 | 400 or 600 | 600 |
550 | 400 | 700 | 600 | 600 | 600 |
600 | 400 | 700 | 600 | 600 | 600 |
650 | 400 | 700 | 900 | 400 or 900 | 700 |
700 | 700 | 700 | 900 | 700 | 700 |
750 | 700 | 1000 | 900 | 700 | 900 |
800 | 700 | 1000 | 900 | 700 or 900 | 900 |
850 | 700 | 1000 | 900 | 900 | 900 |
900 | 700 | 1000 | 900 | 900 | 900 |
950 | 700 | 1000 | 1200 | 700 or 1200 | 1000 |
1000 | 1000 | 1000 | 1000 | 1000 | 1000 |
1050 | 800 | 1100 | 1300 | 800 or 1300 | 1100 |
1100 | 1100 | 1100 | 1300 | 1100 | 1100 |
import math
def ceil_multiple(input, multiple):
"""Round integer up to multiple from 0."""
return (input + multiple - 1) // multiple * multiple
def floor_multiple(input, multiple):
"""Round integer down to multiple from 0."""
return input // multiple * multiple
def closest(input, a, b):
"""Pick a or b depending on which is closer to input. Return result as a
string. If a and b are different and equidistant from input, return the string
"{a} or {b}"."""
if a == b:
return f"{a}"
da = abs(input - a)
db = abs(input - b)
if da < db:
return f"{a}"
elif db < da:
return f"{b}"
else:
return f"{a} or {b}"
def round_calendar(input, multiple, origin):
"""Implement the proposed rounding logic. Rather than returning the result,
print a table line with the intermediate results."""
# "Input value is floored to multiple of temporal units until the next origin unit."
next_origin = ceil_multiple(input, origin)
# Both interpretations:
floored_value = floor_multiple(input - next_origin, multiple) + next_origin
floored_difference = next_origin - floor_multiple(next_origin - input, multiple)
# "Conversely value would be ceiled to the multiple of temporal units since the last origin unit."
previous_origin = floor_multiple(input, origin)
ceiled = ceil_multiple(input - previous_origin, multiple) + previous_origin
# "Rounding would return the closest of the ceiled and floored results."
rounded_value = closest(input, floored_value, ceiled)
rounded_difference = closest(input, floored_difference, ceiled)
print(f"| {input:>5} | {floored_value:>13} | {floored_difference:>13} | {ceiled:>6} | {rounded_value:>13} | {rounded_difference:>13} |")
multiple = 300
origin = 1000
print("| Input | Floored (val) | Floored (dif) | Ceiled | Rounded (val) | Rounded (dif) |")
print("|-------|---------------|---------------|--------|---------------|---------------|")
for input in range(-100, 1101, 50):
round_calendar(input, multiple, origin)
If we're using the interpretation of the floored version that actually floored, the rounding version makes a completely indecipherable mess, so I'm pretty sure there's something fishy about the way the operation is currently described...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've thought about this some more and this is how I would write/implement this instead:
Round a given timestamp/date/time to a multiple of a time unit. If the given timestamp is not already an exact multiple of
origin
in the giventimezone
, the origin point is chosen as one of the two nearest multiples. Which of these is chosen is governed byorigin_rounding
:FLOOR
means to use the earlier one,CEIL
means to use the later one,ROUND_TIE_DOWN
means to choose the nearest and tie to the earlier one if equidistant,ROUND_TIE_UP
means to choose the nearest and tie to the later one if equidistant. If the given timestamp is not already an exact multiple ofmultiple
unit
s away from the chosen origin point, the function chooses between one of the two nearest multiples. Which of these is chosen is governed byrounding
, using the same logic asorigin_rounding
.
Results
For equal rounding
and origin_rounding
:
Input | Floor | Ceil | Round-tie-up | Round-tie-down |
---|---|---|---|---|
-100 | -100 | 0 | 0 | 0 |
-50 | -100 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 |
50 | 0 | 100 | 0 | 0 |
100 | 0 | 100 | 0 | 0 |
150 | 0 | 400 | 300 | 0 |
200 | 0 | 400 | 300 | 300 |
250 | 0 | 400 | 300 | 300 |
300 | 300 | 400 | 300 | 300 |
350 | 300 | 400 | 300 | 300 |
400 | 300 | 400 | 300 | 300 |
450 | 300 | 700 | 600 | 300 |
500 | 300 | 700 | 400 | 600 |
550 | 300 | 700 | 700 | 400 |
600 | 600 | 700 | 700 | 700 |
650 | 600 | 700 | 700 | 700 |
700 | 600 | 700 | 700 | 700 |
750 | 600 | 1000 | 700 | 700 |
800 | 600 | 1000 | 700 | 700 |
850 | 600 | 1000 | 1000 | 700 |
900 | 900 | 1000 | 1000 | 1000 |
950 | 900 | 1000 | 1000 | 1000 |
1000 | 1000 | 1000 | 1000 | 1000 |
1050 | 1000 | 1100 | 1000 | 1000 |
1100 | 1000 | 1100 | 1000 | 1000 |
import math
def floor_multiple(input, multiple):
"""Round integer down to multiple from 0."""
return input // multiple * multiple
def ceil_multiple(input, multiple):
"""Round integer up to multiple from 0."""
return (input + multiple - 1) // multiple * multiple
def round_tie_down_multiple(input, multiple):
"""Round integer to the nearest multiple from 0, tie down if equidistant."""
return (input + (multiple - 1) // 2) // multiple * multiple
def round_tie_up_multiple(input, multiple):
"""Round integer to the nearest multiple from 0, tie down if equidistant."""
return (input + multiple // 2) // multiple * multiple
def round_multiple(input, multiple, method):
return {
"FLOOR": floor_multiple,
"CEIL": ceil_multiple,
"ROUND_TIE_DOWN": round_tie_down_multiple,
"ROUND_TIE_UP": round_tie_up_multiple,
}[method](input, multiple)
def round_calendar(input, multiple, origin, rounding, origin_rounding):
"""Implement the proposed rounding logic."""
# "If the given timestamp is not already an exact multiple of `origin` in
# the given `timezone`, the origin point is chosen as one of the two
# nearest multiples."
origin = round_multiple(input, origin, origin_rounding)
# "If the given timestamp is not already an exact multiple of `multiple`
# `unit`s away from the chosen origin point, the function chooses between
# one of the two nearest multiples."
return round_multiple(input - origin, multiple, rounding) + origin
multiple = 300
origin = 1000
print("| Input | Floor | Ceil | Round-tie-up | Round-tie-down |")
print("|-------|-------|------|--------------|----------------|")
for input in range(-100, 1101, 50):
floored = round_calendar(input, multiple, origin, "FLOOR", "FLOOR")
ceiled = round_calendar(input, multiple, origin, "CEIL", "CEIL")
rounded_tied_up = round_calendar(input, multiple, origin, "ROUND_TIE_UP", "ROUND_TIE_UP")
rounded_tied_down = round_calendar(input, multiple, origin, "ROUND_TIE_DOWN", "ROUND_TIE_DOWN")
print(f"| {input:>5} | {floored:>5} | {ceiled:>4} | {rounded_tied_up:>12} | {rounded_tied_down:>14} |")
There are still some shenanigans when the origin is rounded (i.e. the crossover point is halfway), but I think that's just the nature of the beast when used in conjunction with a multiple that doesn't divide the origin time unit...?
This is indeed a rabbit hole, but if we want to adopt this function in the spec, we'll have to go down it nonetheless. Might be worth splitting it off into another PR if this drags on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if you'd like to use an implementation see:
https://arrow.apache.org/docs/python/generated/pyarrow.compute.floor_temporal.html
import pyarrow as pa
from pyarrow import compute as pc
arr = pa.array([10**14, 10**15, 10**16], type=pa.timestamp("us", "Pacific/Marquesas"))
pc.floor_temporal(arr, 3, "hour", calendar_based_origin=True)
<pyarrow.lib.TimestampArray object at 0x11fb4bd60>
[
1973-03-03 09:30:00.000000,
2001-09-09 00:30:00.000000,
2286-11-20 15:30:00.000000
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my definition there would be two rounding options;
- name: rounding
options: [ FLOOR, CEIL, ROUND_TIE_DOWN, ROUND_TIE_UP ]
- name: origin_rounding
options: [ FLOOR, CEIL, ROUND_TIE_DOWN, ROUND_TIE_UP ]
(there's a double space before ROUND_TIE_UP
in your suggestion, by the way; the YAML linter/format check would probably complain about that)
Using the same option for both would also work, kinda, but they do fundamentally different things; one chooses the crossover point for which origin to use (where shenanigans might occur) and the other chooses how to actually round toward that point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my mind origin in "round_calendar"
is always before the given timestamp. And by origin I imagine like 0 in integers. Is there a benefit to choosing the origin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my mind origin in "round_calendar" is always before the given timestamp.
What I'm pretty sure you meant to say corresponds to origin_rounding = FLOOR by my definition, but strictly speaking your definition gets pretty weird when the incoming value aligns exactly with the origin point. You probably meant "before or equal to".
Is there a benefit to choosing the origin?
I don't know; it seemed to generalize nicely that way, but I'm not attached to it. If everyone uses FLOOR
behavior for origin_rounding
we can leave the option out and update the description accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably meant "before or equal to".
Yes.
I don't know; it seemed to generalize nicely that way, but I'm not attached to it. If everyone uses FLOOR behavior for origin_rounding we can leave the option out and update the description accordingly.
I suppose it's nicely symmetrical but I'd avoid as I've not seen it elsewhere. I'll update the description.
de566ae
to
3559bff
Compare
@jorisvandenbossche do you have any input here? |
According to the definition of intervals that Substrait currently uses, there is no functional difference between "1 day" and "86400 seconds". In both cases you'd just add that amount of time, because If you want to add "1 day" in the calendar sense, which I guess may be 23, 24, or 25 hours if DST changes, or even something far worse if the timezone changes outright due to legislative changes or leap seconds, you would convert the |
That wouldn't work for month or year. |
That's a good point. I would just not provide an overload for adding ETA: I think that's indeed what happened. I guess I can also live with keeping the overload with the timezone argument as a shorthand notation. Unrelated, I'm now noticing a bunch of broken overloads for |
c6b2e40
to
79f39c1
Compare
@westonpace please note I've also changed the week extraction and calendrical rounding somewhat, to allow for all possible modes. |
373ac1a
to
15c5820
Compare
Removed UTC wording as suggested in 337. |
0003578
to
0b92d38
Compare
bf2edf3
to
b1f65eb
Compare
@westonpace thanks for the review, I think I covered all your points. The CI issue is due to some API rate limit. |
04feab2
to
8bc3758
Compare
Co-authored-by: Weston Pace <[email protected]> Co-authored-by: Jeroen van Straten <[email protected]>
8bc3758
to
637a062
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sticking with this. This was quite the undertaking.
Thanks for reviews and feedback @westonpace! I'm really happy this is done. |
This adds new scalar functions to work with timestamp/date/time types.
See #222 for previous discussion.