feat: add temporal functions #272

rok · 2022-07-29T18:32:31Z

This adds new scalar functions to work with timestamp/date/time types.
See #222 for previous discussion.

rok · 2022-08-03T17:38:46Z

spevenhe · 2022-08-12T01:45:21Z

Hi, may be add some add some abbreviations？ like DOW, DOY and ISODOW

rok · 2022-08-12T18:14:10Z

Hi, may be add some add some abbreviations？ like DOW, DOY and ISODOW

I'm not sure substrait supports abbreviations.
Also, my understanding is substrait is not meant to be user facing and abbreviations don't really make sense.

jvanstraten

In general, timezone strings need to be documented. If they're well-defined in some spec I'm not aware of yet, a link would be sufficient. We might also want to consider requiring them to be constant, but I'm not sure about that one.

If I were to implement the primitives for timezone handling I would do something like this:

timezone_offset(string, timestamp_tz) -> interval_day to compute the timezone offset for a given timezone string at the given point in time (so DST/regulatory changes can be handled).
timezone_offset(string, timestamp) -> interval_day to do the above but for local time, if possible (if the timezone string refers to a region rather than being explicit about its offset to UTC, the result of this conversion may be ambiguous or may not exist due to DST and regulatory changes; I guess this would return null if unknown).
timezone_offset(string) -> interval_day like above, but fails if the timezone is ambiguous about DST etc.
add(timestamp_tz, interval_day) -> timestamp to represent a fixed point in time with a given timestamp offset.
subtract(timestamp, interval_day) -> timestamp_tz to combine a local timestamp with the local timezone to get a fixed point in time.

where interval_day is (ab)used as a timezone offset w.r.t. UTC for lack of a better type. Some integer with some documented unit would also work. Using this method, all timezone parsing nonsense can be confined to a single function.

@jacques-n curious to hear your opinion about timezone handling.

Semi-related: it looks like add(timestamp_tz, interval_year, [string]) -> timestamp and add(timestamp_tz, interval_day, [string]) -> timestamp are defined incorrectly; they should probably have been returning timestamp_tz.

extensions/functions_datetime.yaml

jvanstraten · 2022-08-15T11:01:13Z

extensions/functions_datetime.yaml

@@ -40,6 +64,7 @@ scalar_functions:
      - args:
          - value: timestamp_tz
          - value: interval_year
+          - value: string


Where are all these strings coming from (in the other functions as well)? This smells like Arrow timezone shenanigans. Substrait timestamp_tz values don't conceptually have timezones, nor do they need them for computations. They represent fixed points in time, typically implemented as offsets to the Unix epoch, which is defined in UTC.

If interval_year is a physical unit and timestamp_tz is in UTC I agree.

I've reread and since interval_year is a count of days and hence not a physical unit we need the timezone. Imagine the case where you add time to a timestamp_tz and cross a DST switch.

You're right, I didn't consider that interval_year isn't physical. My bad.

jvanstraten · 2022-08-15T11:06:07Z

extensions/functions_datetime.yaml

            name: The part of the value to extract.
            required: true
          - value: timestamp_tz
+          - value: string


I can see why you might want a timezone string here, but at the very least it must be documented that that is what it is, and which formats it supports. However, I would much rather see the reverse of assume_timezone, e.g. with_timezone(string, timestamp_tz) -> timestamp, in which case this implementation can just always use UTC, and a user could write extract(..., with_timezone(timezone, timestamp)) if they want a different one. That way, the existing function signature doesn't need to be changed, and the functions represent more primitive operations (only assume_timezone and with_timezone would need to understand what timezones are, which is great because timezones are terrible).

I'm not sure doing extract(assume_timezone(timestamp, timezone), component) would work. As far as I understand extracted components should be local. So timezone would need to be passed to extract as well.
e.g. extract(assume_timezone(timestamp, timezone), component, timezone)

As far as I understand extracted components should be local

Okay, on second thought, that makes sense. I was thinking in UTC, but timezone_tz is supposed to be agnostic of that.

jvanstraten · 2022-08-15T11:17:27Z

extensions/functions_datetime.yaml

+          - options: [ YEAR, ISO_YEAR, QUARTER, MONTH, DAY, DAY_OF_WEEK, ISO_DAY_OF_WEEK, DAY_OF_YEAR, WEEK,
+                       ISO_WEEK, HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND, NANOSECOND, SUBSECOND, EPOCH,
+                       TIMEZONE_OFFSET ]


It's not clear to me what MILLISECOND, MICROSECOND and NANOSECOND are relative to. Does for example extract(MICROSECOND, 2022-08-15T11:08:24.123456789) yield 24123456, 123456, or 456?

What does SUBSECOND mean?

What does EPOCH mean?

I'm not 100% convinced TIMEZONE_OFFSET belongs here, even if we keep the timezone string argument. Is that supposed to depend on the timestamp due to regulatory changes, DST, and whatnot? If so, can we please restrict that nonsense to dedicated functions converting between timestamp and timestamp_tz instead? If not, the result would not depend on the timestamp_tz value.

If we keep it, with what unit are TIMEZONE_OFFSETs represented (minutes? seconds?)

What's the difference between YEAR and ISO_YEAR?

Do MONTH, DAY, ..., ISO_WEEK start counting at 1 or 0?

Which day is the first day of the week?

Please elaborate in the function description; it should probably have an exhaustive description of all the options, including the ones that were already defined prior.

I've added a description to extract function that should address all questions above.
TIMEZONE_OFFSET should return seconds because that's the precision provided by the tzdb.

Well perhaps microseconds could work for TIMEZONE_OFFSET too since everything is microseconds. I don't have a preference, leaving seconds for now.

I'm fine with seconds, that's good enough for timezone offsets.

extensions/functions_datetime.yaml

rok · 2022-08-15T14:52:24Z

@jvanstraten thanks for the review! I'll try to address it today or tomorrow.

rok · 2022-08-21T02:44:03Z

Apologies @jvanstraten I didn't get to this and will now be traveling for a week. Will reply after.

jacques-n · 2022-08-22T23:40:31Z

| curious to hear your opinion about timezone handling

It's mostly just a red-hot hatred. Why we can't all use julian time is beyond me. Why do I care if 8pm is at night one place and during the day another?

I suppose that isn't what you were asking. My main experience is just to copy a bunch of SQL databases. In general, I consider timezone to be a display problem, not a processing problem. Some problems with timezones in general is that there are illegal times (bleh) and timezone rules are not immutable. Beyond that, I have no strong opinions. The main thing is for people to convert display time between timezones and interact with timeline values.

rok · 2022-09-12T22:30:41Z

In general, timezone strings need to be documented. If they're well-defined in some spec I'm not aware of yet, a link would be sufficient. We might also want to consider requiring them to be constant, but I'm not sure about that one.

I would propose we use IANA timezone database, (wikipedia) which tracks time zone offsets here. To my understanding that is what most DBs do.

rok · 2022-09-12T22:31:18Z

Also adding this table for completness:
Temporal extractions in various DBs. I'll give it another pass and then update the PR as well.

BigQuery [1]	Postgres [2]	MySQL [3]	Arrow Acero [4]	ClickHouse [5]	Description
	EPOCH			toUnixTimestamp	The number of seconds since 1970-01-01 00:00:00 UTC
			SUBSECOND		The seconds field, including fractional parts
			NANOSECOND		The seconds field, including fractional parts, multiplied by 1000000000
MICROSECOND	MICROSECONDS	MICROSECOND	MICROSECOND		The seconds field, including fractional parts, multiplied by 1000000
MILLISECOND	MILLISECONDS		MILLISECONDS		The seconds field, including fractional parts, multiplied by 1000
SECOND	SECOND	SECOND	SECOND	toSecond	Return the second (0-59)
MINUTE	MINUTE	MINUTE	MINUTE	toMinute	Return the minute from the argument
HOUR	HOUR	HOUR	HOUR	toHour	The hour (0-23)
DAY	DAY	DAY	DAY	toDayOfMonth	The day of the month (1-31)
DAYOFYEAR	DOY	DAYOFYEAR	DAY_OF_YEAR	toDayOfYear	The day of year that ranges from 1 to 366
MONTH	MONTH	MONTH	MONTH	toMonth	Return the month from the date passed
QUARTER	QUARTER	QUARTER	QUARTER	toQuarter	Return the quarter from a date argument
YEAR	YEAR	YEAR	YEAR	toYear	Return the year
ISOYEAR	ISOYEAR		ISO_YEAR	toISOYear	Returns the ISO 8601 week-numbering year.
			US_YEAR		Returns the US_WEEK week-numbering year.
	DECADE				The decade that is the year divided by 10
	CENTURY				The century
	MILLENNIUM				The millenium
DAYOFWEEK	DOW	DAYOFWEEK			The day of week Sunday (0) to Saturday (6)
	ISODOW	WEEKDAY	DAY_OF_WEEK	toDayOfWeek	Day of week based on ISO 8601 Monday (1) to Sunday (7)
WEEK		WEEK	WEEK		Return the week number (0-53)
		WEEKOFYEAR	US_WEEK		Return the calendar week of the date (1-53)
WEEK()				toWeek	Weeks begin on WEEKDAY. Times prior to the first WEEKDAY of the year are in week 0.
ISOWEEK	WEEK		ISO_WEEK	toISOWeek	Returns the ISO 8601 week number of the date_expression.
		YEARWEEK		toYearWeek	Return the year and week
			ISO_CALENDAR		Struct {iso_year, iso_week, iso_day_of_week}
			YEAR_MONTH_DAY		Struct {year, month, day}
		LAST_DAY			Returns the last day of the month.
DATE		DATE		toDate	Extract DATE type object
TIME		TIME		toTime	Extract TIME type object
			IS_DST		Boolean - is instant in daylight savings regime.
			IS_LEAP_YEAR		Boolean - is instant in leap year.
	TIMEZONE			timeZoneOf	The timezone offset from UTC, measured in seconds
	TIMEZONE_HOUR				The hour component of the time zone offset
	TIMEZONE_MINUTE				The minute component of the time zone offset
	TIMEZONE_MINUTE			timeZoneOffset	Timezone offset in seconds from UTC

[1] https://cloud.google.com/bigquery/docs/reference/standard-sql/datetime_functions
[2] https://www.postgresql.org/docs/devel/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT
[3] https://dev.mysql.com/doc/refman/8.0/en/date-and-time-functions.html
[4] https://arrow.apache.org/docs/dev/cpp/compute.html?highlight=is_dst#temporal-component-extraction

rok · 2022-09-12T22:44:47Z

I've also added temporal rounding, see examples here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_temporal_test.cc#L2733.

jvanstraten

This is great! It's also a lot to look at in one go, so I don't know if I caught everything... but either way, I don't think any of the remaining things are fundamental problems.

extensions/functions_datetime.yaml

rok · 2022-09-15T16:48:44Z

How would you express adding N days (calendar days, not 86400s) to timestamp_tz?

rok · 2022-09-15T19:18:15Z

@jvanstraten thanks for the review! I think I addressed all the points and would kindly ask for another review round.

rok · 2022-09-15T19:22:30Z

extensions/functions_datetime.yaml

+  -
+    name: "cast_timezone"
+    description: >-
+      Convert UTC-relative timestamp_tz to local timestamp using given local time's timezone.
+
+      Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones).
+      Examples: "Pacific/Marquesas", "Etc/GMT+1".
+      If timezone is not present then UTC is assumed. If timezone is invalid an error is thrown.
+    impls:
+      - args:
+          - name: x
+            value: timestamp_tz
+          - name: timezone
+            description: Timezone string from IANA tzdb.
+            value: string
+        return: timestamp


I would prefer to actually not offer this option until it's actually requested. In my experience local times without the zone information are not very useful and if you just have to know the wall time to show to the user there's strftime.

It's pretty fundamental in order to add/subtract calendar units to a timestamp_tz though, or to convert from one timezone to another, so I don't really see this as a formatting thing at all. I don't know. Keep in mind though that a consumer doesn't have to implement everything in Substrait core. If it's not implemented and doesn't end up being used much or at all there's no harm done IMO.

rok · 2022-09-15T19:30:25Z

extensions/functions_datetime.yaml

+    name: "round_calendar"
+    description: >-
+      Round a given timestamp/date/time to a multiple of a time unit. Time unit can be physical or calendar based.
+      Input value is floored to multiple of temporal units until the next origin unit. Conversely value would be
+      ceiled to the multiple of temporal units since the last origin unit. Rounding would return the closest of the
+      ceiled and floored results.
+
+      Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones).
+      Examples: "Pacific/Marquesas", "Etc/GMT+1".
+      If timezone is not present then UTC is assumed. If timezone is invalid an error is thrown.
+    impls:
+      - args:
+          - name: x
+            value: timestamp
+          - name: rounding
+            options: [ FLOOR, CEIL, ROUND ]
+            required: true
+          - name: unit
+            options: [ YEAR, MONTH, WEEK, DAY, HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND ]
+            required: true
+          - name: origin
+            options: [ YEAR, MONTH, WEEK, DAY, HOUR, MINUTE, SECOND, MILLISECOND ]
+            required: true
+          - name: multiple
+            value: i64
+        return: timestamp
+      - args:
+          - name: x
+            value: timestamp_tz
+          - name: rounding
+            options: [ FLOOR, CEIL, ROUND ]
+            required: true
+          - name: unit
+            options: [ YEAR, MONTH, WEEK, DAY, HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND ]
+            required: true
+          - name: origin
+            options: [ YEAR, MONTH, WEEK, DAY, HOUR, MINUTE, SECOND, MILLISECOND ]
+            required: true
+          - name: multiple
+            value: i64
+          - name: timezone
+            description: Timezone string from IANA tzdb.
+            value: string
+        return: timestamp_tz
+      - args:
+          - name: x
+            value: date
+          - name: rounding
+            options: [ FLOOR, CEIL, ROUND ]
+            required: true
+          - name: unit
+            options: [ YEAR, MONTH, WEEK, DAY ]
+            required: true
+          - name: origin
+            options: [ YEAR, MONTH, WEEK, DAY ]
+            required: true
+          - name: multiple
+            value: i64
+          - name: origin
+            value: date
+        return: date
+      - args:
+          - name: x
+            value: time
+          - name: rounding
+            options: [ FLOOR, CEIL, ROUND ]
+          - name: unit
+            options: [ DAY, HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND ]
+            required: true
+          - name: origin
+            options: [ DAY, HOUR, MINUTE, SECOND, MILLISECOND ]
+            required: true
+          - name: multiple
+            value: i64
+          - name: origin
+            value: time
+        return: time


Rounding to unit-based-origin is lubridate behaviour and as per experiences in the Acero rounding kernel it is best separated from the instant-based-origin.

This behavior comes across as really confusing to me, but I suppose it does seem to cover the edge cases.

It's useful if you want to round to say 3 days, but not from 1970-01-01 but rather from the beginning of each month.
Rounding timestamps (in local time) is a deep rabbit hole.

I understand what it's useful for, it's mostly just the combination of floor/ceil and since/until in that description that's confusing me. So I ended up working out manually what your description actually implies in terms of behavior when taken to the letter.

Input value is floored to multiple of temporal units until the next origin unit.

-> Round input value down such that there is an integer number of multiples between it and the next origin.

though it could also be interpreted as

-> Round input value up such that there is an integer number of multiples between it and the next origin.

if "floored" refers to the difference between the input and the next origin unit, rather than the input itself. I'll call those "floored value" and "floored difference". Based on the rounding options argument I guess the correct one is the one that actually floors the value, but it makes a huge mess for the rounded version.

Conversely value would be ceiled to the multiple of temporal units since the last origin unit.

-> Round input value up such that there is an integer number of multiples between it and the previous origin.

This is not ambiguous because there's no negation here due to the "time until" and always ceils. However, the way it's written implies to me that it both flooring and ceiling happen at the same and that an assertion is made that this is the same thing or something. It'd be very helpful to actually refer to the options here, i.e. "Conversely, if rounding is set to CEILING, value would be ceiled...".

Rounding would return the closest of the ceiled and floored results.

-> Return whichever of the above two values is closer to the input.

I think this is actually ambiguous though. What if the input is equidistant to both options?

I ended up just writing my various algorithm interpretations as a Python script to try it out, because I got way confused. The script operates on integers rather than calendar units, but the logic is the same. I used multiple = 300 and origin = 1000 (so that could be rounding to multiples of 300 milliseconds with the origin set to whole seconds) to hit all corner cases due to indivisibility and halfway values.

Results

Input Floored (val) Floored (dif) Ceiled Rounded (val) Rounded (dif)

-100 -300 0 -100 -100 -100

-50 -300 0 200 -300 or 200 0

0 0 0 0 0 0

50 -200 100 300 -200 or 300 100

100 100 100 300 100 100

150 100 400 300 100 300

200 100 400 300 100 or 300 300

250 100 400 300 300 300

300 100 400 300 300 300

350 100 400 600 100 or 600 400

400 400 400 600 400 400

450 400 700 600 400 600

500 400 700 600 400 or 600 600

550 400 700 600 600 600

600 400 700 600 600 600

650 400 700 900 400 or 900 700

700 700 700 900 700 700

750 700 1000 900 700 900

800 700 1000 900 700 or 900 900

850 700 1000 900 900 900

900 700 1000 900 900 900

950 700 1000 1200 700 or 1200 1000

1000 1000 1000 1000 1000 1000

1050 800 1100 1300 800 or 1300 1100

1100 1100 1100 1300 1100 1100

import math def ceil_multiple(input, multiple): """Round integer up to multiple from 0.""" return (input + multiple - 1) // multiple * multiple def floor_multiple(input, multiple): """Round integer down to multiple from 0.""" return input // multiple * multiple def closest(input, a, b): """Pick a or b depending on which is closer to input. Return result as a string. If a and b are different and equidistant from input, return the string "{a} or {b}".""" if a == b: return f"{a}" da = abs(input - a) db = abs(input - b) if da < db: return f"{a}" elif db < da: return f"{b}" else: return f"{a} or {b}" def round_calendar(input, multiple, origin): """Implement the proposed rounding logic. Rather than returning the result, print a table line with the intermediate results.""" # "Input value is floored to multiple of temporal units until the next origin unit." next_origin = ceil_multiple(input, origin) # Both interpretations: floored_value = floor_multiple(input - next_origin, multiple) + next_origin floored_difference = next_origin - floor_multiple(next_origin - input, multiple) # "Conversely value would be ceiled to the multiple of temporal units since the last origin unit." previous_origin = floor_multiple(input, origin) ceiled = ceil_multiple(input - previous_origin, multiple) + previous_origin # "Rounding would return the closest of the ceiled and floored results." rounded_value = closest(input, floored_value, ceiled) rounded_difference = closest(input, floored_difference, ceiled) print(f"| {input:>5} | {floored_value:>13} | {floored_difference:>13} | {ceiled:>6} | {rounded_value:>13} | {rounded_difference:>13} |") multiple = 300 origin = 1000 print("| Input | Floored (val) | Floored (dif) | Ceiled | Rounded (val) | Rounded (dif) |") print("|-------|---------------|---------------|--------|---------------|---------------|") for input in range(-100, 1101, 50): round_calendar(input, multiple, origin)

If we're using the interpretation of the floored version that actually floored, the rounding version makes a completely indecipherable mess, so I'm pretty sure there's something fishy about the way the operation is currently described...

I've thought about this some more and this is how I would write/implement this instead:

Round a given timestamp/date/time to a multiple of a time unit. If the given timestamp is not already an exact multiple of origin in the given timezone, the origin point is chosen as one of the two nearest multiples. Which of these is chosen is governed by origin_rounding: FLOOR means to use the earlier one, CEIL means to use the later one, ROUND_TIE_DOWN means to choose the nearest and tie to the earlier one if equidistant, ROUND_TIE_UP means to choose the nearest and tie to the later one if equidistant. If the given timestamp is not already an exact multiple of multiple units away from the chosen origin point, the function chooses between one of the two nearest multiples. Which of these is chosen is governed by rounding, using the same logic as origin_rounding.

Results

For equal rounding and origin_rounding:

Input Floor Ceil Round-tie-up Round-tie-down

-100 -100 0 0 0

-50 -100 0 0 0

0 0 0 0 0

50 0 100 0 0

100 0 100 0 0

150 0 400 300 0

200 0 400 300 300

250 0 400 300 300

300 300 400 300 300

350 300 400 300 300

400 300 400 300 300

450 300 700 600 300

500 300 700 400 600

550 300 700 700 400

600 600 700 700 700

650 600 700 700 700

700 600 700 700 700

750 600 1000 700 700

800 600 1000 700 700

850 600 1000 1000 700

900 900 1000 1000 1000

950 900 1000 1000 1000

1000 1000 1000 1000 1000

1050 1000 1100 1000 1000

1100 1000 1100 1000 1000

import math def floor_multiple(input, multiple): """Round integer down to multiple from 0.""" return input // multiple * multiple def ceil_multiple(input, multiple): """Round integer up to multiple from 0.""" return (input + multiple - 1) // multiple * multiple def round_tie_down_multiple(input, multiple): """Round integer to the nearest multiple from 0, tie down if equidistant.""" return (input + (multiple - 1) // 2) // multiple * multiple def round_tie_up_multiple(input, multiple): """Round integer to the nearest multiple from 0, tie down if equidistant.""" return (input + multiple // 2) // multiple * multiple def round_multiple(input, multiple, method): return { "FLOOR": floor_multiple, "CEIL": ceil_multiple, "ROUND_TIE_DOWN": round_tie_down_multiple, "ROUND_TIE_UP": round_tie_up_multiple, }[method](input, multiple) def round_calendar(input, multiple, origin, rounding, origin_rounding): """Implement the proposed rounding logic.""" # "If the given timestamp is not already an exact multiple of `origin` in # the given `timezone`, the origin point is chosen as one of the two # nearest multiples." origin = round_multiple(input, origin, origin_rounding) # "If the given timestamp is not already an exact multiple of `multiple` # `unit`s away from the chosen origin point, the function chooses between # one of the two nearest multiples." return round_multiple(input - origin, multiple, rounding) + origin multiple = 300 origin = 1000 print("| Input | Floor | Ceil | Round-tie-up | Round-tie-down |") print("|-------|-------|------|--------------|----------------|") for input in range(-100, 1101, 50): floored = round_calendar(input, multiple, origin, "FLOOR", "FLOOR") ceiled = round_calendar(input, multiple, origin, "CEIL", "CEIL") rounded_tied_up = round_calendar(input, multiple, origin, "ROUND_TIE_UP", "ROUND_TIE_UP") rounded_tied_down = round_calendar(input, multiple, origin, "ROUND_TIE_DOWN", "ROUND_TIE_DOWN") print(f"| {input:>5} | {floored:>5} | {ceiled:>4} | {rounded_tied_up:>12} | {rounded_tied_down:>14} |")

There are still some shenanigans when the origin is rounded (i.e. the crossover point is halfway), but I think that's just the nature of the beast when used in conjunction with a multiple that doesn't divide the origin time unit...?

This is indeed a rabbit hole, but if we want to adopt this function in the spec, we'll have to go down it nonetheless. Might be worth splitting it off into another PR if this drags on.

Also if you'd like to use an implementation see:
https://arrow.apache.org/docs/python/generated/pyarrow.compute.floor_temporal.html

import pyarrow as pa from pyarrow import compute as pc arr = pa.array([10**14, 10**15, 10**16], type=pa.timestamp("us", "Pacific/Marquesas")) pc.floor_temporal(arr, 3, "hour", calendar_based_origin=True)

<pyarrow.lib.TimestampArray object at 0x11fb4bd60> [ 1973-03-03 09:30:00.000000, 2001-09-09 00:30:00.000000, 2286-11-20 15:30:00.000000 ]

For my definition there would be two rounding options;

- name: rounding options: [ FLOOR, CEIL, ROUND_TIE_DOWN, ROUND_TIE_UP ] - name: origin_rounding options: [ FLOOR, CEIL, ROUND_TIE_DOWN, ROUND_TIE_UP ]

(there's a double space before ROUND_TIE_UP in your suggestion, by the way; the YAML linter/format check would probably complain about that)

Using the same option for both would also work, kinda, but they do fundamentally different things; one chooses the crossover point for which origin to use (where shenanigans might occur) and the other chooses how to actually round toward that point.

In my mind origin in "round_calendar" is always before the given timestamp. And by origin I imagine like 0 in integers. Is there a benefit to choosing the origin?

In my mind origin in "round_calendar" is always before the given timestamp.

What I'm pretty sure you meant to say corresponds to origin_rounding = FLOOR by my definition, but strictly speaking your definition gets pretty weird when the incoming value aligns exactly with the origin point. You probably meant "before or equal to".

Is there a benefit to choosing the origin?

I don't know; it seemed to generalize nicely that way, but I'm not attached to it. If everyone uses FLOOR behavior for origin_rounding we can leave the option out and update the description accordingly.

You probably meant "before or equal to".

Yes.

I don't know; it seemed to generalize nicely that way, but I'm not attached to it. If everyone uses FLOOR behavior for origin_rounding we can leave the option out and update the description accordingly.

I suppose it's nicely symmetrical but I'd avoid as I've not seen it elsewhere. I'll update the description.

rok · 2022-09-15T19:37:13Z

@jorisvandenbossche do you have any input here?

jvanstraten · 2022-09-15T20:13:49Z

How would you express adding N days (calendar days, not 86400s) to timestamp_tz?

According to the definition of intervals that Substrait currently uses, there is no functional difference between "1 day" and "86400 seconds". In both cases you'd just add that amount of time, because timestamp_tz represents "real time" that is unburdened by calendar shenanigans.

If you want to add "1 day" in the calendar sense, which I guess may be 23, 24, or 25 hours if DST changes, or even something far worse if the timezone changes outright due to legislative changes or leap seconds, you would convert the timestamp_tz to the timestamp in the appropriate timezone to get the "calendar time," then add the interval, and then convert back to timezone_tz. Either conversion might fail (ambiguity vs non-existence of certain timestamps due to DST changes), but the additions themselves won't.

rok · 2022-09-15T20:23:52Z

How would you express adding N days (calendar days, not 86400s) to timestamp_tz?

you would convert the timestamp_tz to the timestamp in the appropriate timezone to get the "calendar time," then add the interval, and then convert back to timezone_tz. Either conversion might fail (ambiguity vs non-existence of certain timestamps due to DST changes), but the additions themselves won't.

That wouldn't work for month or year.

jvanstraten · 2022-09-15T20:27:28Z

That's a good point. I would just not provide an overload for adding interval_year to timestamp_tz directly, and leave that only for timestamp, since interval_year by definition uses calendar-based units rather than real time units. But I might very well have glossed over that in review thus far.

ETA: I think that's indeed what happened. I guess I can also live with keeping the overload with the timezone argument as a shorthand notation.

Unrelated, I'm now noticing a bunch of broken overloads for add in the existing definitions, going from timestamp_tz to timestamp during the conversion and such. They all return timestamp in fact. subtract seems correct, interestingly... Could you fix those while we're making all these changes anyway?

rok · 2023-02-17T14:07:23Z

@westonpace please note I've also changed the week extraction and calendrical rounding somewhat, to allow for all possible modes.

rok · 2023-02-17T18:03:22Z

Removed UTC wording as suggested in 337.

extensions/functions_datetime.yaml

rok · 2023-02-23T20:31:51Z

@westonpace thanks for the review, I think I covered all your points. The CI issue is due to some API rate limit.

Co-authored-by: Weston Pace <[email protected]> Co-authored-by: Jeroen van Straten <[email protected]>

westonpace

Thanks for sticking with this. This was quite the undertaking.

rok · 2023-02-28T10:22:46Z

Thanks for reviews and feedback @westonpace! I'm really happy this is done.

Fix spec violation described in #487 by separating `strptime` into `strptime_timestamp`, `strptime_date`, and `strptime_time` Here's the original PR adding this in case anyone wants to see the discussion surrounding this addition: #272

rok mentioned this pull request Jul 29, 2022

feat: adding temporal functions #222

Closed

jvanstraten requested changes Aug 15, 2022

View reviewed changes

jvanstraten changed the title ~~feat: adding temporal functions~~ feat: add temporal functions Aug 29, 2022

rok force-pushed the add_temporal_function branch 2 times, most recently from d8b584e to ad56a10 Compare September 12, 2022 21:01

rok force-pushed the add_temporal_function branch from db50626 to 599accd Compare September 12, 2022 22:42

rok requested a review from jvanstraten September 12, 2022 22:46

jvanstraten requested changes Sep 13, 2022

View reviewed changes

rok force-pushed the add_temporal_function branch 2 times, most recently from 0b18f48 to de566ae Compare September 15, 2022 19:17

rok commented Sep 15, 2022

View reviewed changes

rok force-pushed the add_temporal_function branch from de566ae to 3559bff Compare September 15, 2022 19:36

rok requested a review from jvanstraten September 15, 2022 19:37

rok force-pushed the add_temporal_function branch from c6b2e40 to 79f39c1 Compare September 15, 2022 20:37

rok force-pushed the add_temporal_function branch 2 times, most recently from 373ac1a to 15c5820 Compare February 17, 2023 18:01

rok requested review from westonpace and removed request for martin-g February 17, 2023 19:09

rok force-pushed the add_temporal_function branch 4 times, most recently from 0003578 to 0b92d38 Compare February 23, 2023 19:47

westonpace requested changes Feb 23, 2023

View reviewed changes

rok commented Feb 23, 2023

View reviewed changes

extensions/functions_datetime.yaml Outdated Show resolved Hide resolved

rok force-pushed the add_temporal_function branch 2 times, most recently from bf2edf3 to b1f65eb Compare February 23, 2023 20:25

rok requested a review from westonpace February 24, 2023 00:32

rok force-pushed the add_temporal_function branch 3 times, most recently from 04feab2 to 8bc3758 Compare February 27, 2023 19:33

feat: add temporal functions

637a062

Co-authored-by: Weston Pace <[email protected]> Co-authored-by: Jeroen van Straten <[email protected]>

rok force-pushed the add_temporal_function branch from 8bc3758 to 637a062 Compare February 27, 2023 19:34

westonpace approved these changes Feb 28, 2023

View reviewed changes

westonpace closed this Feb 28, 2023

westonpace reopened this Feb 28, 2023

westonpace merged commit beb104b into substrait-io:main Feb 28, 2023

Tishj mentioned this pull request Mar 20, 2023

Support temporal functions duckdb - ibis substrait-io/duckdb-substrait-extension#50

Closed

vbarua mentioned this pull request Apr 4, 2023

strptime function definition violates spec #487

Closed

richtia mentioned this pull request Apr 10, 2023

fix: separate strptime to fix spec violation #493

Merged

Input	Floored (val)	Floored (dif)	Ceiled	Rounded (val)	Rounded (dif)
-100	-300	0	-100	-100	-100
-50	-300	0	200	-300 or 200	0
0	0	0	0	0	0
50	-200	100	300	-200 or 300	100
100	100	100	300	100	100
150	100	400	300	100	300
200	100	400	300	100 or 300	300
250	100	400	300	300	300
300	100	400	300	300	300
350	100	400	600	100 or 600	400
400	400	400	600	400	400
450	400	700	600	400	600
500	400	700	600	400 or 600	600
550	400	700	600	600	600
600	400	700	600	600	600
650	400	700	900	400 or 900	700
700	700	700	900	700	700
750	700	1000	900	700	900
800	700	1000	900	700 or 900	900
850	700	1000	900	900	900
900	700	1000	900	900	900
950	700	1000	1200	700 or 1200	1000
1000	1000	1000	1000	1000	1000
1050	800	1100	1300	800 or 1300	1100
1100	1100	1100	1300	1100	1100

feat: add temporal functions #272

feat: add temporal functions #272

Conversation

rok commented Jul 29, 2022

rok commented Aug 3, 2022

spevenhe commented Aug 12, 2022

rok commented Aug 12, 2022

jvanstraten left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rok commented Aug 15, 2022

rok commented Aug 21, 2022

jacques-n commented Aug 22, 2022

rok commented Sep 12, 2022

rok commented Sep 12, 2022

rok commented Sep 12, 2022

jvanstraten left a comment

Choose a reason for hiding this comment

rok commented Sep 15, 2022

rok commented Sep 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jvanstraten Sep 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rok commented Sep 15, 2022

jvanstraten commented Sep 15, 2022

rok commented Sep 15, 2022

jvanstraten commented Sep 15, 2022 • edited Loading

rok commented Feb 17, 2023 • edited Loading

rok commented Feb 17, 2023

rok commented Feb 23, 2023

westonpace left a comment

Choose a reason for hiding this comment

rok commented Feb 28, 2023

jvanstraten Sep 19, 2022 •

edited

Loading

jvanstraten commented Sep 15, 2022 •

edited

Loading

rok commented Feb 17, 2023 •

edited

Loading