Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add temporal functions #272

Merged
merged 1 commit into from
Feb 28, 2023

Conversation

rok
Copy link
Contributor

@rok rok commented Jul 29, 2022

This adds new scalar functions to work with timestamp/date/time types.
See #222 for previous discussion.

@rok
Copy link
Contributor Author

rok commented Aug 3, 2022

@jacques-n

@spevenhe
Copy link

Hi, may be add some add some abbreviations? like DOW, DOY and ISODOW

@rok
Copy link
Contributor Author

rok commented Aug 12, 2022

Hi, may be add some add some abbreviations? like DOW, DOY and ISODOW

I'm not sure substrait supports abbreviations.
Also, my understanding is substrait is not meant to be user facing and abbreviations don't really make sense.

Copy link
Contributor

@jvanstraten jvanstraten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, timezone strings need to be documented. If they're well-defined in some spec I'm not aware of yet, a link would be sufficient. We might also want to consider requiring them to be constant, but I'm not sure about that one.

If I were to implement the primitives for timezone handling I would do something like this:

  • timezone_offset(string, timestamp_tz) -> interval_day to compute the timezone offset for a given timezone string at the given point in time (so DST/regulatory changes can be handled).
  • timezone_offset(string, timestamp) -> interval_day to do the above but for local time, if possible (if the timezone string refers to a region rather than being explicit about its offset to UTC, the result of this conversion may be ambiguous or may not exist due to DST and regulatory changes; I guess this would return null if unknown).
  • timezone_offset(string) -> interval_day like above, but fails if the timezone is ambiguous about DST etc.
  • add(timestamp_tz, interval_day) -> timestamp to represent a fixed point in time with a given timestamp offset.
  • subtract(timestamp, interval_day) -> timestamp_tz to combine a local timestamp with the local timezone to get a fixed point in time.

where interval_day is (ab)used as a timezone offset w.r.t. UTC for lack of a better type. Some integer with some documented unit would also work. Using this method, all timezone parsing nonsense can be confined to a single function.

@jacques-n curious to hear your opinion about timezone handling.

Semi-related: it looks like add(timestamp_tz, interval_year, [string]) -> timestamp and add(timestamp_tz, interval_day, [string]) -> timestamp are defined incorrectly; they should probably have been returning timestamp_tz.

@@ -40,6 +64,7 @@ scalar_functions:
- args:
- value: timestamp_tz
- value: interval_year
- value: string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are all these strings coming from (in the other functions as well)? This smells like Arrow timezone shenanigans. Substrait timestamp_tz values don't conceptually have timezones, nor do they need them for computations. They represent fixed points in time, typically implemented as offsets to the Unix epoch, which is defined in UTC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If interval_year is a physical unit and timestamp_tz is in UTC I agree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reread and since interval_year is a count of days and hence not a physical unit we need the timezone. Imagine the case where you add time to a timestamp_tz and cross a DST switch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I didn't consider that interval_year isn't physical. My bad.

name: The part of the value to extract.
required: true
- value: timestamp_tz
- value: string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see why you might want a timezone string here, but at the very least it must be documented that that is what it is, and which formats it supports. However, I would much rather see the reverse of assume_timezone, e.g. with_timezone(string, timestamp_tz) -> timestamp, in which case this implementation can just always use UTC, and a user could write extract(..., with_timezone(timezone, timestamp)) if they want a different one. That way, the existing function signature doesn't need to be changed, and the functions represent more primitive operations (only assume_timezone and with_timezone would need to understand what timezones are, which is great because timezones are terrible).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure doing extract(assume_timezone(timestamp, timezone), component) would work. As far as I understand extracted components should be local. So timezone would need to be passed to extract as well.
e.g. extract(assume_timezone(timestamp, timezone), component, timezone)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand extracted components should be local

Okay, on second thought, that makes sense. I was thinking in UTC, but timezone_tz is supposed to be agnostic of that.

Comment on lines 16 to 18
- options: [ YEAR, ISO_YEAR, QUARTER, MONTH, DAY, DAY_OF_WEEK, ISO_DAY_OF_WEEK, DAY_OF_YEAR, WEEK,
ISO_WEEK, HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND, NANOSECOND, SUBSECOND, EPOCH,
TIMEZONE_OFFSET ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • It's not clear to me what MILLISECOND, MICROSECOND and NANOSECOND are relative to. Does for example extract(MICROSECOND, 2022-08-15T11:08:24.123456789) yield 24123456, 123456, or 456?
  • What does SUBSECOND mean?
  • What does EPOCH mean?
  • I'm not 100% convinced TIMEZONE_OFFSET belongs here, even if we keep the timezone string argument. Is that supposed to depend on the timestamp due to regulatory changes, DST, and whatnot? If so, can we please restrict that nonsense to dedicated functions converting between timestamp and timestamp_tz instead? If not, the result would not depend on the timestamp_tz value.
  • If we keep it, with what unit are TIMEZONE_OFFSETs represented (minutes? seconds?)
  • What's the difference between YEAR and ISO_YEAR?
  • Do MONTH, DAY, ..., ISO_WEEK start counting at 1 or 0?
  • Which day is the first day of the week?

Please elaborate in the function description; it should probably have an exhaustive description of all the options, including the ones that were already defined prior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a description to extract function that should address all questions above.
TIMEZONE_OFFSET should return seconds because that's the precision provided by the tzdb.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well perhaps microseconds could work for TIMEZONE_OFFSET too since everything is microseconds. I don't have a preference, leaving seconds for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with seconds, that's good enough for timezone offsets.

@rok
Copy link
Contributor Author

rok commented Aug 15, 2022

@jvanstraten thanks for the review! I'll try to address it today or tomorrow.

@rok
Copy link
Contributor Author

rok commented Aug 21, 2022

Apologies @jvanstraten I didn't get to this and will now be traveling for a week. Will reply after.

@jacques-n
Copy link
Contributor

| curious to hear your opinion about timezone handling

It's mostly just a red-hot hatred. Why we can't all use julian time is beyond me. Why do I care if 8pm is at night one place and during the day another?

I suppose that isn't what you were asking. My main experience is just to copy a bunch of SQL databases. In general, I consider timezone to be a display problem, not a processing problem. Some problems with timezones in general is that there are illegal times (bleh) and timezone rules are not immutable. Beyond that, I have no strong opinions. The main thing is for people to convert display time between timezones and interact with timeline values.

@jvanstraten jvanstraten changed the title feat: adding temporal functions feat: add temporal functions Aug 29, 2022
@rok rok force-pushed the add_temporal_function branch 2 times, most recently from d8b584e to ad56a10 Compare September 12, 2022 21:01
@rok
Copy link
Contributor Author

rok commented Sep 12, 2022

In general, timezone strings need to be documented. If they're well-defined in some spec I'm not aware of yet, a link would be sufficient. We might also want to consider requiring them to be constant, but I'm not sure about that one.

I would propose we use IANA timezone database, (wikipedia) which tracks time zone offsets here. To my understanding that is what most DBs do.

@rok
Copy link
Contributor Author

rok commented Sep 12, 2022

Also adding this table for completness:
Temporal extractions in various DBs. I'll give it another pass and then update the PR as well.

BigQuery [1] Postgres [2] MySQL [3] Arrow Acero [4] ClickHouse [5] Description
EPOCH toUnixTimestamp The number of seconds since 1970-01-01 00:00:00 UTC
SUBSECOND The seconds field, including fractional parts
NANOSECOND The seconds field, including fractional parts, multiplied by 1000000000
MICROSECOND MICROSECONDS MICROSECOND MICROSECOND The seconds field, including fractional parts, multiplied by 1000000
MILLISECOND MILLISECONDS MILLISECONDS The seconds field, including fractional parts, multiplied by 1000
SECOND SECOND SECOND SECOND toSecond Return the second (0-59)
MINUTE MINUTE MINUTE MINUTE toMinute Return the minute from the argument
HOUR HOUR HOUR HOUR toHour The hour (0-23)
DAY DAY DAY DAY toDayOfMonth The day of the month (1-31)
DAYOFYEAR DOY DAYOFYEAR DAY_OF_YEAR toDayOfYear The day of year that ranges from 1 to 366
MONTH MONTH MONTH MONTH toMonth Return the month from the date passed
QUARTER QUARTER QUARTER QUARTER toQuarter Return the quarter from a date argument
YEAR YEAR YEAR YEAR toYear Return the year
ISOYEAR ISOYEAR ISO_YEAR toISOYear Returns the ISO 8601 week-numbering year.
US_YEAR Returns the US_WEEK week-numbering year.
DECADE The decade that is the year divided by 10
CENTURY The century
MILLENNIUM The millenium
DAYOFWEEK DOW DAYOFWEEK The day of week Sunday (0) to Saturday (6)
ISODOW WEEKDAY DAY_OF_WEEK toDayOfWeek Day of week based on ISO 8601 Monday (1) to Sunday (7)
WEEK WEEK WEEK Return the week number (0-53)
WEEKOFYEAR US_WEEK Return the calendar week of the date (1-53)
WEEK() toWeek Weeks begin on WEEKDAY. Times prior to the first WEEKDAY of the year are in week 0.
ISOWEEK WEEK ISO_WEEK toISOWeek Returns the ISO 8601 week number of the date_expression.
YEARWEEK toYearWeek Return the year and week
ISO_CALENDAR Struct {iso_year, iso_week, iso_day_of_week}
YEAR_MONTH_DAY Struct {year, month, day}
LAST_DAY Returns the last day of the month.
DATE DATE toDate Extract DATE type object
TIME TIME toTime Extract TIME type object
IS_DST Boolean - is instant in daylight savings regime.
IS_LEAP_YEAR Boolean - is instant in leap year.
TIMEZONE timeZoneOf The timezone offset from UTC, measured in seconds
TIMEZONE_HOUR The hour component of the time zone offset
TIMEZONE_MINUTE The minute component of the time zone offset
TIMEZONE_MINUTE timeZoneOffset Timezone offset in seconds from UTC

[1] https://cloud.google.com/bigquery/docs/reference/standard-sql/datetime_functions
[2] https://www.postgresql.org/docs/devel/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT
[3] https://dev.mysql.com/doc/refman/8.0/en/date-and-time-functions.html
[4] https://arrow.apache.org/docs/dev/cpp/compute.html?highlight=is_dst#temporal-component-extraction

@rok rok force-pushed the add_temporal_function branch from db50626 to 599accd Compare September 12, 2022 22:42
@rok
Copy link
Contributor Author

rok commented Sep 12, 2022

@rok rok requested a review from jvanstraten September 12, 2022 22:46
Copy link
Contributor

@jvanstraten jvanstraten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! It's also a lot to look at in one go, so I don't know if I caught everything... but either way, I don't think any of the remaining things are fundamental problems.

@rok
Copy link
Contributor Author

rok commented Sep 15, 2022

How would you express adding N days (calendar days, not 86400s) to timestamp_tz?

@rok rok force-pushed the add_temporal_function branch 2 times, most recently from 0b18f48 to de566ae Compare September 15, 2022 19:17
@rok
Copy link
Contributor Author

rok commented Sep 15, 2022

@jvanstraten thanks for the review! I think I addressed all the points and would kindly ask for another review round.

Comment on lines 380 to 403
-
name: "cast_timezone"
description: >-
Convert UTC-relative timestamp_tz to local timestamp using given local time's timezone.

Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones).
Examples: "Pacific/Marquesas", "Etc/GMT+1".
If timezone is not present then UTC is assumed. If timezone is invalid an error is thrown.
impls:
- args:
- name: x
value: timestamp_tz
- name: timezone
description: Timezone string from IANA tzdb.
value: string
return: timestamp
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to actually not offer this option until it's actually requested. In my experience local times without the zone information are not very useful and if you just have to know the wall time to show to the user there's strftime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty fundamental in order to add/subtract calendar units to a timestamp_tz though, or to convert from one timezone to another, so I don't really see this as a formatting thing at all. I don't know. Keep in mind though that a consumer doesn't have to implement everything in Substrait core. If it's not implemented and doesn't end up being used much or at all there's no harm done IMO.

Comment on lines 541 to 616
name: "round_calendar"
description: >-
Round a given timestamp/date/time to a multiple of a time unit. Time unit can be physical or calendar based.
Input value is floored to multiple of temporal units until the next origin unit. Conversely value would be
ceiled to the multiple of temporal units since the last origin unit. Rounding would return the closest of the
ceiled and floored results.

Timezone strings must be as defined by IANA timezone database (https://www.iana.org/time-zones).
Examples: "Pacific/Marquesas", "Etc/GMT+1".
If timezone is not present then UTC is assumed. If timezone is invalid an error is thrown.
impls:
- args:
- name: x
value: timestamp
- name: rounding
options: [ FLOOR, CEIL, ROUND ]
required: true
- name: unit
options: [ YEAR, MONTH, WEEK, DAY, HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND ]
required: true
- name: origin
options: [ YEAR, MONTH, WEEK, DAY, HOUR, MINUTE, SECOND, MILLISECOND ]
required: true
- name: multiple
value: i64
return: timestamp
- args:
- name: x
value: timestamp_tz
- name: rounding
options: [ FLOOR, CEIL, ROUND ]
required: true
- name: unit
options: [ YEAR, MONTH, WEEK, DAY, HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND ]
required: true
- name: origin
options: [ YEAR, MONTH, WEEK, DAY, HOUR, MINUTE, SECOND, MILLISECOND ]
required: true
- name: multiple
value: i64
- name: timezone
description: Timezone string from IANA tzdb.
value: string
return: timestamp_tz
- args:
- name: x
value: date
- name: rounding
options: [ FLOOR, CEIL, ROUND ]
required: true
- name: unit
options: [ YEAR, MONTH, WEEK, DAY ]
required: true
- name: origin
options: [ YEAR, MONTH, WEEK, DAY ]
required: true
- name: multiple
value: i64
- name: origin
value: date
return: date
- args:
- name: x
value: time
- name: rounding
options: [ FLOOR, CEIL, ROUND ]
- name: unit
options: [ DAY, HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND ]
required: true
- name: origin
options: [ DAY, HOUR, MINUTE, SECOND, MILLISECOND ]
required: true
- name: multiple
value: i64
- name: origin
value: time
return: time
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rounding to unit-based-origin is lubridate behaviour and as per experiences in the Acero rounding kernel it is best separated from the instant-based-origin.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior comes across as really confusing to me, but I suppose it does seem to cover the edge cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's useful if you want to round to say 3 days, but not from 1970-01-01 but rather from the beginning of each month.
Rounding timestamps (in local time) is a deep rabbit hole.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what it's useful for, it's mostly just the combination of floor/ceil and since/until in that description that's confusing me. So I ended up working out manually what your description actually implies in terms of behavior when taken to the letter.

Input value is floored to multiple of temporal units until the next origin unit.

-> Round input value down such that there is an integer number of multiples between it and the next origin.

though it could also be interpreted as

-> Round input value up such that there is an integer number of multiples between it and the next origin.

if "floored" refers to the difference between the input and the next origin unit, rather than the input itself. I'll call those "floored value" and "floored difference". Based on the rounding options argument I guess the correct one is the one that actually floors the value, but it makes a huge mess for the rounded version.

Conversely value would be ceiled to the multiple of temporal units since the last origin unit.

-> Round input value up such that there is an integer number of multiples between it and the previous origin.

This is not ambiguous because there's no negation here due to the "time until" and always ceils. However, the way it's written implies to me that it both flooring and ceiling happen at the same and that an assertion is made that this is the same thing or something. It'd be very helpful to actually refer to the options here, i.e. "Conversely, if rounding is set to CEILING, value would be ceiled...".

Rounding would return the closest of the ceiled and floored results.

-> Return whichever of the above two values is closer to the input.

I think this is actually ambiguous though. What if the input is equidistant to both options?

I ended up just writing my various algorithm interpretations as a Python script to try it out, because I got way confused. The script operates on integers rather than calendar units, but the logic is the same. I used multiple = 300 and origin = 1000 (so that could be rounding to multiples of 300 milliseconds with the origin set to whole seconds) to hit all corner cases due to indivisibility and halfway values.

Results
Input Floored (val) Floored (dif) Ceiled Rounded (val) Rounded (dif)
-100 -300 0 -100 -100 -100
-50 -300 0 200 -300 or 200 0
0 0 0 0 0 0
50 -200 100 300 -200 or 300 100
100 100 100 300 100 100
150 100 400 300 100 300
200 100 400 300 100 or 300 300
250 100 400 300 300 300
300 100 400 300 300 300
350 100 400 600 100 or 600 400
400 400 400 600 400 400
450 400 700 600 400 600
500 400 700 600 400 or 600 600
550 400 700 600 600 600
600 400 700 600 600 600
650 400 700 900 400 or 900 700
700 700 700 900 700 700
750 700 1000 900 700 900
800 700 1000 900 700 or 900 900
850 700 1000 900 900 900
900 700 1000 900 900 900
950 700 1000 1200 700 or 1200 1000
1000 1000 1000 1000 1000 1000
1050 800 1100 1300 800 or 1300 1100
1100 1100 1100 1300 1100 1100
import math

def ceil_multiple(input, multiple):
    """Round integer up to multiple from 0."""
    return (input + multiple - 1) // multiple * multiple

def floor_multiple(input, multiple):
    """Round integer down to multiple from 0."""
    return input // multiple * multiple

def closest(input, a, b):
    """Pick a or b depending on which is closer to input. Return result as a
    string. If a and b are different and equidistant from input, return the string
    "{a} or {b}"."""
    if a == b:
        return f"{a}"
    da = abs(input - a)
    db = abs(input - b)
    if da < db:
        return f"{a}"
    elif db < da:
        return f"{b}"
    else:
        return f"{a} or {b}"

def round_calendar(input, multiple, origin):
    """Implement the proposed rounding logic. Rather than returning the result,
    print a table line with the intermediate results."""
    # "Input value is floored to multiple of temporal units until the next origin unit."
    next_origin = ceil_multiple(input, origin)

    # Both interpretations:
    floored_value = floor_multiple(input - next_origin, multiple) + next_origin
    floored_difference = next_origin - floor_multiple(next_origin - input, multiple)

    # "Conversely value would be ceiled to the multiple of temporal units since the last origin unit."
    previous_origin = floor_multiple(input, origin)
    ceiled = ceil_multiple(input - previous_origin, multiple) + previous_origin

    # "Rounding would return the closest of the ceiled and floored results."
    rounded_value = closest(input, floored_value, ceiled)
    rounded_difference = closest(input, floored_difference, ceiled)

    print(f"| {input:>5} | {floored_value:>13} | {floored_difference:>13} | {ceiled:>6} | {rounded_value:>13} | {rounded_difference:>13} |")

multiple = 300
origin = 1000

print("| Input | Floored (val) | Floored (dif) | Ceiled | Rounded (val) | Rounded (dif) |")
print("|-------|---------------|---------------|--------|---------------|---------------|")
for input in range(-100, 1101, 50):
    round_calendar(input, multiple, origin)

If we're using the interpretation of the floored version that actually floored, the rounding version makes a completely indecipherable mess, so I'm pretty sure there's something fishy about the way the operation is currently described...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've thought about this some more and this is how I would write/implement this instead:

Round a given timestamp/date/time to a multiple of a time unit. If the given timestamp is not already an exact multiple of origin in the given timezone, the origin point is chosen as one of the two nearest multiples. Which of these is chosen is governed by origin_rounding: FLOOR means to use the earlier one, CEIL means to use the later one, ROUND_TIE_DOWN means to choose the nearest and tie to the earlier one if equidistant, ROUND_TIE_UP means to choose the nearest and tie to the later one if equidistant. If the given timestamp is not already an exact multiple of multiple units away from the chosen origin point, the function chooses between one of the two nearest multiples. Which of these is chosen is governed by rounding, using the same logic as origin_rounding.

Results

For equal rounding and origin_rounding:

Input Floor Ceil Round-tie-up Round-tie-down
-100 -100 0 0 0
-50 -100 0 0 0
0 0 0 0 0
50 0 100 0 0
100 0 100 0 0
150 0 400 300 0
200 0 400 300 300
250 0 400 300 300
300 300 400 300 300
350 300 400 300 300
400 300 400 300 300
450 300 700 600 300
500 300 700 400 600
550 300 700 700 400
600 600 700 700 700
650 600 700 700 700
700 600 700 700 700
750 600 1000 700 700
800 600 1000 700 700
850 600 1000 1000 700
900 900 1000 1000 1000
950 900 1000 1000 1000
1000 1000 1000 1000 1000
1050 1000 1100 1000 1000
1100 1000 1100 1000 1000
import math

def floor_multiple(input, multiple):
    """Round integer down to multiple from 0."""
    return input // multiple * multiple

def ceil_multiple(input, multiple):
    """Round integer up to multiple from 0."""
    return (input + multiple - 1) // multiple * multiple

def round_tie_down_multiple(input, multiple):
    """Round integer to the nearest multiple from 0, tie down if equidistant."""
    return (input + (multiple - 1) // 2) // multiple * multiple

def round_tie_up_multiple(input, multiple):
    """Round integer to the nearest multiple from 0, tie down if equidistant."""
    return (input + multiple // 2) // multiple * multiple

def round_multiple(input, multiple, method):
    return {
        "FLOOR": floor_multiple,
        "CEIL": ceil_multiple,
        "ROUND_TIE_DOWN": round_tie_down_multiple,
        "ROUND_TIE_UP": round_tie_up_multiple,
    }[method](input, multiple)

def round_calendar(input, multiple, origin, rounding, origin_rounding):
    """Implement the proposed rounding logic."""
    # "If the given timestamp is not already an exact multiple of `origin` in
    # the given `timezone`, the origin point is chosen as one of the two
    # nearest multiples."
    origin = round_multiple(input, origin, origin_rounding)

    # "If the given timestamp is not already an exact multiple of `multiple`
    # `unit`s away from the chosen origin point, the function chooses between
    # one of the two nearest multiples."
    return round_multiple(input - origin, multiple, rounding) + origin

multiple = 300
origin = 1000

print("| Input | Floor | Ceil | Round-tie-up | Round-tie-down |")
print("|-------|-------|------|--------------|----------------|")
for input in range(-100, 1101, 50):
    floored = round_calendar(input, multiple, origin, "FLOOR", "FLOOR")
    ceiled = round_calendar(input, multiple, origin, "CEIL", "CEIL")
    rounded_tied_up = round_calendar(input, multiple, origin, "ROUND_TIE_UP", "ROUND_TIE_UP")
    rounded_tied_down = round_calendar(input, multiple, origin, "ROUND_TIE_DOWN", "ROUND_TIE_DOWN")
    print(f"| {input:>5} | {floored:>5} | {ceiled:>4} | {rounded_tied_up:>12} | {rounded_tied_down:>14} |")

There are still some shenanigans when the origin is rounded (i.e. the crossover point is halfway), but I think that's just the nature of the beast when used in conjunction with a multiple that doesn't divide the origin time unit...?

This is indeed a rabbit hole, but if we want to adopt this function in the spec, we'll have to go down it nonetheless. Might be worth splitting it off into another PR if this drags on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if you'd like to use an implementation see:
https://arrow.apache.org/docs/python/generated/pyarrow.compute.floor_temporal.html

import pyarrow as pa
from pyarrow import compute as pc

arr = pa.array([10**14, 10**15, 10**16], type=pa.timestamp("us", "Pacific/Marquesas"))
pc.floor_temporal(arr, 3, "hour", calendar_based_origin=True)
<pyarrow.lib.TimestampArray object at 0x11fb4bd60>
[
  1973-03-03 09:30:00.000000,
  2001-09-09 00:30:00.000000,
  2286-11-20 15:30:00.000000
]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my definition there would be two rounding options;

          - name: rounding
            options: [ FLOOR, CEIL, ROUND_TIE_DOWN, ROUND_TIE_UP ]
          - name: origin_rounding
            options: [ FLOOR, CEIL, ROUND_TIE_DOWN, ROUND_TIE_UP ]

(there's a double space before ROUND_TIE_UP in your suggestion, by the way; the YAML linter/format check would probably complain about that)

Using the same option for both would also work, kinda, but they do fundamentally different things; one chooses the crossover point for which origin to use (where shenanigans might occur) and the other chooses how to actually round toward that point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind origin in "round_calendar" is always before the given timestamp. And by origin I imagine like 0 in integers. Is there a benefit to choosing the origin?

Copy link
Contributor

@jvanstraten jvanstraten Sep 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind origin in "round_calendar" is always before the given timestamp.

What I'm pretty sure you meant to say corresponds to origin_rounding = FLOOR by my definition, but strictly speaking your definition gets pretty weird when the incoming value aligns exactly with the origin point. You probably meant "before or equal to".

Is there a benefit to choosing the origin?

I don't know; it seemed to generalize nicely that way, but I'm not attached to it. If everyone uses FLOOR behavior for origin_rounding we can leave the option out and update the description accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably meant "before or equal to".

Yes.

I don't know; it seemed to generalize nicely that way, but I'm not attached to it. If everyone uses FLOOR behavior for origin_rounding we can leave the option out and update the description accordingly.

I suppose it's nicely symmetrical but I'd avoid as I've not seen it elsewhere. I'll update the description.

@rok rok force-pushed the add_temporal_function branch from de566ae to 3559bff Compare September 15, 2022 19:36
@rok
Copy link
Contributor Author

rok commented Sep 15, 2022

@jorisvandenbossche do you have any input here?

@rok rok requested a review from jvanstraten September 15, 2022 19:37
@jvanstraten
Copy link
Contributor

How would you express adding N days (calendar days, not 86400s) to timestamp_tz?

According to the definition of intervals that Substrait currently uses, there is no functional difference between "1 day" and "86400 seconds". In both cases you'd just add that amount of time, because timestamp_tz represents "real time" that is unburdened by calendar shenanigans.

If you want to add "1 day" in the calendar sense, which I guess may be 23, 24, or 25 hours if DST changes, or even something far worse if the timezone changes outright due to legislative changes or leap seconds, you would convert the timestamp_tz to the timestamp in the appropriate timezone to get the "calendar time," then add the interval, and then convert back to timezone_tz. Either conversion might fail (ambiguity vs non-existence of certain timestamps due to DST changes), but the additions themselves won't.

@rok
Copy link
Contributor Author

rok commented Sep 15, 2022

How would you express adding N days (calendar days, not 86400s) to timestamp_tz?

you would convert the timestamp_tz to the timestamp in the appropriate timezone to get the "calendar time," then add the interval, and then convert back to timezone_tz. Either conversion might fail (ambiguity vs non-existence of certain timestamps due to DST changes), but the additions themselves won't.

That wouldn't work for month or year.

@jvanstraten
Copy link
Contributor

jvanstraten commented Sep 15, 2022

That's a good point. I would just not provide an overload for adding interval_year to timestamp_tz directly, and leave that only for timestamp, since interval_year by definition uses calendar-based units rather than real time units. But I might very well have glossed over that in review thus far.

ETA: I think that's indeed what happened. I guess I can also live with keeping the overload with the timezone argument as a shorthand notation.

Unrelated, I'm now noticing a bunch of broken overloads for add in the existing definitions, going from timestamp_tz to timestamp during the conversion and such. They all return timestamp in fact. subtract seems correct, interestingly... Could you fix those while we're making all these changes anyway?

@rok rok force-pushed the add_temporal_function branch from c6b2e40 to 79f39c1 Compare September 15, 2022 20:37
@rok
Copy link
Contributor Author

rok commented Feb 17, 2023

@westonpace please note I've also changed the week extraction and calendrical rounding somewhat, to allow for all possible modes.

@rok rok force-pushed the add_temporal_function branch 2 times, most recently from 373ac1a to 15c5820 Compare February 17, 2023 18:01
@rok
Copy link
Contributor Author

rok commented Feb 17, 2023

Removed UTC wording as suggested in 337.

@rok rok requested review from westonpace and removed request for martin-g February 17, 2023 19:09
@rok rok force-pushed the add_temporal_function branch 4 times, most recently from 0003578 to 0b92d38 Compare February 23, 2023 19:47
@rok rok force-pushed the add_temporal_function branch 2 times, most recently from bf2edf3 to b1f65eb Compare February 23, 2023 20:25
@rok
Copy link
Contributor Author

rok commented Feb 23, 2023

@westonpace thanks for the review, I think I covered all your points. The CI issue is due to some API rate limit.

@rok rok requested a review from westonpace February 24, 2023 00:32
@rok rok force-pushed the add_temporal_function branch 3 times, most recently from 04feab2 to 8bc3758 Compare February 27, 2023 19:33
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Jeroen van Straten <[email protected]>
@rok rok force-pushed the add_temporal_function branch from 8bc3758 to 637a062 Compare February 27, 2023 19:34
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sticking with this. This was quite the undertaking.

@westonpace westonpace closed this Feb 28, 2023
@westonpace westonpace reopened this Feb 28, 2023
@westonpace westonpace merged commit beb104b into substrait-io:main Feb 28, 2023
@rok
Copy link
Contributor Author

rok commented Feb 28, 2023

Thanks for reviews and feedback @westonpace! I'm really happy this is done.

westonpace pushed a commit that referenced this pull request Apr 11, 2023
Fix spec violation described in
#487 by
separating `strptime` into `strptime_timestamp`, `strptime_date`, and
`strptime_time`

Here's the original PR adding this in case anyone wants to see the
discussion
surrounding this addition:
#272
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants