Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Sedate IXDTF string parsing #4646

Merged
merged 26 commits into from
Mar 27, 2024
Merged

Conversation

nekevss
Copy link
Contributor

@nekevss nekevss commented Mar 1, 2024

Hi! This PR is intended to close #2127.

It would add the IxdtfParser and IsoDurationParser.

I'm opening this as a draft for a bit early feedback and comments.

This is essentially a proposal to pull in the core of the parser from over in temporal_rs along with some updates I've added to allow the parser to function as a standalone as well as some general clean up.

There's still a couple things I have left to do:

  • ALOT more documentation is needed. (More docs added, can be built out further if wanted)
  • Clean up the tests / potentially add more tests.

The one question I have is that the previous experimental version of the parser had a bit more precision on the errors when it came to parsing digits. Currently, this really only throws an ExpectedDigitChar error from Cursor::next_digit. If that is too general, this could be caught/mapped in the individual functions or next_digit could return an Option<u8> to allow more exact error handling.

EDIT: I did end up removing ExpectedDigitChar as an error.

@CLAassistant
Copy link

CLAassistant commented Mar 1, 2024

CLA assistant check
All committers have signed the CLA.

@nekevss nekevss marked this pull request as ready for review March 3, 2024 19:20
@nekevss nekevss requested a review from a team as a code owner March 3, 2024 19:20
@nekevss
Copy link
Contributor Author

nekevss commented Mar 3, 2024

Marking as ready for review 😄

A few general questions:

  • What would be the preferred way of resolving an ambiguity with UTC offsets and Time Zone Annotations? Currently, this defaults to the tz annotation if provided, but maybe this isn't the best approach. It could be better to provide the UTC offset value separate from the tz annotation and allow the caller to handle any ambiguous cases.

  • Would it be preferred for IsoParseRecord to not have as many nested records as it currently does? (primary cases: DateRecord and TimeRecord).

  • IxdtfParser has a lot of parsing options beyond just extended date time (admittedly, a decent portion targeting some Temporal functionality). Would it be preferred to split some of the parsing like time zone and time into a TemporalParser over IxdtfParser?

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! Leaving some architectural feedback first; I noted some of these things when I took a pass over the code in the temporal_rs repository, but I didn't post them at that time because they seem like incremental changes from the code base.

@nekevss
Copy link
Contributor Author

nekevss commented Mar 14, 2024

General updates were made concerning the above review. There are a couple more points that need to still be addressed, but I had a question that I thought I'd pose early.

Is there any preference when handling parsing MonthDay, YearMonth, and annotated Time strings?

I'm tempted to keep those but include them as part of a separate parser from the current IxdtfParser, but I thought I'd float the question in case there was any feedback regarding go forward on those.

@sffc sffc self-requested a review March 17, 2024 15:36
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I did a more thorough review this time. Just some questions mostly about data types. Feel free to convert comments into follow-up issues as you see fit. I would like to land this soon and favor follow-up PRs.

@sffc sffc self-requested a review March 19, 2024 09:32
@nekevss nekevss requested a review from sffc March 20, 2024 04:30
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Praise: This is the best version yet! Thanks for all the work you've been putting into this. All my comments can be done later or filed as a follow-up issue. I think we can land the first version now!

Comment on lines +65 to +77
if kv.key == "u-ca" {
if calendar.is_none() {
calendar = Some(kv.value);
calendar_crit = kv.critical;
continue;
}

if calendar_crit || kv.critical {
return Err(ParserError::CriticalDuplicateCalendar);
}
}

annotations.push(kv);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: This logic seems a little odd to me. I think what it does is:

  1. [u-ca=hebrew][!u-ca=buddhist] = error
  2. [!u-ca=hebrew][u-ca=buddhist] = error
  3. [!u-ca=hebrew] = OK
  4. [u-ca=hebrew][u-ca=buddhist] = OK with main calendar hebrew and then buddhist going into annotations

It also doesn't return whether the main calendar is critical or not, which seems like information that is being lost.

I think I lean toward the following behavior for simplicity:

  1. Change the calendar return type to be an Annotation, keeping the critical flag
  2. Use some proper algorithm for selecting which calendar is the main calendar: the first, the last, the first critical, or the last critical. i.e., which calendar wins among:
    • [u-ca=aaa][!u-ca=bbb][!u-ca=ccc][u-ca=ddd]
  3. Put all other calendars in order into the annotations list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably true. I almost switched it (like with TimeZoneAnnotation), but given that u-ca is a registered key, I was viewing it as the parser being the primary consumer, so it was fine to leave it without the critical flag included. After looking back at the spec draft, I'm conflicted but leaning a bit more towards include the annotation.

For clarity on the point in the specification I'm thinking about mostly:

   An application that encounters duplicate use of a suffix key in
   elective suffixes and does not want to perform additional processing
   on this inconsistency MUST choose the first suffix that has that key,
   i.e.,

   2022-07-08T00:14:07Z[u-ca=chinese][u-ca=japanese]
   2022-07-08T00:14:07Z[u-ca=chinese]

All that considered and given the current approach to this parser, I think it makes sense to set calendar to the Annotation for the first calendar and all the rest are moved to annotations to be handled by the user, i.e. 1 & 3, but default to the first calendar over selecting calendars

Secondary Idea:
Would it make sense to add an opt-in strict parsing option to IxdtfParser for calendar / critical flag handling? When strict the parser would through an error on any strict calendar ambiguity or unknown critical annotations, and when not strict the calendar ambiguity and unknown critical annotations are provided to the user to handle?


/// Utility that returns whether a year is a leap year.
#[inline]
fn in_leap_year(year: i32) -> i8 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: return a bool instead of an int?

/// Parsing options that can be provided to `IxdtfParser` to change parsing behavior.
#[non_exhaustive]
#[derive(Debug, Clone, Copy, PartialEq)]
pub enum IxdtfOptions {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: We should bikeshed the name of this enum.

Comment on lines +87 to +92
match self.options {
IxdtfOptions::None => datetime::parse_annotated_date_time(&mut self.cursor),
IxdtfOptions::YearMonth => datetime::parse_annotated_year_month(&mut self.cursor),
IxdtfOptions::MonthDay => datetime::parse_annotated_month_day(&mut self.cursor),
IxdtfOptions::Time => time::parse_annotated_time_record(&mut self.cursor),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: With big match statements like this, it is better for DCE if we instead had four functions like parse, parse_year_month, parse_month_day, and parse_time. Then we wouldn't need the enum.


/// The parsed sign value, representing whether its struct is positive or negative.
#[repr(i8)]
#[allow(clippy::exhaustive_enums)] // Sign can only be positive or negative.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: there are often 3 types of signs, known as Signum, which are "explicit minus", "explicit plus", and "no sign", as in "-1", "+1" and "1". Does IXDTF make a distinction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IXDTF does not to my knowledge and from what I can find. RFC3339 does layout that the offset must be "+" or "-". So Duration is the only one that can be "no sign" as it is considered optional.

Comment on lines +150 to +157
pub enum DurationFraction {
/// The fraction value applied to an hour.
Hours(u64),
/// The fraction value applied to the minutes field.
Minutes(u64),
/// The fraction value applied to the seconds field.
Seconds(u32),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: another way to model this, which may or may not be better, would be

pub enum TimeDuration {
    None,
    Hours {
        hours: u32,
        fraction: u64,
    },
    Minutes {
        hours: u32,
        minutes: u32,
        fraction: u64,
    },
    Seconds {
        hours: u32,
        minutes: u32,
        seconds: u32,
        fraction: u32,
    }
}

This more properly encodes the invariant that if hours have fractions, then minutes and seconds don't exist, for example.

@sffc sffc merged commit 7e151d4 into unicode-org:main Mar 27, 2024
30 checks passed
@nekevss nekevss deleted the ixdtf-parser branch March 27, 2024 01:39
sffc pushed a commit that referenced this pull request Apr 25, 2024
This PR adjusts TimeDurationRecord to align with some feedback from
#4646 regarding the representation of TimeDurationRecord
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Sedate IXDTF (extended ISO) string parsing function
3 participants