Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement data struct PluralRangesV1 #4077

Merged
merged 9 commits into from
Oct 5, 2023

Conversation

jedel1043
Copy link
Contributor

@jedel1043 jedel1043 commented Sep 22, 2023

Related to #3012.

This is still missing some details and documentation, but I wanted to put it out for you to evaluate the data model. Should be feature-complete now.

One of the biggest space optimizations was obviating the need to store ranges where the resulting category of the range is the same as the end category, since TR35 specifies:

If there is no value for a <start,end> pair, the default result is end.

@jedel1043 jedel1043 marked this pull request as ready for review September 22, 2023 18:28
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Praise: Nice job!

pub end: String,
}

impl<'de> Deserialize<'de> for PluralRange {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Praise: Really nice use of a custom Deserialize impl

@@ -155,7 +161,7 @@ pub enum PluralCategory {
///
/// - 0 in Arabic (ar), Latvian (lv)
/// - 10~20, 30, 40, 50, ... in Latvian (lv)
Zero,
Zero = 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: For dataful enums I wonder if in general we should reserve 0 for the value "null"; or, in the case of the PluralCategory enum, null is "other" so should "other" be 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought about it, but I didn't know if reordering the variants would be a breaking change or not (because of the PartialOrd derive), so I opted for following the specified order.

Copy link
Contributor Author

@jedel1043 jedel1043 Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though, we could also implement PartialOrd by hand, but it'll be really weird to have Zero < Other and Zero as u8 > Other as u8

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see, this enum has derive(Ord) which somewhat limits our options

My fully detailed thoughts: I think what you did is fine, but I don't see the harm in making a new enum whose purpose is serialization. We then leave this enum as the public-facing Rust enum. Another thing to consider is, the keys in the ZeroMap do not need to be validated; if you have an out-of-range plural category, you just won't ever find that entry in the map, so we can save a little bit of deserialization code and speed by using a plain u8 as the key. However, that key type does not serialize nicely in JSON, so what we've done in other places is make yet another enum which is a newtype wrapper over the primitive but has nice serialization (usually called UnvalidatedX: see UnvalidatedTinyAsciiStr, UnvalidatedStr, etc). However, the nice part about your proposed solution is that it is the least amount of additional boilerplate code. It just might be tying too many functional use cases together.

Copy link
Contributor Author

@jedel1043 jedel1043 Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then it would be:

  • A new RawPluralCategory enum inside provider with RawPluralCategory::Other = 0 and a conversion from PluralCategory to RawPluralCategory. This will be the new value type of the map.
  • A new UnvalidatedPluralCategory(u8) inside provider that de/serializes from/to a string if the de/serializer is human readable, which will be the key type of the map. Also a conversion from RawPluralCategory to UnvalidatedPluralCategory.

Is that correct?

Also, should both types be #[doc(hidden]? Or should I just document them as unstable?

Copy link
Member

@sffc sffc Sep 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, naming aside, that's basically the other model for this.

They should be public in the provider module with the yellow box warning people that the APIs may be unstable.

I would make the second one pub struct UnvalidatedPluralCategory(pub u8) that is a u8 for all purposes except that is_human_readable impl Serialize makes a nice string, returning an error if out of range, and is_human_readable impl Deserialize knows how to handle the string.

However I'd like to see what @robertbastian or @Manishearth think about these two approaches (re-use the existing enum and make it ZeroVec compatible, or introduce two new types which are data provider specific)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, naming aside, that's basically the other model for this.

They should be public in the provider module with the yellow box warning people that the APIs may be unstable.

I would make the second one pub struct UnvalidatedPluralCategory(pub u8) that is a u8 for all purposes except that is_human_readable impl Serialize makes a nice string, returning an error if out of range, and is_human_readable impl Deserialize knows how to handle the string.

However I'd like to see what @robertbastian or @Manishearth think about these two approaches (re-use the existing enum and make it ZeroVec compatible, or introduce two new types which are data provider specific)

Sounds good! I've already finished implementing the API in another branch, so I'll wait for your consensus in the meantime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that a new type is the best solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

@jedel1043
Copy link
Contributor Author

Applied the review. Let me know if those names are okay.

robertbastian
robertbastian previously approved these changes Sep 26, 2023
/// - `key1` corresponds to the end category of the range.
#[cfg_attr(feature = "serde", serde(borrow))]
pub ranges:
ZeroMap2d<'data, UnvalidatedPluralCategory, UnvalidatedPluralCategory, RawPluralCategory>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: ZeroMap2d might be overkill. I think we don't need functionality to iterate over second-order keys, and we can fit both keys in a single byte.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I'll use a simple ZeroMap then.

Copy link
Contributor Author

@jedel1043 jedel1043 Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced it with a ZeroMap. It makes the human-readable de/serialization slightly worse but non-human-readable de/serialization is a lot smaller and faster.

robertbastian
robertbastian previously approved these changes Sep 26, 2023
Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for @sffc to take a look before merging.

sffc
sffc previously approved these changes Oct 2, 2023
.ok_or_else(|| S::Error::custom("invalid tag in UnvalidatedPluralRange"))?;
let end = RawPluralCategory::new_from_u8(self.0 & 0x0F)
.ok_or_else(|| S::Error::custom("invalid tag in UnvalidatedPluralRange"))?;
(start, end).serialize(serializer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: if we wanted to make this human-readable we could serialize to a separated string like "One/Other", but I think what you have here is good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be a pretty quick fix. Maybe something like One-Other to use a range-like syntax?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I like that!

(This is a totally optional suggestion; this is mergeable as-is)

databake(path = icu_plurals::provider),
)]
#[zerovec::make_ule(UnvalidatedPluralRangeULE)]
pub struct UnvalidatedPluralRange(pub u8);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Praise: so compact!

@jedel1043 jedel1043 dismissed stale reviews from sffc and robertbastian via b1e132a October 2, 2023 04:59
@jedel1043 jedel1043 requested a review from sffc October 2, 2023 05:45
@robertbastian robertbastian merged commit d2669e1 into unicode-org:main Oct 5, 2023
@jedel1043 jedel1043 deleted the plural-ranges-ds branch October 5, 2023 14:46
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants