Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split date skeleton key in two, with datagen #4778

Closed
wants to merge 2 commits into from

Conversation

sffc
Copy link
Member

@sffc sffc commented Apr 5, 2024

Need to see if this helps with data size

@sffc
Copy link
Member Author

sffc commented Apr 5, 2024

Sizes (all locales, maximal deduplication, for each key):

164K    /tmp/icu4x_both.postcard
96K     /tmp/icu4x_day.postcard
52K     /tmp/icu4x_ext.postcard

So it looks like splitting does make things smaller.

@sffc sffc marked this pull request as ready for review April 5, 2024 22:43
type DateDaySkeletonPatternsV1Marker: KeyedDataMarker<Yokeable = PackedSkeletonDataV1<'static>>;

#[cfg(any(feature = "datagen", feature = "experimental"))]
/// The data marker for loading date-ext pattern skeletons for this calendar.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have more comments explaining what ext is

@sffc
Copy link
Member Author

sffc commented Apr 6, 2024

I did a little experiment to find the most efficient split.

For splitting into 2 keys, we're near the optimum with the split proposed in this PR.

For splitting into 3 keys, we should add a split between YearMonthDay and EraYearMonthDay.

For splitting into 4 keys, we should add a split between Year and EraYear.

It's also possible that reordering the skeleta within the list would yield smaller blocks. I haven't tested this.

@sffc
Copy link
Member Author

sffc commented Apr 6, 2024

Splitting the skeleta each into a separate key makes things the smallest, even with the lookup tables. The lookup tables get smaller when the keys are smaller because there is more deduplication. Even in hybrid mode, the wins on data size are great enough that it looks like it would more than offset the increase in lookup table sizes.

@sffc
Copy link
Member Author

sffc commented Apr 6, 2024

Here's a hypothesis worth testing. I expect that making the keys shared across calendars but unique on components would yield better results than the other way around. Patterns don't differ as much across calendars as display names, but they differ a lot from skeleton to skeleton.

@sffc sffc marked this pull request as draft April 6, 2024 08:16
@sffc
Copy link
Member Author

sffc commented Apr 6, 2024

Moving this to draft because I want to compare it with #4779

@Manishearth
Copy link
Member

Manishearth commented Apr 6, 2024

I expect that making the keys shared across calendars but unique on components would yield better results than the other way around.

Wait, really? How would we handle cases where calendars differ, an "extra values" table?

Also won't that deduplicate anyway?

sffc added a commit that referenced this pull request Apr 23, 2024
@sffc sffc closed this Apr 23, 2024
@sffc sffc deleted the neo-split branch April 23, 2024 16:50
@sffc
Copy link
Member Author

sffc commented Apr 23, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants