Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross-calendar skeleta data #4779

Closed
wants to merge 5 commits into from
Closed

Cross-calendar skeleta data #4779

wants to merge 5 commits into from

Conversation

sffc
Copy link
Member

@sffc sffc commented Apr 6, 2024

One skeleton, all calendars and locales = 42057 B (note: this is YearMonthDay which is highly likely to be among the larger skeleta due to more differences across locales)

For comparison, all date skeleta, all locales, only Gregorian = 166515 B

@sffc
Copy link
Member Author

sffc commented Apr 17, 2024

ICU4X WG discussion:

Three dimensions:

  • Skeleton
  • Calendar
  • Locale

Things we care about:

  1. Data size for single-calendar cases
  2. Data size for anycalendar cases
  3. Data size for people who care about just one pattern
  4. Data size for people who care about multiple patterns
  5. Number of marker types / complexity

General directions:

  1. One key per calendar, all skeleta in that key (currently on main)
  2. One key per skeleton, all calendars in that key (in PR Cross-calendar skeleta data #4779)
  3. Mixed keys between calendars and skeletons, possible different granularities (in PR Split date skeleton key in two, with datagen #4778)
  4. One key per calendar*skeleton, resulting in 400+ keys
  5. Use auxiliary keys for skeleta
  6. Use auxiliary keys for calendars

@sffc to make a write-up. Agreed that this is a priority / worthwhile discussion.

@sffc
Copy link
Member Author

sffc commented Apr 17, 2024

The output of the test I just pushed to this branch:

Total count: 3540, total size: 176573, uniqueness: 272, unique size: 17493

My eyes pop a bit at those numbers... 17493 unique size is too small. It's great if true, but I must be missing something.

@sffc
Copy link
Member Author

sffc commented Apr 17, 2024

^ that was running with test locales only. Here it is with all locales:

Total count: 134355, total size: 6537157, uniqueness: 1501, unique size: 108612

108 KB for everything is not bad at all. Doesn't include the lookup table though.

@sffc
Copy link
Member Author

sffc commented Apr 17, 2024

Based on my findings above, my conclusions are:

  1. Fine-grained skeleton data, putting each skeleton/calendar/locale in its own struct, gives the smallest data
  2. The amount of size improvement is substantial
  3. Since we don't want 400 keys, we should consider auxiliary keys
  4. Since we already have infrastructure set up for calendar-specific keys, it makes sense for the auxiliary keys to be on the skeleton dimension
  5. Although this includes by default more data than necessary for a specific skeleton/calendar combination, the difference is fairly small, no worse than what is currently on main, and it can be sliced manually by a flag in datagen

Therefore, I'd like to more forward with skeleta as auxiliary keys.

I was thinking of using one-letter-per-field skeleta as the auxiliary key. For example:

  • Era = G
  • EraYear = Gy
  • YearMonth = yM
  • WeekdayHourMinute = Ehm
  • ...

So then the data would be indexed as something like:

  • "datetime/patterns/gregory/skeleton@1/en-x-Ehm"

@sffc
Copy link
Member Author

sffc commented Apr 17, 2024

As a reminder, the starting point was:

$ cargo run -p icu_datagen --all-features -- --locales full --keys datetime/patterns/gregory/date_skeleton@1 --format blob -o /tmp/icu4x_proposed_skeleta.postcard --deduplication-strategy maximal
...
$ ll /tmp/icu4x_proposed_skeleta.postcard
-rw-r--r-- 1 sffc primarygroup 167467 Apr 17 13:17 /tmp/icu4x_proposed_skeleta.postcard

So this proposal brings us from 167 KB for all skeleta in a single calendar to 108 KB for all skeleta in all calendars.*

* plus some yet-to-be-calculated overhead for the lookup table

sffc added a commit that referenced this pull request Apr 23, 2024
@sffc sffc closed this Apr 23, 2024
@sffc sffc deleted the neo-xcal branch April 23, 2024 16:49
@sffc
Copy link
Member Author

sffc commented Apr 23, 2024

@sffc
Copy link
Member Author

sffc commented Apr 23, 2024

Discussion from ICU4X-WG:

  • @sffc - Any thoughts/concerns about the proposed direction?
  • @Manishearth - I'm in alignment
  • @robertbastian - Can we change this around later based on use cases?
  • @sffc - Datagen can produce smaller data if needed. We can always discuss a v2 version of the key if there's a really strong need.
  • @sffc - It sounds like the sense of the room is, do whatever you figure out that reduces data size the most?
  • @nekevss, @leftmostcast, @echeran - +1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant