Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datagen options around missing locales #3746

Open
sffc opened this issue Jul 27, 2023 · 5 comments
Open

Datagen options around missing locales #3746

sffc opened this issue Jul 27, 2023 · 5 comments
Labels
C-data-infra Component: provider, datagen, fallback, adapters

Comments

@sffc
Copy link
Member

sffc commented Jul 27, 2023

tl;dr, what should we do when a user tries to export a locale from datagen that isn't in CLDR?

At first thought, it seems that we should inform the client by failing datagen. However, this is more nuanced. Some caveats:

  1. Not all keys support the same set of locales.
  2. Not all keys have data at the root locale (example: collator/reord@1).
  3. Some keys require either an extension (example: datetime/skeletons@1) or soon an auxiliary key.

Because of caveats 2 and 3, we cannot always simply run fallback and fill in the data based on fallback.

@Manishearth suggested tagging data keys that aren't expected to fall back to root with some extra metadata. This fixes caveats 2 and 3 but not 1.

Related questions:

  1. Should the behavior depend on whether there was an explicit locale or whether the set of locales came from a CLDR set?
  2. Should the behavior depend on the fallback mode (i.e., should Precomputed be stricter than Hybrid)?

CC @robertbastian

@sffc sffc added discuss Discuss at a future ICU4X-SC meeting discuss-priority Discuss at the next ICU4X meeting labels Jul 27, 2023
@sffc
Copy link
Member Author

sffc commented Jul 27, 2023

  • @zbraniecki - My gut feeling is that we should fail loudly only when we don't have a fallback. If there's a fallback, we could print a warning. We could build a tool that generates a list of what is there and what needs fallback and what is missing.
  • @sffc - Maybe we should make a flag that changes this behavior. For the time being, retain 1.2 behavior of failing silently.

Conclusion: retain 1.2 behavior for the time being; print a log statement for the error cases; revisit in 2.0

@sffc sffc added 2.0-breaking Changes that are breaking API changes and removed discuss Discuss at a future ICU4X-SC meeting discuss-priority Discuss at the next ICU4X meeting labels Jul 27, 2023
@sffc sffc added this to the ICU4X 2.0 milestone Jul 27, 2023
@sffc sffc added the C-data-infra Component: provider, datagen, fallback, adapters label Jul 27, 2023
@Manishearth Manishearth moved this to Investigate in icu4x 2.0 Feb 23, 2024
@sffc sffc moved this from Investigate to Needs discussion to unblock in icu4x 2.0 Mar 14, 2024
@Manishearth
Copy link
Member

Proposal:

  1. Continue printing a warning in datagen when a request language falls back to und@ro in an unexpected way
  2. Do NOT retain the base language if the language is not in CLDR

LGTM: @robertbastian @sffc

Discussion:

  • @zbraniecki - I think the datagen API should return me a list of locales it failed to generate. OK for the CLI to print warnings.

Conclusion: @sffc/@robertbastian to design an API for this.

@sffc
Copy link
Member Author

sffc commented May 30, 2024

First thing we need is a clear definition of what it means when we say "failed to generate". The main thing is that all data can fall back to root, and this is the expected behavior in many cases.

I might propose the following definition: "the requested langid has no ancestors that are in the list in availableLocales.json". Unfortunately this definition only works with DatagenProvider as a data source.

A cleaner definition might be to just require that source providers in DatagenDriver return non-und data for all languages they support (i.e. RetainBaseLanguages-like behavior), and we can make sure DatagenProvider does this.

Once we decide on the definition, for the API, I think DatagenDriver::export should return a struct such as

#[non_exhaustive]
pub struct ExportResult {
    pub missing_locales: Vec<LanguageIdentifier>
}

On the CLI, we just send the list through log::info!.

Feedback? @robertbastian @zbraniecki

@sffc sffc added the needs-approval One or more stakeholders need to approve proposal label Jun 2, 2024
@robertbastian
Copy link
Member

unrelated to missing locales, but I have another use case for ExportResult, returning the crates that a baked exporter needs. This is currently only logged.

@sffc
Copy link
Member Author

sffc commented Jun 25, 2024

Proposal:

  • In 2.0, add non_exhaustive ExportResult and return it from datagen driver
  • Punt the rest of this issue until we have clearer requirements on the use case

LGTM: @sffc @robertbastian

@sffc sffc moved this from Needs discussion to unblock to Unclaimed for sprint in icu4x 2.0 Jun 25, 2024
@sffc sffc removed the needs-approval One or more stakeholders need to approve proposal label Jul 30, 2024
@robertbastian robertbastian removed their assignment Oct 1, 2024
@robertbastian robertbastian removed this from icu4x 2.0 Oct 1, 2024
@robertbastian robertbastian removed the 2.0-breaking Changes that are breaking API changes label Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-data-infra Component: provider, datagen, fallback, adapters
Projects
None yet
Development

No branches or pull requests

3 participants