-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New datagen API #3386
New datagen API #3386
Conversation
This is going to take some time to land so it's probably fine but I'd like to request we don't land this too soon (not this week, at least). @sffc has previously expressed that he considers this repo to still be in "patch 1.2" state so we shouldn't land major changes that we're not okay with having sneak out through a patch release. (Plus I haven't finished the Google3 datagen import yet. Hopefully this week.) |
Sure, I'd appreciate feedback already though. |
Oh absolutely, I plan to go through this sooner than that 😄 |
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
This comment was marked as spam.
This comment was marked as spam.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partial review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reviewed the whole thing. No blocking comments I'd say
pub use icu_provider::KeyedDataMarker; | ||
pub use icu_provider::{datagen::DataExporter, DataKey, KeyedDataMarker}; | ||
|
||
// SEMVER GRAVEYARD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💀 💀 💀 💀
I am now comfortable landing this. capi may still have a patch release to fix the ssize_t issue if Makoto needs, but I think they use ICU4X from Github. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like the design overall. Much cleaner. Mostly small nits with a couple discussion questions
}) | ||
.map(DataLocale::from) | ||
.collect()) | ||
Ok(self.source.options.locales.filter_by_langid_equality( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussion/Issue: I'm not really a fan of pulling this into each and every transformer. It seems very much prone to error.
I think I still prefer the design where the list of locales (or locale config) is provided directly to the export
function instead of being in the datagen constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If export
decided which locales to export, any key-specific logic (like for segmentation) would need to be in export
. I think that's bad design because it would literally be an if key == Marker::KEY { // choose locales }
for any number of keys, whereas with this design the logic for each key is in its provider implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether the locale config is provided to the constructor or export
is independent of whether the actual logic is in export
or the providers. Putting it in the constructor is just easier because passing it to supported_locales
would require changing the public IterableDataProvider
trait (or using internal mutability on self
before calling it). I'm somewhat neutral as to where to provide it (I think export
makes sense as well), but I think evaluating it in the provider impls is much cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems not quite right to me but it's fine to land for now, and we can discuss further. #3409
// TODO: Do we actually want to filter these by the user-selected locales? The keys | ||
// are more like script selectors... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is the big case where we likely want to return data for locales that aren't necesarily in the user specified set.
We probably should have an additional LSTM-specific flag to allow you to choose which models to build (for which languages, grapheme/codepoint, training fidelity, etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'll keep this in for now to preserve behaviour?
Do we an issue for segmenter datagen locale selection? I couldn't find an open one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #3408
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r=me, want shane's approval too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve of the change, but I encourage you next time to put more priority on making it easy for the code reviewer.
use simple_logger::SimpleLogger; | ||
|
||
mod args; | ||
pub mod config; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: A bit confusing to have a pub mod
in a bin target
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well it's pub mod
to the parent which doesn't reexport it, which we also do in lib crates for private modules.
} | ||
} | ||
|
||
fn get_grapheme_segmenter_value_from_name(name: &str) -> GraphemeClusterBreak { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: This is a complicated PR, and it is more difficult to review when you refactor code in a way that is unrelated to the change. It appears that you changed this function from being a module function to being an inline function. It would be a much smaller change if you would add cfg
s to the module functions if that's all you need to make the code compile without warnings.
I didn't check if you have this in a standalone commit; please use more detailed commit messages if you prefer a commit-by-commit review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the most recent commit called seg cleanup
. I would have preferred to make this a standalone PR, but with the cross-timezone review cycle having chains is a massive pain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks for the pointer; everything looks mergeable now except for the seg cleanup
commit which requires more of my time so I will look at in more detail this afternoon.
If your goal is to get things reviewed faster, please don't scope-creep them in the middle of a review cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertbastian highly recommend just making a separate branch and creating a draft PR with "based on , start reviewing at "
@sffc I think given that this is a large PR already we should tend towards merging sooner rather than later and deal with the leftover stuff as post-review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertbastian highly recommend just making a separate branch and creating a draft PR with "based on , start reviewing at "
This has not worked well for me in the past, I would get approval it but it has merge conflicts, I lose approval and it takes another couple of days to get approval again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find rebases to be better than merges for dealing with that. But yeah that can be a pain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preferred solution is to make a new PR as Draft, and once the parent PR is merged, fix up the child Draft PR and then send it out for review. Avoid PR chains longer than 2 or 3 by working on some other component that is independent; the project is big enough that there are lots of options.
Co-authored-by: Shane F. Carr <[email protected]>
Supersedes #3142
This gets rid of a lot of the current API surface (it's kept around
doc(hidden)
for semver). Instead of calling a top-leveldatagen
function withOut
parameters, clients now construct aDatagenProvider
withSourceData
andOptions
, and then callexport
on that.export
accepts animpl DataExporter
. This will make it easier to exclude exporter crates (icu_provider_fs
,icu_provider_blob
, futureicu_provider_bake
), as the new API can be built without them (#3365).The new API doesn't support multiple exporters at once. The reason for that is that it's a very niche use case, it can be worked around by defining a forking data exporter, and it's still available through the old API.
make-testdata
still uses the old API, whereasicu4x-datagen
uses the new one, this way we keep full coverage. When we remove the old API, we can move the forking exporter from datagen tomake-testdata
.#3365
#3564