-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FFI bindings for Collator #2498
Conversation
Co-authored-by: Shane F. Carr <[email protected]>
* Use char instead of U24 in normalizer data char now has the same 3-byte ULE representation as U24, so the postcard and the baked form do not change. (The JSON form changes, though.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FFI looks pretty good
ffi/diplomat/src/collator.rs
Outdated
#[diplomat::enum_convert(core::cmp::Ordering)] | ||
#[diplomat::rust_link(core::cmp::Ordering, Enum)] | ||
pub enum ICU4XOrdering { | ||
Less = 0, | ||
Equal = 1, | ||
Greater = 2, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Should this go in a more general place? Maybe common.rs
?
I could even see an argument for it to go into diplomat runtime, but we shouldn't hold on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elango asked me about this yesterday, I said that the file structure of ffi/diplomat does not matter except for the icu_capi docs, which have limited utility, so probably keeping this in collator.rs for now is fine and we can move it around later if we need)
ffi/diplomat/src/collator.rs
Outdated
Greater = 2, | ||
} | ||
|
||
#[non_exhaustive] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is there a purpose of #[non_exhaustive]
in FFI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really, no
(though i can imagine Diplomat itself gaining support for handling enums slightly differently in these cases wrt errors shown)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I remove the attribute, then? It existed on the corresponding Rust enums in the collator component, which is why I added them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
/// Compare potentially ill-formed UTF-8 strings. | ||
#[diplomat::rust_link(icu::collator::Collator::compare_utf8, FnInStruct)] | ||
pub fn compare_utf8(&self, left: &[u8], right: &[u8]) -> ICU4XOrdering { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per previous discussion, could you, please make this left: &str
and right: &str
and immediately call as_bytes()
on them, so that both compare
and compare_utf8
take std::string_view
in C++?
Also, would it be bad for code generation for non-C++ languages if compare
was named compare_unsafe
and compare_utf8
was named compare
(to guide C++ callers who aren't thinking carefully about UTF-8 well-formedness to the UBless option)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now the only non-C++ language we have is JS, where we force-convert to UTF8, so calling it unsafe is a bit weird. I think the current model is fine, actually? That previous discussion about &str
-that-is-bytes was more for the case where we only take in &str
, here we take in both anyway.
For 1.0 all of the FFI APIs will be suboptimal, I'm hoping to be able to massively improve Diplomat over the next year so that we can have a breaking 2.0 of FFI that is much much better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it still makes sense to take &str
here and immediately call .as_bytes()
to get the API that makes sense for C++.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To elaborate on why I think it makes sense to use &str
with immediate .as_bytes()
for compare_utf8
:
In C++, in the kind of code that uses std::string_view
(and types that autoconvert to it), the use of std::string_view
doesn't really signify an UB-if-wrong-level commitment to well-formed UTF-8. That is, std::string_view
in C++ in contrast to std::span<uint8_t>
doesn't carry a type-system-level commitment of UTF-8 well-formedness.
Furthermore, it's never semantically wrong to call compare_utf8
instead of compare
: Doing so with well-formed inputs is a perf pessimization (I should measure how large!), but that's it. Therefore, we should have documentation and ergonomics that guide C++ caller who haven't thought long and hard about ensuring well-formedness to use compare_utf8
instead of compare
without much thinking. This doesn't work if compare_utf8
has an unidiomatic argument type while compare
has the idiomatic type (but also has UB).
(I don't like relying on a 2.0 break for something like this when there's such an easy fix available right now (&str
and immediate .as_bytes()
).)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but that argument is one-sided from the C++ POV: From the JS POV it's a differently shaped mess. I don't consider that an "easy fix" because there's a tricky tradeoff.
I think in that case we should just not have compare_utf8
and have compare
do the validating string thing (and have JS eat the performance impact)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think exposing a function that routes to the function that in Rust is called compare_utf8
without exposing a function that routes to what in Rust is called compare
is a reasonable option for now, but then that should still use the &str
with immediate .as_bytes()
hack to get the idiomatic signature in C++ and, AFAICT, to get the string conversion behavior in JS.
Going forward, I think it would make sense for Diplomat to have per-target-language directives that would allow exposing only Rust compare_utf16
to JS, Java, and other languages whose strings are potentially-ill-formed UTF-16, allow exposing only Rust compare
to Swift (I'm assuming here that Swift guarantees UTF-8 well-formedness), allow exposing only Rust compare_utf8
to Go, etc. I think it would be appropriate to expose all three to C++ with docs and ergonomics that push the one that routes to Rust compare_utf8
as what callers should use when in doubt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that works for now. I'm also open to exposing compare()
as well but we can start with the simpler thing and add stuff over time.
Going forward, I think it would make sense for Diplomat to have per-target-language directives that would allow exposing only Rust
compare_utf16
to JS, Java, and other languages whose strings are potentially-ill-formed UTF-16, allow exposing only Rustcompare
to Swift (I'm assuming here that Swift guarantees UTF-8 well-formedness), allow exposing only Rustcompare_utf8
to Go, etc. I think it would be appropriate to expose all three to C++ with docs and ergonomics that push the one that routes to Rustcompare_utf8
as what callers should use when in doubt.
Yes, that's already the plan.
To spell it out, the long term plan is:
- Diplomat has per-backend rename/disabling (Support disabling APIs per-backend rust-diplomat/diplomat#233, Support overloading rust-diplomat/diplomat#234) so we can shape the API in a per-backend way
- Diplomat gets an UnvalidatedStr input type, which in JS is identical to
&str
but in C++ is astring_view
that has no validation guarantees - Diplomat's
&str
type is by-default validated in C++, though we have a global config option to disable that. You can "opt in" to that behavior again by using UnvalidatedStr.
These can be achieved in a non-breaking way if done right, though we need to make sure our choices with the API now are compatible with this. If we don't add a compare()
now it gives us some breathing room.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so I think we should land this as is and then discuss the details of what to do for 1.0 when I fix #2520 (which I plan to do after this lands) just so that Elango's work isn't blocked.
I think we already have consensus over the plan but I don't want to drag this PR further
.into() | ||
} | ||
|
||
/// Compare guaranteed well-formed UTF-8 strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a remark that passing ill-formed UTF-8 is Undefined Behavior. (AFAICT, it actually is memory-unsafe to do so.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
* See the {@link https://unicode-org.github.io/icu4x-docs/doc/icu/collator/struct.Collator.html#method.compare Rust documentation for `compare`} for more information. | ||
*/ | ||
compare(left: string, right: string): ICU4XOrdering; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an issue on file to improve the mapping for languages that use UTF-16 strings to route compare
to compare_utf16
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is related to the feature request for Diplomat to support renaming / overloading for target languages that support overloaded functions: rust-diplomat/diplomat#234
Speculatively based off of currently in-progress PR #2475 as it waits for approval.