Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FFI bindings for Collator #2498

Merged
merged 31 commits into from
Sep 7, 2022
Merged

Conversation

echeran
Copy link
Contributor

@echeran echeran commented Aug 31, 2022

Speculatively based off of currently in-progress PR #2475 as it waits for approval.

@echeran echeran marked this pull request as ready for review September 1, 2022 19:50
@echeran echeran requested a review from a team as a code owner September 1, 2022 19:50
@echeran echeran requested review from Manishearth and sffc and removed request for a team September 1, 2022 19:50
@echeran
Copy link
Contributor Author

echeran commented Sep 1, 2022

cc @hsivonen @robertbastian @pdogr

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FFI looks pretty good

@echeran echeran requested a review from Manishearth September 1, 2022 23:24
sffc
sffc previously approved these changes Sep 2, 2022
Comment on lines 19 to 25
#[diplomat::enum_convert(core::cmp::Ordering)]
#[diplomat::rust_link(core::cmp::Ordering, Enum)]
pub enum ICU4XOrdering {
Less = 0,
Equal = 1,
Greater = 2,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Should this go in a more general place? Maybe common.rs?

I could even see an argument for it to go into diplomat runtime, but we shouldn't hold on that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elango asked me about this yesterday, I said that the file structure of ffi/diplomat does not matter except for the icu_capi docs, which have limited utility, so probably keeping this in collator.rs for now is fine and we can move it around later if we need)

Greater = 2,
}

#[non_exhaustive]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is there a purpose of #[non_exhaustive] in FFI?

Copy link
Member

@Manishearth Manishearth Sep 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really, no

(though i can imagine Diplomat itself gaining support for handling enums slightly differently in these cases wrt errors shown)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I remove the attribute, then? It existed on the corresponding Rust enums in the collator component, which is why I added them here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Manishearth
Manishearth previously approved these changes Sep 2, 2022
sffc
sffc previously approved these changes Sep 2, 2022

/// Compare potentially ill-formed UTF-8 strings.
#[diplomat::rust_link(icu::collator::Collator::compare_utf8, FnInStruct)]
pub fn compare_utf8(&self, left: &[u8], right: &[u8]) -> ICU4XOrdering {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per previous discussion, could you, please make this left: &str and right: &str and immediately call as_bytes() on them, so that both compare and compare_utf8 take std::string_view in C++?

Also, would it be bad for code generation for non-C++ languages if compare was named compare_unsafe and compare_utf8 was named compare (to guide C++ callers who aren't thinking carefully about UTF-8 well-formedness to the UBless option)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the only non-C++ language we have is JS, where we force-convert to UTF8, so calling it unsafe is a bit weird. I think the current model is fine, actually? That previous discussion about &str-that-is-bytes was more for the case where we only take in &str, here we take in both anyway.

For 1.0 all of the FFI APIs will be suboptimal, I'm hoping to be able to massively improve Diplomat over the next year so that we can have a breaking 2.0 of FFI that is much much better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it still makes sense to take &str here and immediately call .as_bytes() to get the API that makes sense for C++.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To elaborate on why I think it makes sense to use &str with immediate .as_bytes() for compare_utf8:

In C++, in the kind of code that uses std::string_view (and types that autoconvert to it), the use of std::string_view doesn't really signify an UB-if-wrong-level commitment to well-formed UTF-8. That is, std::string_view in C++ in contrast to std::span<uint8_t> doesn't carry a type-system-level commitment of UTF-8 well-formedness.

Furthermore, it's never semantically wrong to call compare_utf8 instead of compare: Doing so with well-formed inputs is a perf pessimization (I should measure how large!), but that's it. Therefore, we should have documentation and ergonomics that guide C++ caller who haven't thought long and hard about ensuring well-formedness to use compare_utf8 instead of compare without much thinking. This doesn't work if compare_utf8 has an unidiomatic argument type while compare has the idiomatic type (but also has UB).

(I don't like relying on a 2.0 break for something like this when there's such an easy fix available right now (&str and immediate .as_bytes()).)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that argument is one-sided from the C++ POV: From the JS POV it's a differently shaped mess. I don't consider that an "easy fix" because there's a tricky tradeoff.

I think in that case we should just not have compare_utf8 and have compare do the validating string thing (and have JS eat the performance impact)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think exposing a function that routes to the function that in Rust is called compare_utf8 without exposing a function that routes to what in Rust is called compare is a reasonable option for now, but then that should still use the &str with immediate .as_bytes() hack to get the idiomatic signature in C++ and, AFAICT, to get the string conversion behavior in JS.

Going forward, I think it would make sense for Diplomat to have per-target-language directives that would allow exposing only Rust compare_utf16 to JS, Java, and other languages whose strings are potentially-ill-formed UTF-16, allow exposing only Rust compare to Swift (I'm assuming here that Swift guarantees UTF-8 well-formedness), allow exposing only Rust compare_utf8 to Go, etc. I think it would be appropriate to expose all three to C++ with docs and ergonomics that push the one that routes to Rust compare_utf8 as what callers should use when in doubt.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that works for now. I'm also open to exposing compare() as well but we can start with the simpler thing and add stuff over time.

Going forward, I think it would make sense for Diplomat to have per-target-language directives that would allow exposing only Rust compare_utf16 to JS, Java, and other languages whose strings are potentially-ill-formed UTF-16, allow exposing only Rust compare to Swift (I'm assuming here that Swift guarantees UTF-8 well-formedness), allow exposing only Rust compare_utf8 to Go, etc. I think it would be appropriate to expose all three to C++ with docs and ergonomics that push the one that routes to Rust compare_utf8 as what callers should use when in doubt.

Yes, that's already the plan.

To spell it out, the long term plan is:

These can be achieved in a non-breaking way if done right, though we need to make sure our choices with the API now are compatible with this. If we don't add a compare() now it gives us some breathing room.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so I think we should land this as is and then discuss the details of what to do for 1.0 when I fix #2520 (which I plan to do after this lands) just so that Elango's work isn't blocked.

I think we already have consensus over the plan but I don't want to drag this PR further

.into()
}

/// Compare guaranteed well-formed UTF-8 strings.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a remark that passing ill-formed UTF-8 is Undefined Behavior. (AFAICT, it actually is memory-unsafe to do so.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


* See the {@link https://unicode-org.github.io/icu4x-docs/doc/icu/collator/struct.Collator.html#method.compare Rust documentation for `compare`} for more information.
*/
compare(left: string, right: string): ICU4XOrdering;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an issue on file to improve the mapping for languages that use UTF-16 strings to route compare to compare_utf16?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is related to the feature request for Diplomat to support renaming / overloading for target languages that support overloaded functions: rust-diplomat/diplomat#234

Manishearth
Manishearth previously approved these changes Sep 7, 2022
@Manishearth
Copy link
Member

@hsivonen I'm giving Elango the green light to merge this; I think all of your comments have been addressed aside from the one on encoding, and I plan to fix all of that wholesale with the plan laid out in #2520. If there are any other followups please let us know and we can fix them.

@echeran echeran merged commit feaa5c0 into unicode-org:main Sep 7, 2022
kelebra pushed a commit to kelebra/icu4x that referenced this pull request Sep 8, 2022
@echeran echeran deleted the ffi-collator branch November 23, 2023 04:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants