Add FFI bindings for Collator #2498

echeran · 2022-08-31T23:23:34Z

Speculatively based off of currently in-progress PR #2475 as it waits for approval.

…te new enums

Co-authored-by: Shane F. Carr <[email protected]>

* Use char instead of U24 in normalizer data char now has the same 3-byte ULE representation as U24, so the postcard and the baked form do not change. (The JSON form changes, though.)

…e in Rust crate

echeran · 2022-09-01T19:50:58Z

cc @hsivonen @robertbastian @pdogr

Manishearth

FFI looks pretty good

ffi/diplomat/src/collator.rs

sffc · 2022-09-02T17:11:58Z

ffi/diplomat/src/collator.rs

+    #[diplomat::enum_convert(core::cmp::Ordering)]
+    #[diplomat::rust_link(core::cmp::Ordering, Enum)]
+    pub enum ICU4XOrdering {
+        Less = 0,
+        Equal = 1,
+        Greater = 2,
+    }


Suggestion: Should this go in a more general place? Maybe common.rs?

I could even see an argument for it to go into diplomat runtime, but we shouldn't hold on that.

Elango asked me about this yesterday, I said that the file structure of ffi/diplomat does not matter except for the icu_capi docs, which have limited utility, so probably keeping this in collator.rs for now is fine and we can move it around later if we need)

sffc · 2022-09-02T17:12:38Z

ffi/diplomat/src/collator.rs

+        Greater = 2,
+    }
+
+    #[non_exhaustive]


Question: Is there a purpose of #[non_exhaustive] in FFI?

not really, no

(though i can imagine Diplomat itself gaining support for handling enums slightly differently in these cases wrt errors shown)

Should I remove the attribute, then? It existed on the corresponding Rust enums in the collator component, which is why I added them here.

ffi/diplomat/src/collator.rs

hsivonen · 2022-09-05T08:22:19Z

ffi/diplomat/src/collator.rs

+
+        /// Compare potentially ill-formed UTF-8 strings.
+        #[diplomat::rust_link(icu::collator::Collator::compare_utf8, FnInStruct)]
+        pub fn compare_utf8(&self, left: &[u8], right: &[u8]) -> ICU4XOrdering {


Per previous discussion, could you, please make this left: &str and right: &str and immediately call as_bytes() on them, so that both compare and compare_utf8 take std::string_view in C++?

Also, would it be bad for code generation for non-C++ languages if compare was named compare_unsafe and compare_utf8 was named compare (to guide C++ callers who aren't thinking carefully about UTF-8 well-formedness to the UBless option)?

Right now the only non-C++ language we have is JS, where we force-convert to UTF8, so calling it unsafe is a bit weird. I think the current model is fine, actually? That previous discussion about &str-that-is-bytes was more for the case where we only take in &str, here we take in both anyway.

For 1.0 all of the FFI APIs will be suboptimal, I'm hoping to be able to massively improve Diplomat over the next year so that we can have a breaking 2.0 of FFI that is much much better.

I think it still makes sense to take &str here and immediately call .as_bytes() to get the API that makes sense for C++.

To elaborate on why I think it makes sense to use &str with immediate .as_bytes() for compare_utf8:

In C++, in the kind of code that uses std::string_view (and types that autoconvert to it), the use of std::string_view doesn't really signify an UB-if-wrong-level commitment to well-formed UTF-8. That is, std::string_view in C++ in contrast to std::span<uint8_t> doesn't carry a type-system-level commitment of UTF-8 well-formedness.

Furthermore, it's never semantically wrong to call compare_utf8 instead of compare: Doing so with well-formed inputs is a perf pessimization (I should measure how large!), but that's it. Therefore, we should have documentation and ergonomics that guide C++ caller who haven't thought long and hard about ensuring well-formedness to use compare_utf8 instead of compare without much thinking. This doesn't work if compare_utf8 has an unidiomatic argument type while compare has the idiomatic type (but also has UB).

(I don't like relying on a 2.0 break for something like this when there's such an easy fix available right now (&str and immediate .as_bytes()).)

Yes, but that argument is one-sided from the C++ POV: From the JS POV it's a differently shaped mess. I don't consider that an "easy fix" because there's a tricky tradeoff.

I think in that case we should just not have compare_utf8 and have compare do the validating string thing (and have JS eat the performance impact)

I think exposing a function that routes to the function that in Rust is called compare_utf8 without exposing a function that routes to what in Rust is called compare is a reasonable option for now, but then that should still use the &str with immediate .as_bytes() hack to get the idiomatic signature in C++ and, AFAICT, to get the string conversion behavior in JS.

Going forward, I think it would make sense for Diplomat to have per-target-language directives that would allow exposing only Rust compare_utf16 to JS, Java, and other languages whose strings are potentially-ill-formed UTF-16, allow exposing only Rust compare to Swift (I'm assuming here that Swift guarantees UTF-8 well-formedness), allow exposing only Rust compare_utf8 to Go, etc. I think it would be appropriate to expose all three to C++ with docs and ergonomics that push the one that routes to Rust compare_utf8 as what callers should use when in doubt.

Okay, that works for now. I'm also open to exposing compare() as well but we can start with the simpler thing and add stuff over time.

Going forward, I think it would make sense for Diplomat to have per-target-language directives that would allow exposing only Rust compare_utf16 to JS, Java, and other languages whose strings are potentially-ill-formed UTF-16, allow exposing only Rust compare to Swift (I'm assuming here that Swift guarantees UTF-8 well-formedness), allow exposing only Rust compare_utf8 to Go, etc. I think it would be appropriate to expose all three to C++ with docs and ergonomics that push the one that routes to Rust compare_utf8 as what callers should use when in doubt.

Yes, that's already the plan.

To spell it out, the long term plan is:

Diplomat has per-backend rename/disabling (Support disabling APIs per-backend rust-diplomat/diplomat#233, Support overloading rust-diplomat/diplomat#234) so we can shape the API in a per-backend way

Diplomat gets an UnvalidatedStr input type, which in JS is identical to &str but in C++ is a string_view that has no validation guarantees

Diplomat's &str type is by-default validated in C++, though we have a global config option to disable that. You can "opt in" to that behavior again by using UnvalidatedStr.

These can be achieved in a non-breaking way if done right, though we need to make sure our choices with the API now are compatible with this. If we don't add a compare() now it gives us some breathing room.

Okay, so I think we should land this as is and then discuss the details of what to do for 1.0 when I fix #2520 (which I plan to do after this lands) just so that Elango's work isn't blocked.

I think we already have consensus over the plan but I don't want to drag this PR further

hsivonen · 2022-09-05T08:23:13Z

ffi/diplomat/src/collator.rs

+                .into()
+        }
+
+        /// Compare guaranteed well-formed UTF-8 strings.


Please add a remark that passing ill-formed UTF-8 is Undefined Behavior. (AFAICT, it actually is memory-unsafe to do so.)

ffi/diplomat/src/collator.rs

hsivonen · 2022-09-05T08:34:31Z

ffi/diplomat/wasm/icu4x/lib/ICU4XCollator.d.ts

+
+   * See the {@link https://unicode-org.github.io/icu4x-docs/doc/icu/collator/struct.Collator.html#method.compare Rust documentation for `compare`} for more information.
+   */
+  compare(left: string, right: string): ICU4XOrdering;


Is there an issue on file to improve the mapping for languages that use UTF-16 strings to route compare to compare_utf16?

I think this is related to the feature request for Diplomat to support renaming / overloading for target languages that support overloaded functions: rust-diplomat/diplomat#234

ffi/diplomat/Cargo.toml

stale

Manishearth · 2022-09-07T18:56:29Z

@hsivonen I'm giving Elango the green light to merge this; I think all of your comments have been addressed aside from the one on encoding, and I plan to fix all of that wholesale with the plan laid out in #2520. If there are any other followups please let us know and we can fix them.

echeran and others added 17 commits August 29, 2022 18:54

Create options bag for CollatorOptions, keep previous bit field, crea…

17257c3

…te new enums

Merge branch 'main' into collator-options-bag

13279b4

Make docs module doc-only visible

6bec2a5

Make case-level examples use consistent pairs for comparison readability

5c61c26

Apply suggestions from code review

ad96fc3

Co-authored-by: Shane F. Carr <[email protected]>

Apply review feedback

e5a6d69

Apply cargo make diplomat-coverage

ffed5c6

Use well-/ill-formed language instead of in-/valid for UTF-8/-16

ba3c491

Use char instead of U24 in normalizer data (unicode-org#2481)

0aa6444

* Use char instead of U24 in normalizer data char now has the same 3-byte ULE representation as U24, so the postcard and the baked form do not change. (The JSON form changes, though.)

Fix CI (unicode-org#2494)

62b27f4

Add FFI for Collator

077070d

Add files from running diplomat-gen

382f4c0

Merge branch 'main' into ffi-collator

7b29209

Make Collator FFI options enums non_exhaustive to match non_exhaustiv…

c6bba94

…e in Rust crate

Add Collator methods to FFI

b40e87a

Enable serde feature for Collator for Diplomat sources

c2d5ce1

Apply diplomat-coverage changes

446bd3b

echeran marked this pull request as ready for review September 1, 2022 19:50

echeran requested a review from a team as a code owner September 1, 2022 19:50

echeran requested review from Manishearth and sffc and removed request for a team September 1, 2022 19:50

Update Contributing.md test setup instructions for ci-job-ffi

e699d7b

Manishearth reviewed Sep 1, 2022

View reviewed changes

ffi/diplomat/src/collator.rs Show resolved Hide resolved

ffi/diplomat/src/collator.rs Show resolved Hide resolved

echeran added 2 commits September 1, 2022 16:14

Add rust_link annotation for Collator constructor in FFI

70d41de

Add short function doc strings for methods of Collator in FFI

2a8fb9e

echeran requested a review from Manishearth September 1, 2022 23:24

Apply diplomat-gen

6442d95

sffc previously approved these changes Sep 2, 2022

View reviewed changes

Use .into() instead of From:: for straightforward field copying

60d3e3e

echeran dismissed sffc’s stale review via 60d3e3e September 2, 2022 17:55

Manishearth previously approved these changes Sep 2, 2022

View reviewed changes

Remove unnecessary non_exhaustive attribute for FFI enums

dcf51e6

sffc previously approved these changes Sep 2, 2022

View reviewed changes

hsivonen previously requested changes Sep 5, 2022

View reviewed changes

Manishearth mentioned this pull request Sep 6, 2022

Consistently deal with string encodings over FFI #2520

Closed

echeran added 3 commits September 6, 2022 14:11

Add tests for Collator FFI

63e75e9

Change the enum values for ICU4XOrdering to match C++ compare return int

67710fa

Improve API doc for FFI compare accept &strs

d8e2225

echeran dismissed stale reviews from sffc and Manishearth via d8e2225 September 6, 2022 22:14

echeran mentioned this pull request Sep 6, 2022

Support enum discriminants with negative values rust-diplomat/diplomat#254

Closed

Update Diplomat + apply regen

420c083

Manishearth reviewed Sep 7, 2022

View reviewed changes

ffi/diplomat/Cargo.toml Show resolved Hide resolved

echeran added 3 commits September 7, 2022 07:52

Update diplomat-coverage

ec744bd

Apply cargo fmt

64fc9ab

Merge branch 'main' into ffi-collator

0998147

Manishearth previously approved these changes Sep 7, 2022

View reviewed changes

Apply diplomat-gen after merging latest from main

6f70fec

echeran dismissed Manishearth’s stale review via 6f70fec September 7, 2022 18:04

echeran requested review from hsivonen and Manishearth September 7, 2022 18:53

Manishearth approved these changes Sep 7, 2022

View reviewed changes

echeran merged commit feaa5c0 into unicode-org:main Sep 7, 2022

echeran mentioned this pull request Sep 7, 2022

Provide a C++ interface to the collator #2218

Closed

kelebra pushed a commit to kelebra/icu4x that referenced this pull request Sep 8, 2022

Add FFI bindings for Collator (unicode-org#2498)

3356eb4

echeran deleted the ffi-collator branch November 23, 2023 04:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FFI bindings for Collator #2498

Add FFI bindings for Collator #2498

echeran commented Aug 31, 2022

echeran commented Sep 1, 2022

Manishearth left a comment

sffc Sep 2, 2022

Manishearth Sep 2, 2022

sffc Sep 2, 2022

Manishearth Sep 2, 2022 •

edited

Loading

echeran Sep 2, 2022

Manishearth Sep 2, 2022

echeran Sep 7, 2022

hsivonen Sep 5, 2022

Manishearth Sep 5, 2022

hsivonen Sep 6, 2022

hsivonen Sep 6, 2022

Manishearth Sep 6, 2022

hsivonen Sep 7, 2022

Manishearth Sep 7, 2022

Manishearth Sep 7, 2022

hsivonen Sep 5, 2022

echeran Sep 7, 2022

hsivonen Sep 5, 2022

echeran Sep 7, 2022

Manishearth commented Sep 7, 2022

Add FFI bindings for Collator #2498

Add FFI bindings for Collator #2498

Conversation

echeran commented Aug 31, 2022

echeran commented Sep 1, 2022

Manishearth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Manishearth Sep 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Manishearth commented Sep 7, 2022

Manishearth Sep 2, 2022 •

edited

Loading