Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document encode_utf16() endianness; maybe add endianness and BOM options. #83102

Closed
BartMassey opened this issue Mar 14, 2021 · 2 comments · Fixed by #136283
Closed

Document encode_utf16() endianness; maybe add endianness and BOM options. #83102

BartMassey opened this issue Mar 14, 2021 · 2 comments · Fixed by #136283
Labels
A-docs Area: Documentation for any part of the project, including the compiler, standard library, and tools C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@BartMassey
Copy link
Contributor

BartMassey commented Mar 14, 2021

The documentation does not specify the endianness of str::encode_utf16()and char::encode_utf16(): it looks from the source like they are big-endian (UTF-16BE), but I may be reading it wrong and they are little-endian (UTF-16LE) or native-endian.

This may be a deliberate design decision: if so I think it should be reconsidered, as the encoding is useless for some purposes if you don't know its endianness.

It would also be nice to indicate whether str::encode_utf16() inserts a byte-order mark (BOM): pretty sure it does not from the source, which is fine.

It is probably too late to rename these functions or to add equivalents of opposite endianness at this point, which is too bad. It's an odd API given that the corresponding decode functions have little-endian and big-endian variants.

@ChrisDenton
Copy link
Member

encode_utf16 is using the platform's native endian. This is made clear when a u32 is cast directly to a u16 without converting the endian. I do agree that it may be good to explicitly document this.

pub fn encode_utf16_raw(mut code: u32, dst: &mut [u16]) -> &mut [u16] {
// SAFETY: each arm checks whether there are enough bits to write into
unsafe {
if (code & 0xFFFF) == code && !dst.is_empty() {
// The BMP falls through
*dst.get_unchecked_mut(0) = code as u16;

The decode functions also assume native endian UTF-16.

This makes sense as a default. If necessary, endian conversion can be done before decoding or after encoding by mapping the &[u16] slice to the required endian.

@BartMassey
Copy link
Contributor Author

encode_utf16 is using the platform's native endian. This is made clear when a u32 is cast directly to a u16 without converting the endian.

Thanks. My read was too quick.

I do agree that it may be good to explicitly document this.

I can submit a PR if folks like.

The decode functions also assume native endian UTF-16.

I am now thoroughly confused, as usual. I swear I saw something with endianness somewhere in std, but I can't find it now.

Anyhow, I can add the documentation about endianness and the lack of a BOM in the appropriate spots. LMK what you think of me getting a PR together.

Thanks!

@Dylan-DPC Dylan-DPC added C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. A-docs Area: Documentation for any part of the project, including the compiler, standard library, and tools labels Feb 21, 2023
workingjubilee added a commit to workingjubilee/rustc that referenced this issue Jan 31, 2025
Update encode_utf16 to mention it is native endian

Fixes rust-lang#83102
jhpratt added a commit to jhpratt/rust that referenced this issue Feb 1, 2025
Update encode_utf16 to mention it is native endian

Fixes rust-lang#83102
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Feb 1, 2025
Update encode_utf16 to mention it is native endian

Fixes rust-lang#83102
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Feb 1, 2025
Update encode_utf16 to mention it is native endian

Fixes rust-lang#83102
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Feb 2, 2025
Update encode_utf16 to mention it is native endian

Fixes rust-lang#83102
@bors bors closed this as completed in 198384c Feb 2, 2025
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Feb 2, 2025
Rollup merge of rust-lang#136283 - hkBst:patch-31, r=workingjubilee

Update encode_utf16 to mention it is native endian

Fixes rust-lang#83102
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-docs Area: Documentation for any part of the project, including the compiler, standard library, and tools C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants