-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: unicode/utf8: provide generic versions of all functions #56948
Comments
Given the number of existing functions that operate on package bytes
func Of[Bytes ~[]byte | ~string](p Bytes) []byte It seems to me that that would handle all of the cases where a function accepts a slice-or-string and returns only data parsed from it, including all of the
|
What's the difference between bytes.Of and a cast to []byte? |
@aarzilli, (Compare my |
What would happen if someone did try to mutate the return value of bytes.Of? The cast already doesn't make copies sometimes IIRC, wouldn't it be better to improve that? |
@bcmills Being able to mutate a string feels pretty unsafe to me. The motivation for this proposal comes from the fact that we can't efficiently call a non-mutating The only reason I'm trying to write a generic function that operates on both |
You can almost write a function yourself to convert to a // toString returns a string with the same backing data as v. It is
// not safe to hold the returned string if the original was a []byte
// and might be modified.
func toString[T ~string | ~[]byte](v T) string {
// This assumes that the first word of both the string and slice
// header structs is a pointer to the data. Unfortunately, there's
// no way to just get the data from either in an opaque way because
// indexes into strings aren't addressable, for good reason.
return unsafe.String(*(**byte)(unsafe.Pointer(&v)), len(v))
} |
I just filed #57072 as a compiler optimization to somewhat obviate the need for this. |
Change https://go.dev/cl/469556 mentions this issue: |
This is part of the effort to reduce direct reliance on bytes.Buffer so that we can use a buffer with better pooling characteristics. Unify these two methods as a single version that uses generics to reduce duplicated logic. Unfortunately, we lack a generic version of utf8.DecodeRune (see #56948), so we cast []byte to string. The []byte variant is slightly slower for multi-byte unicode since casting results in a stack-allocated copy operation. Fortunately, this code path is used only for TextMarshalers. We can also delete TestStringBytes, which exists to ensure that the two duplicate implementations remain in sync. Performance: name old time/op new time/op delta CodeEncoder 399µs ± 2% 409µs ± 2% +2.59% (p=0.000 n=9+9) CodeEncoderError 450µs ± 1% 451µs ± 2% ~ (p=0.684 n=10+10) CodeMarshal 553µs ± 2% 562µs ± 3% ~ (p=0.075 n=10+10) CodeMarshalError 733µs ± 3% 737µs ± 2% ~ (p=0.400 n=9+10) EncodeMarshaler 24.9ns ±12% 24.1ns ±13% ~ (p=0.190 n=10+10) EncoderEncode 12.3ns ± 3% 14.7ns ±20% ~ (p=0.315 n=8+10) name old speed new speed delta CodeEncoder 4.87GB/s ± 2% 4.74GB/s ± 2% -2.53% (p=0.000 n=9+9) CodeEncoderError 4.31GB/s ± 1% 4.30GB/s ± 2% ~ (p=0.684 n=10+10) CodeMarshal 3.51GB/s ± 2% 3.46GB/s ± 3% ~ (p=0.075 n=10+10) CodeMarshalError 2.65GB/s ± 3% 2.63GB/s ± 2% ~ (p=0.400 n=9+10) name old alloc/op new alloc/op delta CodeEncoder 327B ±347% 447B ±232% +36.93% (p=0.034 n=9+10) CodeEncoderError 142B ± 1% 143B ± 0% ~ (p=1.000 n=8+7) CodeMarshal 1.96MB ± 2% 1.96MB ± 2% ~ (p=0.468 n=10+10) CodeMarshalError 2.04MB ± 3% 2.03MB ± 1% ~ (p=0.971 n=10+10) EncodeMarshaler 4.00B ± 0% 4.00B ± 0% ~ (all equal) EncoderEncode 0.00B 0.00B ~ (all equal) name old allocs/op new allocs/op delta CodeEncoder 0.00 0.00 ~ (all equal) CodeEncoderError 4.00 ± 0% 4.00 ± 0% ~ (all equal) CodeMarshal 1.00 ± 0% 1.00 ± 0% ~ (all equal) CodeMarshalError 6.00 ± 0% 6.00 ± 0% ~ (all equal) EncodeMarshaler 1.00 ± 0% 1.00 ± 0% ~ (all equal) EncoderEncode 0.00 0.00 ~ (all equal) There is a very slight performance degradation for CodeEncoder due to an increase in allocation sizes. However, the number of allocations did not change. This is likely due to remote effects of the growth rate differences between bytes.Buffer and the builtin append function. We shouldn't overly rely on the growth rate of bytes.Buffer anyways since that is subject to possibly change in #51462. As the benchtime increases, the alloc/op goes down indicating that the amortized memory cost is fixed. Updates #27735 Change-Id: Ie35e480e292fe082d7986e0a4d81212c1d4202b3 Reviewed-on: https://go-review.googlesource.com/c/go/+/469556 Run-TryBot: Joseph Tsai <[email protected]> Reviewed-by: Bryan Mills <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Daniel Martí <[email protected]> Auto-Submit: Joseph Tsai <[email protected]>
Sounds like this is on hold for better compiler optimizations before we can even consider whether this is a good API. |
#20881 is also another compiler optimization that would address the need for this. utf8.DecodeRune([]byte(in)) regardless of whether |
This proposal has been added to the active column of the proposals project |
Re utf8.DecodeRune([]byte(in)), sure but it would be even nicer to write utf8.DecodeRune(in) and not worry about whether the conversion allocates. |
Placed on hold. |
Given that "math/rand/v2" set the precedence for v2 packages, one could now imagine a "unicode/utf8/v2" package that has similar API, but generic versions: package utf8 // unicode/utf8/v2
const RuneError, RuneSelf, MaxRune, UTFMax = ...
func RuneLen(rune) int
func RuneStart(byte) bool
func ValidRune(run) bool
func AppendRune([]byte, rune) []byte
func EncodeRune([]byte, rune) int
func DecodeRune[Bytes ~[]byte | ~string](b Bytes) (rune, int)
func DecodeLastRune[Bytes ~[]byte | ~string](b Bytes) (rune, int)
func FullRune[Bytes ~[]byte | ~string](b Bytes) bool
func RuneCount[Bytes ~[]byte | ~string](b Bytes) int The first 4 constants are identical to today. Overall, the functionality of the package has been stable and there's been relatively few proposals or changes to utf8. The latest being:
Most of the open issues about utf8 are related to performance, which I believe in an orthogonal issue. |
I'm trying to a write a generic function that operates on either
string | []byte
. However, I'm unable to do so since the implementation of that function needs to depend on eitherutf8.DecodeRune
orutf8.DecodeRuneInString
, but I'm unable to express that as a simple expression without using a type switch.I propose we add generic versions of:
It is unclear what the name should be since the simpler names are already taken by the non-generic variants.
Perhaps, we should have a v2 variant of
utf8
that operates on either type.The text was updated successfully, but these errors were encountered: