Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support case-changes to Annotated{String,Char}s #54013

Merged
merged 1 commit into from
Apr 18, 2024

Conversation

tecosaur
Copy link
Contributor

@tecosaur tecosaur commented Apr 9, 2024

Arguably an overlooked area, this PR adds specialised methods for some of the functions in unicode.jl , namely the case-changing functions and textwidth. The case-changing functions now all preserve annotations, and the textwidth specialisation makes it about ~12x faster in some basic local benchmarks.

See the commit message for (many) more details.

Screenshot

image

NB: ſ/S and /Ⱥ have a different number of codeunits.

@tecosaur tecosaur added strings "Strings!" backport 1.11 Change should be backported to release-1.11 labels Apr 9, 2024
@tecosaur
Copy link
Contributor Author

tecosaur commented Apr 9, 2024

Oh, I'll add some test cases for this tomorrow.

@tecosaur tecosaur force-pushed the annotated-case-changes branch 2 times, most recently from 740a230 to 0f4fc12 Compare April 10, 2024 05:29
@tecosaur
Copy link
Contributor Author

There we go, that should be a pretty decent test.

@tecosaur tecosaur force-pushed the annotated-case-changes branch from 0f4fc12 to a3eded6 Compare April 10, 2024 05:30
@tecosaur tecosaur added the status: waiting for PR reviewer PR is complete and seems ready to merge. Has tests and news/compat if needed. CI failures unrelated. label Apr 10, 2024
Copy link
Member

@fingolfin fingolfin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you

base/strings/annotated.jl Outdated Show resolved Hide resolved
base/strings/unicode.jl Outdated Show resolved Hide resolved
@tecosaur tecosaur force-pushed the annotated-case-changes branch from a3eded6 to cd05c11 Compare April 10, 2024 17:32
@fingolfin
Copy link
Member

Genuine CI error?

Error in testset strings/annotated:
Error During Test at /cache/build/tester-amdci4-12/julialang/julia-master/julia-cd05c112d1/share/julia/test/strings/annotated.jl:111
  Got exception outside of a @test
  MethodError: no method matching (::Base.Unicode.var"#4#5")(::Char, ::@NamedTuple{startword::Bool, state::Base.RefValue{Int32}, c0::Base.AnnotatedChar{Char}, wordsep::ComposedFunction{typeof(!), typeof(isletter)}, strict::Bool})
  The function `#4` exists, but no method is defined for this combination of argument types.

Previously, any case changes to Annotated{String,Char} types triggered
"fall back to non-annotated type" non-specialised methods. It would be
nice to keep the annotations though, and that can be done so long as we
keep track of any potential changes to the number of bytes taken by each
character on case changes. This is unusual, but can happen with some
letters (e.g. the upper case of 'ſ' is 'S').

To handle this, a helper function annotated_chartransform is introduced.
This allows for efficient uppercase/lowercase methods (about 50%
overhead in managing the annotation ranges, compared to just
transforming a String). The {upper,lower}casefirst and titlecase
transformations are much more inefficient with this style of
implementation, but not prohibitively so. If somebody has a bright idea,
or they emerge as an area deserving of more attention, the performance
characteristics can be improved.

As a bonus, a specialised textwidth method is implemented to avoid the
generic fallback, providing a ~12x performance improvement.

To check that annotated_chartransform is accurate, as are the
specialised case-transformations, a few million random collections of
strings were pre- and post-annotated and checked to be the same in a
fuzzing check performed with Supposition.jl.

    const short_str = Data.Text(Data.Characters(), max_len=20)
    const short_strs = Data.Vectors(short_str, max_size=10)
    const case_transform_fn = Data.SampledFrom((uppercase, lowercase))

    function annot_caseinvariant(f::Function, strs::Vector{String})
        annot_strs =
            map(((i, s),) -> AnnotatedString(s, [(1:ncodeunits(s), :i => i)]),
                enumerate(strs))
        f_annot_strs =
            map(((i, s),) -> AnnotatedString(s, [(1:ncodeunits(s), :i => i)]),
                enumerate(map(f, strs)))
        pre_join = Base.annotated_chartransform(join(annot_strs), f)
        post_join = join(f_annot_strs)
        pre_join == post_join
    end

    @check max_examples=1_000_000 annot_caseinvariant(case_transform_fn, short_strs)

This helped me determine that in annotated_chartransform the "- 1" was
needed with offset position calculation, and that in the "findlast"
calls that less than *or equal* was the correct equality test.
@tecosaur tecosaur force-pushed the annotated-case-changes branch from cd05c11 to 6fb0d5a Compare April 11, 2024 03:14
@tecosaur
Copy link
Contributor Author

Ah, looks like the parenthesis I put around the do arguments are actually unwanted.

@tecosaur tecosaur removed the status: waiting for PR reviewer PR is complete and seems ready to merge. Has tests and news/compat if needed. CI failures unrelated. label Apr 11, 2024
@KristofferC KristofferC mentioned this pull request Apr 17, 2024
59 tasks
@tecosaur tecosaur merged commit 38a9725 into JuliaLang:master Apr 18, 2024
8 checks passed
KristofferC pushed a commit that referenced this pull request Apr 25, 2024
Previously, any case changes to Annotated{String,Char} types triggered
"fall back to non-annotated type" non-specialised methods. It would be
nice to keep the annotations though, and that can be done so long as we
keep track of any potential changes to the number of bytes taken by each
character on case changes. This is unusual, but can happen with some
letters (e.g. the upper case of 'ſ' is 'S').

To handle this, a helper function annotated_chartransform is introduced.
This allows for efficient uppercase/lowercase methods (about 50%
overhead in managing the annotation ranges, compared to just
transforming a String). The {upper,lower}casefirst and titlecase
transformations are much more inefficient with this style of
implementation, but not prohibitively so. If somebody has a bright idea,
or they emerge as an area deserving of more attention, the performance
characteristics can be improved.

As a bonus, a specialised textwidth method is implemented to avoid the
generic fallback, providing a ~12x performance improvement.

To check that annotated_chartransform is accurate, as are the
specialised case-transformations, a few million random collections of
strings were pre- and post-annotated and checked to be the same in a
fuzzing check performed with Supposition.jl.

    const short_str = Data.Text(Data.Characters(), max_len=20)
    const short_strs = Data.Vectors(short_str, max_size=10)
    const case_transform_fn = Data.SampledFrom((uppercase, lowercase))

    function annot_caseinvariant(f::Function, strs::Vector{String})
        annot_strs =
            map(((i, s),) -> AnnotatedString(s, [(1:ncodeunits(s), :i => i)]),
                enumerate(strs))
        f_annot_strs =
            map(((i, s),) -> AnnotatedString(s, [(1:ncodeunits(s), :i => i)]),
                enumerate(map(f, strs)))
        pre_join = Base.annotated_chartransform(join(annot_strs), f)
        post_join = join(f_annot_strs)
        pre_join == post_join
    end

    @check max_examples=1_000_000 annot_caseinvariant(case_transform_fn, short_strs)

This helped me determine that in annotated_chartransform the "- 1" was
needed with offset position calculation, and that in the "findlast"
calls that less than *or equal* was the correct equality test.

(cherry picked from commit 38a9725)
@tecosaur tecosaur deleted the annotated-case-changes branch May 2, 2024 09:45
KristofferC added a commit that referenced this pull request May 28, 2024
Backported PRs:
- [x] #53665 <!-- use afoldl instead of tail recursion for tuples -->
- [x] #53976 <!-- LinearAlgebra: LazyString in interpolated error
messages -->
- [x] #54005 <!-- make `view(::Memory, ::Colon)` produce a Vector -->
- [x] #54010 <!-- Overload `Base.literal_pow` for `AbstractQ` -->
- [x] #54069 <!-- Allow PrecompileTools to see MI's inferred by foreign
abstract interpreters -->
- [x] #53750 <!-- inference correctness: fields and globals can revert
to undef -->
- [x] #53984 <!-- Profile: fix heap snapshot is valid char check -->
- [x] #54102 <!-- Explicitly compute stride in unaliascopy for SubArray
-->
- [x] #54070 <!-- Fix integer overflow in `skip(s::IOBuffer,
typemax(Int64))` -->
- [x] #54013 <!-- Support case-changes to Annotated{String,Char}s -->
- [x] #53941 <!-- Fix writing of AnnotatedChars to AnnotatedIOBuffer -->
- [x] #54137 <!-- Fix typo in docs for `partialsortperm` -->
- [x] #54129 <!-- use correct size when creating output data from an
IOBuffer -->
- [x] #54153 <!-- Fixup IdSet docstring -->
- [x] #54143 <!-- Fix `make install` from tarballs -->
- [x] #54151 <!-- LinearAlgebra: Correct zero element in
`_generic_matvecmul!` for block adj/trans -->
- [x] #54213 <!-- Add `public` statement to `Base.GC` -->
- [x] #54222 <!-- Utilize correct tbaa when emitting stores of unions.
-->
- [x] #54233 <!-- set MAX_OS_WRITE on unix -->
- [x] #54255 <!-- fix `_checked_mul_dims` in the presence of 0s and
overflow. -->
- [x] #54259 <!-- Fix typo in `readuntil` -->
- [x] #54251 <!-- fix typo in gc_mark_memory8 when chunking a large
array -->
- [x] #54276 <!-- Fix solve for complex `Hermitian` with non-vanishing
imaginary part on diagonal -->
- [x] #54248 <!-- ensure package callbacks are invoked when no valid
precompile file exists for an "auto loaded" stdlib -->
- [x] #54308 <!-- Implement eval-able AnnotatedString 2-arg show -->
- [x] #54302 <!-- Specialised substring equality for annotated strs -->
- [x] #54243 <!-- prevent `package_callbacks` to run multiple time for a
single package -->
- [x] #54350 <!-- add a precompile signature to Artifacts code that is
used by JLLs -->
- [x] #54331 <!-- correctly track freed bytes in
jl_genericmemory_to_string -->
- [x] #53509 <!-- revert moving "creating packages" from Pkg.jl -->
- [x] #54335 <!-- When accessing the data pointer for an array, first
decay it to a Derived Pointer -->
- [x] #54239 <!-- Make sure `fieldcount` constant-folds for `Tuple{...}`
-->
- [x] #54288
- [x] #54067
- [x] #53715 <!-- Add read/write specialisation for IOContext{AnnIO} -->
- [x] #54289 <!-- Rework annotation ordering/optimisations -->
- [x] #53815 <!-- create phantom task for GC threads -->
- [x] #54130 <!-- inference: handle `LimitedAccuracy` in
`handle_global_assignment!` -->
- [x] #54428 <!-- Move ConsoleLogging.jl into Base -->
- [x] #54332 <!-- Revert "add unsetindex support to more copyto methods
(#51760)" -->
- [x] #53826 <!-- Make all command-line options documented in all
related files -->
- [x] #54465 <!-- typeintersect: conservative typevar subtitution during
`finish_unionall` -->
- [x] #54514 <!-- typeintersect: followup cleanup for the nothrow path
of type instantiation -->
- [x] #54499 <!-- make `@doc x` work without REPL loaded -->
- [x] #54210 <!-- attach finalizer in `mmap` to the correct object -->
- [x] #54359 <!-- Pkg REPL: cache `pkg_mode` lookup -->

Non-merged PRs with backport label:
- [ ] #54471 <!-- Actually setup jit targets when compiling
packageimages instead of targeting only one -->
- [ ] #54457 <!-- Make `String(::Memory)` copy -->
- [ ] #54323 <!-- inference: fix too conservative effects for recursive
cycles -->
- [ ] #54322 <!-- effects: add new `@consistent_overlay` macro -->
- [ ] #54191 <!-- make `AbstractPipe` public -->
- [ ] #53957 <!-- tweak how filtering is done for what packages should
be precompiled -->
- [ ] #53882 <!-- Warn about cycles in extension precompilation -->
- [ ] #53707 <!-- Make ScopedValue public -->
- [ ] #53452 <!-- RFC: allow Tuple{Union{}}, returning Union{} -->
- [ ] #53402 <!-- Add `jl_getaffinity` and `jl_setaffinity` -->
- [ ] #53286 <!-- Raise an error when using `include_dependency` with
non-existent file or directory -->
- [ ] #52694 <!-- Reinstate similar for AbstractQ for backward
compatibility -->
- [ ] #51479 <!-- prevent code loading from lookin in the versioned
environment when building Julia -->
@KristofferC KristofferC removed the backport 1.11 Change should be backported to release-1.11 label May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants