[StringUtils::indexOfAnyBut] redesign due to inconsistent/faulty behavior regarding UTF-16 surrogates #1327

IBue · 2024-12-03T09:42:48Z

~~depends/stacked on #1326 (tests only)~~

Both signatures of StringUtils::indexOfAnyBut currently behave
inconsistently in matching UTF-16 supplementary characters and single
UTF-16 surrogate characters (i.e. paired and unpaired surrogates), since
they differ unnecessarily in their algorithmic implementations, use
their own incomplete and faulty interpretation of UTF-16 and don't take
full advantage of the standard library.

The example cases below show that they may yield contradictory results
or correct results for the wrong reasons.

This proposal gives a unified algorithmic implementation of both
signatures that

a) is much easier to grasp due to a clear mathematical set approach and
safe iteration and doesn't become entangled in index arithmetic;
stresses the set semantics of the 2nd argument
b) fully relies on the standard library for defined UTF-16
handling/interpretation;
paired surrogates are merged into one codepoint, unpaired surrogates
are left as they are
c) scales much better with input sizes and result index position
d) can benefit from current and future improvements in the standard
library and JVM
(streams implementation, parallelization, JIT optimization, JEP 218, ???…)

The algorithm boils down to:
find index i of first char in cs such that
~~(cs.codePointAt(i) ∈ {x ∈ codepoints(cs) ∣ x ∉ codepoints(searchChars) })~~
(cs.codePointAt(i) ∉ { x ∈ codepoints(searchChars) })

Examples:

<H>: high-surrogate character
<L>: low-surrogate character
(<H><L>): valid supplementary character
signature 1: StringUtils::indexOfAnyBut(final CharSequence seq, final CharSequence searchChars)
signature 2: StringUtils::indexOfAnyBut(final CharSequence cs, final char... searchChars)

Case 1: matching of unpaired high-surrogate

	seq/cs	searchChars	expected/new	sig.1	sig.2
1.1	`<H>aaaa`	`<H>abcd`	!found	!found	!found
				sig.1: 'a' is somewhere in `searchChars`	sig.2: 'a' happens to follow `<H>` in `searchChars`
1.2	`<H>baaa`	`<H>abcd`	!found	!found	0
				sig.1: 'b' is somewhere in searchChars
1.3	`<H>aaaa`	`(<H><L>)abcd`	0	!found	0
				sig.1: 'a' is somewhere in searchChars
1.4	`aaaa<H>`	`(<H><L>)abcd`	4	!found	!found

Case 2: matching of unpaired low-surrogate

	seq/cs	searchChars	expected/new	sig.1	sig.2
2.1	`<L>aaaa`	`(<H><L>)abcd`	0	!found	!found
					sig.1+2 don't interpret suppl. character
2.2	`aaaa<L>`	`(<H><L>)abcd`	4	!found	!found
					sig.1+2 don't interpret suppl. character

Case 3: matching of supplementary character

	seq/cs	searchChars	expected/new	sig.1	sig.2
3.1	`(<H><L>)aaaa`	`<L>ab<H>cd`	0	!found	0
				sig.1: `<L>` is somewhere in `searchChars`
3.2	`(<H><L>)aaaa`	`abcd`	0	1	0
				sig.1 always points to low-surrogate of (fully) unmatched suppl. character
3.3	`(<H><L>)aaaa`	`abcd<H>`	0	0	1
3.4	`(<H><L>)aaaa`	`abcd<L>`	0	!found	0
				sig.1: `<H>` skipped by algorithm

garydgregory

Hello @IBue

Please rebate on git master and see my comments. It's hard to read the changes since some changes make the PR noisier than it has to be, like unnecessary parameter name changes. Also the PR contains duplicate work from your other PR.

TY!

garydgregory · 2024-12-04T12:36:36Z

src/main/java/org/apache/commons/lang3/StringUtils.java

@@ -2829,41 +2835,23 @@ public static int indexOfAny(final CharSequence cs, final String searchChars) {

     * </pre>
     *
-     * @param cs  the CharSequence to check, may be null
+     * @param srcChars  the CharSequence to check, may be null


Don't rename parameters in methods, it makes the PR noisy and makes it harder to review.

garydgregory · 2024-12-04T12:38:45Z

src/test/java/org/apache/commons/lang3/Supplementary.java

     */
-    static final String CharU20000 = "\uD840\uDC00";
+    static final String CharU20000SupplCharLow = "\uDC00";


These changes duplicate your changes in your other PR. Rebase on git master and don't make these changes here.

garydgregory · 2024-12-04T12:40:12Z

src/main/java/org/apache/commons/lang3/StringUtils.java

-    public static int indexOfAnyBut(final CharSequence seq, final CharSequence searchChars) {
-        if (isEmpty(seq) || isEmpty(searchChars)) {
+    public static int indexOfAnyBut(final CharSequence srcChars, final CharSequence searchChars) {
+        if (srcChars == null || searchChars == null || srcChars.length() == 0 || searchChars.length() == 0) {


Not sure why you inlined these calls. Keep the PR to only the required changes to fix a bug.

garydgregory · 2024-12-04T12:42:09Z

src/main/java/org/apache/commons/lang3/StringUtils.java

-                }
-            } else if (!chFound) {
-                return i;
+        final Set<Integer> srcSetCodePoints = srcChars.codePoints().boxed().collect(Collectors.toSet()); // >=10: Collectors::toUnmodifiableSet


What does comments like >=10, 11, ... mean here? Can you be more explicit? Are these magic numbers?

IBue · 2024-12-05T07:13:44Z

I have rebased onto master and have followed your comments.

garydgregory · 2024-12-05T20:56:59Z

This one needs a second pair of eyes... @ppkarwasz @chtompki @struberg ?

One big difference is that the current implementation does not allocate any extra memory. I wonder what happens to performance of existing apps that call these methods on large strings.

garydgregory · 2024-12-05T21:34:50Z

I have rebased onto master and have followed your comments.

Hi @IBue

It looks like you forgot to run mvn (by itself) before pushing, see the checkstyle error in StringUtils.

src/main/java/org/apache/commons/lang3/StringUtils.java

IBue · 2024-12-06T19:49:46Z

note on performance:
boxing is the main responsible for increase in memory footprint here, that is +3GB for a 100M input sequence,
but the current implementation takes already a 10² order of magnitude more runtime than the proposed one for a 10M input sequence (minutes!) and a result index at half of the input size.

ppkarwasz · 2024-12-06T20:06:14Z

@IBue,

note on performance: boxing is the main responsible for increase in memory footprint here, that is +3GB for a 100M input sequence, but the current implementation takes already a 10² order of magnitude more runtime than the proposed one for a 10M input sequence (minutes!) and a result index at half of the input size.

Can you tell us more on the usage you have for this method? That will tell us a lot on how it should be optimized.

A common usage for this method could be extracting contiguous series of digits from a long string. With series of 10-20 digits it doesn't really make sense to go parallel, but if you want to validate a 100M input to make sure it only contains digits, you can sacrifice more memory.

Note: The indexOfAnyBut method can have multiple implementations and switched between them based on the input size.

garydgregory · 2024-12-14T12:55:46Z

@IBue
Please reply to @ppkarwasz's question:

Can you tell us more on the usage you have for this method? That will tell us a lot on how it should be optimized.

IBue · 2024-12-18T01:26:32Z

I have transformed the ListIterator loop into the proposed simplified index-based one.
Does anyone know if the index has to be protected against overflow?

Can you tell us more on the usage you have for this method? That will tell us a lot on how it should be optimized.

Actually, I don't have productive usage for this (yet). I stumbled upon the problem in the context of UTF-16 and learning some Java.

ppkarwasz

@IBue,

This looks good to me.

src/test/java/org/apache/commons/lang3/StringUtilsEqualsIndexOfTest.java

garydgregory · 2024-12-19T13:26:45Z

@IBue @ppkarwasz
Another issue here is that for every single step in the for loop, the PR does a sequential search in a collection. Is it worth doing a binary search instead (at the cost for sorting the array)? Like this:

    public static int indexOfAnyBut(final CharSequence seq, final CharSequence searchChars) {
        if (isEmpty(seq) || isEmpty(searchChars)) {
            return INDEX_NOT_FOUND;
        }
        final int[] codePoints = searchChars.codePoints().sorted().toArray();
        // advance character index from one interpreted codepoint to the next
        for (int curSeqCharIdx = 0; curSeqCharIdx < seq.length();) {
            final int curSeqCodePoint = Character.codePointAt(seq, curSeqCharIdx);
            if (Arrays.binarySearch(codePoints, curSeqCodePoint) < 0) {
                return curSeqCharIdx;
            }
            curSeqCharIdx += Character.charCount(curSeqCodePoint); // skip indices to paired low-surrogates
        }  
        return INDEX_NOT_FOUND;
    }

ppkarwasz · 2024-12-19T13:41:42Z

@garydgregory,

Another issue here is that for every single step in the for loop, the PR does a sequential search in a collection. Is it worth doing a binary search instead (at the cost for sorting the array)? Like this:

Currently it is a lookup in a Set (not a generic collection):

commons-lang/src/main/java/org/apache/commons/lang3/StringUtils.java

Lines 2877 to 2887 in e0e3f08

    
           final Set<Integer> searchSetCodePoints = searchChars.codePoints() 
        
                   .boxed().collect(Collectors.toSet()); // JDK >=10: Collectors::toUnmodifiableSet 
        
           // advance character index from one interpreted codepoint to the next 
        
           for (int curSeqCharIdx = 0; curSeqCharIdx < seq.length();) { 
        
               final int curSeqCodePoint = Character.codePointAt(seq, curSeqCharIdx); 
        
               if (!searchSetCodePoints.contains(curSeqCodePoint)) { 
        
                   return curSeqCharIdx; 
        
               } 
        
               curSeqCharIdx += Character.charCount(curSeqCodePoint); // skip indices to paired low-surrogates 
        
           } 
        
           return INDEX_NOT_FOUND;

garydgregory · 2025-01-05T14:09:31Z

@IBue Would you please rebase on git master?

…behaviour regarding UTF-16 surrogates Both signatures of StringUtils::indexOfAnyBut currently behave inconsistently in matching UTF-16 supplementary characters and single UTF-16 surrogate characters (i.e. paired and unpaired surrogates), since they differ unnecessarily in their algorithmic implementations, use their own incomplete and faulty interpretation of UTF-16 and don't take full advantage of the standard library. The example cases below show that they may yield contradictory results or correct results for the wrong reasons. This proposal gives a unified algorithmic implementation of both signatures that a) is much easier to grasp due to a clear mathematical set approach and safe iteration and doesn't become entangled in index arithmetic; stresses the set semantics of the 2nd argument b) fully relies on the standard library for defined UTF-16 handling/interpretation; paired surrogates are merged into one codepoint, unpaired surrogates are left as they are c) scales much better with input sizes and result index position d) can benefit from current and future improvements in the standard library and JVM (streams implementation, parallelization, JIT optimization, JEP 218, ???…) The algorithm boils down to: find index i of first char in cs such that (cs.codePointAt(i) ∈ {x ∈ codepoints(cs) ∣ x ∉ codepoints(searchChars) }) Examples: --------- <H>: high-surrogate character <L>: low-surrogate character (<H><L>): valid supplementary character signature 1: StringUtils::indexOfAnyBut(final CharSequence seq, final CharSequence searchChars) signature 2: StringUtils::indexOfAnyBut(final CharSequence cs, final char... searchChars) Case 1: matching of unpaired high-surrogate ---------seq/cs-------searchChars------exp./new-----sig.1-------sig.2--- 1.1 <H>aaaa <H>abcd !found !found !found sig.2: 'a' happens to follow <H> in searchChars; sig.1: 'a' is somewhere in searchChars 1.2 <H>baaa <H>abcd !found !found 0 sig.1: 'b' is somewhere in searchChars 1.3 <H>aaaa (<H><L>)abcd 0 !found 0 sig.1: 'a' is somewhere in searchChars 1.4 aaaa<H> (<H><L>)abcd 4 !found !found sig.1+2 don't interpret suppl. character Case 2: matching of unpaired low-surrogate ---------seq/cs-------searchChars------exp./new-----sig.1-------sig.2--- 2.1 <L>aaaa (<H><L>)abcd 0 !found !found sig.1+2 don't interpret suppl. character 2.2 aaaa<L> (<H><L>)abcd 4 !found !found sig.1+2 don't interpret suppl. character Case 3: matching of supplementary character ---------seq/cs-------------searchChars-----exp./new----sig.1-----sig.2- 3.1 (<H><L>)aaaa <L>ab<H>cd 0 !found 0 sig.1: <L> is somewhere in searchChars 3.2 (<H><L>)aaaa abcd 0 1 0 sig.1 always points to low-surrogate of (fully) unmatched suppl. character 3.3 (<H><L>)aaaa abcd<H> 0 0 1 3.4 (<H><L>)aaaa abcd<L> 0 !found 0 sig.1: <H> skipped by algorithm

by simplifying set consideration: find index i of first char in seq such that (seq.codePointAt(i) ∉ { x ∈ codepoints(searchChars) })

by transforming ListIterator loop into index-based loop, advancing by Character.charCount(codepoint); enabling short-circuit processing, avoiding full in-advance processing of input-sequence

providing a single source-of-truth (arguments stream) for the two function variants

Set::contains of immutable Set has unclear desastrous performance issues when searching for large values (here: >0xffff) in a set of smaller values (including JDK 23)

IBue · 2025-01-06T21:48:34Z

I have parameterized the test functions and rebased onto master.

(You made premature, conflicting changes without proper comment in 1a89cca )

ppkarwasz

LGTM

garydgregory · 2025-01-06T22:23:38Z

@IBue @ppkarwasz
Merged! TY for the work. Do you see where else Lang might have issues with UTF-16 surrogates?

…vior regarding UTF-16 surrogates #1327

IBue · 2025-01-13T01:25:02Z

Thank you for reviewing and valuable feedback!
I am preparing the transfer to StringUtils:indexOfAny; and I think there are still quite a few functions that need reconsideration regarding UTF-16 correctness and scalability.

garydgregory · 2025-01-14T20:33:07Z

@IBue
Let us know what you find 👀

garydgregory requested changes Dec 4, 2024

View reviewed changes

IBue force-pushed the pr-indexOfAnyBut-redesign branch from f74224a to 70ae28d Compare December 5, 2024 07:05

ppkarwasz requested changes Dec 6, 2024

View reviewed changes

src/main/java/org/apache/commons/lang3/StringUtils.java Outdated Show resolved Hide resolved

src/main/java/org/apache/commons/lang3/StringUtils.java Outdated Show resolved Hide resolved

src/main/java/org/apache/commons/lang3/StringUtils.java Outdated Show resolved Hide resolved

IBue force-pushed the pr-indexOfAnyBut-redesign branch from 70ae28d to 0e372a6 Compare December 6, 2024 18:57

IBue requested review from ppkarwasz and garydgregory December 6, 2024 19:00

ppkarwasz reviewed Dec 6, 2024

View reviewed changes

src/main/java/org/apache/commons/lang3/StringUtils.java Outdated Show resolved Hide resolved

ppkarwasz approved these changes Dec 19, 2024

View reviewed changes

src/test/java/org/apache/commons/lang3/StringUtilsEqualsIndexOfTest.java Outdated Show resolved Hide resolved

src/test/java/org/apache/commons/lang3/StringUtilsEqualsIndexOfTest.java Outdated Show resolved Hide resolved

IBue added 5 commits January 6, 2025 22:12

[StringUtils::indexOfAnyBut] further reduction of algorithm

0c9920a

by simplifying set consideration: find index i of first char in seq such that (seq.codePointAt(i) ∉ { x ∈ codepoints(searchChars) })

[StringUtils::indexOfAnyBut] simplify input-sequence iteration

13a2bde

by transforming ListIterator loop into index-based loop, advancing by Character.charCount(codepoint); enabling short-circuit processing, avoiding full in-advance processing of input-sequence

[StringUtils:indexOfAnyBut] parameterization of test functions

b585112

providing a single source-of-truth (arguments stream) for the two function variants

[StringUtils:indexOfAnyBut] remove comment

915ee33

Set::contains of immutable Set has unclear desastrous performance issues when searching for large values (here: >0xffff) in a set of smaller values (including JDK 23)

IBue force-pushed the pr-indexOfAnyBut-redesign branch from e0e3f08 to 915ee33 Compare January 6, 2025 21:45

ppkarwasz approved these changes Jan 6, 2025

View reviewed changes

garydgregory merged commit 665f047 into apache:master Jan 6, 2025
18 of 19 checks passed

garydgregory added a commit that referenced this pull request Jan 6, 2025

[StringUtils::indexOfAnyBut] redesign due to inconsistent/faulty beha…

f5da9b7

…vior regarding UTF-16 surrogates #1327

garydgregory changed the title ~~[StringUtils::indexOfAnyBut] redesign due to inconsistent/faulty behaviour regarding UTF-16 surrogates~~ [StringUtils::indexOfAnyBut] redesign due to inconsistent/faulty behavior regarding UTF-16 surrogates Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[StringUtils::indexOfAnyBut] redesign due to inconsistent/faulty behavior regarding UTF-16 surrogates #1327

[StringUtils::indexOfAnyBut] redesign due to inconsistent/faulty behavior regarding UTF-16 surrogates #1327

IBue commented Dec 3, 2024 •

edited

Loading

garydgregory left a comment •

edited

Loading

garydgregory Dec 4, 2024

garydgregory Dec 4, 2024

garydgregory Dec 4, 2024

garydgregory Dec 4, 2024

IBue commented Dec 5, 2024

garydgregory commented Dec 5, 2024

garydgregory commented Dec 5, 2024

IBue commented Dec 6, 2024

ppkarwasz commented Dec 6, 2024

garydgregory commented Dec 14, 2024

IBue commented Dec 18, 2024

ppkarwasz left a comment

garydgregory commented Dec 19, 2024 •

edited

Loading

ppkarwasz commented Dec 19, 2024

garydgregory commented Jan 5, 2025

IBue commented Jan 6, 2025 •

edited

Loading

ppkarwasz left a comment

garydgregory commented Jan 6, 2025

IBue commented Jan 13, 2025

garydgregory commented Jan 14, 2025

[StringUtils::indexOfAnyBut] redesign due to inconsistent/faulty behavior regarding UTF-16 surrogates #1327

[StringUtils::indexOfAnyBut] redesign due to inconsistent/faulty behavior regarding UTF-16 surrogates #1327

Conversation

IBue commented Dec 3, 2024 • edited Loading

Examples:

garydgregory left a comment • edited Loading

Choose a reason for hiding this comment

garydgregory Dec 4, 2024

Choose a reason for hiding this comment

garydgregory Dec 4, 2024

Choose a reason for hiding this comment

garydgregory Dec 4, 2024

Choose a reason for hiding this comment

garydgregory Dec 4, 2024

Choose a reason for hiding this comment

IBue commented Dec 5, 2024

garydgregory commented Dec 5, 2024

garydgregory commented Dec 5, 2024

IBue commented Dec 6, 2024

ppkarwasz commented Dec 6, 2024

garydgregory commented Dec 14, 2024

IBue commented Dec 18, 2024

ppkarwasz left a comment

Choose a reason for hiding this comment

garydgregory commented Dec 19, 2024 • edited Loading

ppkarwasz commented Dec 19, 2024

garydgregory commented Jan 5, 2025

IBue commented Jan 6, 2025 • edited Loading

ppkarwasz left a comment

Choose a reason for hiding this comment

garydgregory commented Jan 6, 2025

IBue commented Jan 13, 2025

garydgregory commented Jan 14, 2025

IBue commented Dec 3, 2024 •

edited

Loading

garydgregory left a comment •

edited

Loading

garydgregory commented Dec 19, 2024 •

edited

Loading

IBue commented Jan 6, 2025 •

edited

Loading