-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[StringUtils::indexOfAnyBut] redesign due to inconsistent/faulty behavior regarding UTF-16 surrogates #1327
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @IBue
Please rebate on git master and see my comments. It's hard to read the changes since some changes make the PR noisier than it has to be, like unnecessary parameter name changes. Also the PR contains duplicate work from your other PR.
TY!
@@ -2829,41 +2835,23 @@ public static int indexOfAny(final CharSequence cs, final String searchChars) { | |||
|
|||
* </pre> | |||
* | |||
* @param cs the CharSequence to check, may be null | |||
* @param srcChars the CharSequence to check, may be null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't rename parameters in methods, it makes the PR noisy and makes it harder to review.
*/ | ||
static final String CharU20000 = "\uD840\uDC00"; | ||
static final String CharU20000SupplCharLow = "\uDC00"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes duplicate your changes in your other PR. Rebase on git master and don't make these changes here.
public static int indexOfAnyBut(final CharSequence seq, final CharSequence searchChars) { | ||
if (isEmpty(seq) || isEmpty(searchChars)) { | ||
public static int indexOfAnyBut(final CharSequence srcChars, final CharSequence searchChars) { | ||
if (srcChars == null || searchChars == null || srcChars.length() == 0 || searchChars.length() == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why you inlined these calls. Keep the PR to only the required changes to fix a bug.
} | ||
} else if (!chFound) { | ||
return i; | ||
final Set<Integer> srcSetCodePoints = srcChars.codePoints().boxed().collect(Collectors.toSet()); // >=10: Collectors::toUnmodifiableSet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does comments like >=10, 11, ... mean here? Can you be more explicit? Are these magic numbers?
f74224a
to
70ae28d
Compare
I have rebased onto |
This one needs a second pair of eyes... @ppkarwasz @chtompki @struberg ? One big difference is that the current implementation does not allocate any extra memory. I wonder what happens to performance of existing apps that call these methods on large strings. |
Hi @IBue It looks like you forgot to run |
70ae28d
to
0e372a6
Compare
note on performance: |
Can you tell us more on the usage you have for this method? That will tell us a lot on how it should be optimized. A common usage for this method could be extracting contiguous series of digits from a long string. With series of 10-20 digits it doesn't really make sense to go parallel, but if you want to validate a 100M input to make sure it only contains digits, you can sacrifice more memory. Note: The |
@IBue
|
I have transformed the
Actually, I don't have productive usage for this (yet). I stumbled upon the problem in the context of UTF-16 and learning some Java. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
src/test/java/org/apache/commons/lang3/StringUtilsEqualsIndexOfTest.java
Outdated
Show resolved
Hide resolved
src/test/java/org/apache/commons/lang3/StringUtilsEqualsIndexOfTest.java
Outdated
Show resolved
Hide resolved
@IBue @ppkarwasz public static int indexOfAnyBut(final CharSequence seq, final CharSequence searchChars) {
if (isEmpty(seq) || isEmpty(searchChars)) {
return INDEX_NOT_FOUND;
}
final int[] codePoints = searchChars.codePoints().sorted().toArray();
// advance character index from one interpreted codepoint to the next
for (int curSeqCharIdx = 0; curSeqCharIdx < seq.length();) {
final int curSeqCodePoint = Character.codePointAt(seq, curSeqCharIdx);
if (Arrays.binarySearch(codePoints, curSeqCodePoint) < 0) {
return curSeqCharIdx;
}
curSeqCharIdx += Character.charCount(curSeqCodePoint); // skip indices to paired low-surrogates
}
return INDEX_NOT_FOUND;
} |
Currently it is a lookup in a commons-lang/src/main/java/org/apache/commons/lang3/StringUtils.java Lines 2877 to 2887 in e0e3f08
|
@IBue Would you please rebase on git master? |
…behaviour regarding UTF-16 surrogates Both signatures of StringUtils::indexOfAnyBut currently behave inconsistently in matching UTF-16 supplementary characters and single UTF-16 surrogate characters (i.e. paired and unpaired surrogates), since they differ unnecessarily in their algorithmic implementations, use their own incomplete and faulty interpretation of UTF-16 and don't take full advantage of the standard library. The example cases below show that they may yield contradictory results or correct results for the wrong reasons. This proposal gives a unified algorithmic implementation of both signatures that a) is much easier to grasp due to a clear mathematical set approach and safe iteration and doesn't become entangled in index arithmetic; stresses the set semantics of the 2nd argument b) fully relies on the standard library for defined UTF-16 handling/interpretation; paired surrogates are merged into one codepoint, unpaired surrogates are left as they are c) scales much better with input sizes and result index position d) can benefit from current and future improvements in the standard library and JVM (streams implementation, parallelization, JIT optimization, JEP 218, ???…) The algorithm boils down to: find index i of first char in cs such that (cs.codePointAt(i) ∈ {x ∈ codepoints(cs) ∣ x ∉ codepoints(searchChars) }) Examples: --------- <H>: high-surrogate character <L>: low-surrogate character (<H><L>): valid supplementary character signature 1: StringUtils::indexOfAnyBut(final CharSequence seq, final CharSequence searchChars) signature 2: StringUtils::indexOfAnyBut(final CharSequence cs, final char... searchChars) Case 1: matching of unpaired high-surrogate ---------seq/cs-------searchChars------exp./new-----sig.1-------sig.2--- 1.1 <H>aaaa <H>abcd !found !found !found sig.2: 'a' happens to follow <H> in searchChars; sig.1: 'a' is somewhere in searchChars 1.2 <H>baaa <H>abcd !found !found 0 sig.1: 'b' is somewhere in searchChars 1.3 <H>aaaa (<H><L>)abcd 0 !found 0 sig.1: 'a' is somewhere in searchChars 1.4 aaaa<H> (<H><L>)abcd 4 !found !found sig.1+2 don't interpret suppl. character Case 2: matching of unpaired low-surrogate ---------seq/cs-------searchChars------exp./new-----sig.1-------sig.2--- 2.1 <L>aaaa (<H><L>)abcd 0 !found !found sig.1+2 don't interpret suppl. character 2.2 aaaa<L> (<H><L>)abcd 4 !found !found sig.1+2 don't interpret suppl. character Case 3: matching of supplementary character ---------seq/cs-------------searchChars-----exp./new----sig.1-----sig.2- 3.1 (<H><L>)aaaa <L>ab<H>cd 0 !found 0 sig.1: <L> is somewhere in searchChars 3.2 (<H><L>)aaaa abcd 0 1 0 sig.1 always points to low-surrogate of (fully) unmatched suppl. character 3.3 (<H><L>)aaaa abcd<H> 0 0 1 3.4 (<H><L>)aaaa abcd<L> 0 !found 0 sig.1: <H> skipped by algorithm
by simplifying set consideration: find index i of first char in seq such that (seq.codePointAt(i) ∉ { x ∈ codepoints(searchChars) })
by transforming ListIterator loop into index-based loop, advancing by Character.charCount(codepoint); enabling short-circuit processing, avoiding full in-advance processing of input-sequence
providing a single source-of-truth (arguments stream) for the two function variants
Set::contains of immutable Set has unclear desastrous performance issues when searching for large values (here: >0xffff) in a set of smaller values (including JDK 23)
e0e3f08
to
915ee33
Compare
I have parameterized the test functions and rebased onto (You made premature, conflicting changes without proper comment in 1a89cca ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@IBue @ppkarwasz |
…vior regarding UTF-16 surrogates #1327
Thank you for reviewing and valuable feedback! |
@IBue |
depends/stacked on #1326 (tests only)Both signatures of
StringUtils::indexOfAnyBut
currently behaveinconsistently in matching UTF-16 supplementary characters and single
UTF-16 surrogate characters (i.e. paired and unpaired surrogates), since
they differ unnecessarily in their algorithmic implementations, use
their own incomplete and faulty interpretation of UTF-16 and don't take
full advantage of the standard library.
The example cases below show that they may yield contradictory results
or correct results for the wrong reasons.
This proposal gives a unified algorithmic implementation of both
signatures that
safe iteration and doesn't become entangled in index arithmetic;
stresses the set semantics of the 2nd argument
handling/interpretation;
paired surrogates are merged into one codepoint, unpaired surrogates
are left as they are
library and JVM
(streams implementation, parallelization, JIT optimization, JEP 218, ???…)
The algorithm boils down to:
find index i of first char in cs such that
(cs.codePointAt(i) ∈ {x ∈ codepoints(cs) ∣ x ∉ codepoints(searchChars) })
(cs.codePointAt(i) ∉ { x ∈ codepoints(searchChars) })
Examples:
<H>
: high-surrogate character<L>
: low-surrogate character(<H><L>)
: valid supplementary charactersignature 1:
StringUtils::indexOfAnyBut(final CharSequence seq, final CharSequence searchChars)
signature 2:
StringUtils::indexOfAnyBut(final CharSequence cs, final char... searchChars)
Case 1: matching of unpaired high-surrogate
<H>aaaa
<H>abcd
in
searchChars
<H>
insearchChars
<H>baaa
<H>abcd
in searchChars
<H>aaaa
(<H><L>)abcd
in searchChars
aaaa<H>
(<H><L>)abcd
Case 2: matching of unpaired low-surrogate
<L>aaaa
(<H><L>)abcd
aaaa<L>
(<H><L>)abcd
Case 3: matching of supplementary character
(<H><L>)aaaa
<L>ab<H>cd
<L>
is somewherein
searchChars
(<H><L>)aaaa
abcd
of (fully) unmatched suppl. character
(<H><L>)aaaa
abcd<H>
(<H><L>)aaaa
abcd<L>
<H>
skipped by algorithm