Skip to content

Commit

Permalink
[StringUtils::indexOfAnyBut] redesign due to inconsistent/faulty…
Browse files Browse the repository at this point in the history
…behaviour regarding UTF-16 surrogates

Both signatures of StringUtils::indexOfAnyBut currently behave
inconsistently in matching UTF-16 supplementary characters and single
UTF-16 surrogate characters (i.e. paired and unpaired surrogates), since
they differ unnecessarily in their algorithmic implementations, use
their own incomplete and faulty interpretation of UTF-16 and don't take
full advantage of the standard library.

The example cases below show that they may yield contradictory results
or correct results for the wrong reasons.

This proposal gives a unified algorithmic implementation of both
signatures that
a) is much easier to grasp due to a clear mathematical set approach and
   safe iteration and doesn't become entangled in index arithmetic;
   stresses the set semantics of the 2nd argument
b) fully relies on the standard library for defined UTF-16
   handling/interpretation;
   paired surrogates are merged into one codepoint, unpaired surrogates
   are left as they are
c) scales much better with input sizes and result index position
d) can benefit from current and future improvements in the standard
   library and JVM
   (streams implementation, parallelization, JIT optimization, JEP 218,
   ???…)

The algorithm boils down to:
find index i of first char in srcChars such that
(srcChars.codePointAt(i) ∈ {x ∈ codepoints(srcChars) ∣ x ∉
codepoints(searchChars) })

Examples:
---------

<H>: high-surrogate character
<L>: low-surrogate character
(<H><L>): valid supplementary character
signature 1: StringUtils::indexOfAnyBut(final CharSequence srcChars,
final CharSequence searchChars)
signature 2: StringUtils::indexOfAnyBut(final CharSequence srcChars,
final char... searchChars)

Case 1: matching of unpaired high-surrogate
---------srcChars-----searchChars------exp./new-----sig.1-------sig.2---

 1.1     <H>aaaa      <H>abcd          !found       !found      !found
  sig.2: 'a' happens to follow <H> in searchChars;
  sig.1: 'a' is somewhere in searchChars

 1.2     <H>baaa      <H>abcd          !found       !found      0
  sig.1: 'b' is somewhere in searchChars

 1.3     <H>aaaa      (<H><L>)abcd     0            !found      0
  sig.1: 'a' is somewhere in searchChars

 1.4     aaaa<H>      (<H><L>)abcd     4            !found      !found
  sig.1+2 don't interpret suppl. character

Case 2: matching of unpaired low-surrogate
---------srcChars-----searchChars------exp./new-----sig.1-------sig.2---

 2.1     <L>aaaa      (<H><L>)abcd     0            !found      !found
  sig.1+2 don't interpret suppl. character

 2.2     aaaa<L>      (<H><L>)abcd     4            !found      !found
  sig.1+2 don't interpret suppl. character

Case 3: matching of supplementary character
---------srcChars-----------searchChars-----exp./new----sig.1-----sig.2-

 3.1     (<H><L>)aaaa       <L>ab<H>cd      0           !found    0
  sig.1: <L> is somewhere in searchChars

 3.2     (<H><L>)aaaa       abcd            0           1         0
  sig.1 always points to low-surrogate of (fully) unmatched
  suppl. character

 3.3     (<H><L>)aaaa       abcd<H>         0           0         1
 3.4     (<H><L>)aaaa       abcd<L>         0           !found    0
  sig.1: <H> skipped by algorithm
  • Loading branch information
IBue committed Dec 3, 2024
1 parent 7f56607 commit f74224a
Show file tree
Hide file tree
Showing 2 changed files with 61 additions and 42 deletions.
71 changes: 31 additions & 40 deletions src/main/java/org/apache/commons/lang3/StringUtils.java
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,21 @@
package org.apache.commons.lang3;

import java.io.UnsupportedEncodingException;
import java.nio.CharBuffer;
import java.nio.charset.Charset;
import java.text.Normalizer;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;
import java.util.ListIterator;
import java.util.Locale;
import java.util.Objects;
import java.util.Set;
import java.util.function.Predicate;
import java.util.function.Supplier;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

import org.apache.commons.lang3.function.Suppliers;
import org.apache.commons.lang3.stream.LangCollectors;
Expand Down Expand Up @@ -2813,8 +2818,9 @@ public static int indexOfAny(final CharSequence cs, final String searchChars) {

/**
* Searches a CharSequence to find the first index of any
* character not in the given set of characters.
*
* character not in the given set of characters, i.e.,
* find index i of first char in srcChars such that (srcChars.codePointAt(i) ∈ {x ∈ codepoints(srcChars) ∣ x ∉ codepoints(searchChars) })
*
* <p>A {@code null} CharSequence will return {@code -1}.
* A {@code null} or zero length search array will return {@code -1}.</p>
*
Expand All @@ -2829,41 +2835,23 @@ public static int indexOfAny(final CharSequence cs, final String searchChars) {
* </pre>
*
* @param cs the CharSequence to check, may be null
* @param srcChars the CharSequence to check, may be null
* @param searchChars the chars to search for, may be null
* @return the index of any of the chars, -1 if no match or null input
* @since 2.0
* @since 3.0 Changed signature from indexOfAnyBut(String, char[]) to indexOfAnyBut(CharSequence, char...)
*/
public static int indexOfAnyBut(final CharSequence cs, final char... searchChars) {
if (isEmpty(cs) || ArrayUtils.isEmpty(searchChars)) {
public static int indexOfAnyBut(final CharSequence srcChars, final char... searchChars) {
if (searchChars == null || searchChars.length == 0) {
return INDEX_NOT_FOUND;
}
final int csLen = cs.length();
final int csLast = csLen - 1;
final int searchLen = searchChars.length;
final int searchLast = searchLen - 1;
outer:
for (int i = 0; i < csLen; i++) {
final char ch = cs.charAt(i);
for (int j = 0; j < searchLen; j++) {
if (searchChars[j] == ch) {
if (i >= csLast || j >= searchLast || !Character.isHighSurrogate(ch)) {
continue outer;
}
if (searchChars[j + 1] == cs.charAt(i + 1)) {
continue outer;
}
}
}
return i;
}
return INDEX_NOT_FOUND;
return indexOfAnyBut(srcChars, CharBuffer.wrap(searchChars));
}

/**
* Search a CharSequence to find the first index of any
* character not in the given set of characters.
* character not in the given set of characters, i.e.,
* find index i of first char in srcChars such that (srcChars.codePointAt(i) ∈ {x ∈ codepoints(srcChars) ∣ x ∉ codepoints(searchChars) })
*
* <p>A {@code null} CharSequence will return {@code -1}.
* A {@code null} or empty search string will return {@code -1}.</p>
Expand All @@ -2878,27 +2866,30 @@ public static int indexOfAnyBut(final CharSequence cs, final char... searchChars
* StringUtils.indexOfAnyBut("aba", "ab") = -1
* </pre>
*
* @param seq the CharSequence to check, may be null
* @param srcChars the CharSequence to check, may be null
* @param searchChars the chars to search for, may be null
* @return the index of any of the chars, -1 if no match or null input
* @since 2.0
* @since 3.0 Changed signature from indexOfAnyBut(String, String) to indexOfAnyBut(CharSequence, CharSequence)
*/
public static int indexOfAnyBut(final CharSequence seq, final CharSequence searchChars) {
if (isEmpty(seq) || isEmpty(searchChars)) {
public static int indexOfAnyBut(final CharSequence srcChars, final CharSequence searchChars) {
if (srcChars == null || searchChars == null || srcChars.length() == 0 || searchChars.length() == 0) {
return INDEX_NOT_FOUND;
}
final int strLen = seq.length();
for (int i = 0; i < strLen; i++) {
final char ch = seq.charAt(i);
final boolean chFound = CharSequenceUtils.indexOf(searchChars, ch, 0) >= 0;
if (i + 1 < strLen && Character.isHighSurrogate(ch)) {
final char ch2 = seq.charAt(i + 1);
if (chFound && CharSequenceUtils.indexOf(searchChars, ch2, 0) < 0) {
return i;
}
} else if (!chFound) {
return i;
final Set<Integer> srcSetCodePoints = srcChars.codePoints().boxed().collect(Collectors.toSet()); // >=10: Collectors::toUnmodifiableSet
final Set<Integer> searchSetCodePoints = searchChars.codePoints().boxed()
.collect(Collectors.toSet()); // >=10: Collectors::toUnmodifiableSet
final Set<Integer> complSetCodePoints = srcSetCodePoints.stream().filter(((Predicate<Integer>) searchSetCodePoints::contains).negate()) // >=11: Predicate.not(searchSetCodePoints::contains)
.collect(Collectors.toSet()); // >=10: Collectors::toUnmodifiableSet
for (final ListIterator<Integer> srcCharsListIt = srcChars.chars().boxed().collect(Collectors.toList()) // >=16: Stream::toList, >=10: Collectors::toUnmodifiableList
.listIterator(); srcCharsListIt.hasNext(); srcCharsListIt.next()) {
final int curSrcCharIdx = srcCharsListIt.nextIndex();
final int curSrcCodePoint = Character.codePointAt(srcChars, curSrcCharIdx);
if (complSetCodePoints.contains(curSrcCodePoint)) {
return curSrcCharIdx;
}
if (Character.isSupplementaryCodePoint(curSrcCodePoint)) {
srcCharsListIt.next(); // skip subsequent low-surrogate in next loop, since it merged into curSrcCodePoint
}
}
return INDEX_NOT_FOUND;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@

import static org.apache.commons.lang3.Supplementary.CharU20000;
import static org.apache.commons.lang3.Supplementary.CharU20001;
import static org.apache.commons.lang3.Supplementary.CharUSupplCharHigh;
import static org.apache.commons.lang3.Supplementary.CharU20000SupplCharLow;
import static org.hamcrest.MatcherAssert.assertThat;
import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertFalse;
Expand Down Expand Up @@ -444,11 +446,24 @@ public void testIndexOfAnyBut_StringCharArray() {
}

@Test
public void testIndexOfAnyBut_StringCharArrayWithSupplementaryChars() {
public void testIndexOfAnyBut_StringCharArrayWithSurrogateChars() {
assertEquals(2, StringUtils.indexOfAnyBut(CharU20000 + CharU20001, CharU20000.toCharArray()));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + CharU20001, CharU20001.toCharArray()));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + CharU20001, ("abcd" + CharU20000SupplCharLow).toCharArray()));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + CharU20001, ("abcd" + CharUSupplCharHigh).toCharArray()));
assertEquals(-1, StringUtils.indexOfAnyBut(CharU20000, CharU20000.toCharArray()));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000, CharU20001.toCharArray()));
assertEquals(-1, StringUtils.indexOfAnyBut(CharUSupplCharHigh + "aaaa", (CharUSupplCharHigh + "abcd").toCharArray()));
assertEquals(-1, StringUtils.indexOfAnyBut(CharUSupplCharHigh + "baaa", (CharUSupplCharHigh + "abcd").toCharArray()));
assertEquals(0, StringUtils.indexOfAnyBut(CharUSupplCharHigh + "aaaa", (CharU20000 + "abcd").toCharArray()));
assertEquals(4, StringUtils.indexOfAnyBut("aaaa" + CharUSupplCharHigh, (CharU20000 + "abcd").toCharArray()));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000SupplCharLow + "aaaa", (CharU20000 + "abcd").toCharArray()));
assertEquals(4, StringUtils.indexOfAnyBut("aaaa" + CharU20000SupplCharLow, (CharU20000 + "abcd").toCharArray()));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + "aaaa", (CharU20000SupplCharLow + "ab" + CharUSupplCharHigh + "cd").toCharArray()));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + "aaaa", "abcd".toCharArray()));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + "aaaa", ("abcd" + CharUSupplCharHigh).toCharArray()));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + "aaaa", ("abcd" + CharU20000SupplCharLow).toCharArray()));
assertEquals(-1, StringUtils.indexOfAnyBut("aaaa" + CharU20000, (CharU20000 + "abcd").toCharArray()));
}

@Test
Expand All @@ -469,11 +484,24 @@ public void testIndexOfAnyBut_StringString() {
}

@Test
public void testIndexOfAnyBut_StringStringWithSupplementaryChars() {
public void testIndexOfAnyBut_StringStringWithSurrogateChars() {
assertEquals(2, StringUtils.indexOfAnyBut(CharU20000 + CharU20001, CharU20000));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + CharU20001, CharU20001));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + CharU20001, "abcd" + CharU20000SupplCharLow));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + CharU20001, "abcd" + CharUSupplCharHigh));
assertEquals(-1, StringUtils.indexOfAnyBut(CharU20000, CharU20000));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000, CharU20001));
assertEquals(-1, StringUtils.indexOfAnyBut(CharUSupplCharHigh + "aaaa", CharUSupplCharHigh + "abcd"));
assertEquals(-1, StringUtils.indexOfAnyBut(CharUSupplCharHigh + "baaa", CharUSupplCharHigh + "abcd"));
assertEquals(0, StringUtils.indexOfAnyBut(CharUSupplCharHigh + "aaaa", CharU20000 + "abcd"));
assertEquals(4, StringUtils.indexOfAnyBut("aaaa" + CharUSupplCharHigh, CharU20000 + "abcd"));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000SupplCharLow + "aaaa", CharU20000 + "abcd"));
assertEquals(4, StringUtils.indexOfAnyBut("aaaa" + CharU20000SupplCharLow, CharU20000 + "abcd"));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + "aaaa", CharU20000SupplCharLow + "ab" + CharUSupplCharHigh + "cd"));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + "aaaa", "abcd"));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + "aaaa", "abcd" + CharUSupplCharHigh));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + "aaaa", "abcd" + CharU20000SupplCharLow));
assertEquals(-1, StringUtils.indexOfAnyBut("aaaa" + CharU20000, CharU20000 + "abcd"));
}

@Test
Expand Down

0 comments on commit f74224a

Please sign in to comment.