Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[StringUtils::indexOfAnyBut] redesign due to inconsistent/faulty…
…behaviour regarding UTF-16 surrogates Both signatures of StringUtils::indexOfAnyBut currently behave inconsistently in matching UTF-16 supplementary characters and single UTF-16 surrogate characters (i.e. paired and unpaired surrogates), since they differ unnecessarily in their algorithmic implementations, use their own incomplete and faulty interpretation of UTF-16 and don't take full advantage of the standard library. The example cases below show that they may yield contradictory results or correct results for the wrong reasons. This proposal gives a unified algorithmic implementation of both signatures that a) is much easier to grasp due to a clear mathematical set approach and safe iteration and doesn't become entangled in index arithmetic; stresses the set semantics of the 2nd argument b) fully relies on the standard library for defined UTF-16 handling/interpretation; paired surrogates are merged into one codepoint, unpaired surrogates are left as they are c) scales much better with input sizes and result index position d) can benefit from current and future improvements in the standard library and JVM (streams implementation, parallelization, JIT optimization, JEP 218, ???…) The algorithm boils down to: find index i of first char in srcChars such that (srcChars.codePointAt(i) ∈ {x ∈ codepoints(srcChars) ∣ x ∉ codepoints(searchChars) }) Examples: --------- <H>: high-surrogate character <L>: low-surrogate character (<H><L>): valid supplementary character signature 1: StringUtils::indexOfAnyBut(final CharSequence srcChars, final CharSequence searchChars) signature 2: StringUtils::indexOfAnyBut(final CharSequence srcChars, final char... searchChars) Case 1: matching of unpaired high-surrogate ---------srcChars-----searchChars------exp./new-----sig.1-------sig.2--- 1.1 <H>aaaa <H>abcd !found !found !found sig.2: 'a' happens to follow <H> in searchChars; sig.1: 'a' is somewhere in searchChars 1.2 <H>baaa <H>abcd !found !found 0 sig.1: 'b' is somewhere in searchChars 1.3 <H>aaaa (<H><L>)abcd 0 !found 0 sig.1: 'a' is somewhere in searchChars 1.4 aaaa<H> (<H><L>)abcd 4 !found !found sig.1+2 don't interpret suppl. character Case 2: matching of unpaired low-surrogate ---------srcChars-----searchChars------exp./new-----sig.1-------sig.2--- 2.1 <L>aaaa (<H><L>)abcd 0 !found !found sig.1+2 don't interpret suppl. character 2.2 aaaa<L> (<H><L>)abcd 4 !found !found sig.1+2 don't interpret suppl. character Case 3: matching of supplementary character ---------srcChars-----------searchChars-----exp./new----sig.1-----sig.2- 3.1 (<H><L>)aaaa <L>ab<H>cd 0 !found 0 sig.1: <L> is somewhere in searchChars 3.2 (<H><L>)aaaa abcd 0 1 0 sig.1 always points to low-surrogate of (fully) unmatched suppl. character 3.3 (<H><L>)aaaa abcd<H> 0 0 1 3.4 (<H><L>)aaaa abcd<L> 0 !found 0 sig.1: <H> skipped by algorithm
- Loading branch information