findnext throws StringIndexError for malformed UTF-8 #25997

bkamins · 2018-02-11T21:05:46Z

Here is an example of the problem

julia> s = "∀"
"∀"

julia> ss = "\x80"
"\x80"

julia> r = r"\x80"
r"\x80"

julia> contains(s, ss)
ERROR: StringIndexError("∀", 3)
Stacktrace:
 [1] string_index_err(::String, ::Int64) at .\strings\string.jl:7
 [2] getindex_continued(::String, ::Int64, ::UInt32) at .\strings\string.jl:200
 [3] getindex at .\strings\string.jl:193 [inlined]
 [4] findnext(::Base.EqualTo{Char}, ::String, ::Int64) at .\strings\search.jl:16
 [5] _searchindex(::String, ::String, ::Int64) at .\strings\search.jl:165
 [6] contains(::String, ::String) at .\strings\search.jl:477
 [7] top-level scope

julia> contains(s, r)
false

julia> findfirst(ss, s)
ERROR: StringIndexError("∀", 3)
Stacktrace:
 [1] string_index_err(::String, ::Int64) at .\strings\string.jl:7
 [2] getindex_continued(::String, ::Int64, ::UInt32) at .\strings\string.jl:200
 [3] getindex at .\strings\string.jl:193 [inlined]
 [4] findnext(::Base.EqualTo{Char}, ::String, ::Int64) at .\strings\search.jl:16
 [5] _searchindex(::String, ::String, ::Int64) at .\strings\search.jl:165
 [6] _search at .\strings\search.jl:233 [inlined]
 [7] findnext at .\strings\search.jl:267 [inlined]
 [8] findfirst(::String, ::String) at .\strings\search.jl:101
 [9] top-level scope

julia> findfirst(r, s)
0:-1

The reason is that findnext for String executes the following code:

        i = _search(s, first_utf8_byte(c), i)
        i == 0 && return nothing
        s[i] == c && return i

which fails if c is Malfrormed UTF-8 as s[i] throws an error then.

The decision to be made is whether such a search:

should throw an error (but with an appropriate error message)
or behave like for regular expressions case (indicating that search failed)

When we have a decision a fix should be relatively simple.

CC @StefanKarpinski @nalimilan

The text was updated successfully, but these errors were encountered:

bkamins · 2018-02-11T21:14:49Z

Unfortunately the situation is a bit more complex than I thought. We should also cover those cases with a consistent rule (or directly document that string and regex searches may return different results):

julia> ss = "\x80"
"\x80"

julia> r = r"\x80"
r"\x80"

julia> s1 = "∀\x80"
"∀\x80"

julia> s2 = "\x80∀"
"\x80∀"

julia> contains(s1, ss)
ERROR: StringIndexError("∀\x80", 3)
Stacktrace:
 [1] string_index_err(::String, ::Int64) at .\strings\string.jl:7
 [2] getindex_continued(::String, ::Int64, ::UInt32) at .\strings\string.jl:200
 [3] getindex at .\strings\string.jl:193 [inlined]
 [4] findnext(::Base.EqualTo{Char}, ::String, ::Int64) at .\strings\search.jl:16
 [5] _searchindex(::String, ::String, ::Int64) at .\strings\search.jl:165
 [6] contains(::String, ::String) at .\strings\search.jl:477
 [7] top-level scope

julia> contains(s2, ss)
true

julia> contains(s1, r)
false

julia> contains(s2, r)
false

nalimilan · 2018-02-11T22:19:12Z

Good catch!

I guess the best behavior would to be always consistent with what the same functions return when looking for the corresponding character in collect(haystack). That is, invalid sequences which appear inside a valid code point should not be matched because you wouldn't get them when iterating over the string. No error should be thrown. This principle can be generalized when looking for multi-character strings.

bkamins · 2018-02-11T22:51:56Z

This is reasonable. The only problem is that it would not be consistent with regex search (see s2 = "\x80∀" example for which your rule would return true and regex search fails). I guess this is acceptable, but should be documented.

nalimilan · 2018-02-12T08:45:12Z

Yeah, it's a bit annoying, but using invalid UTF-8 in regexes shouldn't be that common hopefully. We could make it an error if PCRE doesn't support it as we want. There's some documentation about this. IIUC PCRE will throw errors with too large code points, so there will be a problem there too. I don't really understand why r"\x80" is accepted but doesn't match anything...

StefanKarpinski · 2018-02-12T15:05:57Z

We may not want to support using invalid UTF-8 in regexes at all. If the behavior of regexes is meant to be explicable in terms of operating on characters, a regex containing invalid UTF-8 will sometimes have behavior that simply cannot be explained that way since really it is operating on bytes. Fundamentally, it seems ambiguous what r"\x80" even means. Does it mean "search for the byte 0x80 somewhere in the sequence of bytes encoding this string" or does it mean "search for an invalid character encoded by the single byte 0x80 in this string? Those are two quite different operations.

bkamins · 2018-02-12T15:17:05Z

This is exactly the point I wanted to make with the examples.

To sum up the decisions to be made:

regex case
- do we allow r"[something]" to have invalid UTF-8 characters; I understand that the answer is no (and trying to create such an object should fail);
- do we allow the string in which we search for a regex to contain invalid UTF-8 characters (given regex itself is valid UTF-8); here I am not sure as I do not know precisely how PCRE would behave in such a situation. E.g. we have a .* somewhere inside regex - would it always handle such a case correctly. Any opinion?
string case
- I understand that here the consensus is that we allow both strings to contain invalid UTF-8 characters but we make sure that we match characters (as defined by String specification) not code units in the search. Right?

StefanKarpinski · 2019-05-31T15:50:30Z

Discussion about invalid UTF-8 and PCRE https://news.ycombinator.com/item?id=20051020:

As of PCRE 10.33, you can able the PCRE2_JIT_INVALID_UTF check for JIT matching instead.

vtjnash · 2020-03-16T14:28:57Z

OK, so let's set the PCRE2_MATCH_INVALID_UTF and close this!

vtjnash · 2021-02-04T05:47:03Z

Fundamentally, it seems ambiguous what r"\x80" even means. Does it mean "search for the byte 0x80 somewhere in the sequence of bytes encoding this string" or does it mean "search for an invalid character encoded by the single byte 0x80 in this string?

Neither. We set the PCRE.ALT_BSUX by default, so it means the code point:

julia> match(r"a\x80", "a\u80")
RegexMatch("a\u80")

(3) \x matches a lower case "x" character unless it is followed by two hexadecimal digits, in which case the hexadecimal number defines the code point to match

Per ECMAscript, \x## is the same as \u00## in a regex

This differs from Regex("\x80"), which is an illegal pattern in pcre2.

Refs #25997

StefanKarpinski · 2021-03-23T21:43:34Z

Per ECMAscript, \x## is the same as \u00## in a regex

That may be what ECMAScript says, but that's not how we treat these escapes: in Julia, \x80 is the byte 0x80 whereas \u80 is the code point U+80. I agree that if the regex specification is invalid we should just error.

vtjnash · 2021-03-24T05:31:52Z

It is not an escape sequence, since this is not in Julia. Since this is a regex, it is a regex match sequence.

Refs JuliaLang#25997

vtjnash · 2022-03-08T20:04:25Z

Seems to be done now

aviks · 2022-03-08T21:02:32Z

Linking the PR here: regex: enable safe handling of invalid UTF-8 by default #39524

bkamins changed the title ~~findnext throws StringIndexErrof for malformed UTF-8~~ findnext throws StringIndexError for malformed UTF-8 Feb 11, 2018

nalimilan added unicode Related to unicode characters and encodings strings "Strings!" search & find The find* family of functions labels Feb 11, 2018

vtjnash added the Hacktoberfest Good for Hacktoberfest participants label Mar 16, 2020

vtjnash added triage This should be discussed on a triage call and removed Hacktoberfest Good for Hacktoberfest participants labels Feb 4, 2021

vtjnash added a commit that referenced this issue Feb 4, 2021

regex: enable safe handling of invalid UTF-8 by default

d336d48

Refs #25997

ElOceanografo pushed a commit to ElOceanografo/julia that referenced this issue May 4, 2021

regex: enable safe handling of invalid UTF-8 by default

59a035e

Refs JuliaLang#25997

antoine-levitt pushed a commit to antoine-levitt/julia that referenced this issue May 9, 2021

regex: enable safe handling of invalid UTF-8 by default

7ec69b3

Refs JuliaLang#25997

maleadt mentioned this issue Jan 3, 2022

Regression in Regex construction (and thereby DateFormat) performance on 1.7 #43610

Closed

fonsp mentioned this issue Mar 8, 2022

Regex not matching malformed UTF-8 #44520

Closed

vtjnash closed this as completed Mar 8, 2022

DilumAluthge removed the triage This should be discussed on a triage call label Mar 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

findnext throws StringIndexError for malformed UTF-8 #25997

findnext throws StringIndexError for malformed UTF-8 #25997

bkamins commented Feb 11, 2018 •

edited

Loading

bkamins commented Feb 11, 2018

nalimilan commented Feb 11, 2018

bkamins commented Feb 11, 2018

nalimilan commented Feb 12, 2018

StefanKarpinski commented Feb 12, 2018 •

edited

Loading

bkamins commented Feb 12, 2018

StefanKarpinski commented May 31, 2019

vtjnash commented Mar 16, 2020

vtjnash commented Feb 4, 2021

StefanKarpinski commented Mar 23, 2021

vtjnash commented Mar 24, 2021

vtjnash commented Mar 8, 2022

aviks commented Mar 8, 2022

findnext throws StringIndexError for malformed UTF-8 #25997

findnext throws StringIndexError for malformed UTF-8 #25997

Comments

bkamins commented Feb 11, 2018 • edited Loading

bkamins commented Feb 11, 2018

nalimilan commented Feb 11, 2018

bkamins commented Feb 11, 2018

nalimilan commented Feb 12, 2018

StefanKarpinski commented Feb 12, 2018 • edited Loading

bkamins commented Feb 12, 2018

StefanKarpinski commented May 31, 2019

vtjnash commented Mar 16, 2020

vtjnash commented Feb 4, 2021

StefanKarpinski commented Mar 23, 2021

vtjnash commented Mar 24, 2021

vtjnash commented Mar 8, 2022

aviks commented Mar 8, 2022

bkamins commented Feb 11, 2018 •

edited

Loading

StefanKarpinski commented Feb 12, 2018 •

edited

Loading