Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

findnext throws StringIndexError for malformed UTF-8 #25997

Closed
bkamins opened this issue Feb 11, 2018 · 13 comments
Closed

findnext throws StringIndexError for malformed UTF-8 #25997

bkamins opened this issue Feb 11, 2018 · 13 comments
Labels
search & find The find* family of functions strings "Strings!" unicode Related to unicode characters and encodings

Comments

@bkamins
Copy link
Member

bkamins commented Feb 11, 2018

Here is an example of the problem

julia> s = "∀"
"∀"

julia> ss = "\x80"
"\x80"

julia> r = r"\x80"
r"\x80"

julia> contains(s, ss)
ERROR: StringIndexError("∀", 3)
Stacktrace:
 [1] string_index_err(::String, ::Int64) at .\strings\string.jl:7
 [2] getindex_continued(::String, ::Int64, ::UInt32) at .\strings\string.jl:200
 [3] getindex at .\strings\string.jl:193 [inlined]
 [4] findnext(::Base.EqualTo{Char}, ::String, ::Int64) at .\strings\search.jl:16
 [5] _searchindex(::String, ::String, ::Int64) at .\strings\search.jl:165
 [6] contains(::String, ::String) at .\strings\search.jl:477
 [7] top-level scope

julia> contains(s, r)
false

julia> findfirst(ss, s)
ERROR: StringIndexError("∀", 3)
Stacktrace:
 [1] string_index_err(::String, ::Int64) at .\strings\string.jl:7
 [2] getindex_continued(::String, ::Int64, ::UInt32) at .\strings\string.jl:200
 [3] getindex at .\strings\string.jl:193 [inlined]
 [4] findnext(::Base.EqualTo{Char}, ::String, ::Int64) at .\strings\search.jl:16
 [5] _searchindex(::String, ::String, ::Int64) at .\strings\search.jl:165
 [6] _search at .\strings\search.jl:233 [inlined]
 [7] findnext at .\strings\search.jl:267 [inlined]
 [8] findfirst(::String, ::String) at .\strings\search.jl:101
 [9] top-level scope

julia> findfirst(r, s)
0:-1

The reason is that findnext for String executes the following code:

        i = _search(s, first_utf8_byte(c), i)
        i == 0 && return nothing
        s[i] == c && return i

which fails if c is Malfrormed UTF-8 as s[i] throws an error then.

The decision to be made is whether such a search:

  1. should throw an error (but with an appropriate error message)
  2. or behave like for regular expressions case (indicating that search failed)

When we have a decision a fix should be relatively simple.

CC @StefanKarpinski @nalimilan

@bkamins bkamins changed the title findnext throws StringIndexErrof for malformed UTF-8 findnext throws StringIndexError for malformed UTF-8 Feb 11, 2018
@bkamins
Copy link
Member Author

bkamins commented Feb 11, 2018

Unfortunately the situation is a bit more complex than I thought. We should also cover those cases with a consistent rule (or directly document that string and regex searches may return different results):

julia> ss = "\x80"
"\x80"

julia> r = r"\x80"
r"\x80"

julia> s1 = "∀\x80"
"∀\x80"

julia> s2 = "\x80∀"
"\x80∀"

julia> contains(s1, ss)
ERROR: StringIndexError("∀\x80", 3)
Stacktrace:
 [1] string_index_err(::String, ::Int64) at .\strings\string.jl:7
 [2] getindex_continued(::String, ::Int64, ::UInt32) at .\strings\string.jl:200
 [3] getindex at .\strings\string.jl:193 [inlined]
 [4] findnext(::Base.EqualTo{Char}, ::String, ::Int64) at .\strings\search.jl:16
 [5] _searchindex(::String, ::String, ::Int64) at .\strings\search.jl:165
 [6] contains(::String, ::String) at .\strings\search.jl:477
 [7] top-level scope

julia> contains(s2, ss)
true

julia> contains(s1, r)
false

julia> contains(s2, r)
false

@nalimilan nalimilan added unicode Related to unicode characters and encodings strings "Strings!" search & find The find* family of functions labels Feb 11, 2018
@nalimilan
Copy link
Member

Good catch!

I guess the best behavior would to be always consistent with what the same functions return when looking for the corresponding character in collect(haystack). That is, invalid sequences which appear inside a valid code point should not be matched because you wouldn't get them when iterating over the string. No error should be thrown. This principle can be generalized when looking for multi-character strings.

@bkamins
Copy link
Member Author

bkamins commented Feb 11, 2018

This is reasonable. The only problem is that it would not be consistent with regex search (see s2 = "\x80∀" example for which your rule would return true and regex search fails). I guess this is acceptable, but should be documented.

@nalimilan
Copy link
Member

Yeah, it's a bit annoying, but using invalid UTF-8 in regexes shouldn't be that common hopefully. We could make it an error if PCRE doesn't support it as we want. There's some documentation about this. IIUC PCRE will throw errors with too large code points, so there will be a problem there too. I don't really understand why r"\x80" is accepted but doesn't match anything...

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Feb 12, 2018

We may not want to support using invalid UTF-8 in regexes at all. If the behavior of regexes is meant to be explicable in terms of operating on characters, a regex containing invalid UTF-8 will sometimes have behavior that simply cannot be explained that way since really it is operating on bytes. Fundamentally, it seems ambiguous what r"\x80" even means. Does it mean "search for the byte 0x80 somewhere in the sequence of bytes encoding this string" or does it mean "search for an invalid character encoded by the single byte 0x80 in this string? Those are two quite different operations.

@bkamins
Copy link
Member Author

bkamins commented Feb 12, 2018

This is exactly the point I wanted to make with the examples.

To sum up the decisions to be made:

  1. regex case
    • do we allow r"[something]" to have invalid UTF-8 characters; I understand that the answer is no (and trying to create such an object should fail);
    • do we allow the string in which we search for a regex to contain invalid UTF-8 characters (given regex itself is valid UTF-8); here I am not sure as I do not know precisely how PCRE would behave in such a situation. E.g. we have a .* somewhere inside regex - would it always handle such a case correctly. Any opinion?
  2. string case
    • I understand that here the consensus is that we allow both strings to contain invalid UTF-8 characters but we make sure that we match characters (as defined by String specification) not code units in the search. Right?

@StefanKarpinski
Copy link
Member

Discussion about invalid UTF-8 and PCRE https://news.ycombinator.com/item?id=20051020:

As of PCRE 10.33, you can able the PCRE2_JIT_INVALID_UTF check for JIT matching instead.

@vtjnash
Copy link
Member

vtjnash commented Mar 16, 2020

OK, so let's set the PCRE2_MATCH_INVALID_UTF and close this!

@vtjnash vtjnash added the Hacktoberfest Good for Hacktoberfest participants label Mar 16, 2020
@vtjnash
Copy link
Member

vtjnash commented Feb 4, 2021

Fundamentally, it seems ambiguous what r"\x80" even means. Does it mean "search for the byte 0x80 somewhere in the sequence of bytes encoding this string" or does it mean "search for an invalid character encoded by the single byte 0x80 in this string?

Neither. We set the PCRE.ALT_BSUX by default, so it means the code point:

julia> match(r"a\x80", "a\u80")
RegexMatch("a\u80")

(3) \x matches a lower case "x" character unless it is followed by two hexadecimal digits, in which case the hexadecimal number defines the code point to match

Per ECMAscript, \x## is the same as \u00## in a regex

This differs from Regex("\x80"), which is an illegal pattern in pcre2.

@vtjnash vtjnash added triage This should be discussed on a triage call and removed Hacktoberfest Good for Hacktoberfest participants labels Feb 4, 2021
@StefanKarpinski
Copy link
Member

Per ECMAscript, \x## is the same as \u00## in a regex

That may be what ECMAScript says, but that's not how we treat these escapes: in Julia, \x80 is the byte 0x80 whereas \u80 is the code point U+80. I agree that if the regex specification is invalid we should just error.

@vtjnash
Copy link
Member

vtjnash commented Mar 24, 2021

It is not an escape sequence, since this is not in Julia. Since this is a regex, it is a regex match sequence.

@vtjnash
Copy link
Member

vtjnash commented Mar 8, 2022

Seems to be done now

@vtjnash vtjnash closed this as completed Mar 8, 2022
@aviks
Copy link
Member

aviks commented Mar 8, 2022

@DilumAluthge DilumAluthge removed the triage This should be discussed on a triage call label Mar 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
search & find The find* family of functions strings "Strings!" unicode Related to unicode characters and encodings
Projects
None yet
Development

No branches or pull requests

6 participants