-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
findnext throws StringIndexError for malformed UTF-8 #25997
Comments
Unfortunately the situation is a bit more complex than I thought. We should also cover those cases with a consistent rule (or directly document that string and regex searches may return different results):
|
Good catch! I guess the best behavior would to be always consistent with what the same functions return when looking for the corresponding character in |
This is reasonable. The only problem is that it would not be consistent with regex search (see |
Yeah, it's a bit annoying, but using invalid UTF-8 in regexes shouldn't be that common hopefully. We could make it an error if PCRE doesn't support it as we want. There's some documentation about this. IIUC PCRE will throw errors with too large code points, so there will be a problem there too. I don't really understand why |
We may not want to support using invalid UTF-8 in regexes at all. If the behavior of regexes is meant to be explicable in terms of operating on characters, a regex containing invalid UTF-8 will sometimes have behavior that simply cannot be explained that way since really it is operating on bytes. Fundamentally, it seems ambiguous what |
This is exactly the point I wanted to make with the examples. To sum up the decisions to be made:
|
Discussion about invalid UTF-8 and PCRE https://news.ycombinator.com/item?id=20051020:
|
OK, so let's set the PCRE2_MATCH_INVALID_UTF and close this! |
Neither. We set the PCRE.ALT_BSUX by default, so it means the code point:
Per ECMAscript, This differs from |
That may be what ECMAScript says, but that's not how we treat these escapes: in Julia, |
It is not an escape sequence, since this is not in Julia. Since this is a regex, it is a regex match sequence. |
Seems to be done now |
Linking the PR here: regex: enable safe handling of invalid UTF-8 by default #39524 |
Here is an example of the problem
The reason is that
findnext
forString
executes the following code:which fails if
c
is Malfrormed UTF-8 ass[i]
throws an error then.The decision to be made is whether such a search:
When we have a decision a fix should be relatively simple.
CC @StefanKarpinski @nalimilan
The text was updated successfully, but these errors were encountered: