Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex not matching malformed UTF-8 #44520

Closed
fonsp opened this issue Mar 8, 2022 · 2 comments
Closed

Regex not matching malformed UTF-8 #44520

fonsp opened this issue Mar 8, 2022 · 2 comments
Labels
strings "Strings!"

Comments

@fonsp
Copy link
Member

fonsp commented Mar 8, 2022

I noticed this behaviour on 1.7.0 (it throws an error on 1.6.5, see #25997 for prior discussion):

julia> match(r".+", "hello\x80world")
RegexMatch("hello")

I expected . to match any character, including invalid characters.

In JS (V8 engine), which also allows arbitrary data in strings, the result is:

> let c = new TextDecoder().decode(new Uint8Array([0x80]))

> /.+/.exec(`hello${c}world`)
[ "hello�world" ]

This came up in JuliaWeb/HTTP.jl#796

@fonsp fonsp changed the title Regex matching malformed UTF-8 Regex not matching malformed UTF-8 Mar 8, 2022
@vtjnash vtjnash added the strings "Strings!" label Mar 8, 2022
@vtjnash
Copy link
Member

vtjnash commented Mar 8, 2022

In V8, which does not allow arbitrary data in strings, it has lied to you and changed your string to only contain valid characters:

> new TextEncoder().encode(c)
Uint8Array(3) [239, 191, 189]

> c.charAt(0) == '\ufffd'
true

PCRE2 made the decision to not consider their definition of any character to include bytes that are not characters (#39524)

@fonsp
Copy link
Member Author

fonsp commented Mar 8, 2022

Just learned something new, thanks Jameson!

Let's close then, I suppose that PCRE2 has a good reason for that decision.

@fonsp fonsp closed this as completed Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

No branches or pull requests

2 participants