Regex not matching malformed UTF-8 #44520

fonsp · 2022-03-08T17:51:26Z

I noticed this behaviour on 1.7.0 (it throws an error on 1.6.5, see #25997 for prior discussion):

julia> match(r".+", "hello\x80world")
RegexMatch("hello")

I expected . to match any character, including invalid characters.

In JS (V8 engine), which also allows arbitrary data in strings, the result is:

> let c = new TextDecoder().decode(new Uint8Array([0x80]))

> /.+/.exec(`hello${c}world`)
[ "hello�world" ]

The text was updated successfully, but these errors were encountered:

vtjnash · 2022-03-08T20:16:03Z

In V8, which does not allow arbitrary data in strings, it has lied to you and changed your string to only contain valid characters:

> new TextEncoder().encode(c)
Uint8Array(3) [239, 191, 189]

> c.charAt(0) == '\ufffd'
true

PCRE2 made the decision to not consider their definition of any character to include bytes that are not characters (#39524)

fonsp · 2022-03-08T22:27:38Z

Just learned something new, thanks Jameson!

Let's close then, I suppose that PCRE2 has a good reason for that decision.

fonsp changed the title ~~Regex matching malformed UTF-8~~ Regex not matching malformed UTF-8 Mar 8, 2022

vtjnash added the strings "Strings!" label Mar 8, 2022

fonsp closed this as completed Mar 8, 2022

Provide feedback