-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent behaviour among Julia 1.6.5 (works ok) versus Julia 1.7.0 (error) #796
Comments
It looks like the header contains data in a different encoding, perhaps UTF-16, but Julia parsed is as UTF-8: julia> codeunits("Martín") # utf8 representation:
7-element Base.CodeUnits{UInt8, String}:
0x4d
0x61
0x72
0x74
0xc3
0xad
0x6e
julia> transcode(UInt16,"Martín") # UTF-16 representation:
6-element Vector{UInt16}:
0x004d
0x0061
0x0072
0x0074
0x00ed # <- here is the \xed character
0x006e I also found a difference in regex matching, when given this faulty string: On Julia 1.7.0, I get: julia> match(r"[^\r\n]+", "Martín")
RegexMatch("Martín")
julia> match(r"[^\r\n]+", "Mart\xedn")
RegexMatch("Mart") On Julia 1.6.5, it's: julia> match(r"[^\r\n]+", "Martín")
RegexMatch("Martín")
julia> match(r"[^\r\n]+", "Mart\xedn")
ERROR: BoundsError: attempt to access 6-codeunit String at index [8]
Stacktrace:
[1] prevind(s::String, i::Int64, n::Int64)
@ Base ./strings/basic.jl:499
[2] prevind
@ ./strings/basic.jl:494 [inlined]
[3] prevind
@ ./strings/basic.jl:493 [inlined]
[4] match(re::Regex, str::String, idx::Int64, add_opts::UInt32)
@ Base ./regex.jl:306
[5] match
@ ./regex.jl:297 [inlined]
[6] match(r::Regex, s::String)
@ Base ./regex.jl:316
[7] top-level scope
@ REPL[2]:1
Not sure what the conclusion is... |
The behavior in 1.6.5 is a bug, that could potentially lead to a security issue. The real question here, IMO, is why is the string being created with invalid UTF-8. If a server is legitimately returning results in a different encoding, that should then be converted to to UTF-8 before sending to the regex engine. Maybe you can read the |
I couldn't find any information online about encoding in the header values. Also no evidence that the It works with Downloads.jl (based on CURL), it throws a warning that that julia> r = Downloads.request("http://doi.org/10.1016/j.cma.2021.114093"; headers=["Accept" => "application/x-bibtex"]);
┌ Warning: malformed HTTP header
│ url = "https://api.crossref.org/v1/works/10.1016%2Fj.cma.2021.114093/transform"
│ status = 200
│ header = "link: <http://dx.doi.org/10.1016/j.cma.2021.114093>; rel=\"canonical\", <https://api.elsevier.com/content/article/PII:S0045782521004242?httpAccept=text/xml>; version=\"vor\"; type=\"text/xml\"; rel=\"item\", <https://api.elsevier.com/content/article/PII:S0045782" ⋯ 65 bytes ⋯ "\"; rel=\"item\", <https://www.elsevier.com/tdm/userlicense/1.0/>; version=\"tdm\"; rel=\"license\", <http://orcid.org/0000-0003-2391-4086>; title=\"Santiago Badia\"; rel=\"author\", <http://orcid.org/0000-0001-5751-4561>; title=\"Alberto F. Mart\xedn\"; rel=\"author\"\r\n"
└ @ Downloads.Curl /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.7/Downloads/src/Curl/Easy.jl:261
julia> r.headers
14-element Vector{Pair{String, String}}:
"date" => "Tue, 08 Mar 2022 22:34:16 GMT"
"content-type" => "application/x-bibtex"
"vary" => "Accept"
"access-control-expose-headers" => "Link"
"access-control-allow-headers" => "X-Requested-With, Accept, Accept-Encoding, Accept-Charset, Accept-Language, Accept-Ranges, Cache-Control"
"access-control-allow-origin" => "*"
"server" => "Jetty(9.4.40.v20210413)"
"x-ratelimit-limit" => "50"
"x-ratelimit-interval" => "1s"
"x-api-pool" => "public"
"x-rate-limit-limit" => "50"
"x-rate-limit-interval" => "1s"
"permissions-policy" => "interest-cohort=()"
"connection" => "close"
@amartinhuertas For your use case, consider using HTTPDownloads.jl, then it only logs a warning: julia> a = String(HTTP.get("http://doi.org/$doi", ["Accept" => "application/x-bibtex"]).body)
┌ Warning: malformed HTTP header
│ ... same warning as above ...
└ @ Downloads.Curl
"@article{Badia_2021,\n\tdoi = {10.1016/j.cma.2021.114093},\n\turl = {https://doi.org/10.1016%2Fj.cma.2021.114093},\n\tyear = 2021,\n\tmonth = {dec},\n\tpublisher = {Elsevier {BV}},\n\tvolume = {386},\n\tpages = {114093},\n\tauthor = {Santiago Badia and Manuel A. Caicedo and Alberto F. Mart{\\'{\\i}}n and Javier Principe},\n\ttitle = {A robust and scalable unfitted adaptive finite element framework for nonlinear solid mechanics},\n\tjournal = {Computer Methods in Applied Mechanics and Engineering}\n}" |
Perhaps HTTP.jl, and Downloads.jl for that matter, should just do that? cc @StefanKarpinski |
HTTP headers are supposed to only be ASCII but there are apparently some legacy values that end up being encoded in Latin-1, so maybe the best approach here is to interpret them as Latin-1 and transcode them to UTF-8 before presentation to the user. This case appears to be encoded in Latin-1 for what it's worth, so this would "just work" with that approach. Here's some code that checks for ASCII first and if the string isn't pure ASCII transcodes it from Latin-1 to UTF-8: julia> val = "Mart\xedn" # Latin-1 encoded string
"Mart\xedn"
julia> n = count(!isascii, val)
julia> if n > 0
val′ = sprint(sizehint = ncodeunits(val) + n) do io
for byte in codeunits(val)
write(io, Char(byte))
end
end
end
"Martín" |
Here's a an optimized version: function latin1_to_utf8(str::String)
bytes = codeunits(str)
n = count(≥(0x80), bytes)
n == 0 && return str
buf = Base.StringVector(length(bytes) + n)
i = 0
for byte in bytes
if byte ≥ 0x80
buf[i += 1] = 0xc0 | (byte >> 6)
buf[i += 1] = 0x80 | (byte & 0x3f)
else
buf[i += 1] = byte
end
end
return String(buf)
end |
As brought up in #796, there may be scenarios where headers may contain non-UTF8 characters (even though they're supposed to be ASCII). Appreciation to @StefanKarpinski for the Latin-1 -> UTF-8 conversion code and the suggestion to try reencoding before throwing an error. As proposed in this PR, the normal header parsing path should be unaffected and only when we're unable to parse a normal header will we attempt this reencoding. Note that curl warns on the malformed header and filters it out.
Thanks for the suggested solutions @StefanKarpinski; I've packaged these up in a PR here: #830. I think it works pretty well because it's just a check right before we throw the error to see if we can re-encode. |
* Attempt to reencode malformed headers from Latin-1 to UTF8 As brought up in #796, there may be scenarios where headers may contain non-UTF8 characters (even though they're supposed to be ASCII). Appreciation to @StefanKarpinski for the Latin-1 -> UTF-8 conversion code and the suggestion to try reencoding before throwing an error. As proposed in this PR, the normal header parsing path should be unaffected and only when we're unable to parse a normal header will we attempt this reencoding. Note that curl warns on the malformed header and filters it out. * cleanup implementation and ensure tests pass * fix * one more cleanup
I'm not sure this is useful but many dois of mine get rejected and pop up during the link verification. Adding the x-bibtex header (which I spotted in JuliaWeb/HTTP.jl#796) seems to fix that for me, hence this tiny PR. Do with it what you will!
Dear HTTP.jl developers,
when I execute the following MWE on a Julia REPL:
I get the expected output with Julia 1.6.5, but an exception with Julia 1.7.0 (see details below).
Please note that the doi at hand (i.e., 10.1016/j.cma.2021.114093) is tricky in the sense that one of the authors of the paper has an special character in it, namely í in Martín. If I use a doi associated to an article in which the author surnames do not have such character, both Julia 1.6.5 and 1.7.0 have the correct behaviour. I suspect this might be related to differences in the regex implementation of Julia 1.6.5 versus 1.7.0, but I am not certain.
Do you have any clue with regards to whether this is a bug or not, and a possible fix?
Thanks in advance!
cc @santiagobadia
Versions:
The text was updated successfully, but these errors were encountered: