Inconsistent behaviour among Julia 1.6.5 (works ok) versus Julia 1.7.0 (error) #796

amartinhuertas · 2022-01-17T01:45:45Z

Dear HTTP.jl developers,

when I execute the following MWE on a Julia REPL:

using HTTP
doi="10.1016/j.cma.2021.114093"
a = String(HTTP.get("http://doi.org/$doi", ["Accept" => "application/x-bibtex"]).body)

I get the expected output with Julia 1.6.5, but an exception with Julia 1.7.0 (see details below).

Please note that the doi at hand (i.e., 10.1016/j.cma.2021.114093) is tricky in the sense that one of the authors of the paper has an special character in it, namely í in Martín. If I use a doi associated to an article in which the author surnames do not have such character, both Julia 1.6.5 and 1.7.0 have the correct behaviour. I suspect this might be related to differences in the regex implementation of Julia 1.6.5 versus 1.7.0, but I am not certain.

Do you have any clue with regards to whether this is a bug or not, and a possible fix?
Thanks in advance!

cc @santiagobadia

Versions:

Julia v1.6.5 vs v1.7.0
HTTP.jl v0.9.17
MbedTLS.jl v1.0.3

a = String(HTTP.get("http://doi.org/$doi", ["Accept" => "application/x-bibtex"]).body)
ERROR: LoadError: HTTP.Parsers.ParseError(:INVALID_HEADER_FIELD, "link: <http://dx.doi.org/10.1016/j.cma.2021.114093>; rel=\"canonical\", <https://api.elsevier.com/content/article/PII:S0045782521004242?httpAccept=text/xml>; version=\"vor\"; type=\"text/xml\"; rel=\"item\", <https://api.elsevier.com/content/article/PII:S0045782521004242?httpAccept=text/plain>; version=\"vor\"; type=\"text/plain\"; rel=\"item\", <https://www.elsevier.com/tdm/userlicense/1.0/>; version=\"tdm\"; rel=\"license\", <http://orcid.org/0000-0003-2391-4086>; title=\"Santiago Badia\"; rel=\"author\", <http://orcid.org/0000-0001-5751-4561>; title=\"Alberto F. Mart\xedn\"; rel=\"author\"\r\naccess-control-expose-headers: Link\r\naccess-control-allow-headers: X-Requested-With, Accept, Accept-Encoding, Accept-Charset, Accept-Language, Accept-Ranges, Cache-Control\r\naccess-control-allow-origin: *\r\nserver: Jetty(9.4.40.v20210413)\r\nx-ratelimit-limit: 50\r\nx-ratelimit-interval: 1s\r\nx-api-pool: public\r\nx-rate-limit-limit: 50\r\nx-rate-limit-interval: 1s\r\npermissions-policy: interest-cohort=()\r\nconnection: close\r\n\r\n")
Stacktrace:
  [1] parse_header_field(bytes::SubString{String})
    @ HTTP.Parsers ~/.julia/packages/HTTP/aTjcj/src/Parsers.jl:238
  [2] parse_header_fields!(bytes::SubString{String}, m::HTTP.Messages.Response)
    @ HTTP.Messages ~/.julia/packages/HTTP/aTjcj/src/Messages.jl:552
  [3] readheaders(io::HTTP.ConnectionPool.Transaction{MbedTLS.SSLContext}, message::HTTP.Messages.Response)
    @ HTTP.Messages ~/.julia/packages/HTTP/aTjcj/src/Messages.jl:539
  [4] startread(http::HTTP.Streams.Stream{HTTP.Messages.Response, HTTP.ConnectionPool.Transaction{MbedTLS.SSLContext}})
    @ HTTP.Streams ~/.julia/packages/HTTP/aTjcj/src/Streams.jl:170
  [5] macro expansion
    @ ~/.julia/packages/HTTP/aTjcj/src/StreamRequest.jl:67 [inlined]
  [6] macro expansion
    @ ./task.jl:399 [inlined]
  [7] request(::Type{StreamLayer{Union{}}}, io::HTTP.ConnectionPool.Transaction{MbedTLS.SSLContext}, req::HTTP.Messages.Request, body::Vector{UInt8}; reached_redirect_limit::Bool, response_stream::Nothing, iofunction::Nothing, verbose::Int64, kw::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ HTTP.StreamRequest ~/.julia/packages/HTTP/aTjcj/src/StreamRequest.jl:57
  [8] request(::Type{ConnectionPoolLayer{StreamLayer{Union{}}}}, url::URIs.URI, req::HTTP.Messages.Request, body::Vector{UInt8}; proxy::Nothing, socket_type::Type, reuse_limit::Int64, kw::Base.Pairs{Symbol, Union{Nothing, Bool}, Tuple{Symbol, Symbol}, NamedTuple{(:iofunction, :reached_redirect_limit), Tuple{Nothing, Bool}}})
    @ HTTP.ConnectionRequest ~/.julia/packages/HTTP/aTjcj/src/ConnectionRequest.jl:108
  [9] request(::Type{ExceptionLayer{ConnectionPoolLayer{StreamLayer{Union{}}}}}, ::URIs.URI, ::Vararg{Any}; kw::Base.Pairs{Symbol, Union{Nothing, Bool}, Tuple{Symbol, Symbol}, NamedTuple{(:iofunction, :reached_redirect_limit), Tuple{Nothing, Bool}}})
    @ HTTP.ExceptionRequest ~/.julia/packages/HTTP/aTjcj/src/ExceptionRequest.jl:19
 [10] (::Base.var"#76#78"{Base.var"#76#77#79"{ExponentialBackOff, HTTP.RetryRequest.var"#2#3"{Bool, HTTP.Messages.Request}, typeof(HTTP.request)}})(::Type, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{Nothing, Bool}, Tuple{Symbol, Symbol}, NamedTuple{(:iofunction, :reached_redirect_limit), Tuple{Nothing, Bool}}})
    @ Base ./error.jl:294
 [11] #request#1
    @ ~/.julia/packages/HTTP/aTjcj/src/RetryRequest.jl:44 [inlined]
 [12] request(::Type{MessageLayer{RetryLayer{ExceptionLayer{ConnectionPoolLayer{StreamLayer{Union{}}}}}}}, method::String, url::URIs.URI, headers::Vector{Pair{SubString{String}, SubString{String}}}, body::Vector{UInt8}; http_version::VersionNumber, target::String, parent::HTTP.Messages.Response, iofunction::Nothing, kw::Base.Pairs{Symbol, Bool, Tuple{Symbol}, NamedTuple{(:reached_redirect_limit,), Tuple{Bool}}})
    @ HTTP.MessageRequest ~/.julia/packages/HTTP/aTjcj/src/MessageRequest.jl:66
 [13] request(::Type{BasicAuthLayer{MessageLayer{RetryLayer{ExceptionLayer{ConnectionPoolLayer{StreamLayer{Union{}}}}}}}}, method::String, url::URIs.URI, headers::Vector{Pair{SubString{String}, SubString{String}}}, body::Vector{UInt8}; kw::Base.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:reached_redirect_limit, :parent), Tuple{Bool, HTTP.Messages.Response}}})
    @ HTTP.BasicAuthRequest ~/.julia/packages/HTTP/aTjcj/src/BasicAuthRequest.jl:28
 [14] request(::Type{RedirectLayer{BasicAuthLayer{MessageLayer{RetryLayer{ExceptionLayer{ConnectionPoolLayer{StreamLayer{Union{}}}}}}}}}, method::String, url::URIs.URI, headers::Vector{Pair{SubString{String}, SubString{String}}}, body::Vector{UInt8}; redirect_limit::Int64, forwardheaders::Bool, kw::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ HTTP.RedirectRequest ~/.julia/packages/HTTP/aTjcj/src/RedirectRequest.jl:28
 [15] request
    @ ~/.julia/packages/HTTP/aTjcj/src/RedirectRequest.jl:21 [inlined]
 [16] #request#1
    @ ~/.julia/packages/HTTP/aTjcj/src/TopRequest.jl:15 [inlined]
 [17] request(::Type{TopLayer{RedirectLayer{BasicAuthLayer{MessageLayer{RetryLayer{ExceptionLayer{ConnectionPoolLayer{StreamLayer{Union{}}}}}}}}}}, ::String, ::URIs.URI, ::Vector{Pair{SubString{String}, SubString{String}}}, ::Vector{UInt8})
    @ HTTP.TopRequest ~/.julia/packages/HTTP/aTjcj/src/TopRequest.jl:15
 [18] request(method::String, url::String, h::Vector{Pair{String, String}}, b::Vector{UInt8}; headers::Vector{Pair{String, String}}, body::Vector{UInt8}, query::Nothing, kw::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ HTTP ~/.julia/packages/HTTP/aTjcj/src/HTTP.jl:330
 [19] request (repeats 2 times)
    @ ~/.julia/packages/HTTP/aTjcj/src/HTTP.jl:330 [inlined]
 [20] get(::String, ::Vararg{Any}; kw::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ HTTP ~/.julia/packages/HTTP/aTjcj/src/HTTP.jl:407
 [21] get(::String, ::Vararg{Any})
    @ HTTP ~/.julia/packages/HTTP/aTjcj/src/HTTP.jl:407
 [22] top-level scope
    @ ~/github/santiagobadia/MyCodingProjects.jl/src/doi-to-refs.jl:125
in expression starting at /home/santiago/github/santiagobadia/MyCodingProjects.jl/src/doi-to-refs.jl:12

The text was updated successfully, but these errors were encountered:

fonsp · 2022-03-08T14:05:29Z

It looks like the header contains data in a different encoding, perhaps UTF-16, but Julia parsed is as UTF-8:

julia> codeunits("Martín") # utf8 representation:
7-element Base.CodeUnits{UInt8, String}:
 0x4d
 0x61
 0x72
 0x74
 0xc3
 0xad
 0x6e

julia> transcode(UInt16,"Martín") # UTF-16 representation:
6-element Vector{UInt16}:
 0x004d
 0x0061
 0x0072
 0x0074
 0x00ed # <- here is the \xed character
 0x006e

I also found a difference in regex matching, when given this faulty string:

On Julia 1.7.0, I get:

julia> match(r"[^\r\n]+", "Martín")
RegexMatch("Martín")

julia> match(r"[^\r\n]+", "Mart\xedn")
RegexMatch("Mart")

On Julia 1.6.5, it's:

julia> match(r"[^\r\n]+", "Martín")
RegexMatch("Martín")

julia> match(r"[^\r\n]+", "Mart\xedn")
ERROR: BoundsError: attempt to access 6-codeunit String at index [8]
Stacktrace:
 [1] prevind(s::String, i::Int64, n::Int64)
   @ Base ./strings/basic.jl:499
 [2] prevind
   @ ./strings/basic.jl:494 [inlined]
 [3] prevind
   @ ./strings/basic.jl:493 [inlined]
 [4] match(re::Regex, str::String, idx::Int64, add_opts::UInt32)
   @ Base ./regex.jl:306
 [5] match
   @ ./regex.jl:297 [inlined]
 [6] match(r::Regex, s::String)
   @ Base ./regex.jl:316
 [7] top-level scope
   @ REPL[2]:1

Not sure what the conclusion is...

aviks · 2022-03-08T22:28:34Z

The behavior in 1.6.5 is a bug, that could potentially lead to a security issue. The real question here, IMO, is why is the string being created with invalid UTF-8. If a server is legitimately returning results in a different encoding, that should then be converted to to UTF-8 before sending to the regex engine. Maybe you can read the charset from the content-type

fonsp · 2022-03-08T22:42:52Z

I couldn't find any information online about encoding in the header values. Also no evidence that the content-type charset applies to other header values...

It works with Downloads.jl (based on CURL), it throws a warning that that link header value is malformed, but still returns a response with body and the remaining headers:

julia> r = Downloads.request("http://doi.org/10.1016/j.cma.2021.114093"; headers=["Accept" => "application/x-bibtex"]);
┌ Warning: malformed HTTP header
│   url = "https://api.crossref.org/v1/works/10.1016%2Fj.cma.2021.114093/transform"
│   status = 200
│   header = "link: <http://dx.doi.org/10.1016/j.cma.2021.114093>; rel=\"canonical\", <https://api.elsevier.com/content/article/PII:S0045782521004242?httpAccept=text/xml>; version=\"vor\"; type=\"text/xml\"; rel=\"item\", <https://api.elsevier.com/content/article/PII:S0045782" ⋯ 65 bytes ⋯ "\"; rel=\"item\", <https://www.elsevier.com/tdm/userlicense/1.0/>; version=\"tdm\"; rel=\"license\", <http://orcid.org/0000-0003-2391-4086>; title=\"Santiago Badia\"; rel=\"author\", <http://orcid.org/0000-0001-5751-4561>; title=\"Alberto F. Mart\xedn\"; rel=\"author\"\r\n"
└ @ Downloads.Curl /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.7/Downloads/src/Curl/Easy.jl:261

julia> r.headers
14-element Vector{Pair{String, String}}:
                          "date" => "Tue, 08 Mar 2022 22:34:16 GMT"
                  "content-type" => "application/x-bibtex"
                          "vary" => "Accept"
 "access-control-expose-headers" => "Link"
  "access-control-allow-headers" => "X-Requested-With, Accept, Accept-Encoding, Accept-Charset, Accept-Language, Accept-Ranges, Cache-Control"
   "access-control-allow-origin" => "*"
                        "server" => "Jetty(9.4.40.v20210413)"
             "x-ratelimit-limit" => "50"
          "x-ratelimit-interval" => "1s"
                    "x-api-pool" => "public"
            "x-rate-limit-limit" => "50"
         "x-rate-limit-interval" => "1s"
            "permissions-policy" => "interest-cohort=()"
                    "connection" => "close"

@amartinhuertas For your use case, consider using HTTPDownloads.jl, then it only logs a warning:

julia> a = String(HTTP.get("http://doi.org/$doi", ["Accept" => "application/x-bibtex"]).body)
┌ Warning: malformed HTTP header
│   ... same warning as above ...
└ @ Downloads.Curl
"@article{Badia_2021,\n\tdoi = {10.1016/j.cma.2021.114093},\n\turl = {https://doi.org/10.1016%2Fj.cma.2021.114093},\n\tyear = 2021,\n\tmonth = {dec},\n\tpublisher = {Elsevier {BV}},\n\tvolume = {386},\n\tpages = {114093},\n\tauthor = {Santiago Badia and Manuel A. Caicedo and Alberto F. Mart{\\'{\\i}}n and Javier Principe},\n\ttitle = {A robust and scalable unfitted adaptive finite element framework for nonlinear solid mechanics},\n\tjournal = {Computer Methods in Applied Mechanics and Engineering}\n}"

fredrikekre · 2022-03-09T08:43:51Z

curl seems to just filter it out:

$ curl -H "Accept: application/x-bibtex" -fsSIL http://doi.org/10.1016/j.cma.2021.114093 | rg link
link: <https://dul.usage.elsevier.com/doi/>; rel=dul
link: <https://dul.usage.elsevier.com/doi/>; rel=dul
link: <http://dx.doi.org/10.1016/j.cma.2021.114093>; rel="canonical", <https://api.elsevier.com/content/article/PII:S0045782521004242?httpAccept=text/xml>; version="vor"; type="text/xml"; rel="item", <https://api.elsevier.com/content/article/PII:S0045782521004242?httpAccept=text/plain>; version="vor"; type="text/plain"; rel="item", <https://www.elsevier.com/tdm/userlicense/1.0/>; version="tdm"; rel="license", <http://orcid.org/0000-0003-2391-4086>; title="Santiago Badia"; rel="author", <http://orcid.org/0000-0001-5751-4561>; title="Alberto F. Martn"; rel="author"

Perhaps HTTP.jl, and Downloads.jl for that matter, should just do that? cc @StefanKarpinski

StefanKarpinski · 2022-03-14T17:58:18Z

HTTP headers are supposed to only be ASCII but there are apparently some legacy values that end up being encoded in Latin-1, so maybe the best approach here is to interpret them as Latin-1 and transcode them to UTF-8 before presentation to the user. This case appears to be encoded in Latin-1 for what it's worth, so this would "just work" with that approach. Here's some code that checks for ASCII first and if the string isn't pure ASCII transcodes it from Latin-1 to UTF-8:

julia> val = "Mart\xedn" # Latin-1 encoded string
"Mart\xedn"

julia> n = count(!isascii, val)

julia> if n > 0
           val′ = sprint(sizehint = ncodeunits(val) + n) do io
               for byte in codeunits(val)
                   write(io, Char(byte))
               end
           end
       end
"Martín"

StefanKarpinski · 2022-03-14T19:54:56Z

Here's a an optimized version:

function latin1_to_utf8(str::String)
    bytes = codeunits(str)
    n = count(≥(0x80), bytes)
    n == 0 && return str
    buf = Base.StringVector(length(bytes) + n)
    i = 0
    for byte in bytes
        if byte ≥ 0x80
            buf[i += 1] = 0xc0 | (byte >> 6)
            buf[i += 1] = 0x80 | (byte & 0x3f)
        else
            buf[i += 1] = byte
        end
    end
    return String(buf)
end

@StefanKarpinski

As brought up in #796, there may be scenarios where headers may contain non-UTF8 characters (even though they're supposed to be ASCII). Appreciation to @StefanKarpinski for the Latin-1 -> UTF-8 conversion code and the suggestion to try reencoding before throwing an error. As proposed in this PR, the normal header parsing path should be unaffected and only when we're unable to parse a normal header will we attempt this reencoding. Note that curl warns on the malformed header and filters it out.

quinnj · 2022-05-25T13:47:23Z

Thanks for the suggested solutions @StefanKarpinski; I've packaged these up in a PR here: #830. I think it works pretty well because it's just a check right before we throw the error to see if we can re-encode.

@StefanKarpinski

* Attempt to reencode malformed headers from Latin-1 to UTF8 As brought up in #796, there may be scenarios where headers may contain non-UTF8 characters (even though they're supposed to be ASCII). Appreciation to @StefanKarpinski for the Latin-1 -> UTF-8 conversion code and the suggestion to try reencoding before throwing an error. As proposed in this PR, the normal header parsing path should be unaffected and only when we're unable to parse a normal header will we attempt this reencoding. Note that curl warns on the malformed header and filters it out. * cleanup implementation and ensure tests pass * fix * one more cleanup

I'm not sure this is useful but many dois of mine get rejected and pop up during the link verification. Adding the x-bibtex header (which I spotted in JuliaWeb/HTTP.jl#796) seems to fix that for me, hence this tiny PR. Do with it what you will!

fonsp mentioned this issue Mar 8, 2022

Regex not matching malformed UTF-8 JuliaLang/julia#44520

Closed

fonsp added client About our HTTP client bug parser HTTP (headers) parser labels Mar 16, 2022

quinnj mentioned this issue May 25, 2022

Attempt to reencode malformed headers from Latin-1 to UTF8 #830

Merged

quinnj closed this as completed May 25, 2022

briochemc mentioned this issue Sep 2, 2024

fix verify_links to accept doi links tlienart/Franklin.jl#1091

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent behaviour among Julia 1.6.5 (works ok) versus Julia 1.7.0 (error) #796

Inconsistent behaviour among Julia 1.6.5 (works ok) versus Julia 1.7.0 (error) #796

amartinhuertas commented Jan 17, 2022

fonsp commented Mar 8, 2022

aviks commented Mar 8, 2022

fonsp commented Mar 8, 2022 •

edited

Loading

fredrikekre commented Mar 9, 2022

StefanKarpinski commented Mar 14, 2022 •

edited

Loading

StefanKarpinski commented Mar 14, 2022 •

edited

Loading

quinnj commented May 25, 2022

Inconsistent behaviour among Julia 1.6.5 (works ok) versus Julia 1.7.0 (error) #796

Inconsistent behaviour among Julia 1.6.5 (works ok) versus Julia 1.7.0 (error) #796

Comments

amartinhuertas commented Jan 17, 2022

fonsp commented Mar 8, 2022

aviks commented Mar 8, 2022

fonsp commented Mar 8, 2022 • edited Loading

fredrikekre commented Mar 9, 2022

StefanKarpinski commented Mar 14, 2022 • edited Loading

StefanKarpinski commented Mar 14, 2022 • edited Loading

quinnj commented May 25, 2022

fonsp commented Mar 8, 2022 •

edited

Loading

StefanKarpinski commented Mar 14, 2022 •

edited

Loading

StefanKarpinski commented Mar 14, 2022 •

edited

Loading