Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behaviour among Julia 1.6.5 (works ok) versus Julia 1.7.0 (error) #796

Closed
amartinhuertas opened this issue Jan 17, 2022 · 7 comments
Labels
bug client About our HTTP client parser HTTP (headers) parser

Comments

@amartinhuertas
Copy link

Dear HTTP.jl developers,

when I execute the following MWE on a Julia REPL:

using HTTP
doi="10.1016/j.cma.2021.114093"
a = String(HTTP.get("http://doi.org/$doi", ["Accept" => "application/x-bibtex"]).body)

I get the expected output with Julia 1.6.5, but an exception with Julia 1.7.0 (see details below).

Please note that the doi at hand (i.e., 10.1016/j.cma.2021.114093) is tricky in the sense that one of the authors of the paper has an special character in it, namely í in Martín. If I use a doi associated to an article in which the author surnames do not have such character, both Julia 1.6.5 and 1.7.0 have the correct behaviour. I suspect this might be related to differences in the regex implementation of Julia 1.6.5 versus 1.7.0, but I am not certain.

Do you have any clue with regards to whether this is a bug or not, and a possible fix?
Thanks in advance!

cc @santiagobadia

Versions:

  • Julia v1.6.5 vs v1.7.0
  • HTTP.jl v0.9.17
  • MbedTLS.jl v1.0.3
a = String(HTTP.get("http://doi.org/$doi", ["Accept" => "application/x-bibtex"]).body)
ERROR: LoadError: HTTP.Parsers.ParseError(:INVALID_HEADER_FIELD, "link: <http://dx.doi.org/10.1016/j.cma.2021.114093>; rel=\"canonical\", <https://api.elsevier.com/content/article/PII:S0045782521004242?httpAccept=text/xml>; version=\"vor\"; type=\"text/xml\"; rel=\"item\", <https://api.elsevier.com/content/article/PII:S0045782521004242?httpAccept=text/plain>; version=\"vor\"; type=\"text/plain\"; rel=\"item\", <https://www.elsevier.com/tdm/userlicense/1.0/>; version=\"tdm\"; rel=\"license\", <http://orcid.org/0000-0003-2391-4086>; title=\"Santiago Badia\"; rel=\"author\", <http://orcid.org/0000-0001-5751-4561>; title=\"Alberto F. Mart\xedn\"; rel=\"author\"\r\naccess-control-expose-headers: Link\r\naccess-control-allow-headers: X-Requested-With, Accept, Accept-Encoding, Accept-Charset, Accept-Language, Accept-Ranges, Cache-Control\r\naccess-control-allow-origin: *\r\nserver: Jetty(9.4.40.v20210413)\r\nx-ratelimit-limit: 50\r\nx-ratelimit-interval: 1s\r\nx-api-pool: public\r\nx-rate-limit-limit: 50\r\nx-rate-limit-interval: 1s\r\npermissions-policy: interest-cohort=()\r\nconnection: close\r\n\r\n")
Stacktrace:
  [1] parse_header_field(bytes::SubString{String})
    @ HTTP.Parsers ~/.julia/packages/HTTP/aTjcj/src/Parsers.jl:238
  [2] parse_header_fields!(bytes::SubString{String}, m::HTTP.Messages.Response)
    @ HTTP.Messages ~/.julia/packages/HTTP/aTjcj/src/Messages.jl:552
  [3] readheaders(io::HTTP.ConnectionPool.Transaction{MbedTLS.SSLContext}, message::HTTP.Messages.Response)
    @ HTTP.Messages ~/.julia/packages/HTTP/aTjcj/src/Messages.jl:539
  [4] startread(http::HTTP.Streams.Stream{HTTP.Messages.Response, HTTP.ConnectionPool.Transaction{MbedTLS.SSLContext}})
    @ HTTP.Streams ~/.julia/packages/HTTP/aTjcj/src/Streams.jl:170
  [5] macro expansion
    @ ~/.julia/packages/HTTP/aTjcj/src/StreamRequest.jl:67 [inlined]
  [6] macro expansion
    @ ./task.jl:399 [inlined]
  [7] request(::Type{StreamLayer{Union{}}}, io::HTTP.ConnectionPool.Transaction{MbedTLS.SSLContext}, req::HTTP.Messages.Request, body::Vector{UInt8}; reached_redirect_limit::Bool, response_stream::Nothing, iofunction::Nothing, verbose::Int64, kw::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ HTTP.StreamRequest ~/.julia/packages/HTTP/aTjcj/src/StreamRequest.jl:57
  [8] request(::Type{ConnectionPoolLayer{StreamLayer{Union{}}}}, url::URIs.URI, req::HTTP.Messages.Request, body::Vector{UInt8}; proxy::Nothing, socket_type::Type, reuse_limit::Int64, kw::Base.Pairs{Symbol, Union{Nothing, Bool}, Tuple{Symbol, Symbol}, NamedTuple{(:iofunction, :reached_redirect_limit), Tuple{Nothing, Bool}}})
    @ HTTP.ConnectionRequest ~/.julia/packages/HTTP/aTjcj/src/ConnectionRequest.jl:108
  [9] request(::Type{ExceptionLayer{ConnectionPoolLayer{StreamLayer{Union{}}}}}, ::URIs.URI, ::Vararg{Any}; kw::Base.Pairs{Symbol, Union{Nothing, Bool}, Tuple{Symbol, Symbol}, NamedTuple{(:iofunction, :reached_redirect_limit), Tuple{Nothing, Bool}}})
    @ HTTP.ExceptionRequest ~/.julia/packages/HTTP/aTjcj/src/ExceptionRequest.jl:19
 [10] (::Base.var"#76#78"{Base.var"#76#77#79"{ExponentialBackOff, HTTP.RetryRequest.var"#2#3"{Bool, HTTP.Messages.Request}, typeof(HTTP.request)}})(::Type, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{Nothing, Bool}, Tuple{Symbol, Symbol}, NamedTuple{(:iofunction, :reached_redirect_limit), Tuple{Nothing, Bool}}})
    @ Base ./error.jl:294
 [11] #request#1
    @ ~/.julia/packages/HTTP/aTjcj/src/RetryRequest.jl:44 [inlined]
 [12] request(::Type{MessageLayer{RetryLayer{ExceptionLayer{ConnectionPoolLayer{StreamLayer{Union{}}}}}}}, method::String, url::URIs.URI, headers::Vector{Pair{SubString{String}, SubString{String}}}, body::Vector{UInt8}; http_version::VersionNumber, target::String, parent::HTTP.Messages.Response, iofunction::Nothing, kw::Base.Pairs{Symbol, Bool, Tuple{Symbol}, NamedTuple{(:reached_redirect_limit,), Tuple{Bool}}})
    @ HTTP.MessageRequest ~/.julia/packages/HTTP/aTjcj/src/MessageRequest.jl:66
 [13] request(::Type{BasicAuthLayer{MessageLayer{RetryLayer{ExceptionLayer{ConnectionPoolLayer{StreamLayer{Union{}}}}}}}}, method::String, url::URIs.URI, headers::Vector{Pair{SubString{String}, SubString{String}}}, body::Vector{UInt8}; kw::Base.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:reached_redirect_limit, :parent), Tuple{Bool, HTTP.Messages.Response}}})
    @ HTTP.BasicAuthRequest ~/.julia/packages/HTTP/aTjcj/src/BasicAuthRequest.jl:28
 [14] request(::Type{RedirectLayer{BasicAuthLayer{MessageLayer{RetryLayer{ExceptionLayer{ConnectionPoolLayer{StreamLayer{Union{}}}}}}}}}, method::String, url::URIs.URI, headers::Vector{Pair{SubString{String}, SubString{String}}}, body::Vector{UInt8}; redirect_limit::Int64, forwardheaders::Bool, kw::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ HTTP.RedirectRequest ~/.julia/packages/HTTP/aTjcj/src/RedirectRequest.jl:28
 [15] request
    @ ~/.julia/packages/HTTP/aTjcj/src/RedirectRequest.jl:21 [inlined]
 [16] #request#1
    @ ~/.julia/packages/HTTP/aTjcj/src/TopRequest.jl:15 [inlined]
 [17] request(::Type{TopLayer{RedirectLayer{BasicAuthLayer{MessageLayer{RetryLayer{ExceptionLayer{ConnectionPoolLayer{StreamLayer{Union{}}}}}}}}}}, ::String, ::URIs.URI, ::Vector{Pair{SubString{String}, SubString{String}}}, ::Vector{UInt8})
    @ HTTP.TopRequest ~/.julia/packages/HTTP/aTjcj/src/TopRequest.jl:15
 [18] request(method::String, url::String, h::Vector{Pair{String, String}}, b::Vector{UInt8}; headers::Vector{Pair{String, String}}, body::Vector{UInt8}, query::Nothing, kw::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ HTTP ~/.julia/packages/HTTP/aTjcj/src/HTTP.jl:330
 [19] request (repeats 2 times)
    @ ~/.julia/packages/HTTP/aTjcj/src/HTTP.jl:330 [inlined]
 [20] get(::String, ::Vararg{Any}; kw::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ HTTP ~/.julia/packages/HTTP/aTjcj/src/HTTP.jl:407
 [21] get(::String, ::Vararg{Any})
    @ HTTP ~/.julia/packages/HTTP/aTjcj/src/HTTP.jl:407
 [22] top-level scope
    @ ~/github/santiagobadia/MyCodingProjects.jl/src/doi-to-refs.jl:125
in expression starting at /home/santiago/github/santiagobadia/MyCodingProjects.jl/src/doi-to-refs.jl:12
@fonsp
Copy link
Member

fonsp commented Mar 8, 2022

It looks like the header contains data in a different encoding, perhaps UTF-16, but Julia parsed is as UTF-8:

julia> codeunits("Martín") # utf8 representation:
7-element Base.CodeUnits{UInt8, String}:
 0x4d
 0x61
 0x72
 0x74
 0xc3
 0xad
 0x6e

julia> transcode(UInt16,"Martín") # UTF-16 representation:
6-element Vector{UInt16}:
 0x004d
 0x0061
 0x0072
 0x0074
 0x00ed # <- here is the \xed character
 0x006e

I also found a difference in regex matching, when given this faulty string:

On Julia 1.7.0, I get:

julia> match(r"[^\r\n]+", "Martín")
RegexMatch("Martín")

julia> match(r"[^\r\n]+", "Mart\xedn")
RegexMatch("Mart")

On Julia 1.6.5, it's:

julia> match(r"[^\r\n]+", "Martín")
RegexMatch("Martín")

julia> match(r"[^\r\n]+", "Mart\xedn")
ERROR: BoundsError: attempt to access 6-codeunit String at index [8]
Stacktrace:
 [1] prevind(s::String, i::Int64, n::Int64)
   @ Base ./strings/basic.jl:499
 [2] prevind
   @ ./strings/basic.jl:494 [inlined]
 [3] prevind
   @ ./strings/basic.jl:493 [inlined]
 [4] match(re::Regex, str::String, idx::Int64, add_opts::UInt32)
   @ Base ./regex.jl:306
 [5] match
   @ ./regex.jl:297 [inlined]
 [6] match(r::Regex, s::String)
   @ Base ./regex.jl:316
 [7] top-level scope
   @ REPL[2]:1

Not sure what the conclusion is...

@aviks
Copy link
Member

aviks commented Mar 8, 2022

The behavior in 1.6.5 is a bug, that could potentially lead to a security issue. The real question here, IMO, is why is the string being created with invalid UTF-8. If a server is legitimately returning results in a different encoding, that should then be converted to to UTF-8 before sending to the regex engine. Maybe you can read the charset from the content-type

@fonsp
Copy link
Member

fonsp commented Mar 8, 2022

I couldn't find any information online about encoding in the header values. Also no evidence that the content-type charset applies to other header values...

It works with Downloads.jl (based on CURL), it throws a warning that that link header value is malformed, but still returns a response with body and the remaining headers:

julia> r = Downloads.request("http://doi.org/10.1016/j.cma.2021.114093"; headers=["Accept" => "application/x-bibtex"]);
┌ Warning: malformed HTTP header
│   url = "https://api.crossref.org/v1/works/10.1016%2Fj.cma.2021.114093/transform"
│   status = 200
│   header = "link: <http://dx.doi.org/10.1016/j.cma.2021.114093>; rel=\"canonical\", <https://api.elsevier.com/content/article/PII:S0045782521004242?httpAccept=text/xml>; version=\"vor\"; type=\"text/xml\"; rel=\"item\", <https://api.elsevier.com/content/article/PII:S0045782"  65 bytes  "\"; rel=\"item\", <https://www.elsevier.com/tdm/userlicense/1.0/>; version=\"tdm\"; rel=\"license\", <http://orcid.org/0000-0003-2391-4086>; title=\"Santiago Badia\"; rel=\"author\", <http://orcid.org/0000-0001-5751-4561>; title=\"Alberto F. Mart\xedn\"; rel=\"author\"\r\n"
└ @ Downloads.Curl /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.7/Downloads/src/Curl/Easy.jl:261

julia> r.headers
14-element Vector{Pair{String, String}}:
                          "date" => "Tue, 08 Mar 2022 22:34:16 GMT"
                  "content-type" => "application/x-bibtex"
                          "vary" => "Accept"
 "access-control-expose-headers" => "Link"
  "access-control-allow-headers" => "X-Requested-With, Accept, Accept-Encoding, Accept-Charset, Accept-Language, Accept-Ranges, Cache-Control"
   "access-control-allow-origin" => "*"
                        "server" => "Jetty(9.4.40.v20210413)"
             "x-ratelimit-limit" => "50"
          "x-ratelimit-interval" => "1s"
                    "x-api-pool" => "public"
            "x-rate-limit-limit" => "50"
         "x-rate-limit-interval" => "1s"
            "permissions-policy" => "interest-cohort=()"
                    "connection" => "close"

@amartinhuertas For your use case, consider using HTTPDownloads.jl, then it only logs a warning:

julia> a = String(HTTP.get("http://doi.org/$doi", ["Accept" => "application/x-bibtex"]).body)
┌ Warning: malformed HTTP header
│   ... same warning as above ...
└ @ Downloads.Curl
"@article{Badia_2021,\n\tdoi = {10.1016/j.cma.2021.114093},\n\turl = {https://doi.org/10.1016%2Fj.cma.2021.114093},\n\tyear = 2021,\n\tmonth = {dec},\n\tpublisher = {Elsevier {BV}},\n\tvolume = {386},\n\tpages = {114093},\n\tauthor = {Santiago Badia and Manuel A. Caicedo and Alberto F. Mart{\\'{\\i}}n and Javier Principe},\n\ttitle = {A robust and scalable unfitted adaptive finite element framework for nonlinear solid mechanics},\n\tjournal = {Computer Methods in Applied Mechanics and Engineering}\n}"

@fredrikekre
Copy link
Member

curl seems to just filter it out:

$ curl -H "Accept: application/x-bibtex" -fsSIL http://doi.org/10.1016/j.cma.2021.114093 | rg link
link: <https://dul.usage.elsevier.com/doi/>; rel=dul
link: <https://dul.usage.elsevier.com/doi/>; rel=dul
link: <http://dx.doi.org/10.1016/j.cma.2021.114093>; rel="canonical", <https://api.elsevier.com/content/article/PII:S0045782521004242?httpAccept=text/xml>; version="vor"; type="text/xml"; rel="item", <https://api.elsevier.com/content/article/PII:S0045782521004242?httpAccept=text/plain>; version="vor"; type="text/plain"; rel="item", <https://www.elsevier.com/tdm/userlicense/1.0/>; version="tdm"; rel="license", <http://orcid.org/0000-0003-2391-4086>; title="Santiago Badia"; rel="author", <http://orcid.org/0000-0001-5751-4561>; title="Alberto F. Martn"; rel="author"

Perhaps HTTP.jl, and Downloads.jl for that matter, should just do that? cc @StefanKarpinski

@StefanKarpinski
Copy link
Contributor

StefanKarpinski commented Mar 14, 2022

HTTP headers are supposed to only be ASCII but there are apparently some legacy values that end up being encoded in Latin-1, so maybe the best approach here is to interpret them as Latin-1 and transcode them to UTF-8 before presentation to the user. This case appears to be encoded in Latin-1 for what it's worth, so this would "just work" with that approach. Here's some code that checks for ASCII first and if the string isn't pure ASCII transcodes it from Latin-1 to UTF-8:

julia> val = "Mart\xedn" # Latin-1 encoded string
"Mart\xedn"

julia> n = count(!isascii, val)

julia> if n > 0
           val′ = sprint(sizehint = ncodeunits(val) + n) do io
               for byte in codeunits(val)
                   write(io, Char(byte))
               end
           end
       end
"Martín"

@StefanKarpinski
Copy link
Contributor

StefanKarpinski commented Mar 14, 2022

Here's a an optimized version:

function latin1_to_utf8(str::String)
    bytes = codeunits(str)
    n = count((0x80), bytes)
    n == 0 && return str
    buf = Base.StringVector(length(bytes) + n)
    i = 0
    for byte in bytes
        if byte  0x80
            buf[i += 1] = 0xc0 | (byte >> 6)
            buf[i += 1] = 0x80 | (byte & 0x3f)
        else
            buf[i += 1] = byte
        end
    end
    return String(buf)
end

@fonsp fonsp added client About our HTTP client bug parser HTTP (headers) parser labels Mar 16, 2022
quinnj added a commit that referenced this issue May 25, 2022
As brought up in #796, there may be scenarios where headers may contain
non-UTF8 characters (even though they're supposed to be ASCII). Appreciation
to @StefanKarpinski for the Latin-1 -> UTF-8 conversion code and the suggestion
to try reencoding before throwing an error. As proposed in this PR, the normal
header parsing path should be unaffected and only when we're unable to parse
a normal header will we attempt this reencoding.

Note that curl warns on the malformed header and filters it out.
@quinnj
Copy link
Member

quinnj commented May 25, 2022

Thanks for the suggested solutions @StefanKarpinski; I've packaged these up in a PR here: #830. I think it works pretty well because it's just a check right before we throw the error to see if we can re-encode.

quinnj added a commit that referenced this issue May 25, 2022
* Attempt to reencode malformed headers from Latin-1 to UTF8

As brought up in #796, there may be scenarios where headers may contain
non-UTF8 characters (even though they're supposed to be ASCII). Appreciation
to @StefanKarpinski for the Latin-1 -> UTF-8 conversion code and the suggestion
to try reencoding before throwing an error. As proposed in this PR, the normal
header parsing path should be unaffected and only when we're unable to parse
a normal header will we attempt this reencoding.

Note that curl warns on the malformed header and filters it out.

* cleanup implementation and ensure tests pass

* fix

* one more cleanup
@quinnj quinnj closed this as completed May 25, 2022
briochemc added a commit to briochemc/Franklin.jl that referenced this issue Sep 2, 2024
I'm not sure this is useful but many dois of mine get rejected and pop up during the link verification. Adding the x-bibtex header (which I spotted in JuliaWeb/HTTP.jl#796) seems to fix that for me, hence this tiny PR. Do with it what you will!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug client About our HTTP client parser HTTP (headers) parser
Projects
None yet
Development

No branches or pull requests

6 participants