[bug] libxml 2.9.13 breaks HTML4 parser recovery from ill-formed `<` character #2461

flavorjones · 2022-02-21T20:24:27Z

Summary

Nokogiri v1.13.2 shipped libxml 2.9.13. That version of libxml2 introduced a behavior change to how the HTML4 parser recovers when it sees a bare (ill-formed) < character (one that is not part of a start tag).

I've opened an issue upstream at https://gitlab.gnome.org/GNOME/libxml2/-/issues/339

Immediate next steps

add test coverage for recovery from < in an HTML4 document to the Nokogiri test suite
apply a patch to libxml2 reverting https://gitlab.gnome.org/GNOME/libxml2/-/commit/798bdf13f6964a650b9a0b7b4b3a769f6f1d509a
- fix: revert libxml2 regression with HTML4 recovery #2462
- fix: revert libxml2 regression with HTML4 recovery (v1.13.x branch) #2463
ship v1.13.3 🤷

Less-urgent next steps

test: add coverage for entities flavorjones/loofah#227
prioritize CI should test against major downstream consumers #2293 to test loofah and rails-html-sanitizer (at least) against upstream libxml2/lbxslt

The text was updated successfully, but these errors were encountered:

flavorjones · 2022-02-21T20:55:32Z

OK, I have a repro which seems common across the rails-html-sanitizer failures as well as my day job CI failures:

      it "handles < character" do
        input = %{<div> this < that </div>}
        expected = %{<div> this &lt; that </div>}
        actual = Loofah.scrub_fragment(input, :escape)
        assert_equal(expected, actual.to_html)
      end

with nokogiri v1.13.1, this passes. with nokogiri v1.13.2:

Expected: "<div> this &lt; that </div>"
  Actual: "<div> this </div>"

flavorjones · 2022-02-21T21:02:49Z

Without Loofah, here's the core problem:

# nokogiri 1.13.1
$ ruby -rnokogiri -e 'pp Nokogiri::HTML4::Document.parse("<div> this < that </div>")'
#(Document:0x3c {
  name = "document",
  children = [
    #(DTD:0x50 { name = "html" }),
    #(Element:0x64 {
      name = "html",
      children = [
        #(Element:0x78 {
          name = "body",
          children = [
            #(Element:0x8c {
              name = "div",
              children = [ #(Text " this < that ")]
              })]
          })]
      })]
  })

# nokogiri 1.13.2
$ ruby -rnokogiri -e 'pp Nokogiri::HTML4::Document.parse("<div> this < that </div>")'
#(Document:0x3c {
  name = "document",
  children = [
    #(DTD:0x50 { name = "html" }),
    #(Element:0x64 {
      name = "html",
      children = [
        #(Element:0x78 {
          name = "body",
          children = [
            #(Element:0x8c { name = "div", children = [ #(Text " this ")] })]
          })]
      })]
  })

flavorjones · 2022-02-21T22:47:51Z

I've opened an issue upstream: https://gitlab.gnome.org/GNOME/libxml2/-/issues/339

I'm going to explore reverting the related commits in a patch to see if I can get a fast-follow release of Nokogiri for y'all.

flavorjones · 2022-02-21T22:59:17Z

I've updated this issue's description with a punch list of next steps.

Fixes #2461

see sparklemotion/nokogiri#2461 for background

Fixes #2461

flavorjones · 2022-02-22T04:53:49Z

v1.13.3 has been released to address this: https://github.com/sparklemotion/nokogiri/releases/tag/v1.13.3

flavorjones added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Feb 21, 2022

flavorjones changed the title ~~[bug] Nokogiri v1.13.2 is buggy with respect to sanitization and entities~~ [bug] Nokogiri v1.13.2 / libxml 2.9.13 breaks some sanitization and entity behavior Feb 21, 2022

flavorjones added upstream/libxml2 and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Feb 21, 2022

flavorjones changed the title ~~[bug] Nokogiri v1.13.2 / libxml 2.9.13 breaks some sanitization and entity behavior~~ [bug] Nokogiri v1.13.2 / libxml 2.9.13 breaks HTML4 parser recovery from ill-formed < character Feb 21, 2022

flavorjones changed the title ~~[bug] Nokogiri v1.13.2 / libxml 2.9.13 breaks HTML4 parser recovery from ill-formed < character~~ [bug] libxml 2.9.13 breaks HTML4 parser recovery from ill-formed < character Feb 21, 2022

flavorjones added a commit that referenced this issue Feb 21, 2022

fix: revert libxml2 regression with HTML4 recovery

9530da8

Fixes #2461

flavorjones mentioned this issue Feb 21, 2022

fix: revert libxml2 regression with HTML4 recovery #2462

Merged

flavorjones added a commit to flavorjones/loofah that referenced this issue Feb 21, 2022

test: add coverage for entities

f8c6249

see sparklemotion/nokogiri#2461 for background

flavorjones mentioned this issue Feb 21, 2022

test: add coverage for entities flavorjones/loofah#227

Merged

flavorjones added a commit that referenced this issue Feb 21, 2022

fix: revert libxml2 regression with HTML4 recovery

16b4fa2

Fixes #2461

flavorjones closed this as completed in #2462 Feb 22, 2022

flavorjones added a commit that referenced this issue Feb 22, 2022

fix: revert libxml2 regression with HTML4 recovery

ff65816

Fixes #2461

flavorjones mentioned this issue Feb 22, 2022

fix: revert libxml2 regression with HTML4 recovery (v1.13.x branch) #2463

Merged

flavorjones added a commit that referenced this issue Feb 22, 2022

fix: revert libxml2 regression with HTML4 recovery

5970fd9

Fixes #2461

flavorjones reopened this Feb 22, 2022

flavorjones closed this as completed Feb 22, 2022

This was referenced Feb 26, 2022

Bump nokogiri from 1.13.1 to 1.13.3 in /apps/myjobs OSC/ondemand#1857

Closed

Bump nokogiri from 1.13.1 to 1.13.3 in /apps/dashboard OSC/ondemand#1856

Closed

voxik mentioned this issue Mar 18, 2022

Revisit libxml 2.9.13 compatibility #2479

Closed

flavorjones mentioned this issue Mar 27, 2022

tests fail with latest versions of dependencies flavorjones/loofah#230

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] libxml 2.9.13 breaks HTML4 parser recovery from ill-formed `<` character #2461

[bug] libxml 2.9.13 breaks HTML4 parser recovery from ill-formed `<` character #2461

flavorjones commented Feb 21, 2022 •

edited

Loading

flavorjones commented Feb 21, 2022

flavorjones commented Feb 21, 2022

flavorjones commented Feb 21, 2022

flavorjones commented Feb 21, 2022

flavorjones commented Feb 22, 2022

[bug] libxml 2.9.13 breaks HTML4 parser recovery from ill-formed < character #2461

[bug] libxml 2.9.13 breaks HTML4 parser recovery from ill-formed < character #2461

Comments

flavorjones commented Feb 21, 2022 • edited Loading

Summary

Immediate next steps

Less-urgent next steps

flavorjones commented Feb 21, 2022

flavorjones commented Feb 21, 2022

flavorjones commented Feb 21, 2022

flavorjones commented Feb 21, 2022

flavorjones commented Feb 22, 2022

[bug] libxml 2.9.13 breaks HTML4 parser recovery from ill-formed `<` character #2461

[bug] libxml 2.9.13 breaks HTML4 parser recovery from ill-formed `<` character #2461

flavorjones commented Feb 21, 2022 •

edited

Loading