Bug: Inside comments, `<<` is parsed as `<!` #325

anko · 2020-09-29T12:07:49Z

Repro steps (Linux):

$ cd $(mktemp --directory)
$ npm i parse5-sax-parser
[... npm output installing [email protected] ...]
$ node << 'EOF'
const Parser = require('parse5-sax-parser')
const p = new Parser()
p.on('comment', (c) => console.log(c))
p.end('<!--test <<-->')
EOF
{ text: 'test <!', sourceCodeLocation: undefined }

Expected: test <<, not test <!.

Rationale: I find the above behaviour confusing because the HTML spec on comments does not limit how < can be used inside comments. A comment containing 2 consecutive less-than signs should be legal, but is currently unrepresentable.

Analysis:

I've just walked into the source and don't know the details, but it appears wrong to me that the tokeniser switches state from COMMENT_STATE to COMMENT_LESS_THAN_SIGN_STATE when encountering <. COMMENT_LESS_THAN_SIGN_STATE then treats < as ! and causes the weird output seen above.

Surely COMMENT_STATE represents the state where we're inside the text part of the comment? The only non-error references out of there should be to itself (parsing more content) or to COMMENT_END_DASH_STATE, right?

The text was updated successfully, but these errors were encountered:

anko · 2020-09-29T12:40:46Z

I'm interested in making a PR to make comment tokenisation spec-compliant in general, but I can't get the tests to run.

Notes I took while trying to run tests

Here's the story of me trying to do that:

After npm i, npm t fails at mocha --ui exports --reporter progress --timeout 20000 packages/**/test/*.test.js with

Error: ENOENT: no such file or directory, scandir '/home/anko/src/parse5/test/data/html5lib-tests/tree-construction'

The only reference to such a thing in package.json is the generate-feedback-tests script. That failed to resolve requires, turns out there's a typo:

diff --git a/scripts/generate-parser-feedback-test/index.js b/scripts/generate-parser-feedback-test/index.js
index 0c128d9..b793c13 100644
--- a/scripts/generate-parser-feedback-test/index.js
+++ b/scripts/generate-parser-feedback-test/index.js
@@ -6,7 +6,7 @@ const Tokenizer = require('../../packages/parse5/lib/tokenizer');
 const defaultTreeAdapter = require('../../packages/parse5/lib/tree-adapters/default');
 const { convertTokenToHtml5Lib } = require('../../test/utils/generate-tokenization-tests');
 const parseDatFile = require('../../test/utils/parse-dat-file');
-const { addSlashes } = require('../../test/test/utils/common');
+const { addSlashes } = require('../../test/utils/common');
 
 readFile = promisify(readFile);
 writeFile = promisify(writeFile);

That fixed, the script exits 0 but

(node:18502) UnhandledPromiseRejectionWarning: Error: ENOENT: no such file or directory, open 'test/data/html5lib-tests/tree-construction/*.dat'

So I'm guessing this script doesn't generate those files, but consumes them.

find -name '*.dat' sees—

./test/data/tree-construction-regression/gh40_form_in_template.dat
./test/data/tree-construction-scripting/document_write.dat

They're at such a different path that it can't be a typo.

I don't know where else to look for missing mystery test data files.

At this point I gave up.

Advice appreciated.

inikulin · 2020-09-29T18:03:00Z

@anko it's indeed a bug, you're right we append the wrong character here: https://github.com/inikulin/parse5/blob/master/packages/parse5/lib/tokenizer/index.js#L1471 It should be <.

Regarding tests - you need to fetch html5lib git submodule by running:

git submodule update --init --recursive

anko · 2020-09-30T12:34:11Z

I see; tests are in a separate repo. Looks like they cover most of the comment spec stuff too. I'd feel dumb PRing for a 1-char diff, and it doesn't feel worthwhile even writing a regression test when the wrongness is unsubtle.

Poke it right when you have time I guess.

inikulin · 2020-09-30T17:03:46Z

I'd feel dumb PRing for a 1-char diff, and it doesn't feel worthwhile even writing a regression test when the wrongness is unsubtle

Don't see anything dumb about it. And it's always worth writing a regression test.

Inside comments two consecutive less-than characters (`<<`) parsed wrongly as `<!`, due to what was probably a typo. This fixes that. Added regression test. Fixes inikulin#325.

Inside comments two consecutive less-than characters (`<<`) parsed wrongly as `<!`, due to what was probably a typo. This fixes that. Added regression test. Fixes #325.

See inikulin/parse5#325

Inside comments two consecutive less-than characters (`<<`) parsed wrongly as `<!`, due to what was probably a typo. This fixes that. Added regression test. Fixes inikulin#325.

anko mentioned this issue Sep 29, 2020

<< inside comments turns into <!, making heredocs unusable inside !program commands anko/txm#4

Closed

anko mentioned this issue Oct 1, 2020

Make << in comment text parse correctly #326

Merged

inikulin closed this as completed in #326 Oct 1, 2020

inikulin pushed a commit that referenced this issue Oct 1, 2020

Make << in comment text parse correctly (#326)

9c683e1

Inside comments two consecutive less-than characters (`<<`) parsed wrongly as `<!`, due to what was probably a typo. This fixes that. Added regression test. Fixes #325.

fb55 added a commit to fb55/html5lib-tests that referenced this issue Feb 28, 2022

Add test for << in comment

e271ed2

See inikulin/parse5#325

fb55 mentioned this issue Mar 2, 2022

Add tests for form in template, << in comment html5lib/html5lib-tests#145

Merged

Ms2ger pushed a commit to html5lib/html5lib-tests that referenced this issue Mar 11, 2022

Add test for << in comment

538a6cd

See inikulin/parse5#325

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Inside comments, `<<` is parsed as `<!` #325

Bug: Inside comments, `<<` is parsed as `<!` #325

anko commented Sep 29, 2020

anko commented Sep 29, 2020 •

edited

Loading

inikulin commented Sep 29, 2020

anko commented Sep 30, 2020

inikulin commented Sep 30, 2020

Bug: Inside comments, << is parsed as <! #325

Bug: Inside comments, << is parsed as <! #325

Comments

anko commented Sep 29, 2020

anko commented Sep 29, 2020 • edited Loading

inikulin commented Sep 29, 2020

anko commented Sep 30, 2020

inikulin commented Sep 30, 2020

Bug: Inside comments, `<<` is parsed as `<!` #325

Bug: Inside comments, `<<` is parsed as `<!` #325

anko commented Sep 29, 2020 •

edited

Loading