Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Inside comments, << is parsed as <! #325

Closed
anko opened this issue Sep 29, 2020 · 4 comments · Fixed by #326
Closed

Bug: Inside comments, << is parsed as <! #325

anko opened this issue Sep 29, 2020 · 4 comments · Fixed by #326

Comments

@anko
Copy link
Contributor

anko commented Sep 29, 2020

Module: [email protected]

Repro steps (Linux):

$ cd $(mktemp --directory)
$ npm i parse5-sax-parser
[... npm output installing [email protected] ...]
$ node << 'EOF'
const Parser = require('parse5-sax-parser')
const p = new Parser()
p.on('comment', (c) => console.log(c))
p.end('<!--test <<-->')
EOF
{ text: 'test <!', sourceCodeLocation: undefined }

Expected: test <<, not test <!.

Rationale: I find the above behaviour confusing because the HTML spec on comments does not limit how < can be used inside comments. A comment containing 2 consecutive less-than signs should be legal, but is currently unrepresentable.

Analysis:

I've just walked into the source and don't know the details, but it appears wrong to me that the tokeniser switches state from COMMENT_STATE to COMMENT_LESS_THAN_SIGN_STATE when encountering <. COMMENT_LESS_THAN_SIGN_STATE then treats < as ! and causes the weird output seen above.

Surely COMMENT_STATE represents the state where we're inside the text part of the comment? The only non-error references out of there should be to itself (parsing more content) or to COMMENT_END_DASH_STATE, right?

@anko
Copy link
Contributor Author

anko commented Sep 29, 2020

I'm interested in making a PR to make comment tokenisation spec-compliant in general, but I can't get the tests to run.

Notes I took while trying to run tests

Here's the story of me trying to do that:

After npm i, npm t fails at mocha --ui exports --reporter progress --timeout 20000 packages/**/test/*.test.js with

Error: ENOENT: no such file or directory, scandir '/home/anko/src/parse5/test/data/html5lib-tests/tree-construction'

The only reference to such a thing in package.json is the generate-feedback-tests script. That failed to resolve requires, turns out there's a typo:

diff --git a/scripts/generate-parser-feedback-test/index.js b/scripts/generate-parser-feedback-test/index.js
index 0c128d9..b793c13 100644
--- a/scripts/generate-parser-feedback-test/index.js
+++ b/scripts/generate-parser-feedback-test/index.js
@@ -6,7 +6,7 @@ const Tokenizer = require('../../packages/parse5/lib/tokenizer');
 const defaultTreeAdapter = require('../../packages/parse5/lib/tree-adapters/default');
 const { convertTokenToHtml5Lib } = require('../../test/utils/generate-tokenization-tests');
 const parseDatFile = require('../../test/utils/parse-dat-file');
-const { addSlashes } = require('../../test/test/utils/common');
+const { addSlashes } = require('../../test/utils/common');
 
 readFile = promisify(readFile);
 writeFile = promisify(writeFile);

That fixed, the script exits 0 but

(node:18502) UnhandledPromiseRejectionWarning: Error: ENOENT: no such file or directory, open 'test/data/html5lib-tests/tree-construction/*.dat'

So I'm guessing this script doesn't generate those files, but consumes them.

find -name '*.dat' sees—

./test/data/tree-construction-regression/gh40_form_in_template.dat
./test/data/tree-construction-scripting/document_write.dat

They're at such a different path that it can't be a typo.

I don't know where else to look for missing mystery test data files.

At this point I gave up.

Advice appreciated.

@inikulin
Copy link
Owner

@anko it's indeed a bug, you're right we append the wrong character here: https://github.com/inikulin/parse5/blob/master/packages/parse5/lib/tokenizer/index.js#L1471 It should be <.

Regarding tests - you need to fetch html5lib git submodule by running:

git submodule update --init --recursive  

@anko
Copy link
Contributor Author

anko commented Sep 30, 2020

I see; tests are in a separate repo. Looks like they cover most of the comment spec stuff too. I'd feel dumb PRing for a 1-char diff, and it doesn't feel worthwhile even writing a regression test when the wrongness is unsubtle.

Poke it right when you have time I guess.

@inikulin
Copy link
Owner

I'd feel dumb PRing for a 1-char diff, and it doesn't feel worthwhile even writing a regression test when the wrongness is unsubtle

Don't see anything dumb about it. And it's always worth writing a regression test.

anko added a commit to anko/parse5 that referenced this issue Oct 1, 2020
Inside comments two consecutive less-than characters (`<<`) parsed
wrongly as `<!`, due to what was probably a typo.  This fixes that.

Added regression test.

Fixes inikulin#325.
inikulin pushed a commit that referenced this issue Oct 1, 2020
Inside comments two consecutive less-than characters (`<<`) parsed
wrongly as `<!`, due to what was probably a typo.  This fixes that.

Added regression test.

Fixes #325.
fb55 added a commit to fb55/html5lib-tests that referenced this issue Feb 28, 2022
Ms2ger pushed a commit to html5lib/html5lib-tests that referenced this issue Mar 11, 2022
jmbpwtw pushed a commit to jmbpwtw/parse5 that referenced this issue Feb 16, 2023
Inside comments two consecutive less-than characters (`<<`) parsed
wrongly as `<!`, due to what was probably a typo.  This fixes that.

Added regression test.

Fixes inikulin#325.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants