-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong text mode for title element? #40
Comments
I'm starting to do a little reading on the spec to get some background. For reference, the spec section on title is at http://www.w3.org/TR/html51/document-metadata.html#the-title-element. In the section of the spec for parsing the head it says title should be parsed as RCDATA. Following along, RCDATA should handle consuming character references properly. TEXT_RCDATA should handle character references making 5 work. @KitaitiMakoto did you track if this is happening in the parser or writer/serializer? |
Did a little digging and the issue is in the parser. |
Does it make a difference if the HTML is well-formed? With malformed HTML, it might have gotten kicked into QuirksMode. |
With a composer file that includes html5-php and querypath try the following code: <?php
require_once 'vendor/autoload.php';
$html = <<<EOH
<!doctype html>
<html>
<head>
<title>'</title>
</head>
<body>
<p>'</p>
</body>
</html>
EOH;
$dom = \HTML5::loadHTML($html);
echo \HTML5::saveHTML($dom);
$qp = htmlqp($dom);
print_r($qp->find('title')->get(0)); You'll get the output:
Validation says the html is valid html5. I've not had a chance to dig further than this yet. |
Here's the reason: TEXT_RCDATA means that the text inside the tag should be parsed as "raw character data" with no entity decoding. I haven't tested, but if we change that from 5 to 1, title should decode entities. Wanna give that a try @KitaitiMakoto ? Please let us know if it works. |
@mattfarina just showed me that I'm wrong. There's a bug in the parser where it does not handle entities in RCDATA fields correctly. http://www.w3.org/TR/html51/syntax.html#rcdata-state So I need to fix that in the tokenizer. |
Sorry for my late reply. But I'm still busy now. Will try and reply on the weekend. |
Thank you for digging code and pointing specification. |
Hello,
Thanks for the nice library. I installed and tried this HTML5 lib.
And founded that handling of entity references in title element is wrong like this:
In example above,
'
should be decoded as'
(quotation) but actually doesn't.If I set text mode for
title
element to 81, the entity ref is decoded properly:I've intended to send a pull request but I couldn't because I didn't know why
¥HTML5¥Elements::$html5['title']
was set to 5.Could you consider about this?
The text was updated successfully, but these errors were encountered: