Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML API: Expose raw tag markup to support existing filters #5143

Draft
wants to merge 4 commits into
base: trunk
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions src/wp-includes/html-api/class-wp-html-tag-processor.php
Original file line number Diff line number Diff line change
Expand Up @@ -711,6 +711,41 @@ public function release_bookmark( $name ) {
}


/**
* Allow internal WordPress code to extract raw HTML snippets associated with a matched tag.
*
* This is meant for internal use to support existing filters that pass segments of input HTML
* for further analysis. Filters which rely on these values should be updated to interact with
* the structural methods this class provides.
*
* @access private
*
* @since 6.4.0
*
* @param string $internal_use_only Required to accept liability for exposing system internals through this function.
* @param string $content What portion of the raw HTML markup to return: 'entire-tag' or 'only-attributes'.
* @return string|null Raw HTML markup from input document for matched tag, if matched, else NULL.
*/
public function _wp_internal_extract_raw_token_markup( $internal_use_only, $contents = 'entire-tag' ) {
$disclaimer = <<<INTERNAL_ONLY
I understand that this is only for internal WordPress usage and something
will likely break if used elsewhere. This function comes with no warranty.
INTERNAL_ONLY;

if ( $internal_use_only !== $disclaimer || null === $this->tag_name_starts_at ) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it only necessary to check for tag_name_starts_at? If that property is set, does it follow that tag_ends_at is set too? If not, we could add the check here, but if it does, adding a clarifying (assertion) comment would improve code readability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's correct, and as internal values this isn't exposed. it's an idiom used throughout the Tag Processor. I've thought about different means to check what we're doing, but I don't think adding the same comment to twenty different places within the class would be that useful.

I do agree this is a bit indirect; more or less though I'm waiting until a much clearer and better solution appears.

Comment on lines +730 to +735
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting approach. I haven't seen how we do things like this in WP - is this the first function of such nature?

Another potential idea is to prefix such functions with unsafe or something, like React does it :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gutenberg uses this in some places. we considered unsafe_ prefixes but I don't like those because of how normative they can become and how they can unintentionally lesson the impact of seeing unsafe_ in the code.

this is a method I think I would prefer we never merge, and instead of supporting these old filters we move to new ones entirely. if we do put these in I want to discourage people as much as possible to call them separately, but we need to make the methods public for Core's internal consumption.

return null;
}
Comment on lines +730 to +737
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following the meaning of your emoji.

The point I'm trying to make here is that I really want the API to discourage use of this function. It's dangerous because it carries the legacy of interacting with HTML as a string that led to all the issues that brought us here.

In fact the only reason I'm considering this is a purely pragmatic backwards-compatibility issue. Core (and Jetpack and many other popular plugins) expose full or near-full HTML syntax through filters so that extending code can do further processing. For example, a full link HTML is passed into wp_targeted_link_rel and some plugins examine the URL to decide what additional rel values to add.

We could reconstruct normative HTML for these filters, but that would involve iterating over the attributes to create strings when we have one already. Maybe it's a worthwhile tradeoff, but I think most of those filters are unused and so if we can pass along a substring of an existing string and not touch it, that will have far less impact.

I'd prefer we find ways to update existing code to work structurally instead stringily. We don't need this function today; it's merely an aid to refactors that need the HTML API. If we can deprecate all those filters and leave time to update, then we can avoid this entire thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the added context and explanation, much appreciated!


The emoji meant something to the effect, "Wowzers, this one blows the one we use for the constructor out of the water. I guess that's the only precedent we have, but it's probably warranted."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I did realize that I can't make a class const private, so the one in the constructor could be easily foiled by calling WP_HTML_Processor::CONSTRUCTOR_UNLOCK_CODE.

In that case I'm less worried anyway because it's there to alert someone why something broke. In this case it's to try and put pressure against dangerous programming habits that will work if attempted, at least to a point.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation, @dmsnell. Could you provide an example snippet used in Jetpack (or other popular plugin), and how that code should look like after this patch? Just to get a better sense of the changes introduced here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bor0 there is no code change required by this PR for other plugins and code authors. the point is to preserve backwards compatibility. however, here is one of many examples of how existing filters rely on the raw markup.

// formatting.php
$rel = apply_filters( 'wp_targeted_link_rel', 'noopener', $link_html );

here, the $link_html is supposed to represent the full LINK element's HTML, such as <link rel="stylesheet" href="some/url">

lots of existing code then parses this with a regex or string-find, and breaks down in the same cases that normal HTML parsing fails: boolean attribute values, different quoting than expected, duplicate or differently-ordered attributes, escaping issues, etc…

ideally we would move Core towards a future where instead of passing around these strings we could pass around a kind of read-only or otherwise limited HTML Tag Processor, so that filters could reliably read attribute values, but changing the filter would mean breaking those existing plugins


switch ( $contents ) {
case 'entire-tag':
return substr( $this->html, $this->tag_name_starts_at - 1, $this->tag_ends_at - $this->tag_name_starts_at + 2 );

case 'only-attributes':
return substr( $this->html, $this->tag_name_starts_at + $this->tag_name_length + 1, $this->tag_ends_at - $this->tag_name_starts_at - $this->tag_name_length - 1 );
}
}


/**
* Skips contents of title and textarea tags.
*
Expand Down
104 changes: 104 additions & 0 deletions tests/phpunit/tests/html-api/wpHtmlTagProcessor-internals.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
<?php
/**
* Unit tests covering WP_HTML_Tag_Processor internal helpers.
*
* @package WordPress
* @subpackage HTML-API
*/

/**
* @group html-api
*
* @coversDefaultClass WP_HTML_Tag_Processor
*/
class Tests_HtmlApi_WpHtmlTagProcessor_Internals extends WP_UnitTestCase {
/**
* @ticket 59291
*
* @dataProvider data_html_with_entire_tag_raw_markup
*
* @param string $html_with_target_element HTML with a tag containing "target" attribute.
* @param string $expected_raw_markup Expected full raw markup for targeted tag.
*/
public function test_returns_raw_html_markup_for_entire_tag( $html_with_target_element, $expected_raw_markup ) {
$processor = new WP_HTML_Tag_Processor( $html_with_target_element );
while ( $processor->next_tag() && null === $processor->get_attribute( 'target' ) ) {
continue;
}

$this->assertSame(
$expected_raw_markup,
$processor->_wp_internal_extract_raw_token_markup(
<<<INTERNAL_ONLY
I understand that this is only for internal WordPress usage and something
will likely break if used elsewhere. This function comes with no warranty.
INTERNAL_ONLY
,
'entire-tag'
),
'Failed to exactly extract raw HTML for matched tag.'
);
}

/**
* Data provider.
*
* @return array[].
*/
public function data_html_with_entire_tag_raw_markup() {
return array(
'Tag at start of string' => array( '<img target src="atat.png">', '<img target src="atat.png">' ),
'Tag in middle of string' => array( '<a href="#"><img target src="atat.png"></a>', '<img target src="atat.png">' ),
'Tag at end of string' => array( '<a href="#"><img src="atat.png"><div id="5" target class="dingy">', '<div id="5" target class="dingy">' ),
'Tab after tag name' => array( "<img\ttarget>", "<img\ttarget>" ),
'Space after attributes' => array( '<img target >', '<img target >' ),
'Unquoted attribute' => array( '<span id = 5 class=sunshine target>', '<span id = 5 class=sunshine target>' ),
'Value with character references' => array( '<a href="#" title="This is &gt; That" target>', '<a href="#" title="This is &gt; That" target>' ),
);
}

/**
* @ticket 59291
*
* @dataProvider data_html_with_only_attributes_raw_markup
*
* @param string $html_with_target_element HTML with a tag containing "target" attribute.
* @param string $expected_raw_markup Expected raw markup for targeted tag containing only the attributes.
*/
public function test_returns_raw_html_markup_for_only_attributes( $html_with_target_element, $expected_raw_markup ) {
$processor = new WP_HTML_Tag_Processor( $html_with_target_element );
while ( $processor->next_tag() && null === $processor->get_attribute( 'target' ) ) {
continue;
}

$this->assertSame(
$expected_raw_markup,
$processor->_wp_internal_extract_raw_token_markup(
<<<INTERNAL_ONLY
I understand that this is only for internal WordPress usage and something
will likely break if used elsewhere. This function comes with no warranty.
INTERNAL_ONLY
,
'only-attributes'
),
'Failed to exactly extract raw HTML for matched tag.'
);
}

/**
* Data provider.
*
* @return array[].
*/
public function data_html_with_only_attributes_raw_markup() {
return array(
'Tag at start of string' => array( '<img target src="atat.png">', 'target src="atat.png"' ),
'Tag in middle of string' => array( '<a href="#"><img target src="atat.png"></a>', 'target src="atat.png"' ),
'Tag at end of string' => array( '<a href="#"><img src="atat.png"><div id="5" target class="dingy">', 'id="5" target class="dingy"' ),
'Tab after tag name' => array( "<img\ttarget>", 'target' ),
'Space after attributes' => array( '<img target >', 'target ' ),
'Unquoted attribute' => array( '<span id = 5 class=sunshine target>', 'id = 5 class=sunshine target' ),
'Value with character references' => array( '<a href="#" title="This is &gt; That" target>', 'href="#" title="This is &gt; That" target' ),
);
}
}