Add method to get inner/outer HTML in WP_HTML_Tag_Processor #60046

bfintal · 2024-03-20T15:49:09Z

What problem does this address?

With the WP_HTML_Tag_Processor, you can get an attribute, the tag name, but there is no way to get the innerHTML and outerHTML. The class is great for traversing HTML and it would be great if it can be used as an alternative to regex for grabbing html content.

Scenario: right now I'm using the render_block to grab some contents of some <style>...</style> tags via regex.

What is your proposed solution?

Add a method get_inner_html and get_outer_html that would return the inner and outer html where the current "pointer" is at.

If added, I should now be able to do:

function render_block( $html ) {
    $processor = new WP_HTML_Tag_Processor( $html );
    if ( $processor->next_tag( 'style' ) ) {
        $style = $processor->get_inner_html();
        // Do something with $style, not necessarily updating it
    }
    return $html;
}

The text was updated successfully, but these errors were encountered:

nextgenthemes · 2024-03-20T21:57:01Z

Weirdly enough, the interactivity API does now has a class with methods that is build on top of the Tag Processor that has the description, seems they use it to extract the HTML. I have not looked deeply into this, but when I saw it, I thought: why not build a general purpose method right into the Tag Processor. There might be reasons, and I think there is a plan for bringing more functionality into the HTML API.

get_content_between_balanced_template_tags

gziolo · 2024-03-26T06:44:11Z

@dmsnell, can you provide the technical feedback?

dmsnell · 2024-03-26T11:24:41Z

Thanks for the inquiry @bfintal.

If you follow the broad roadmap for the HTML API, you will note that functions like inner_html are in the plans, but we're not entirely ready for those as we don't know what interface they need, exactly.

The Interactivity API is a kind of test-bed for this work, even though hopefully in the 6.6 release cycle the custom parser will be replaced with the HTML Processor.

"Balanced" is a common idea for matching tag content. The idea is that if we assume that an HTML document always has an opening and closing tag for each element, then we can parse with a simple stack. This works reasonably well in practice, but still fails in a number of common edge cases. For example, among the web's highest-ranked pages, many closing </p> tags also implicitly close opened formatting tags like <b> and <em>. The balanced method doesn't work here.

The HTML Processor incorporates the rules in the HTML5 specification so that nobody will need to worry about when an element is opened and closed. The funny thing is that its logic ends up being much simpler than all the over-simplified attempts:

while ( $processor->next_token() && $processor->still_open( $opening_tag ) ) {
	continue;
}

This aside, there still remains open questions about how to represent inner and outer HTML relating to escaping, decoding, and composition. I encourage people to explore the existing interfaces and to share feedback in #core-html-api, but please be warned against building structural parsers for production: it's almost impossible to know what is and isn't inner HTML without implementing the semantic rules of HTML5.

Scenario: right now I'm using the render_block to grab some contents of some <style>...</style> tags via regex.

Good news! in WordPress 6.5 this is even easier, because the introduction of the $processor->next_token() function makes it easier and safer to read the contents of a SCRIPT element. Both SCRIPT and STYLE (and TITLE and TEXTAREA) are special elements wherein they only contain plaintext; they cannot contain markup. That means if you find <img> inside of them a browser would treat that as the text <img> and display it as text, not as a tag. In order to guard against accidentally treating those contents as HTML, the Tag Processor exposes $processor->get_modifiable_text() and properly decodes the contents (because some are supposed to decode HTML character references like &colon; while others aren't supposed to).

while ( $processor->next_tag( 'STYLE' ) ) {
	$contents = $processor->get_modifiable_text();
	analyze_style( $contents );
}

Unfortunately there's no support yet for modifying the modifiable text. If you want to do that, come join us in Slack and we can discuss how to do it, or link to a PR in your project and I'd be happy to review.

I'm going to close this issue because: we already plan on adding inner/outer HTML support, but not yet; and HTML API development is tracking in the linked discussion and on Core Trac. Feel free to continue responding.

nextgenthemes · 2025-02-09T11:29:59Z

@dmsnell I am coming back to this and almost a year has passed. Has anything changed as to getting tasks done as simple as getting the content of an HTML tag?

Why (TF?) is WP_Interactivity_API_Directives_Processor final?

I read that are edge cases, but what if I do not care about them? Seems still weird to me that this is not a general purpose class. Can the edge cases not just detected and errors put out? Or prevent functions from running with edge case tags or something?

I just figured out how I can trick WP_Interactivity_API_Directives_Processor to do what I want just by renaming the tag to template seems like the perfect case for extending the class for my own get_content_between_balanced_template_tags that allows this for all tags. But final locks me out from doing this. Why are you so hard preventing this? I can of course just copy and paste the entire class and remove these 3 lines to get this done ...

		if ( 'TEMPLATE' !== $this->get_tag() ) {
			return null;
		}

This is how I tricked WP to do this without any new code.

	public function set_description_link(): void {

		$html = 'hello <a href="#">world</a>!';

		$html = strtr(
			(string) $html,
			array(
				'<a'   => '<template',
				'</a>' => '</template>',
			)
		);

		$p = new WP_Interactivity_API_Directives_Processor( $html );

		if ( $p->next_tag( 'template' ) ) {
			$this->descriptionlink     = $p->get_attribute( 'href' );
			$this->descriptionlinktext = $p->get_content_between_balanced_template_tags();

			d( $html, $this->descriptionlink, $this->descriptionlinktext );
		}
	}

Am I supposed to do something as simple as this still with regex or what? Seems arbitrary.

dmsnell · 2025-02-09T14:00:45Z

thanks for sharing your feelings on this @nextgenthemes

Has anything changed as to getting tasks done as simple as getting the content of an HTML tag?

I wish getting the content of an HTML tag were simple 🙂

there is a method available in WordPress today that you can use without changing any code to work with inner and outer HTML in a way that is at least as safe as PHP 8.4’s Dom\HTMLDocument and much safer than the always-broken DOMDocument.

A few months ago I tossed out some code to do this in an HTML_Serialization_Builder class which extends the WP_HTML_Processor. Be warned, there are bugs in that — I scribbled it over the course of an hour while chatting with folks, but the bugs should be obvious and fixable. This approach has some limitations, but gets the job done.

Why (TF?) is WP_Interactivity_API_Directives_Processor final?

This would be a good question for those who build the Interactivity API Directives Processor. I can only guess at the reason.

Why are you so hard preventing this?

The values of the HTML API have been shared in places like this Progress Report

While this…feels like a slow start in building out the API, it also means that you can trust it from day one to do what it claims to do, and support will only improve with time. You don’t have to worry that it will break with certain kinds of input

Poor parsing of HTML is a solved problem and we’ve been dealing with the consequences of this since before HTML was born (because people were trying to parse SGML naively with regular expressions as well). Many of the trust issues with WordPress stem from hasty parsing. The HTML API is specifically designed to be reliable and safe, which means that if it can’t provide those things it will refuse to continue giving the impression that it will.

I can of course just copy and paste the entire class and remove these 3 lines to get this done ...

This is correct, and one of the reasons the HTML API development is able to move at a pace which supports its goals (because existing solutions, buggy as they may be, exist and have been common practice for decades). Repeating the methods which have been around forever is not making things worse — just continuing with the status quo. There’s also a lot more in the HTML API’s public API than it might seem at first glance. For instance, the HTML Builder class I wrote provides reasonably safe updates to inner and outer content without modifying any Core classes.

The important thing here is that reducing safety is a choice. The goal for the HTML API is to provide a system that is reliable and safe by default, and any increase in risk comes as a deliberate choice by a developer. The primary sources of risk are fully handled by the Tag Processor, so it’s already safer beyond the traditional methods of parsing HTML. If you want to apply changes directly and bypass some of the additional safeties you need not feel bad about that.

The single-most important aspect for pushing out a public API is ensuring that we don’t invite and encourage corruption and security risks. This is why the HTML API development is taking the pace it is.

This is how I tricked WP to do this without any new code.

I can’t remember if the Interactivity API still aborts on TEMPLATE elements but the HTML Processor supports those. If it’s still the case then it’s also probably possible to continue through those and you could propose an update to the Directive Processor.

Note, however, that your example workaround also changes <area> to <templaterea> and overlooks <A HREF="…"> and corrupts JavaScript which contains things like <script>if(5<anchors.length)</script> into things like <script>if(5<templatenchors.length)</script> and a host of other quite common “edge cases.”

With the HTML Processor it’s possible to scan until a given element is no longer open. This is a much stronger approach than the earlier idea of “balanced tags” (which is terribly fraught because it doesn’t match how HTML works).

while ( $processor->next_token() ) {
	// Skip TEMPLATE elements.
	if ( 'TEMPLATE' === $processor->get_tag() ) {
		$depth_at_opening = $processor->get_current_depth();
		while ( $processor->next_token() && $processor->get_current_depth() > $depth_at_openinhg ) {
			continue;
		}
	}

	…
}

bfintal added the [Type] Enhancement A suggestion for improvement. label Mar 20, 2024

jordesign added the [Feature] Block API API that allows to express the block paradigm. label Mar 20, 2024

gziolo added the [Feature] HTML API An API for updating HTML attributes in markup label Mar 26, 2024

gziolo added the Needs Decision Needs a decision to be actionable or relevant label Mar 26, 2024

dmsnell closed this as completed Mar 26, 2024

dmsnell mentioned this issue Apr 2, 2024

HTML API: Roadmap #60397

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add method to get inner/outer HTML in WP_HTML_Tag_Processor #60046

Add method to get inner/outer HTML in WP_HTML_Tag_Processor #60046

bfintal commented Mar 20, 2024 •

edited

Loading

nextgenthemes commented Mar 20, 2024 •

edited

Loading

gziolo commented Mar 26, 2024

dmsnell commented Mar 26, 2024

nextgenthemes commented Feb 9, 2025

dmsnell commented Feb 9, 2025

Add method to get inner/outer HTML in WP_HTML_Tag_Processor #60046

Add method to get inner/outer HTML in WP_HTML_Tag_Processor #60046

Comments

bfintal commented Mar 20, 2024 • edited Loading

What problem does this address?

What is your proposed solution?

nextgenthemes commented Mar 20, 2024 • edited Loading

gziolo commented Mar 26, 2024

dmsnell commented Mar 26, 2024

nextgenthemes commented Feb 9, 2025

dmsnell commented Feb 9, 2025

bfintal commented Mar 20, 2024 •

edited

Loading

nextgenthemes commented Mar 20, 2024 •

edited

Loading