-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add method to get inner/outer HTML in WP_HTML_Tag_Processor #60046
Comments
Weirdly enough, the interactivity API does now has a class with methods that is build on top of the Tag Processor that has the description, seems they use it to extract the HTML. I have not looked deeply into this, but when I saw it, I thought: why not build a general purpose method right into the Tag Processor. There might be reasons, and I think there is a plan for bringing more functionality into the HTML API. |
@dmsnell, can you provide the technical feedback? |
Thanks for the inquiry @bfintal. If you follow the broad roadmap for the HTML API, you will note that functions like The Interactivity API is a kind of test-bed for this work, even though hopefully in the 6.6 release cycle the custom parser will be replaced with the HTML Processor. "Balanced" is a common idea for matching tag content. The idea is that if we assume that an HTML document always has an opening and closing tag for each element, then we can parse with a simple stack. This works reasonably well in practice, but still fails in a number of common edge cases. For example, among the web's highest-ranked pages, many closing The HTML Processor incorporates the rules in the HTML5 specification so that nobody will need to worry about when an element is opened and closed. The funny thing is that its logic ends up being much simpler than all the over-simplified attempts: while ( $processor->next_token() && $processor->still_open( $opening_tag ) ) {
continue;
} This aside, there still remains open questions about how to represent inner and outer HTML relating to escaping, decoding, and composition. I encourage people to explore the existing interfaces and to share feedback in #core-html-api, but please be warned against building structural parsers for production: it's almost impossible to know what is and isn't inner HTML without implementing the semantic rules of HTML5.
Good news! in WordPress 6.5 this is even easier, because the introduction of the while ( $processor->next_tag( 'STYLE' ) ) {
$contents = $processor->get_modifiable_text();
analyze_style( $contents );
} Unfortunately there's no support yet for modifying the modifiable text. If you want to do that, come join us in Slack and we can discuss how to do it, or link to a PR in your project and I'd be happy to review. I'm going to close this issue because: we already plan on adding inner/outer HTML support, but not yet; and HTML API development is tracking in the linked discussion and on Core Trac. Feel free to continue responding. |
@dmsnell I am coming back to this and almost a year has passed. Has anything changed as to getting tasks done as simple as getting the content of an HTML tag? Why (TF?) is I read that are edge cases, but what if I do not care about them? Seems still weird to me that this is not a general purpose class. Can the edge cases not just detected and errors put out? Or prevent functions from running with edge case tags or something? I just figured out how I can trick
This is how I tricked WP to do this without any new code. public function set_description_link(): void {
$html = 'hello <a href="#">world</a>!';
$html = strtr(
(string) $html,
array(
'<a' => '<template',
'</a>' => '</template>',
)
);
$p = new WP_Interactivity_API_Directives_Processor( $html );
if ( $p->next_tag( 'template' ) ) {
$this->descriptionlink = $p->get_attribute( 'href' );
$this->descriptionlinktext = $p->get_content_between_balanced_template_tags();
d( $html, $this->descriptionlink, $this->descriptionlinktext );
}
} Am I supposed to do something as simple as this still with regex or what? Seems arbitrary. |
thanks for sharing your feelings on this @nextgenthemes
I wish getting the content of an HTML tag were simple 🙂 there is a method available in WordPress today that you can use without changing any code to work with inner and outer HTML in a way that is at least as safe as PHP 8.4’s A few months ago I tossed out some code to do this in an HTML_Serialization_Builder class which extends the
This would be a good question for those who build the Interactivity API Directives Processor. I can only guess at the reason.
The values of the HTML API have been shared in places like this Progress Report
Poor parsing of HTML is a solved problem and we’ve been dealing with the consequences of this since before HTML was born (because people were trying to parse SGML naively with regular expressions as well). Many of the trust issues with WordPress stem from hasty parsing. The HTML API is specifically designed to be reliable and safe, which means that if it can’t provide those things it will refuse to continue giving the impression that it will.
This is correct, and one of the reasons the HTML API development is able to move at a pace which supports its goals (because existing solutions, buggy as they may be, exist and have been common practice for decades). Repeating the methods which have been around forever is not making things worse — just continuing with the status quo. There’s also a lot more in the HTML API’s public API than it might seem at first glance. For instance, the HTML Builder class I wrote provides reasonably safe updates to inner and outer content without modifying any Core classes. The important thing here is that reducing safety is a choice. The goal for the HTML API is to provide a system that is reliable and safe by default, and any increase in risk comes as a deliberate choice by a developer. The primary sources of risk are fully handled by the Tag Processor, so it’s already safer beyond the traditional methods of parsing HTML. If you want to apply changes directly and bypass some of the additional safeties you need not feel bad about that. The single-most important aspect for pushing out a public API is ensuring that we don’t invite and encourage corruption and security risks. This is why the HTML API development is taking the pace it is.
I can’t remember if the Interactivity API still aborts on Note, however, that your example workaround also changes With the HTML Processor it’s possible to scan until a given element is no longer open. This is a much stronger approach than the earlier idea of “balanced tags” (which is terribly fraught because it doesn’t match how HTML works). while ( $processor->next_token() ) {
// Skip TEMPLATE elements.
if ( 'TEMPLATE' === $processor->get_tag() ) {
$depth_at_opening = $processor->get_current_depth();
while ( $processor->next_token() && $processor->get_current_depth() > $depth_at_openinhg ) {
continue;
}
}
…
} |
What problem does this address?
With the WP_HTML_Tag_Processor, you can get an attribute, the tag name, but there is no way to get the innerHTML and outerHTML. The class is great for traversing HTML and it would be great if it can be used as an alternative to regex for grabbing html content.
Scenario: right now I'm using the
render_block
to grab some contents of some<style>...</style>
tags via regex.What is your proposed solution?
Add a method
get_inner_html
andget_outer_html
that would return the inner and outer html where the current "pointer" is at.If added, I should now be able to do:
The text was updated successfully, but these errors were encountered: