Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: Skip iteration in block supports elements when not necessary. #5411

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Oct 4, 2023

Trac ticket: #59544

Previously there have been three ways in which the block supports elements code has been performing computation it doesn't need to.

  • It checks for every possible style attribute on a block even if there are no possible style attributes available on the block.

  • Once it determines that it needs to add a special class to a block it continues scanning through the rest of the attributes.

  • It constructs dot-separated literal key paths for the attributes and immediately and repeatedly calls explode() on them to create an array at runtime.

The combination of these three factors leads to an outsized impact of this function on the runtime and memory pressure for what it's doing. This patch removes all three of these inefficiencies:

  • If no elements style attributes exist on the block it immediately returns and skips all further processing.

  • It stops scanning block attributes once a single one is found that imposes the requirement for the extra class name.

  • It constructs array literals for the key path instead of strings. This removes the need to artificially join and explode the keys at runtime.

There should be no functional or visual changes in this patch, but it should reduce the overall runtime of page renders and require less memory while rendering. The impact will be proportional to the number of blocks rendered on a page.

Some probably-misleading benchmarks.

Summary

  • this shaves off a couple of milliseconds (around 5%) on render time on my laptop for the home page of twentytwentyfour
  • it reduces memory consumption by around 1.5 MB
  • on the given render it's also reducing the variability of the measured times but I don't have an explanation for why

benchmark-server-timing

  • M2 MBP, 32 GB, plugged in with a full power adapter
  • npm run research -- benchmark-server-timing -u 'http://localhost' -n 3000 -c 20 -p -o csv
  • php -S 0.0.0.0:80 and mysql both running natively, no Docker or other indirection
╔════════════════════════════════╤════════════════════╤════════════════════╤════════════════════╗
║ URL                            │ trunk              │ branch w/o extra   │ branch with extra  ║
╟────────────────────────────────┼────────────────────┼────────────────────┼────────────────────╢
║ wp-load-alloptions-query (p10) │ 0.49               │ 0.46               │ 0.39               ║
╟────────────────────────────────┼────────────────────┼────────────────────┼────────────────────╢
║ wp-load-alloptions-query (p25) │ 0.54               │ 0.48               │ 0.41               ║
╟────────────────────────────────┼────────────────────┼────────────────────┼────────────────────╢
║ wp-load-alloptions-query (p50) │ 0.61               │ 0.52               │ 0.45               ║
╟────────────────────────────────┼────────────────────┼────────────────────┼────────────────────╢
║ wp-load-alloptions-query (p75) │ 0.85               │ 0.67               │ 0.63               ║
╟────────────────────────────────┼────────────────────┼────────────────────┼────────────────────╢
║ wp-load-alloptions-query (p90) │ 0.99               │ 0.88               │ 0.71               ║
╟────────────────────────────────┼────────────────────┼────────────────────┼────────────────────╢
║ wp-before-template (p10)       │ 21.94              │ 22.11              │ 22.17              ║
╟────────────────────────────────┼────────────────────┼────────────────────┼────────────────────╢
║ wp-before-template (p25)       │ 22.67              │ 22.19              │ 22.42              ║
╟────────────────────────────────┼────────────────────┼────────────────────┼────────────────────╢
║ wp-before-template (p50)       │ 23.09              │ 22.61              │ 22.92              ║
╟────────────────────────────────┼────────────────────┼────────────────────┼────────────────────╢
║ wp-before-template (p75)       │ 24.05              │ 23.06              │ 23.67              ║
╟────────────────────────────────┼────────────────────┼────────────────────┼────────────────────╢
║ wp-before-template (p90)       │ 24.65              │ 24.00              │ 24.24              ║
╟────────────────────────────────┼────────────────────┼────────────────────┼────────────────────╢
║ wp-template (p10)              │ 41.03              │ 38.71              │ 38.98              ║
╟────────────────────────────────┼────────────────────┼────────────────────┼────────────────────╢
║ wp-template (p25)              │ 41.42              │ 38.83              │ 39.07              ║
╟────────────────────────────────┼────────────────────┼────────────────────┼────────────────────╢
║ wp-template (p50)              │ 41.74              │ 39.26              │ 39.82              ║
╟────────────────────────────────┼────────────────────┼────────────────────┼────────────────────╢
║ wp-template (p75)              │ 42.06              │ 39.63              │ 40.92              ║
╟────────────────────────────────┼────────────────────┼────────────────────┼────────────────────╢
║ wp-template (p90)              │ 42.89              │ 39.89              │ 41.33              ║
╟────────────────────────────────┼────────────────────┼────────────────────┼────────────────────╢
║ wp-total (p10)                 │ 62.9               │ 60.84              │ 61.56              ║
╟────────────────────────────────┼────────────────────┼────────────────────┼────────────────────╢
║ wp-total (p25)                 │ 64.45              │ 61.47              │ 62.28              ║
╟────────────────────────────────┼────────────────────┼────────────────────┼────────────────────╢
║ wp-total (p50)                 │ 65.15              │ 61.88 (-3.27 / -5%)│ 62.72              ║
╟────────────────────────────────┼────────────────────┼────────────────────┼────────────────────╢
║ wp-total (p75)                 │ 66.08              │ 62.56              │ 64.08              ║
╟────────────────────────────────┼────────────────────┼────────────────────┼────────────────────╢
║ wp-total (p90)                 │ 67.05              │ 65.21              │ 64.96              ║
╚════════════════════════════════╧════════════════════╧════════════════════╧════════════════════╝

xdebug

  • M2 MBP, 32 GB, plugged in with a full power adapter
  • php -d xdebug.mode=profile -d xdebug.use_compression=false -d xdebug.profiler_output_name=cachegrind.out.tt4-trunk -S 0.0.0.0:80
  • php -S 0.0.0.0:80 and mysql both running natively, no Docker or other indirection
branch runtime memory use
trunk Screenshot 2023-10-04 at 4 21 18 PM Screenshot 2023-10-04 at 4 21 29 PM
#5411 Screenshot 2023-10-04 at 4 21 21 PM Screenshot 2023-10-04 at 4 21 37 PM

Copy link
Contributor

@costdev costdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dmsnell and nice find! I've added one nitpick and a few ideas for further reducing memory usage.

@@ -31,7 +31,7 @@ function wp_get_elements_class_name( $block ) {
* @return string Filtered block content.
*/
function wp_render_elements_support( $block_content, $block ) {
if ( ! $block_content || empty( $block['attrs'] ) ) {
if ( ! $block_content || !isset( $block['attrs']['style']['elements'] ) ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ( ! $block_content || !isset( $block['attrs']['style']['elements'] ) ) {
if ( ! $block_content || ! isset( $block['attrs']['style']['elements'] ) ) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @costdev - not sure how I missed that - it was literally glowing at me 🙃

Screenshot 2023-10-05 at 10 32 13 AM

'style.elements.button.color.text',
'style.elements.button.color.background',
'style.elements.button.color.gradient',
array( 'button', 'color', 'text' ),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmsnell To avoid creating the arrays unnecessarily on each call, what do you think about moving the early return on lines 84-90 up to line 39 and changing them to this?

$skip_button_color  = wp_should_skip_block_supports_serialization( $block_type, 'color', 'button' );
$skip_link_color    = wp_should_skip_block_supports_serialization( $block_type, 'color', 'link' );
$skip_heading_color = wp_should_skip_block_supports_serialization( $block_type, 'color', 'heading' );

if ( $skip_button_color && $skip_link_color && $skip_heading_color) {
    return $block_content;
}

Then this will further reduce memory usage while having no functional or visual impact.

The $skip_* variables could be used for the 'skip' keys in the arrays below to avoid duplicate calls, but see the comment below for an alternative approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could further reduce memory usage by using the $skip_* variables to determine whether to add the property and its paths to the array at all, rather than create it and store whether to skip it later.

$element_color_properties = array();

if ( ! $skip_button_color ) {
    $element_color_properties['button'] = array( 
        array( 'heading', 'color', 'text' ),
        // etc.
    );
}

// same for link and heading.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a minor optimization, but I think it makes sense to add, given this PR is purely focused on performance in the first place 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @costdev - I explored this route but found diminishing returns. there are more optimization opportunities involving bailing out in a hierarchical manner too, such as saying, if there's no style.elements.heading attribute then don't check style.elements.heading.color.text but in my tests these all created marginal returns while making the code harder to understand or modify.

since these are all static we could eliminate the arrays entirely and fall back to static control structures.

again, things that I looked at but found no evidence that they are noteworthy optimizations, and things I felt could be addressed in a more normal development cycle where bigger changes can be examined.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure where you all stand on adding these extra changes. I will try and create a new commit for them, but given the goals of this change and the timeline and that it's code coming in from a Gutenberg package I'm in favor of leaving this patch as-is, especially if we can't show a meaningful further improvement.

*/
$tags = new WP_HTML_Tag_Processor( $block_content );
if ( $tags->next_tag() ) {
$tags->add_class( wp_get_elements_class_name( $block ) );
Copy link
Contributor

@costdev costdev Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since $block doesn't change per iteration, we can avoid calling wp_get_elements_class_name() (and therefore md5() and serialize()) for each iteration by storing this above the foreach on line 94.

$elements_class_name = wp_get_elements_class_name( $block );

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was something I thought at first too but there's actually no iteration here so it only gets called once. once it realized it needs the class name, it finds the first opening HTML tag, adds the class, and then returns from the function.

@felixarntz
Copy link
Member

@dmsnell This is a great catch, and IMO worth committing for the 6.4 release.

I ran some additional benchmarks to further validate the benefits outlined by your initial benchmarks in the PR description, based on 100 benchmark-server-timing runs. To rule out the changes being caused by variance, I did that multiple times, giving additional confidence on the data. Summary of the median wp-total values:

  • Against TT4 home page:
    • 88.43ms (PR) against 89.73ms (trunk) --> 1.4% faster
    • 88.23ms (PR) against 89.93ms (trunk) --> 1.9% faster
    • 88.59ms (PR) against 89.89ms (trunk) --> 1.4% faster
  • Against TT4 "Sample Page":
    • 67.49ms (PR) against 67.6ms (trunk) --> 0.2% faster
    • 67.39ms (PR) against 68.2ms (trunk) --> 1.2% faster
    • 67.21ms (PR) against 67.91ms (trunk) --> 1.0% faster
  • Against TT3 home page:
    • 62.91ms (PR) against 64ms (trunk) --> 1.7% faster
    • 63.19ms (PR) against 63.68ms (trunk) --> 0.8% faster
    • 63.86ms (PR) against 63.98ms (trunk) --> 0.2% faster

Per the above, it is clear that the benefits of this PR are particularly relevant the more complex the current page / blocks are. Even for the very simple "Sample Page" though, there is a net benefit visible, so it's a safe performance improvement. If we were making the decision based on very simple demo content, I'd say this is not critical to merge, but with lots of real-world content more complex than that, I think it's safe to say this makes sense to add, even this late in the cycle. 👏

Copy link
Member

@felixarntz felixarntz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmsnell Other than the remarks by @costdev, this looks great to me.

Would you be able to add 1-2 unit tests for this function to cover the changes?

@dmsnell
Copy link
Member Author

dmsnell commented Oct 5, 2023

Would you be able to add 1-2 unit tests for this function to cover the changes?

If I do this @felixarntz we'll all be waiting longer than anyone wants. I didn't write this function and I have very little knowledge of how to test it or what is important to test. My refactor has centered on code auditing to avoid changing any behaviors except for the runtime performance.

Maybe @jorgefilipecosta or @aaronrobertshaw could chime in on this.

@dmsnell dmsnell force-pushed the performance/skip-needless-iteration-in-block-supports-elements branch from c6d3209 to 1490dab Compare October 5, 2023 18:40
@dmsnell
Copy link
Member Author

dmsnell commented Oct 5, 2023

All, I have pushed a new commit to this branch which separates the question of whether a block needs the extra class name from the act of adding it. This new function incorporates all of the further practical optimizations.

  • it avoids creating arrays, except for the static list of heading elements that I thought was reasonable enough.
  • it aborts as soon as it can to avoid computing any needless lookups

As this function diverges a lot from the existing code I'm still leery of the change in this release cycle, also I'm a bit skeptical on the additional performance impact, so I would appreciate several people testing out to see if it's worth it. This change, in my opinion, makes it a little more fragile for future additions and changes to the rules because they are now encoded more in code structure than in a data list by itself. I have no foot in the game, though, so this is just me sharing my experience on the matter and not a personal wish.

@dmsnell
Copy link
Member Author

dmsnell commented Oct 5, 2023

I've updated the description with some initial benchmarking against the extra commit. While I think these are all fairly close, I think the addition of the function call made the change slower. I'll inline it, but I think these results are matching my intuition, that the performance gains were made and the rest is in the blurry boundary.

@dmsnell
Copy link
Member Author

dmsnell commented Oct 5, 2023

Funny: I inlined the function and the performance results got worse. Optimization is hard on modern CPUs. Maybe some speculation was winning an advantage with the array() structure somehow. I'm not sure. Or maybe the JIT found a better path 🤷‍♂️

As it stands this PR is faster at the first commit and slower at the second. So please, maybe we just stick with the initial proposal which was both a smaller and clearer change but also probably faster in practice?

@felixarntz
Copy link
Member

@dmsnell How did you verify that the performance got worse? If you used the benchmark-server-timing command, I'm not sure it's suitable for this kind of optimization. There is a considerable amount of variance, even with 100+ runs. It's good to rely on when the difference is substantial enough and consistently leaning in the one direction, but I am pretty sure that none of the additionally proposed improvements have any impact that would bring visible results with that command.

That doesn't mean those improvements are incorrect. For example, computing the class name from the function outside the loop surely is more efficient than for each entry in the loop, even if it doesn't show in benchmarks. All of the additionally proposed points of feedback are micro optimizations, which we probably just need to individually reason about. While in an ideal world we could even spot those with benchmark-server-timing, the variance in that command is too high to help with micro optimizations.

Anyway, all that is to say: I'm not disagreeing with your conclusion, but I think we need to be mindful of not interpreting too much into benchmark-server-timing results when it comes to tiny differences (>1%). In those cases, I would recommend to run the command several times, and if the tendency is always in the same direction, there's a good chance there really is an improvement (or a regression, in the inverse case). But otherwise, the benchmarks simply don't help in that case.

@felixarntz
Copy link
Member

@dmsnell

Would you be able to add 1-2 unit tests for this function to cover the changes?

If I do this @felixarntz we'll all be waiting longer than anyone wants. I didn't write this function and I have very little knowledge of how to test it or what is important to test. My refactor has centered on code auditing to avoid changing any behaviors except for the runtime performance.

I'm not sure that's a reasonable path forward. The fact that this refactor aims to avoid changing any behaviors is precisely why it should have test coverage, to ensure that that is the case, and remains so in the future. Yes, writing tests takes extra time, and I get that it sometimes feels like it slows things down. But it's crucial in the WordPress core project that code has test coverage.

Having test coverage also makes committing changes less risky if the committer can rely on existing test coverage passing as another indicator that a change is good to go. I think this function can reasonably be tested with PHPUnit, and it probably should have had tests added already whenever it was introduced. But since we're changing it now, it's crucial we at least follow up with that now.

@dmsnell
Copy link
Member Author

dmsnell commented Oct 5, 2023

How did you verify that the performance got worse? If you used the benchmark-server-timing command, I'm not sure it's suitable for this kind of optimization.

This is indeed how I measured, with the specific commands and environment documented above in the description. I ran with 3000 iterations and have seen fairly consistent results each time I run it.

My point isn't that this does or doesn't have micro optimizations, but that if it's this difficult to tell whether it helps, hurts, or makes no difference, then it's not worth the complexity and readability issues that it requires. I'm given to believe that something underpins the fact that it appears that these additional micro-optimizations slow things down, strange as it seems.

Regardless, the changes introduced in the first commit reliably appear in various benchmarks and measurements, so those are a clear win without a clear readability penalty. I generally perfer to get better out the door before we can confirm if best is better.

computing the class name from the function outside the loop surely is more efficient than for each entry in the loop, even if it doesn't show in benchmarks.

Also noted in the thread above but this isn't getting computed multiple times. As soon as that part of the loop is reached the function returns, so this is only ever being called once. There was an initial misunderstanding of the code that led to this suggestion, but it doesn't apply here.

I'm not sure that's a reasonable path forward.

Pardon for the misunderstanding! I'm not saying I won't add tests; and unfortunately unless someone comes in to help it may be unreasonable but it's the only thing I can offer 😄

Also worth mentioning: if I add tests someone is going to have to review the tests to make sure I'm testing the right things 🙃 Quis custodiet ipsos custodes? I'm flying blind here. I'm not sure I can get these in this week, or even by end of next week. I'll try.

@aaronrobertshaw are you at all available to help speed this up?

@felixarntz
Copy link
Member

Thanks for the detailed clarification @dmsnell!

@aaronrobertshaw If you can assist here with test coverage, that would be tremendously helpful. I am happy to review, will have to see if alternatively I can take a stab at it as well.

@dmsnell
Copy link
Member Author

dmsnell commented Oct 5, 2023

@felixarntz still relying on the wpp-research tooling to measure, I setup the tests on an isolated server in a climate-controlled data center, sharing no resources with other systems. the results agree with my local testing.

  • 2.0 GHz EPYC Milan, 16 GB
  • npm run research -- benchmark-server-timing -u 'http://localhost:3880' -n 5000 -c 30 -p -o csv
  • PHP and mysql are both running natively, outside of any container
version wp-total (p50) % against trunk
trunk 124.9 -
first commit only 119.0 -4.73%
with micro-optimizations 121.2 -2.96%
with inlined micro-opt… 123.8 -0.89%

I haven't listed the other numbers because they correspond with the results shared in the description. I'm happy to see other results show otherwise, but I think something was lost in my attempts to realize the additional suggestions. these results each stem from 5000 requests and are relatively consistent when running multiple times.

@dmsnell dmsnell force-pushed the performance/skip-needless-iteration-in-block-supports-elements branch from a805b24 to f1455d7 Compare October 5, 2023 23:08
@aaronrobertshaw
Copy link

Thanks for the pings 👍

I'm happy to help wherever I can. Just catching up on the changes and discussion before looking into tests.

@aaronrobertshaw
Copy link

aaronrobertshaw commented Oct 6, 2023

Quick update:

I'm working on some additional tests and I'll put together PRs for both core and Gutenberg.

The plan is to leave the existing wp_render_elements_support tests untouched for the moment and simply add coverage to ensure that the correct styles are being generated for elements via the style engine.

It'll take a little bit to get it sorted but it'll happen in the next couple of hours.

@dmsnell
Copy link
Member Author

dmsnell commented Oct 6, 2023

Quick note here @felixarntz: in further testing on my server I'm getting results all over the place, which I wasn't earlier. I'm not sure what's going on, unless the server is just more variable than my laptop. Maybe I can make my own benchmark with curl and build a more statistically sound analysis.

Any suggestions otherwise? I would have hoped that over thousands of runs this would converge, but it would also be nice to see a histogram of the results which maybe tells a clearer picture.

Needless to say though, we're clearly doing less work by aborting early, so I don't doubt any of the existing conclusions. Maybe the additional optimizations are more effective than I think, or maybe they aren't 🤷‍♂️. I'm sure there's no point in continuing to scan attributes though when we've already determined we need to add the class, and explode() was already creating arrays for the paths, whereas even in the first commit, these arrays still exist, it's strictly equal or better in those terms.

@dmsnell
Copy link
Member Author

dmsnell commented Oct 6, 2023

and some plots:

first up is a raw scatter plot of 3000 runs each of Server-Timing based on manually extracting the values, still on that server mentioned above. outliers are cut out of the graph, but they do not significantly change the distribution. it's interesting to note just how patterned the variation is.

for i in (seq 1 3000)
	curl http://localhost:3880 -D - | grep 'Server Timing' >> branch.txt
end

I used node to quickly extract the timings from each file.

branch-compare

for a different perspective, here are the runs for each branch when sorted.

branch-sorted

not much but mystery I guess

between-stats

ggbetweenstats thinks there's a significant difference though both from trunk and with the extra optimizations.

branch standard deviation
trunk 28.1
first commit 10.2
all commits 6.67

strangely this is showing a performance regression in all this work 🤦‍♂️ but I still think there's bias slipping in on the server. I will see if I can re-run this overnight while alternating branches to account for any systematic issues with the server; maybe it's not as isolated as I have thought.

@aaronrobertshaw
Copy link

There's a PR up in #5418 adding elements block support unit tests. The equivalent Gutenberg PR (WordPress/gutenberg#55113) has been merged there.

@dmsnell
Copy link
Member Author

dmsnell commented Oct 6, 2023

really splitting hairs now, but here is the same plot based on around 11,000 points for each commit. every 500 samples I switched branches to spread out pattern bias in the server.

this keeps on confirming I guess what we've already concluded: the first commit brings a significant but small improvement over trunk, while the additional changes for some reason, surprising as they are, regress from the first commit but are still faster than trunk

between-stats

with this many samples the standard deviation for each commit is pretty tight and close to each other: 5.66 for trunk, 5.13 for the first commit, and 5.43 for all commits.

@@ -31,90 +31,79 @@ function wp_get_elements_class_name( $block ) {
* @return string Filtered block content.
*/
function wp_render_elements_support( $block_content, $block ) {
if ( ! $block_content || empty( $block['attrs'] ) ) {
static $heading_elements = array( 'heading', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6' );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This static variable feels overkill. It is that expensive to create an array of values?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static $heading_elements = array( 'heading', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6' );

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth a shot: trying in b5207cb

Copy link
Member

@felixarntz felixarntz Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the downside of using a static here?

Indeed I wouldn't have suggested using a static if @dmsnell wasn't already using it (simply because the cost of creating an array is so little), but despite that, creating an array on every function call is definitely more expensive than only creating it once, even if that difference is irrelevant in the grand scheme of things.

So IMO, unless there's a downside from using a static, I'm not sure I agree with removing it.

Update: Apparently, using a static for this is really not faster, it's actually slower. I don't know why, but I trust the data. Running both approaches 10M times clearly shows that, see https://3v4l.org/Iju9q.

I learned something today 😁

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Static variables use up memory. If this array is not used, like on a page that doesnt have headers, then this cache is point.

Copy link
Member

@felixarntz felixarntz Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spacedmonkey See my updated reply in #5411 (comment): Even if it the array was used, static doesn't appear to be faster one bit.

Something to keep in mind for the future, there doesn't seem to be any performance benefit to this, so we should avoid using static variables in a function unless it has a functional benefit.

// Heading element support.
$supports_heading = ! wp_should_skip_block_supports_serialization( $block_type, 'color', 'heading' );
if ( $supports_heading ) {
foreach ( $heading_elements as $element_name ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
foreach ( $heading_elements as $element_name ) {
$heading_elements = array( 'heading', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6' );
foreach ( $heading_elements as $element_name ) {

Variable is only used here.

…sary.

Previously there have been three ways in which the block supports elements
code has been performing computation it doesn't need to.

 - It checks for every possible style attribute on a block even if there
   are no possible style attributes available on the block.

 - Once it determines that it needs to add a special class to a block
   it continues scanning through the rest of the attributes.

 - It constructs dot-separated literal key paths for the attributes
   and immediately and repeatedly calls `explode()` on them to create
   an array at runtime.

The combination of these three factors leads to an outsized impact of this
function on the runtime and memory pressure for what it's doing. This patch
removes all three of these inefficiencies:

 - If no elements style attributes exist on the block it immediately returns
   and skips all further processing.

 - It stops scanning block attributes once a single one is found that imposes
   the requirement for the extra class name.

 - It constructs array literals for the key path instead of strings. This
   removes the need to artificially join and explode the keys at runtime.

There should be no functional or visual changes in this patch, but it should
reduce the overall runtime of page renders and require less memory while rendering.
The impact will be proportional to the number of blocks rendered on a page.
@dmsnell dmsnell force-pushed the performance/skip-needless-iteration-in-block-supports-elements branch from f1455d7 to b35fda3 Compare October 6, 2023 14:10
@dmsnell
Copy link
Member Author

dmsnell commented Oct 6, 2023

Last night was fun: I have a few more charts to share, and I wouldn't have continued on this except I feel like it will be useful to have some better testing infrastructure for myself for other work and other measurements. On the aforementioned server and also on an old Core i9 MBP, 64 GB I setup the same experiment and collected 50,000 data points for each of the three commits in question: trunk at the merge base, 26f3d58, and a805b24. I made sure to manually control the fan on the MBP to ensure that it didn't thermally throttle (which it did early on before I set the fan and then I had to restart the experiment because the numbers shot straight up).

What did I find? More of the same, really, but a surprise with the laptop. these data were collected over a period of roughly five to six hours. obviously the laptop collected its results much faster since it's almost twice as fast to run the requests, probably because it ramps up to 5.0 GHz

source raw data stats comparing data
server branch-compare between-stats
laptop branch-compare-laptop between-stats-laptop

what's really interesting to me is the strongly bi-modal distribution of request timings on the laptop. I'm not sure why that is, but I think it might explain somewhat how at times it can seem like our performance results can flip in different test runs. I haven't confirmed if this behavior is present on my Apple silicon laptop yet, so it could be something related to the PC hardware and architecture. I think the final piece I'd like to figure out, and this is something I don't plan on doing for this PR, is to run these experiments on some free shared WordPress hosts if I can so that I can obtain a better grasp of how some of these performance changes impact the "everyman's WordPress hosting" - the kinds of hosts that are much more typical for WordPress than high-end developer machines or dedicated servers. sometimes I wonder if the changes we make help the best environments at the cost of those installs running on spinning disks, shared caches, and strained databases.

overall though this does nothing to add to the earlier experiments demonstrating that the first commit alone is not only a much smaller patch, but faster than the additional changes. my recommendation remains therefore that we ought to consider merging just the first commit and leave any potential micro-optimization for the normal release cycle. that being said I'm rerunning the experiment with @spacedmonkey's latest suggestion to inline the array of heading elements. happy to take further recommendations, but also happy to take the easy win and leave the muddy parts for a better time.

@felixarntz
Copy link
Member

Thank you for this super detailed performance research @dmsnell, it's indeed interesting (and confusing!) to see those results.

Unfortunately, there is still a lot for us to uncover and improve regarding the performance benchmarking tooling, and at the moment the variance is still relatively high. For Server-Timing, it generally produces quite consistent results for me - but when I say consistent, that is still within a 1-2% margin of error/variance when comparing medians from multiple benchmark-server-timing calls.

That's why for now I use those benchmarks mostly as indicator for broader changes to see whether they make a notable improvement and, if so, how notable it is.

For micro optimizations (basically anything that came after the initial commit), I don't trust it at all, quite frankly. For those optimizations, I think we'll still have to operate on general performance knowledge, discussion, etc. I think if in doubt over a micro optimization, our best bet is to use something like https://3v4l.org/, run an operation using both of the potential approach let's say 100,000 times, and look at the difference to make an educated call. I would advise to rely on that instead of benchmark-server-timing when considering micro optimizations.

return $block_content;
}

$block_type = WP_Block_Type_Registry::get_instance()->get_registered( $block['blockName'] );
$block_type = WP_Block_Type_Registry::get_instance()->get_registered( $block['blockName'] );
$style_attributes = $block['attrs']['style']['elements'];

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if( ! $block_type ){
return $block_content;
}

Should we do this as well?

@spacedmonkey
Copy link
Member

I think we need the following missing unit tests.

  • Unit test for empty content passed to $block_content
  • Unit test for empty content passed to $block with empty 'attrs'
  • Unit test for a block that is not registered.
  • Unit tests for buttons, headings and links.

@dmsnell
Copy link
Member Author

dmsnell commented Oct 6, 2023

I think we need the following missing unit tests.

I can try and get these added when I bring over the tests that @aaronrobertshaw added. thanks again @aaronrobertshaw!

static

removing the static isn't appearing to show any benefit either over the first commit.

if in doubt over a micro optimization, our best bet is to use something like https://3v4l.org/

we can do that @felixarntz but I would also doubt results from micro-benchmarks like that. it's too easy to construct synthetic benchmarks that remove the important context, such as the code that runs around this stuff. obviously if an impact is huge it won't matter, but it won't require creating an artificial test scenario that may not exist in production to prove a significant improvement.

run an operation using both of the potential approach let's say 100,000 times, and look at the difference to make an educated call. I would advise to rely on that instead of benchmark-server-timing when considering micro optimizations.

well I ran these 10,000 and while the distributions overlap there's also a strong statistical evidence that the differences are not attributed to random variation alone. (edit of note I wasn't using benchmark-server-timing, as noted in the setup above but was grabbing the Server-Timing value from each request, storing it, and then running a statistical analysis on the entire set of data. in other words, this is the raw data, not the p10 - p90 bins that benchmark-server-timing creates, and I removed the outliers which were near 1600ms for the response so that we were looking at the bulk normative distribution).

for the micro-optimizations I would be happy to see evidence that shows they improve things, but I'm trusting my measurements on this over my intuition. to that end I think it's best if I focus on porting the tests into a Core ticket and let y'all do with this what you want because there are bigger performance penalties to chase. so far, it seems like all of the evidence we have collected pushes one way, and the dubious hypothetical benefits come at a cost of more drastic code changes (and this is even a slightly a different scenario than it would be if all that extra complexity demonstrated in any way that it was even marginally faster, which it isn't). my language might appear strong here, but I don't have any strong feelings on where you take this. I'm only sharing my experience and experiments and trying to be clear about where my recommendations are coming from.

at the moment the variance is still relatively high

curious to hear more of your thoughts on why you doubt the tooling, or why you doubt it after running statistical sampling on it. N = 10000 is generally considered quite incredible, and in review of the raw data it looks fair between the branches, and pattern bias that was shown to impact the results in the short term were spread out among the branches when I identified it and changed the experiment.

I'd love to have a formalized way to compare the performance of the full site across branches so I welcome your input on how to improve the setup.

@felixarntz
Copy link
Member

FWIW I am happy with pretty much "whichever" version of this PR to commit it, since almost all feedback are performance nit-picks not worth blocking this.

I would suggest though that we wait until #5418 has been committed first, so that there is additional test coverage that helps verify that this change is "safe to make".

@spacedmonkey
Copy link
Member

For micro optimisation like this, profiling is much better than benchmark to see effect.

@dmsnell
Copy link
Member Author

dmsnell commented Oct 6, 2023

previous PR HEAD was b5207cb
resetting to first commit only based on following the data

For micro optimisation like this, profiling is much better than benchmark to see effect.

I guess this is a good conversation for a different place, but if a test of a realistic environment conflicts with profiling why would we trust the profiling? we know that profiling itself impacts the runtime and can bias results, particularly any results where blocking time is a big part of the run. sampling profilers definitely help, but they still skew results.

all I really want to do is not waste time hunting for a micro-optimization that nobody can demonstrate is better or worse. that's such a precarious path to take.

I would suggest though that we wait until #5418 has been committed first, so that there is additional test coverage that helps verify that this change is "safe to make".

sounds great! I didn't realize we already had a Core ticket for those unit tests either. I'll turn this PR over to either of you, or whoever wants to push it through. in the meantime I'll try and follow the tests PR and see if I can add the additional requested tests.

do you need anything else from me on this?

@felixarntz
Copy link
Member

Just merged latest trunk into here in 70e91de, which has the tests added from https://core.trac.wordpress.org/ticket/59555

Copy link
Member

@felixarntz felixarntz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to commit in its current shape, and tests are passing. 🎉

I ran another benchmark against the TT4 home page, based on 1000 runs: With the PR, it took 85.56ms instead of 87.6ms in trunk (2.3% improvement) 👍

@felixarntz
Copy link
Member

@felixarntz felixarntz closed this Oct 9, 2023
@dmsnell dmsnell deleted the performance/skip-needless-iteration-in-block-supports-elements branch October 9, 2023 18:21
@dmsnell
Copy link
Member Author

dmsnell commented Oct 9, 2023

Thanks @felixarntz - I added additional unit tests in #5446

dmsnell added a commit to WordPress/gutenberg that referenced this pull request Oct 11, 2023
Brings over optimization surfaced during the WordPress 6.4 beta release for
block supports elements, which entails skipping needless iteration when it's
known that the iteration need not continue.

See WordPress/wordpress-develop#5411
See [#59544-trac](https://core.trac.wordpress.org/ticket/59544)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants