-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance: Skip iteration in block supports elements when not necessary. #5411
Performance: Skip iteration in block supports elements when not necessary. #5411
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dmsnell and nice find! I've added one nitpick and a few ideas for further reducing memory usage.
@@ -31,7 +31,7 @@ function wp_get_elements_class_name( $block ) { | |||
* @return string Filtered block content. | |||
*/ | |||
function wp_render_elements_support( $block_content, $block ) { | |||
if ( ! $block_content || empty( $block['attrs'] ) ) { | |||
if ( ! $block_content || !isset( $block['attrs']['style']['elements'] ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if ( ! $block_content || !isset( $block['attrs']['style']['elements'] ) ) { | |
if ( ! $block_content || ! isset( $block['attrs']['style']['elements'] ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @costdev - not sure how I missed that - it was literally glowing at me 🙃
'style.elements.button.color.text', | ||
'style.elements.button.color.background', | ||
'style.elements.button.color.gradient', | ||
array( 'button', 'color', 'text' ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmsnell To avoid creating the arrays unnecessarily on each call, what do you think about moving the early return on lines 84-90 up to line 39 and changing them to this?
$skip_button_color = wp_should_skip_block_supports_serialization( $block_type, 'color', 'button' );
$skip_link_color = wp_should_skip_block_supports_serialization( $block_type, 'color', 'link' );
$skip_heading_color = wp_should_skip_block_supports_serialization( $block_type, 'color', 'heading' );
if ( $skip_button_color && $skip_link_color && $skip_heading_color) {
return $block_content;
}
Then this will further reduce memory usage while having no functional or visual impact.
The $skip_*
variables could be used for the 'skip'
keys in the arrays below to avoid duplicate calls, but see the comment below for an alternative approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could further reduce memory usage by using the $skip_*
variables to determine whether to add the property and its paths to the array at all, rather than create it and store whether to skip it later.
$element_color_properties = array();
if ( ! $skip_button_color ) {
$element_color_properties['button'] = array(
array( 'heading', 'color', 'text' ),
// etc.
);
}
// same for link and heading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a minor optimization, but I think it makes sense to add, given this PR is purely focused on performance in the first place 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @costdev - I explored this route but found diminishing returns. there are more optimization opportunities involving bailing out in a hierarchical manner too, such as saying, if there's no style.elements.heading
attribute then don't check style.elements.heading.color.text
but in my tests these all created marginal returns while making the code harder to understand or modify.
since these are all static we could eliminate the arrays entirely and fall back to static control structures.
again, things that I looked at but found no evidence that they are noteworthy optimizations, and things I felt could be addressed in a more normal development cycle where bigger changes can be examined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure where you all stand on adding these extra changes. I will try and create a new commit for them, but given the goals of this change and the timeline and that it's code coming in from a Gutenberg package I'm in favor of leaving this patch as-is, especially if we can't show a meaningful further improvement.
*/ | ||
$tags = new WP_HTML_Tag_Processor( $block_content ); | ||
if ( $tags->next_tag() ) { | ||
$tags->add_class( wp_get_elements_class_name( $block ) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since $block
doesn't change per iteration, we can avoid calling wp_get_elements_class_name()
(and therefore md5()
and serialize()
) for each iteration by storing this above the foreach
on line 94.
$elements_class_name = wp_get_elements_class_name( $block );
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was something I thought at first too but there's actually no iteration here so it only gets called once. once it realized it needs the class name, it finds the first opening HTML tag, adds the class, and then returns from the function.
@dmsnell This is a great catch, and IMO worth committing for the 6.4 release. I ran some additional benchmarks to further validate the benefits outlined by your initial benchmarks in the PR description, based on 100
Per the above, it is clear that the benefits of this PR are particularly relevant the more complex the current page / blocks are. Even for the very simple "Sample Page" though, there is a net benefit visible, so it's a safe performance improvement. If we were making the decision based on very simple demo content, I'd say this is not critical to merge, but with lots of real-world content more complex than that, I think it's safe to say this makes sense to add, even this late in the cycle. 👏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I do this @felixarntz we'll all be waiting longer than anyone wants. I didn't write this function and I have very little knowledge of how to test it or what is important to test. My refactor has centered on code auditing to avoid changing any behaviors except for the runtime performance. Maybe @jorgefilipecosta or @aaronrobertshaw could chime in on this. |
c6d3209
to
1490dab
Compare
All, I have pushed a new commit to this branch which separates the question of whether a block needs the extra class name from the act of adding it. This new function incorporates all of the further practical optimizations.
As this function diverges a lot from the existing code I'm still leery of the change in this release cycle, also I'm a bit skeptical on the additional performance impact, so I would appreciate several people testing out to see if it's worth it. This change, in my opinion, makes it a little more fragile for future additions and changes to the rules because they are now encoded more in code structure than in a data list by itself. I have no foot in the game, though, so this is just me sharing my experience on the matter and not a personal wish. |
I've updated the description with some initial benchmarking against the extra commit. While I think these are all fairly close, I think the addition of the function call made the change slower. I'll inline it, but I think these results are matching my intuition, that the performance gains were made and the rest is in the blurry boundary. |
Funny: I inlined the function and the performance results got worse. Optimization is hard on modern CPUs. Maybe some speculation was winning an advantage with the As it stands this PR is faster at the first commit and slower at the second. So please, maybe we just stick with the initial proposal which was both a smaller and clearer change but also probably faster in practice? |
@dmsnell How did you verify that the performance got worse? If you used the That doesn't mean those improvements are incorrect. For example, computing the class name from the function outside the loop surely is more efficient than for each entry in the loop, even if it doesn't show in benchmarks. All of the additionally proposed points of feedback are micro optimizations, which we probably just need to individually reason about. While in an ideal world we could even spot those with Anyway, all that is to say: I'm not disagreeing with your conclusion, but I think we need to be mindful of not interpreting too much into |
I'm not sure that's a reasonable path forward. The fact that this refactor aims to avoid changing any behaviors is precisely why it should have test coverage, to ensure that that is the case, and remains so in the future. Yes, writing tests takes extra time, and I get that it sometimes feels like it slows things down. But it's crucial in the WordPress core project that code has test coverage. Having test coverage also makes committing changes less risky if the committer can rely on existing test coverage passing as another indicator that a change is good to go. I think this function can reasonably be tested with PHPUnit, and it probably should have had tests added already whenever it was introduced. But since we're changing it now, it's crucial we at least follow up with that now. |
This is indeed how I measured, with the specific commands and environment documented above in the description. I ran with 3000 iterations and have seen fairly consistent results each time I run it. My point isn't that this does or doesn't have micro optimizations, but that if it's this difficult to tell whether it helps, hurts, or makes no difference, then it's not worth the complexity and readability issues that it requires. I'm given to believe that something underpins the fact that it appears that these additional micro-optimizations slow things down, strange as it seems. Regardless, the changes introduced in the first commit reliably appear in various benchmarks and measurements, so those are a clear win without a clear readability penalty. I generally perfer to get better out the door before we can confirm if best is better.
Also noted in the thread above but this isn't getting computed multiple times. As soon as that part of the loop is reached the function returns, so this is only ever being called once. There was an initial misunderstanding of the code that led to this suggestion, but it doesn't apply here.
Pardon for the misunderstanding! I'm not saying I won't add tests; and unfortunately unless someone comes in to help it may be unreasonable but it's the only thing I can offer 😄 Also worth mentioning: if I add tests someone is going to have to review the tests to make sure I'm testing the right things 🙃 Quis custodiet ipsos custodes? I'm flying blind here. I'm not sure I can get these in this week, or even by end of next week. I'll try. @aaronrobertshaw are you at all available to help speed this up? |
Thanks for the detailed clarification @dmsnell! @aaronrobertshaw If you can assist here with test coverage, that would be tremendously helpful. I am happy to review, will have to see if alternatively I can take a stab at it as well. |
@felixarntz still relying on the
I haven't listed the other numbers because they correspond with the results shared in the description. I'm happy to see other results show otherwise, but I think something was lost in my attempts to realize the additional suggestions. these results each stem from 5000 requests and are relatively consistent when running multiple times. |
a805b24
to
f1455d7
Compare
Thanks for the pings 👍 I'm happy to help wherever I can. Just catching up on the changes and discussion before looking into tests. |
Quick update: I'm working on some additional tests and I'll put together PRs for both core and Gutenberg. The plan is to leave the existing It'll take a little bit to get it sorted but it'll happen in the next couple of hours. |
Quick note here @felixarntz: in further testing on my server I'm getting results all over the place, which I wasn't earlier. I'm not sure what's going on, unless the server is just more variable than my laptop. Maybe I can make my own benchmark with Any suggestions otherwise? I would have hoped that over thousands of runs this would converge, but it would also be nice to see a histogram of the results which maybe tells a clearer picture. Needless to say though, we're clearly doing less work by aborting early, so I don't doubt any of the existing conclusions. Maybe the additional optimizations are more effective than I think, or maybe they aren't 🤷♂️. I'm sure there's no point in continuing to scan attributes though when we've already determined we need to add the class, and |
and some plots: first up is a raw scatter plot of 3000 runs each of for i in (seq 1 3000)
curl http://localhost:3880 -D - | grep 'Server Timing' >> branch.txt
end I used for a different perspective, here are the runs for each branch when sorted. not much but mystery I guess
strangely this is showing a performance regression in all this work 🤦♂️ but I still think there's bias slipping in on the server. I will see if I can re-run this overnight while alternating branches to account for any systematic issues with the server; maybe it's not as isolated as I have thought. |
There's a PR up in #5418 adding elements block support unit tests. The equivalent Gutenberg PR (WordPress/gutenberg#55113) has been merged there. |
really splitting hairs now, but here is the same plot based on around 11,000 points for each commit. every 500 samples I switched branches to spread out pattern bias in the server. this keeps on confirming I guess what we've already concluded: the first commit brings a significant but small improvement over with this many samples the standard deviation for each commit is pretty tight and close to each other: 5.66 for |
@@ -31,90 +31,79 @@ function wp_get_elements_class_name( $block ) { | |||
* @return string Filtered block content. | |||
*/ | |||
function wp_render_elements_support( $block_content, $block ) { | |||
if ( ! $block_content || empty( $block['attrs'] ) ) { | |||
static $heading_elements = array( 'heading', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6' ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This static variable feels overkill. It is that expensive to create an array of values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static $heading_elements = array( 'heading', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6' ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth a shot: trying in b5207cb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the downside of using a static
here?
Indeed I wouldn't have suggested using a static if @dmsnell wasn't already using it (simply because the cost of creating an array is so little), but despite that, creating an array on every function call is definitely more expensive than only creating it once, even if that difference is irrelevant in the grand scheme of things.
So IMO, unless there's a downside from using a static
, I'm not sure I agree with removing it.
Update: Apparently, using a static
for this is really not faster, it's actually slower. I don't know why, but I trust the data. Running both approaches 10M times clearly shows that, see https://3v4l.org/Iju9q.
I learned something today 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Static variables use up memory. If this array is not used, like on a page that doesnt have headers, then this cache is point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spacedmonkey See my updated reply in #5411 (comment): Even if it the array was used, static
doesn't appear to be faster one bit.
Something to keep in mind for the future, there doesn't seem to be any performance benefit to this, so we should avoid using static variables in a function unless it has a functional benefit.
// Heading element support. | ||
$supports_heading = ! wp_should_skip_block_supports_serialization( $block_type, 'color', 'heading' ); | ||
if ( $supports_heading ) { | ||
foreach ( $heading_elements as $element_name ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
foreach ( $heading_elements as $element_name ) { | |
$heading_elements = array( 'heading', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6' ); | |
foreach ( $heading_elements as $element_name ) { |
Variable is only used here.
…sary. Previously there have been three ways in which the block supports elements code has been performing computation it doesn't need to. - It checks for every possible style attribute on a block even if there are no possible style attributes available on the block. - Once it determines that it needs to add a special class to a block it continues scanning through the rest of the attributes. - It constructs dot-separated literal key paths for the attributes and immediately and repeatedly calls `explode()` on them to create an array at runtime. The combination of these three factors leads to an outsized impact of this function on the runtime and memory pressure for what it's doing. This patch removes all three of these inefficiencies: - If no elements style attributes exist on the block it immediately returns and skips all further processing. - It stops scanning block attributes once a single one is found that imposes the requirement for the extra class name. - It constructs array literals for the key path instead of strings. This removes the need to artificially join and explode the keys at runtime. There should be no functional or visual changes in this patch, but it should reduce the overall runtime of page renders and require less memory while rendering. The impact will be proportional to the number of blocks rendered on a page.
f1455d7
to
b35fda3
Compare
Last night was fun: I have a few more charts to share, and I wouldn't have continued on this except I feel like it will be useful to have some better testing infrastructure for myself for other work and other measurements. On the aforementioned server and also on an old Core i9 MBP, 64 GB I setup the same experiment and collected 50,000 data points for each of the three commits in question: What did I find? More of the same, really, but a surprise with the laptop. these data were collected over a period of roughly five to six hours. obviously the laptop collected its results much faster since it's almost twice as fast to run the requests, probably because it ramps up to 5.0 GHz
what's really interesting to me is the strongly bi-modal distribution of request timings on the laptop. I'm not sure why that is, but I think it might explain somewhat how at times it can seem like our performance results can flip in different test runs. I haven't confirmed if this behavior is present on my Apple silicon laptop yet, so it could be something related to the PC hardware and architecture. I think the final piece I'd like to figure out, and this is something I don't plan on doing for this PR, is to run these experiments on some free shared WordPress hosts if I can so that I can obtain a better grasp of how some of these performance changes impact the "everyman's WordPress hosting" - the kinds of hosts that are much more typical for WordPress than high-end developer machines or dedicated servers. sometimes I wonder if the changes we make help the best environments at the cost of those installs running on spinning disks, shared caches, and strained databases. overall though this does nothing to add to the earlier experiments demonstrating that the first commit alone is not only a much smaller patch, but faster than the additional changes. my recommendation remains therefore that we ought to consider merging just the first commit and leave any potential micro-optimization for the normal release cycle. that being said I'm rerunning the experiment with @spacedmonkey's latest suggestion to inline the array of heading elements. happy to take further recommendations, but also happy to take the easy win and leave the muddy parts for a better time. |
Thank you for this super detailed performance research @dmsnell, it's indeed interesting (and confusing!) to see those results. Unfortunately, there is still a lot for us to uncover and improve regarding the performance benchmarking tooling, and at the moment the variance is still relatively high. For Server-Timing, it generally produces quite consistent results for me - but when I say consistent, that is still within a 1-2% margin of error/variance when comparing medians from multiple That's why for now I use those benchmarks mostly as indicator for broader changes to see whether they make a notable improvement and, if so, how notable it is. For micro optimizations (basically anything that came after the initial commit), I don't trust it at all, quite frankly. For those optimizations, I think we'll still have to operate on general performance knowledge, discussion, etc. I think if in doubt over a micro optimization, our best bet is to use something like https://3v4l.org/, run an operation using both of the potential approach let's say 100,000 times, and look at the difference to make an educated call. I would advise to rely on that instead of |
return $block_content; | ||
} | ||
|
||
$block_type = WP_Block_Type_Registry::get_instance()->get_registered( $block['blockName'] ); | ||
$block_type = WP_Block_Type_Registry::get_instance()->get_registered( $block['blockName'] ); | ||
$style_attributes = $block['attrs']['style']['elements']; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if( ! $block_type ){ | |
return $block_content; | |
} |
Should we do this as well?
I think we need the following missing unit tests.
|
I can try and get these added when I bring over the tests that @aaronrobertshaw added. thanks again @aaronrobertshaw!
removing the static isn't appearing to show any benefit either over the first commit.
we can do that @felixarntz but I would also doubt results from micro-benchmarks like that. it's too easy to construct synthetic benchmarks that remove the important context, such as the code that runs around this stuff. obviously if an impact is huge it won't matter, but it won't require creating an artificial test scenario that may not exist in production to prove a significant improvement.
well I ran these 10,000 and while the distributions overlap there's also a strong statistical evidence that the differences are not attributed to random variation alone. (edit of note I wasn't using for the micro-optimizations I would be happy to see evidence that shows they improve things, but I'm trusting my measurements on this over my intuition. to that end I think it's best if I focus on porting the tests into a Core ticket and let y'all do with this what you want because there are bigger performance penalties to chase. so far, it seems like all of the evidence we have collected pushes one way, and the dubious hypothetical benefits come at a cost of more drastic code changes (and this is even a slightly a different scenario than it would be if all that extra complexity demonstrated in any way that it was even marginally faster, which it isn't). my language might appear strong here, but I don't have any strong feelings on where you take this. I'm only sharing my experience and experiments and trying to be clear about where my recommendations are coming from.
curious to hear more of your thoughts on why you doubt the tooling, or why you doubt it after running statistical sampling on it. N = 10000 is generally considered quite incredible, and in review of the raw data it looks fair between the branches, and pattern bias that was shown to impact the results in the short term were spread out among the branches when I identified it and changed the experiment. I'd love to have a formalized way to compare the performance of the full site across branches so I welcome your input on how to improve the setup. |
FWIW I am happy with pretty much "whichever" version of this PR to commit it, since almost all feedback are performance nit-picks not worth blocking this. I would suggest though that we wait until #5418 has been committed first, so that there is additional test coverage that helps verify that this change is "safe to make". |
For micro optimisation like this, profiling is much better than benchmark to see effect. |
previous PR
I guess this is a good conversation for a different place, but if a test of a realistic environment conflicts with profiling why would we trust the profiling? we know that profiling itself impacts the runtime and can bias results, particularly any results where blocking time is a big part of the run. sampling profilers definitely help, but they still skew results. all I really want to do is not waste time hunting for a micro-optimization that nobody can demonstrate is better or worse. that's such a precarious path to take.
sounds great! I didn't realize we already had a Core ticket for those unit tests either. I'll turn this PR over to either of you, or whoever wants to push it through. in the meantime I'll try and follow the tests PR and see if I can add the additional requested tests. do you need anything else from me on this? |
Just merged latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to commit in its current shape, and tests are passing. 🎉
I ran another benchmark against the TT4 home page, based on 1000 runs: With the PR, it took 85.56ms instead of 87.6ms in trunk (2.3% improvement) 👍
Committed in https://core.trac.wordpress.org/changeset/56807 |
Thanks @felixarntz - I added additional unit tests in #5446 |
Brings over optimization surfaced during the WordPress 6.4 beta release for block supports elements, which entails skipping needless iteration when it's known that the iteration need not continue. See WordPress/wordpress-develop#5411 See [#59544-trac](https://core.trac.wordpress.org/ticket/59544)
Trac ticket: #59544
Previously there have been three ways in which the block supports elements code has been performing computation it doesn't need to.
It checks for every possible style attribute on a block even if there are no possible style attributes available on the block.
Once it determines that it needs to add a special class to a block it continues scanning through the rest of the attributes.
It constructs dot-separated literal key paths for the attributes and immediately and repeatedly calls
explode()
on them to create an array at runtime.The combination of these three factors leads to an outsized impact of this function on the runtime and memory pressure for what it's doing. This patch removes all three of these inefficiencies:
If no elements style attributes exist on the block it immediately returns and skips all further processing.
It stops scanning block attributes once a single one is found that imposes the requirement for the extra class name.
It constructs array literals for the key path instead of strings. This removes the need to artificially join and explode the keys at runtime.
There should be no functional or visual changes in this patch, but it should reduce the overall runtime of page renders and require less memory while rendering. The impact will be proportional to the number of blocks rendered on a page.
Some probably-misleading benchmarks.
Summary
twentytwentyfour
benchmark-server-timing
npm run research -- benchmark-server-timing -u 'http://localhost' -n 3000 -c 20 -p -o csv
php -S 0.0.0.0:80
and mysql both running natively, no Docker or other indirectionxdebug
php -d xdebug.mode=profile -d xdebug.use_compression=false -d xdebug.profiler_output_name=cachegrind.out.tt4-trunk -S 0.0.0.0:80
php -S 0.0.0.0:80
and mysql both running natively, no Docker or other indirectiontrunk