Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization for htmlspecialchars function #18126

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

ArtUkrainskiy
Copy link

@ArtUkrainskiy ArtUkrainskiy commented Mar 21, 2025

Optimization for htmlspecialchars function.

A dedicated php_htmlspecialchars function instead of the “universal” php_escape_html_entities_ex.
We work with ASCII-compatible encodings, we can employ byte-by-byte scanning and a lookup table to identify special characters. For c < 0x80, the lookup table is used; for potentially multi-byte characters, we continue to rely on get_next_char.
This approach provides a noticeable performance improvement for ASCII strings and some improvement for multi-byte strings due to more optimized logic.

A separate optimized validate_utf8_char function has been added for validating multi-byte UTF-8 characters. The optimization is achieved through more straightforward condition checks and the use of bitwise operations.

Important!
The entity handling logic has been updated to respect the LONGEST_ENTITY_LENGTH constant. Named entities longer than the defined maximum length are no longer processed.
The $double_encode = false flag now only applies to valid entities.
For numeric entities, the maximum length is limited to 10 characters. The largest valid value is &#x10FFFF;.
An entity like &#xFFFFFF; will still be parsed and its numeric value computed, but it won’t be escaped, as it's outside the valid Unicode range.
An entity like &#xFFFFFFF; (11 characters) will not be processed at all.

Benchmark branch - https://github.com/ArtUkrainskiy/php-src/tree/htmlspecialchars-benchmark

----------------------------------------------------------------------------------------
|                                Test |     old avg(ns) |     new avg(ns) |    diff(%) |
----------------------------------------------------------------------------------------
|                        Empty string |              63 |              42 |     50.00% |
----------------------------------------------------------------------------------------
|                              1 char |              64 |              78 |    -17.95% |
----------------------------------------------------------------------------------------
|                              4 char |              76 |              81 |     -6.17% |
----------------------------------------------------------------------------------------
|                              8 char |              93 |              86 |      8.14% |
----------------------------------------------------------------------------------------
|                     1000 spec. char |           14257 |            8449 |     68.74% |
----------------------------------------------------------------------------------------
|                       ASCII letters |            9647 |            3293 |    192.95% |
----------------------------------------------------------------------------------------
|                          Emoji UTF8 |           19212 |           14991 |     28.16% |
----------------------------------------------------------------------------------------
|                       Cyrillic UTF8 |           17028 |           11767 |     44.71% |
----------------------------------------------------------------------------------------
|                        Chinese UTF8 |           18220 |           14904 |     22.25% |
----------------------------------------------------------------------------------------
|                          Japan UTF8 |           18223 |           14880 |     22.47% |
----------------------------------------------------------------------------------------
|                     Cyrillic CP1251 |            9664 |            3858 |    150.49% |
----------------------------------------------------------------------------------------
|                        Chinese Big5 |           27433 |           24126 |     13.71% |
----------------------------------------------------------------------------------------
|                          Japan SJIS |           16125 |           16090 |      0.22% |
----------------------------------------------------------------------------------------
|         200 entities !double_decode |           12979 |            7499 |     73.08% |
----------------------------------------------------------------------------------------
|         800 entities !double_decode |           10363 |            9454 |      9.61% |
----------------------------------------------------------------------------------------

The main performance improvement comes from fast-path handling of ASCII bytes and single-byte encodings using a lookup table for special character detection.

The new validate_utf8_char function efficiently handles multi-byte UTF-8 characters.

Overall, the logic for character processing and validation has been improved.

Performance is lower for single-character strings due to the overhead of initializing the LUT.

In the benchmark, I tried to cover a variety of scenarios with different encodings and flags, but feel free to run it on your own data and share the results.

A dedicated php_htmlspecialchars function instead of the “universal”
php_escape_html_entities_ex.
We work with ASCII-compatible encodings, we can employ byte-by-byte scanning and a lookup
table to identify special characters. For c < 0x80, the lookup table is used; for
potentially multi-byte characters, we continue to rely on get_next_char.
This approach provides a noticeable performance improvement for ASCII strings and some
 improvement for multi-byte strings due to more optimized logic.
The new htmlspecialchars function respects the maximum entity size, defined as LONGEST_ENTITY_LENGTH.
There is no strict limit on the length of a numeric entity in the HTML and XML specifications,
but in practice the maximum possible is &#x10FFFF;, which takes up 10 characters.
Any numeric entities larger than this size are effectively invalid and will not be processed by browsers.
@ArtUkrainskiy ArtUkrainskiy marked this pull request as ready for review March 21, 2025 14:45
@ArtUkrainskiy ArtUkrainskiy requested a review from bukka as a code owner March 21, 2025 14:45
Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't really gone through the logic yet, but there are already things that don't really make sense.

Comment on lines 1563 to 1591
/* {{{ php_html_entities */
static void php_htmlspecialchars(INTERNAL_FUNCTION_PARAMETERS)
{
zend_string *str, *hint_charset = NULL;
zend_long flags = ENT_QUOTES|ENT_SUBSTITUTE;
zend_string *replaced;
bool double_encode = 1;

ZEND_PARSE_PARAMETERS_START(1, 4)
Z_PARAM_STR(str)
Z_PARAM_OPTIONAL
Z_PARAM_LONG(flags)
Z_PARAM_STR_OR_NULL(hint_charset)
Z_PARAM_BOOL(double_encode);
ZEND_PARSE_PARAMETERS_END();

replaced = php_htmlspecialchars_ex(
str, (int) flags,
hint_charset ? ZSTR_VAL(hint_charset) : NULL, double_encode, /* quiet */ 0);
RETVAL_STR(replaced);
}
/* }}} */

/* {{{ Convert special characters to HTML entities */
PHP_FUNCTION(htmlspecialchars)
{
php_html_entities(INTERNAL_FUNCTION_PARAM_PASSTHRU, 0);
php_htmlspecialchars(INTERNAL_FUNCTION_PARAM_PASSTHRU);
}
/* }}} */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you using a pass through function when there is only ever one use case now? Especially as you are adding a C function call overhead, which is weird for an optimization PR.

You should also "inline" the other usage of the php_html_entities pass through function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged in 80e7a41

echo htmlspecialchars('"""""""""""""""""""""""""""""""""""""""""""""&#x000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000005;',
echo htmlspecialchars('"""""""""""""""""""""""""""""""""""""""""""""&#x123456789123456789123456789;',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Author

@ArtUkrainskiy ArtUkrainskiy Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added support for the LONGEST_ENTITY_LENGTH constant to define the maximum length of an entity. While it originally applies to named entities, I think it also makes sense to use it to limit the length of numeric entities.

There’s no strict limit on numeric entity length in the HTML or XML specs, but in practice the longest valid one is &#x10FFFF;, which is 10 characters long — so there’s no point in scanning the string beyond that when looking for a semicolon.

Any numeric entities longer than that are effectively invalid and won’t be processed by browsers anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not keep this and just add an extra case? That's my main question.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I see what you meant. I’ve reverted to the previous value and added an extra check. Now the original value in the test is no longer processed, and the numeric value of the entity is not computed. 914fa23

Comment on lines 77 to 81
/* Lookup table for php_htmlspecialchars */
typedef struct {
char* entity[256];
ushort entity_len[256];
} htmlspecialchars_lut;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this closer to the declaration of the function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 80e7a41

@@ -752,6 +758,60 @@ static zend_result resolve_named_entity_html(const char *start, size_t length, c
}
/* }}} */

/* {{{ is_codepoint_allowed */
static inline zend_bool is_codepoint_allowed(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static inline zend_bool is_codepoint_allowed(
static bool is_codepoint_allowed(

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 80e7a41

return unicode_cp_is_allowed(cp, doctype);
}

return 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return 1;
return true;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 80e7a41

size_t replacement_len = 0;
if (flags & (ENT_HTML_SUBSTITUTE_ERRORS | ENT_HTML_SUBSTITUTE_DISALLOWED_CHARS)) {
if (charset == cs_utf_8) {
replacement = (const unsigned char*)"\xEF\xBF\xBD";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the cast?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idk Fixed in 80e7a41

if (flags & (ENT_HTML_SUBSTITUTE_ERRORS | ENT_HTML_SUBSTITUTE_DISALLOWED_CHARS)) {
if (charset == cs_utf_8) {
replacement = (const unsigned char*)"\xEF\xBF\xBD";
replacement_len = sizeof("\xEF\xBF\xBD") - 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
replacement_len = sizeof("\xEF\xBF\xBD") - 1;
replacement_len = strlen("\xEF\xBF\xBD");

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we need to account for the fact that we’re replacing a single byte.

Comment on lines 1401 to 1402
replacement = (const unsigned char*)"&#xFFFD;";
replacement_len = sizeof("&#xFFFD;") - 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto prior remarks

/* Lookup table for php_htmlspecialchars */
typedef struct {
char* entity[256];
ushort entity_len[256];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ushort is not standard.

Suggested change
ushort entity_len[256];
uint8_t entity_len[256];

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 80e7a41

Comment on lines 1367 to 1368
/* {{{ php_htmlspecialchars */
PHPAPI zend_string* php_htmlspecialchars_ex(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a PHPAPI?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I’ll go through all the comments and make the necessary changes. I don’t have much experience contributing to public PHP projects yet, so I might not be fully aware of all the conventions and best practices.

I was working off the existing code and reused some parts as-is. I plan to refactor php_html_entities later. The main goal here was to switch from a hashtable to a LUT for special character replacement, and to restructure the logic accordingly.

@ArtUkrainskiy ArtUkrainskiy marked this pull request as draft March 26, 2025 19:05
…ties. The $double_encode = false flag only applies to valid entities with a maximum length of 10 characters. The largest valid numeric entity is &#x10FFFF;.

An entity like &#xFFFFFF; will be parsed and its numeric value computed, but it won't be escaped, since it exceeds the valid Unicode range.

An entity like &#xFFFFFFF; (11 characters) will not be processed at all.
A separate optimized validate_utf8_char function is used for validating multi-byte UTF-8 characters. The optimization comes from using more straightforward conditional logic and bitwise operations for faster execution.
@ArtUkrainskiy ArtUkrainskiy requested a review from Girgias April 1, 2025 15:18
@ArtUkrainskiy ArtUkrainskiy marked this pull request as ready for review April 1, 2025 15:19
@bukka
Copy link
Member

bukka commented Apr 2, 2025

@ArtUkrainskiy The Windows failure seem related so please fix that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants