Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHP 8.3 | Tokenizer/PHP: support "yield from" with comments #647

Merged
merged 8 commits into from
Nov 2, 2024

Conversation

jrfnl
Copy link
Member

@jrfnl jrfnl commented Oct 25, 2024

Description

PHP 8.3 | Tokenizer/PHP: support "yield from" with comments

As discussed in and discovered via issue #529:

  • Prior to PHP 8.3, only whitespace was allowed between the yield and from keywords. A comment between the yield and from keywords in a yield from expression would result in a parse error.
  • As of PHP 8.3, this is no longer a parse error and both whitespace as well as comments are allowed between the yield and from and the complete expression is tokenized in PHP itself as one T_YIELD_FROM token. See: https://3v4l.org/2SI2Q#veol
    In the context of PHPCS this is problematic as comments should always have their own token to allow sniffs to examine them.
    Additionally, such comments may contain PHPCS ignore annotations, which, when not tokenized as a separate token, will not be respected.

This commit adds support for this change in PHP 8.3 to PHP_CodeSniffer. It does contain an, albeit small, BC-break, due to the BC-break created by PHP.

Previously in PHPCS:

  • A single line yield from expression would always tokenize as T_YIELD_FROM, independently of the type and amount of whitespace between the keywords.
  • A multi-line yield [new line]+ from expression would tokenize as multiple T_YIELD_FROM tokens, one for each line.
  • A yield from expression with a comment between the keywords was not supported.
    In PHP < 8.3, this meant that this would tokenize as T_YIELD, [T_WHITESPACE|T_COMMENT]+, T_STRING (from).
    As of PHP 8.3, this was tokenized as one or more T_YIELD_FROM tokens (depending on single/multi-line) with the comment being tokenized as T_YIELD_FROM as well.

This commit changes this as follows:

  • Single line yield from expression with only whitespace between the keywords: no change, this will still tokenize as a single T_YIELD_FROM token.
  • Multi-line yield [new line]+ from expressions and yield from expressions with a comment (both single line as well as multi-line) will now consistently be tokenized as T_YIELD_FROM (yield), [T_WHITESPACE|T_COMMENT]+, T_YIELD_FROM (from).

In practice, this means that:

  • Whitespace and comments between the keywords can now be examined and handled by relevant sniffs, which are likely to give more accurate results (fewer false negatives, like for tab indentation of a from keyword on a separate line).
  • The tokenization used by PHPCS is now consistent again for all supported PHP versions.
  • The PHP 8.3 change is now supported.

It does mean that sniffs which explicitly handle multi-token yield from expressions, will need to be updated.

In my opinion, adding this change in a minor is justified as:

  1. The PHP 8.3 change can not be supported otherwise.
  2. The impact is expected to be minimal anyhow as there are not many sniffs which specifically look for and handle T_YIELD_FROM tokens and those sniffs within PHPCS itself will be updated/adjusted in the same release.

Also, the (negative) impact on end-users of this BC-break is also expected to be minimal as a scan of the top 2000 projects listed on Packagist shows that in those project no multi-line/multi-token yield from expressions are used in the source code, which means that even when sniff code is not updated (yet) for the change in tokenization, the chances of an end-user getting incorrect results because of this are very slim as the code affected is just not written as multi-line/with comment that often.

Includes tests.

Fixes #529

Refs:


Information for standards maintainers

The "yield from" keyword could previously already consist of multiple T_YIELD_FROM tokens if the "keyword" was spread over multiple lines.
Now, the tokens between the actual keywords will be tokenized as T_WHITESPACE and comment tokens.

To find the last token for a T_YIELD_FROM "keyword", change old code like this:

$yieldFromEnd = $stackPtr;
if (preg_match('`yield\s+from`', $tokens[$stackPtr]['content']) !== 1) {
    for ($yieldFromEnd = ($stackPtr + 1); $tokens[$yieldFromEnd]['code'] === T_YIELD_FROM; $yieldFromEnd++);
    --$yieldFromEnd;
}

to

$yieldFromEnd = $stackPtr;
if (strtolower(trim($tokens[$stackPtr]['content'])) === 'yield') {
    for ($i = ($stackPtr + 1); $i < $phpcsFile->numTokens; $i++) {
        if ($tokens[$i]['code'] === T_YIELD_FROM && strtolower(trim($tokens[$i]['content'])) === 'from') {
            $yieldFromEnd = $i;
            break;
        }

        if (isset(Tokens::$emptyTokens[$tokens[$i]['code']]) === false && $tokens[$i]['code'] !== T_YIELD_FROM) {
            // Shouldn't be possible. Just to be on the safe side.
            break;
        }
    }
}

The above presumes that $stackPtr is set to a T_YIELD_FROM token.
Also note that the second code snippet is largely cross-version compatible. It will work with older PHPCS versions with code compatible with PHP < 8.3 and will work on PHPCS 3.11.0+ for code compatible with all supported PHP versions.

Tokenizer: apply tab replacement to "yield from"

The yield from keyword(s) can be separated by spaces, tabs, new lines and since PHP 8.3, even comments.

The PHPCS Tokenizer previously did not execute tab replacement on this token leading to unexpected 'content' and incorrect 'length' values in the File::$tokens array, which in turn could lead to incorrect sniff results and incorrect fixes.

Previously, this affected all T_YIELD_FROM tokens. After the change to support PHP 8.3 "yield ... comment... from" syntax, this now only affects single line yield from expressions where the keywords are separated by a tab or a mix of tabs and spaces.

All the same, consistency is key, so this commit adds the T_YIELD_FROM token to the array of tokens for which to do tab replacement, which should make them more consistent with the rest of PHPCS.

Includes unit tests safeguarding this change.

Generic/LanguageConstructSpacing: update for changed "yield from" tokenization

This commit updates the Generic.WhiteSpace.LanguageConstructSpacing sniff to recognize that a multi-token yield from may now have whitespace and comment tokens between the keywords.

Generic/LanguageConstructSpacing: don't auto-fix 'yield from' with comment

If there is a comment between the yield and the from keywords in a yield from expression, the sniff should report this, but should not auto-fix it, as it cannot be determined where the comment should be moved to.

Thic commit updates the sniff to throw a non-autofixable error with error code IncorrectYieldFromWithComment for that situation.

Includes unit tests.

Generic/LowerCaseKeyword: add extra tests with "yield from"

... to safeguard that a "yield from" keyword split into multiple tokens is still handled correctly.

Generic/ScopeIndent: add extra test with multi-token "yield from"

This confirms that the Tokenizer changes related to yield from tokenization which were made in a previous commit also fix issue squizlabs/PHP_CodeSniffer#3808, as reported upstream.

Closes squizlabs/PHP_CodeSniffer#3808

Includes moving the phpcs:set directives in test case file 3 to inside the PHP tags to make the tests reproducable when doing a plain phpcs run against the file.

Generic/DisallowTabIndent: improve handling of yield from

This commit on the one hand confirms (via a test) that indentation for a multi-line yield from expression is now handled correctly by the sniff.

On the other hand, this commit enables checking for tabs being used for inline "alignment" between the yield and the from keyword.
This will now also be handled (and auto-fixed) by this sniff. Includes test.

Generic/DisallowSpaceIndent: add test with multi-line yield from

As the indentation whitespace in a multi-line yield from expression is now tokenized as T_WHITESPACE, instead of being tokenized as part of the (second) T_YIELD_FROM token, space indentation of such a line will now be handled correctly by the sniff.

This commit just adds a test to confirm and safeguard this.

Suggested changelog entry

  • Tokenizer support for PHP 8.3 yield from expressions with a comment between the keywords.
    • Sniffs explicitly handling T_YIELD_FROM tokens may need updating. See the PR description for guidance.
    • Additionally, the following sniff has been updated to support yield from expressions with comments:
      • Generic.WhiteSpace.LanguageConstructSpacing
  • Fixed: Tokenizer not applying tab replacement in single token yield from keywords.
  • Fixed 3808: Generic.WhiteSpace.ScopeIndent would throw false positive for tab indented multi-token yield from expression.
  • Fixed: Generic.WhiteSpace.DisallowTabIndent did not flag tabs inside yield from

jrfnl added 8 commits November 2, 2024 10:12
As discussed in and discovered via issue 529:
* Prior to PHP 8.3, only whitespace was allowed between the `yield` and `from` keywords. A comment between the `yield` and `from` keywords in a `yield from` expression would result in a parse error.
* As of PHP 8.3, this is no longer a parse error and both whitespace as well as comments are allowed between the `yield` and `from` and the complete expression is tokenized in PHP itself as one `T_YIELD_FROM` token. See: https://3v4l.org/2SI2Q#veol
    In the context of PHPCS this is problematic as comments should always have their own token to allow sniffs to examine them.
    Additionally, such comments may contain PHPCS ignore annotations, which, when not tokenized as a separate token, will not be respected.

This commit adds support for this change in PHP 8.3 to PHP_CodeSniffer. It does contain an, albeit small, BC-break, due to the BC-break created by PHP.

Previously in PHPCS:
* A single line `yield from` expression would always tokenize as `T_YIELD_FROM`, independently of the type and amount of whitespace between the keywords.
* A multi-line `yield` [new line]+ `from` expression would tokenize as multiple `T_YIELD_FROM` tokens, one for each line.
* A `yield from` expression with a comment between the keywords was not supported.
    In PHP < 8.3, this meant that this would tokenize as `T_YIELD`, [`T_WHITESPACE`|T_COMMENT`]+, `T_STRING` (`from`).
    As of PHP 8.3, this was tokenized as one or more `T_YIELD_FROM` tokens (depending on single/multi-line) with the comment being tokenized as `T_YIELD_FROM` as well.

This commit changes this as follows:
* Single line `yield from` expression with only whitespace between the keywords: **no change**, this will still tokenize as a single `T_YIELD_FROM` token.
* Multi-line `yield` [new line]+ `from` expressions and `yield from` expressions with a comment (both single line as well as multi-line) will now consistently be tokenized as `T_YIELD_FROM` (`yield`), [`T_WHITESPACE`|T_COMMENT`]+, `T_YIELD_FROM` (`from`).

In practice, this means that:
* Whitespace and comments between the keywords can now be examined and handled by relevant sniffs, which are likely to give more accurate results (fewer false negatives, like for tab indentation of a `from` keyword on a separate line).
* The tokenization used by PHPCS is now consistent again for all supported PHP versions.
* The PHP 8.3 change is now supported.

It does mean that sniffs which explicitly handle multi-token `yield from` expressions, will need to be updated.

In my opinion, adding this change in a minor is justified as:
1. The PHP 8.3 change can not be supported otherwise.
2. The impact is expected to be minimal anyhow as there are not many sniffs which specifically look for and handle `T_YIELD_FROM` tokens and those sniffs within PHPCS itself will be updated/adjusted in the same release.

Also, the (negative) impact on _end-users_ of this BC-break is also expected to be minimal as a scan of the top 2000 projects listed on Packagist shows that in those project no multi-line/multi-token `yield from` expressions are used in the source code, which means that even when sniff code is not updated (yet) for the change in tokenization, the chances of an end-user getting incorrect results because of this are very slim as the code affected is just not written as multi-line/with comment that often.

Includes tests.

Fixes 529

Refs:
* squizlabs/PHP_CodeSniffer 1524 (original polyfill code)
* php/php-src 10125
* php/php-src 14926
* https://externals.io/message/124462

---

Information for standards maintainers

The "yield from" _keyword_ could previously already consist of multiple T_YIELD_FROM tokens if the "keyword" was spread over multiple lines.
Now, the tokens between the actual keywords will be tokenized as `T_WHITESPACE` and comment tokens.

To find the last token for a `T_YIELD_FROM` "keyword", change old code like this:
```php
$yieldFromEnd = $stackPtr;
if (preg_match('`yield\s+from`', $tokens[$stackPtr]['content']) !== 1) {
    for ($yieldFromEnd = ($stackPtr + 1); $tokens[$yieldFromEnd]['code'] === T_YIELD_FROM; $yieldFromEnd++);
    --$yieldFromEnd;
}
```
to
```php
$yieldFromEnd = $stackPtr;
if (strtolower(trim($tokens[$stackPtr]['content'])) === 'yield') {
    for ($i = ($stackPtr + 1); $i < $phpcsFile->numTokens; $i++) {
        if ($tokens[$i]['code'] === T_YIELD_FROM && strtolower(trim($tokens[$i]['content'])) === 'from') {
            $yieldFromEnd = $i;
            break;
        }

        if (isset(Tokens::$emptyTokens[$tokens[$i]['code']]) === false && $tokens[$i]['code'] !== T_YIELD_FROM) {
            // Shouldn't be possible. Just to be on the safe side.
            break;
        }
    }
}
```

The above presumes that `$stackPtr` is set to a `T_YIELD_FROM` token.
Also note that the second code snippet is largely cross-version compatible. It will work with older PHPCS versions with code compatible with PHP < 8.3 and will work on PHPCS 3.11.0+ for code compatible with all supported PHP versions.
The `yield from` keyword(s) can be separated by spaces, tabs, new lines and since PHP 8.3, even comments.

The PHPCS `Tokenizer` previously did not execute tab replacement on this token leading to unexpected `'content'` and incorrect `'length'` values in the `File::$tokens` array, which in turn could lead to incorrect sniff results and incorrect fixes.

Previously, this affected all `T_YIELD_FROM` tokens. After the change to support PHP 8.3 "yield ... comment... from" syntax, this now only affects single line `yield from` expressions where the keywords are separated by a tab or a mix of tabs and spaces.

All the same, consistency is key, so this commit adds the `T_YIELD_FROM` token to the array of tokens for which to do tab replacement, which should make them more consistent with the rest of PHPCS.

Includes unit tests safeguarding this change.
…enization

This commit updates the `Generic.WhiteSpace.LanguageConstructSpacing` sniff to recognize that a multi-token `yield from` may now have whitespace and comment tokens between the keywords.
…mment

If there is a comment between the `yield` and the `from` keywords in a `yield from` expression, the sniff should report this, but should not auto-fix it, as it cannot be determined where the comment should be moved to.

Thic commit updates the sniff to throw a non-autofixable error with error code `IncorrectYieldFromWithComment` for that situation.

Includes unit tests.
... to safeguard that a "yield from" keyword split into multiple tokens is still handled correctly.
This confirms that the Tokenizer changes related to `yield from` tokenization which were made in a previous commit also fix issue squizlabs/PHP_CodeSniffer 3808, as reported upstream.

Closes squizlabs/PHP_CodeSniffer 3808

Includes moving the `phpcs:set` directives in test case file 3 to inside the PHP tags to make the tests reproducable when doing a plain `phpcs` run against the file.
This commit on the one hand confirms (via a test) that indentation for a multi-line `yield from` expression is now handled correctly by the sniff.

On the other hand, this commit enables checking for tabs being used for inline "alignment" between the `yield` and the `from` keyword.
This will now also be handled (and auto-fixed) by this sniff. Includes test.
As the indentation whitespace in a multi-line `yield from` expression is now tokenized as `T_WHITESPACE`, instead of being tokenized as part of the (second) `T_YIELD_FROM` token, space indentation of such a line will now be handled correctly by the sniff.

This commit just adds a test to confirm and safeguard this.
@jrfnl
Copy link
Member Author

jrfnl commented Nov 2, 2024

Rebasing without changes. Merging once the build passes.

@jrfnl jrfnl force-pushed the feature/529-tokenizer-yield-from-tokenization-review branch from f3985c7 to fc6d98a Compare November 2, 2024 09:14
@jrfnl jrfnl merged commit c2b6235 into master Nov 2, 2024
50 checks passed
@jrfnl jrfnl deleted the feature/529-tokenizer-yield-from-tokenization-review branch November 2, 2024 09:58
@jrfnl jrfnl mentioned this pull request Nov 26, 2024
9 tasks
wmfgerrit pushed a commit to wikimedia/mediawiki-tools-codesniffer that referenced this pull request Jan 27, 2025
Update AssignmentInReturnSniff for the BC break in
PHPCSStandards/PHP_CodeSniffer#647.

Bug: T379674
Change-Id: I01009b858466b7c678c799d8c99207f3395b7a5d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant