-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normative: Update GetSubstitution to match reality #1732
Conversation
spec.html
Outdated
1. If _capture_ is *undefined*, replace the text through `>` with the empty string. | ||
1. Otherwise, replace the text through `>` with ? ToString(_capture_). | ||
</emu-alg> | ||
If _nn_ is `00` or the MV of _nn_ > _m_, the replacement is the string-concatenation of the replacement for the first two matched code units (i.e., a replacement specified by one of the two preceding rows) and the remaining matched code units (i.e., a single |DecimalDigit|). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If _nn_ is `00` or the MV of _nn_ > _m_, the replacement is the string-concatenation of the replacement for the first two matched code units (i.e., a replacement specified by one of the two preceding rows) and the remaining matched code units (i.e., a single |DecimalDigit|). | |
If _nn_ is `00` or the MV of _nn_ > _m_, the replacement is the string-concatenation of the replacement for the first two matched code units (i.e., a replacement specified by one of the two preceding rows) and the remaining matched code unit (i.e., a single |DecimalDigit|). |
("units" -> "unit")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to leave this, because the intent is to split the matched code units into two subsequences, replacing the first and preserving the second (which happens to be a single code unit, but is not required to be so for the logic to work).
spec.html
Outdated
1. Let _k_ be 0. | ||
1. Repeat, while _k_ < _replacementLength_ | ||
1. Let _replaceablePattern_ be the longest sequence of consecutive code units from _replacement_ beginning with the code unit at index _k_ such that _replaceablePattern_ matches the “Code units” column of a row in <emu-xref href="#table-45"></emu-xref>. | ||
1. If _replaceablePattern_ is not empty, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. If _replaceablePattern_ is not empty, | |
1. If _replaceablePattern_ is not empty, then |
Also, this condition assumes that _replaceablePattern_
is empty if there's no match, but there's nothing that actually makes it so. If might be better to say If no such match is found, then
(and swap the two arms).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic itself makes it so. "The longest sequence of consecutive code units from replacement beginning with the code unit at index k…" is either zero code units or more than zero code units. If it is more than zero, then replaceablePattern is not empty and we will append the corresponding replacement text to result. Otherwise, we will append the code unit at index k.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The longest sequence of consecutive code units from replacement beginning with the code unit at index k…" is either zero code units or more than zero code units.
Based on that quote, it could be, but the part you elided is such that _replaceablePattern_ matches the "Code units" column of a row in [Table 53]
, and an empty sequence of code units does not match the "Code units" column of any row in the table.
spec.html
Outdated
If the MV of _n_ > _m_, the replacement is the matched code units (equivalent to no replacement). | ||
<br> | ||
Otherwise, if the element in _captures_ at index equal to the MV of _n_ minus 1 is *undefined*, the replacement is the empty String. | ||
<br> | ||
Otherwise, the replacement is the element in _captures_ at index equal to the MV of _n_ minus 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you were to use an <emu-alg>
, that would allow (something like) Let _nmv_ be the MV of _n_
, which you could then use at 3 points.
It would also let you replace the element in _captures_ at index equal to the MV of _n_ minus 1
with _captures_[_nmv_ - 1]
.
For actual names, I think I'd recommend _N_
for the code unit (which matches the variable in column 1), and then _n_
for its MV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered <emu-alg>, but ultimately preferred the appearance without it: https://deploy-preview-1732--ecma262-snapshots.netlify.com/#table-45 (and I'd get rid of the <emu-alg> for $<…>
if I could find a good way to do so). As always, though, I'm willing to change if there's consensus.
As for N vs. n, we must use the latter because the former is not defined as the expansion of a lexical grammar production and thus has no MV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for N vs. n, we must use the latter because the former is not defined as the expansion of a lexical grammar production and thus has no MV.
Well, of course, you'd need to change the name in column 2 as well. Sorry that wasn't clear. Anyhow, the point there was to free up the name _n_
for the MV of _N_
, but that isn't needed if you don't convert to <emu-alg>
.
spec.html
Outdated
<br> | ||
Otherwise, if the element in _captures_ at index equal to the MV of _nn_ minus 1 is *undefined*, the replacement is the empty String. | ||
<br> | ||
Otherwise, the replacement is the element in _captures_ at index equal to the MV of _nn_ minus 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto above comment re using <emu-alg>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, @jmdyck. I've updated the PR with some of your suggestions and added explanations for why I didn't accept the others.
spec.html
Outdated
1. Let _k_ be 0. | ||
1. Repeat, while _k_ < _replacementLength_ | ||
1. Let _replaceablePattern_ be the longest sequence of consecutive code units from _replacement_ beginning with the code unit at index _k_ such that _replaceablePattern_ matches the “Code units” column of a row in <emu-xref href="#table-45"></emu-xref>. | ||
1. If _replaceablePattern_ is not empty, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic itself makes it so. "The longest sequence of consecutive code units from replacement beginning with the code unit at index k…" is either zero code units or more than zero code units. If it is more than zero, then replaceablePattern is not empty and we will append the corresponding replacement text to result. Otherwise, we will append the code unit at index k.
spec.html
Outdated
1. If _capture_ is *undefined*, replace the text through `>` with the empty string. | ||
1. Otherwise, replace the text through `>` with ? ToString(_capture_). | ||
</emu-alg> | ||
If _nn_ is `00` or the MV of _nn_ > _m_, the replacement is the string-concatenation of the replacement for the first two matched code units (i.e., a replacement specified by one of the two preceding rows) and the remaining matched code units (i.e., a single |DecimalDigit|). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to leave this, because the intent is to split the matched code units into two subsequences, replacing the first and preserving the second (which happens to be a single code unit, but is not required to be so for the logic to work).
spec.html
Outdated
If the MV of _n_ > _m_, the replacement is the matched code units (equivalent to no replacement). | ||
<br> | ||
Otherwise, if the element in _captures_ at index equal to the MV of _n_ minus 1 is *undefined*, the replacement is the empty String. | ||
<br> | ||
Otherwise, the replacement is the element in _captures_ at index equal to the MV of _n_ minus 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered <emu-alg>, but ultimately preferred the appearance without it: https://deploy-preview-1732--ecma262-snapshots.netlify.com/#table-45 (and I'd get rid of the <emu-alg> for $<…>
if I could find a good way to do so). As always, though, I'm willing to change if there's consensus.
As for N vs. n, we must use the latter because the former is not defined as the expansion of a lexical grammar production and thus has no MV.
$nn patterns fall back to $n when there aren't at least nn captures Fixes tc39gh-1426
87592e6
to
e5941f0
Compare
…replace] https://bugs.webkit.org/show_bug.cgi?id=204490 Reviewed by Mark Lam. JSTests: * stress/replace-indexed-backreferences.js: Added. * test262/expectations.yaml: Mark four test cases as passing. Source/JavaScriptCore: String.prototype.replace and RegExp.prototype[Symbol.replace] are meant to perform the same substitution of $-backreferences (called GetSubstitution in the spec: https://tc39.es/ecma262/#sec-getsubstitution). The implementation of this in StringPrototype.cpp is correct but the one in RegExpPrototype.js is not. In particular, the latter *removes* backreferences with out-of-range indices, instead of leaving them as-is. One thing that is *not* broken in either implementation and thus maintained here is the fact $10 is interpreted as $1 followed by a 0 when we have 1 <= n < 10 captures (and analogously for other invalid $nn backreferences). This behavior is consistent across all engines but currently described incorrectly in the spec; this patch thus aligns with the spec PR currently open to correct this (tc39/ecma262#1732). * builtins/RegExpPrototype.js: (getSubstitution): Ensure that invalid backreferences remain untouched in the output string. (replace): Fix off-by-one error when populating captures list. We shouldn't be reserving a slot for the full match. git-svn-id: http://svn.webkit.org/repository/webkit/trunk@252836 268f45cc-cd09-0410-ab3c-d52691b4dbfc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks editorially correct to me (and I'm in favor of it on that basis), but I don't understand why it's normative. I thought the no-substitution case was covered by clauses like, "If n > m, no replacement is done." Edit: And seems like a critical fix, given how confusing my previous wording was. LGTM!
is this still a thing? |
I believe so - it needs test262 tests, a rebase, and an updated review. |
I'm planning to rebase and update this PR after #2021 lands. |
@gibson042 #2021 has landed, do you want to rebase this? |
Yep, it should be done this week. |
…replace] https://bugs.webkit.org/show_bug.cgi?id=204490 Reviewed by Mark Lam. JSTests: * stress/replace-indexed-backreferences.js: Added. * test262/expectations.yaml: Mark four test cases as passing. Source/JavaScriptCore: String.prototype.replace and RegExp.prototype[Symbol.replace] are meant to perform the same substitution of $-backreferences (called GetSubstitution in the spec: https://tc39.es/ecma262/#sec-getsubstitution). The implementation of this in StringPrototype.cpp is correct but the one in RegExpPrototype.js is not. In particular, the latter *removes* backreferences with out-of-range indices, instead of leaving them as-is. One thing that is *not* broken in either implementation and thus maintained here is the fact $10 is interpreted as $1 followed by a 0 when we have 1 <= n < 10 captures (and analogously for other invalid $nn backreferences). This behavior is consistent across all engines but currently described incorrectly in the spec; this patch thus aligns with the spec PR currently open to correct this (tc39/ecma262#1732). * builtins/RegExpPrototype.js: (getSubstitution): Ensure that invalid backreferences remain untouched in the output string. (replace): Fix off-by-one error when populating captures list. We shouldn't be reserving a slot for the full match. Canonical link: https://commits.webkit.org/217811@main git-svn-id: https://svn.webkit.org/repository/webkit/trunk@252836 268f45cc-cd09-0410-ab3c-d52691b4dbfc
3d0c24c
to
7a79833
Compare
I think this already has tests, actually, or at least one test, as pointed out in the original issue. |
Replaced by #3157. |
THIS PR HAS BEEN REPLACED BY #3157
$nn patterns fall back to $n when there aren't at least nn captures
Fixes gh-1426
Preview: https://deploy-preview-1732--ecma262-snapshots.netlify.com/#sec-getsubstitution