Skip to content

Commit

Permalink
Reorganize and clean up readme flow
Browse files Browse the repository at this point in the history
  • Loading branch information
domenic committed Jul 2, 2015
1 parent b09f7b5 commit b1aa0e2
Showing 1 changed file with 40 additions and 56 deletions.
96 changes: 40 additions & 56 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,22 @@
# `RegExp.escape` Proposal

Proposal for adding a `RegExp.escape` method to the ECMAScript standard http://benjamingr.github.io/RegExp.escape/.
Proposal for adding a `RegExp.escape` method to the ECMAScript standard.

## Status

This proposal is a [stage 0 (strawman) proposal](https://docs.google.com/document/d/1QbEE0BsO4lvl7NFTn5WXWeiEIBfaVUF7Dk0hpPpPDzU/edit#) and is awaiting implementation and more input. Please see [the issues](https://github.com/benjamingr/RegExp.escape/issues) on how to get involved.
[Formal specification](http://benjamingr.github.io/RegExp.escape/)

## Motivation
## Status

See [this issue](https://esdiscuss.org/topic/regexp-escape). It is often the case when we want to build a regular expression out of a string without treating special characters from the string as special regular expression tokens. For example if we want to replace all occurrences of the the string `Hello.` which we got from the user we might be tempted to do `ourLongText.replace(new RegExp(text, "g"))` but this would match `.` against any character rather than a dot.
This proposal is a [stage 0 (strawman) proposal](https://docs.google.com/document/d/1QbEE0BsO4lvl7NFTn5WXWeiEIBfaVUF7Dk0hpPpPDzU/edit#) and is awaiting implementation and more input. Please see [the issues](https://github.com/benjamingr/RegExp.escape/issues) on how to get involved.

This is a fairly common use in regular expressions and standardizing it would be useful.

In other languages:
## Motivation

- Perl: quotemeta(str) - see [the docs](http://perldoc.perl.org/functions/quotemeta.html)
- PHP: preg_quote(str) - see [the docs](http://php.net/manual/en/function.preg-quote.php)
- Python: re.escape(str) - see [the docs](https://docs.python.org/3/library/re.html#re.escape)
- Ruby: Regexp.escape(str) - see [the docs](http://ruby-doc.org/core-2.2.0/Regexp.html#method-c-escape)
- Java: Pattern.quote(str) - see [the docs](http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#quote(java.lang.String))
- C#, VB.NET: Regex.Escape(str) - see [the docs](https://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.escape.aspx)
It is often the case when we want to build a regular expression out of a string without treating special characters from the string as special regular expression tokens. For example, if we want to replace all occurrences of the the string `let text = "Hello."` which we got from the user, we might be tempted to do `ourLongText.replace(new RegExp(text, "g"))`. However, this would match `.` against any character rather than matching it against a dot.

Note that the languages differ in what they do - (perl does something different from C#) but they all have the same goal.
This is commonly-desired functionality, as can be seen from [this years-old es-discuss thread](https://esdiscuss.org/topic/regexp-escape). Standardizing it would be very useful to developers, and avoid subpar implementations they might create that could miss edge cases.

We've had [a meeting about it](https://github.com/benjamingr/RegExp.escape/blob/master/data/other_languages/discussions.md) including a more detailed wrap up of what other languages do and the pros and cons.

## Proposed Solution
## Proposed solution and usage examples

We propose the addition of an `RegExp.escape` function, such that strings can be escaped in order to be used inside regular expressions:

Expand All @@ -35,63 +26,56 @@ str = RegExp.escape(str);
alert(ourLongText.replace(new RegExp(str, "g")); // handles reg exp special tokens with the replacement.
```
There is initial previous work here: https://gist.github.com/kangax/9698100 which includes valuable work we've used. Unlike that proposal this one uses the spec's `SyntaxCharacter` list of characters so updates are in sync with the specificaiton instead of specifying the characters escaped manually.
```js
RegExp.escape("The Quick Brown Fox"); // "The Quick Brown Fox"
RegExp.escape("Buy it. use it. break it. fix it.") // "Buy it\. use it\. break it\. fix it\."
RegExp.escape("(*.*)"); // "\(\*\.\*\)"
RegExp.escape("。^・ェ・^。") // "。\^・ェ・\^。"
RegExp.escape("😊 *_* +_+ ... 👍"); // "😊 \*_\* \+_\+ \.\.\. 👍"
RegExp.escape("\d \D (?:)"); // "\\d \\D \(\?\:\)"
```
## Cross-Cutting Concerns
## Cross-cutting concerns
The list of escaped identifiers should be kept in sync with what the regular expressions grammar considers to be syntax characters that need escaping - for this reason instead of hard-coding the list of escaped characters we escape characters that are recognized as a `SyntaxCharacter`s by the engine. For example, if regex comments are ever added to the specification (presumably under a flag) - this ensures they are properly escaped.
The list of escaped identifiers should be kept in sync with what the regular expression grammar considers to be syntax characters that need escaping. For this reason, instead of hard-coding the list of escaped characters, we escape characters that are recognized as `SyntaxCharacter`s by the engine. For example, if regexp comments are ever added to the specification (presumably under a flag), this ensures that they are properly escaped.
## FAQ
* **What about `"/"`?**
## In other languages
Empirical data has been collected (see the /data folder) from about a hundred thousand code bases (most popular sites, most popular packages, most depended on packages and Q&A sites) and it was found out that its use case (for `eval`) was not common enough to justify addition.
- Perl: [quotemeta(str)](http://perldoc.perl.org/functions/quotemeta.html)
- PHP: [preg_quote(str)](http://php.net/manual/en/function.preg-quote.php)
- Python: [re.escape(str)](https://docs.python.org/3/library/re.html#re.escape)
- Ruby: [Regexp.escape(str)](http://ruby-doc.org/core-2.2.0/Regexp.html#method-c-escape)
- Java: [Pattern.quote(str)](http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#quote(java.lang.String))
- .NET: [Regex.Escape(str)](https://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.escape.aspx)
* **Why not escape every character?**
Note that the languages differ in what they do (e.g. Perl does something different from C#), but they all have the same goal.
Other languages that have done this regretted this choice because of the readability impact and string size. More imformation on why other languages have moved from this in the data folder under other_languages.
We've had [a meeting about this subject](https://github.com/benjamingr/RegExp.escape/blob/master/data/other_languages/discussions.md), whose notes include a more detailed writeup of what other languages do, and the pros and cons thereof.
* **How is unicode handled?**
This proposal deals with code points and not code units so further extensions and dealing with unicode is done.
## FAQ
* **Why don't you do X?**
* **Why not escape every character?**
If you believe there is a concern that was not addressed yet - please [open an issue](https://github.com/benjamingr/RexExp.escape/issues).
Other languages that have done this regretted this choice because of the readability impact and string size. More imformation on why other languages have moved from this in the data folder under /other_languages.
* **What about `unescape`?**
* **What about the `/` character?**
While some other languages provide an unescape method we choose to defer discussion about it to a later point, mainly because no evidence of people asking for it has been found (while `.escape` is commonly asked for).
Empirical data has been collected (see the /data folder) from about a hundred thousand code bases (most popular sites, most popular packages, most depended on packages and Q&A sites) and it was found out that its use case (for `eval`) was not common enough to justify addition.
* **What about EscapeRegExpString?**
EscapeRegExpPattern (as the name implies) takes a pattern and escapes it so that it can be represented as a string. What `RegExp.escape` does is take a string and escapes it so it can be literally represented as a pattern. The two do not need to share an escaped set and we can't use one for the other. We're discussing renaming EscapeRegExpString in the spec in the future to avoid confusion for readers.
* **How is Unicode handled?**
This proposal deals with code points and not code units, so further extensions and dealing with Unicode is done.
## Semantics
* **What about `RegExp.unescape`?**
### RegExp.escape(S)
While some other languages provide an unescape method we choose to defer discussion about it to a later point, mainly because no evidence of people asking for it has been found (while `RegExp.escape` is commonly asked for).
When the **escape** function is called with an argument *S* the following steps are taken:
* **How does this relate to EscapeRegExpPattern?**
1. Let *str* be [ToString](http://people.mozilla.org/~jorendorff/es6-draft.html#sec-tostring)(*S*).
2. [ReturnIfAbrupt](http://people.mozilla.org/~jorendorff/es6-draft.html#sec-returnifabrupt)(*str*).
3. Let *cpList* be a [List](http://people.mozilla.org/~jorendorff/es6-draft.html#sec-list-and-record-specification-type) containing in order the code points as defined in [6.1.4](http://people.mozilla.org/~jorendorff/es6-draft.html#sec-ecmascript-language-types-string-type) of *str*, starting at the first element of *str*.
4. Let *cuList* be a new [List](http://people.mozilla.org/~jorendorff/es6-draft.html#sec-list-and-record-specification-type).
5. For each code point *c* in *cpList* in List order, do:
1. If **c** is matched by [*SyntaxCharacter*](http://people.mozilla.org/~jorendorff/es6-draft.html#sec-patterns) then do:
1. Append code unit 0x005C (REVERSE SOLIDUS) to *cuList*.
2. Append the elements of the UTF16Encoding (10.1.1) of *c* to *cuList*.
6. Let **L** be a String whose elements are, in order, the elements of *cuList*.
7. Return **L**.
EscapeRegExpPattern (as the name implies) takes a pattern and escapes it so that it can be represented as a string. What `RegExp.escape` does is take a string and escapes it so it can be literally represented as a pattern. The two do not need to share an escaped set and we can't use one for the other. We're discussing renaming EscapeRegExpPattern in the spec in the future to avoid confusion for readers.
## Usage Examples
* **Why don't you do X?**
```js
RegExp.escape("The Quick Brown Fox"); // "The Quick Brown Fox"
RegExp.escape("Buy it. use it. break it. fix it.") // "Buy it\. use it\. break it\. fix it\."
RegExp.escape("(*.*)"); // "\(\*\.\*\)"
RegExp.escape("。^・ェ・^。") // "。\^・ェ・\^。"
RegExp.escape("😊 *_* +_+ ... 👍"); // "😊 \*_\* \+_\+ \.\.\. 👍"
RegExp.escape("\d \D (?:)"); // "\\d \\D \(\?\:\)"
```
If you believe there is a concern that was not addressed yet, please [open an issue](https://github.com/benjamingr/RexExp.escape/issues).

0 comments on commit b1aa0e2

Please sign in to comment.