-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Tidy up CharStreams API. Add new doc/unicode.md
- Loading branch information
1 parent
b467dc8
commit 4f21686
Showing
10 changed files
with
415 additions
and
126 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
# Lexers and Unicode text | ||
|
||
Until ANTLR 4.7, generated lexers only supported part of the Unicode standard | ||
(code points up to `U+FFFF`). | ||
|
||
With ANTLR 4.7 and later, lexers as well as all languages' runtimes | ||
support the full range of Unicode code points up to `U+10FFFF`, as | ||
long as the input `CharStream` is opened using `CharStreams.fromPath()` | ||
or the equivalent method for your runtime's language. | ||
|
||
# Unicode Code Points in Lexer Grammars | ||
|
||
To refer to Unicode [code points](https://en.wikipedia.org/wiki/Code_point) | ||
in lexer grammars, use the `\u` string escape. For example, to create | ||
a lexer rule for a single Cyrillic character by creating a range from | ||
`U+0400` to `U+04FF`: | ||
|
||
```ANTLR | ||
CYRILLIC = ('\u0400'..'\u04FF'); | ||
``` | ||
|
||
Unicode literals larger than U+FFFF must use the extended `\u{12345}` syntax. | ||
For example, to create a lexer rule for a selection of smiley faces | ||
from the [Emoticons Unicode block](http://www.unicode.org/charts/PDF/U1F600.pdf): | ||
|
||
```ANTLR | ||
EMOTICONS = ('\u{1F600}' | '\u{1F602}' | '\u{1F615}'); | ||
``` | ||
|
||
Finally, lexer char sets can include Unicode properties: | ||
|
||
```ANTLR | ||
EMOJI = [\p{Emoji}]; | ||
JAPANESE = [\p{Script=Hiragana}\p{Script=Katakana}\p{Script=Han}]; | ||
NOT_CYRILLIC = [\P{Script=Cyrillic}]; | ||
``` | ||
|
||
See [lexer-rules.md](lexer-rules.md#lexer-rule-elements) for more detail on Unicode | ||
escapes in lexer rules. | ||
|
||
# CharStreams and UTF-8 | ||
|
||
If your lexer grammar contains code points larger than `U+FFFF`, your | ||
lexer client code must open the file using `CharStreams.fromPath()` or | ||
equivalent in your runtime's language, or input values larger than | ||
`U+FFFF` will *not* match. | ||
|
||
For backwards compatibility, the existing `ANTLRInputStream` and | ||
`ANTLRFileStream` APIs only support Unicode code points up to `U+FFFF`. | ||
|
||
The existing `TestRig` command-line interface supports all Unicode | ||
code points. | ||
|
||
# Example | ||
|
||
If you have generated a lexer named `UnicodeLexer`: | ||
|
||
```Java | ||
public static void main(String[] args) { | ||
CharStream charStream = CharStreams.fromPath(Paths.get(args[0])); | ||
Lexer lexer = new UnicodeLexer(charStream); | ||
CommonTokenStream tokens = new CommonTokenStream(lexer); | ||
tokens.fill(); | ||
for (Token token : tokens.getTokens()) { | ||
System.out.println("Got token: " + token.toString()); | ||
} | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.