Skip to content

Latest commit

 

History

History
283 lines (202 loc) · 13.3 KB

A4-basic-parsers.md

File metadata and controls

283 lines (202 loc) · 13.3 KB

A4. Basic Parsers: Whirlwind Tour

In this section, we'll give an overview of the basic parsers and combinators of Autumn, so that you know what is available to you.

Quick reminder of things established in previous sections:

Parsers are instances of (subclasses of) Parser while combinators can refer either these parsers that have sub-parsers, or the methods used to create them (typically from the classes Grammar and rule).

A rule is a wrapper for a Parser that enables easy parser construction via the builder pattern. In general, however, we'll reserve the word "rule" for those that are assigned to a field (i.e. public rule my_rule = ... ;).

In general, if you want to find out more about Autumn's built-in parsers, there are two places to look at:

  • The norswap.autumn.parsers package, which contains all bundled subclasses of Parser.
  • The Grammar and rule classes, which contain all builder methods to construct instances of those classes.

In general, the behaviour of the parser will be specified in the documentation of its Parser subclass. Builder methods sometimes provide important options, so their documentation is important too.

Let's jump right in. We won't review every built-in parser here, only the basic ones (which will still be the majority of them). Details on advanced parsers will follow in further sections.

Sequences and Choices

We already talked about the Sequence and Choice parsers in the previous section (sub-section "Vertical Backtracking"), so I won't repeat the explanation here.

Construct them with seq and choice, respectively.

Basic examples:

  • seq(str("a"), str("b"), str("c"))
  • choice(str("a"), str("b"), str("c"))

Longest

The Longest parser is a slight variation on the choice parser. Instead of matching the same thing as its first successful child, Longest tries to match every one of its children, then succeeds by matching the same thing as the one matching the most input. In case of a tie, it matches the earliest longest matching child.

Construct with longest.

Basic example: longest(str("ab"), str("abb"), str("aba"))

Repetitions

There are two classes that implement repetitions: Repeat and Around.

Repeat supports specifying a number of repetitions to match, and whether that is a minimum or an exact number. Those variants can be constructed with rule#at_least and rule#repeat, respectively.

Basic examples:

  • str("a").at_least(0)
  • str("b").at_least(1)
  • str("c").repeat(6)

Around matches a series of repetitions of a subparser (the around parser), separated by another parser (the inside parser). The canonical example is parsing comma-separated lists. It also enables specifying whether a trailing repetition of the inside parser should be allowed, as some languages allow trailing commas in comma-separated lists, for instance. Just like Repeat, a minimum or exact number of repetitions (of the around parser) can also be specified.

Construct with rule#sep, rule#sep_trailing and rule#sep_exact.

Basic examples:

  • str("a").sep(0, "/")
    matches "", "a", "a/a", ...

  • str("a").sep_trailing(0, "/")
    matches "", "a", "a/", "a/a", "a/a/", ...

  • str("a").sep(1, "/") matches "a", "a/a", ...

  • str("a").sep_trailing(1, "/") matches "a", "a/", "a/a", "a/a/", ...

  • str("a").sep_exact(3, "/") matches "a/a/a" only

Optional

The Optional parser allows you to specify some optional bit of syntax. Construct with rule#opt.

Basic example: str("a").opt(), which matches both "" and "a".

Lookahead

The two lookahead parsers, Lookahead and Not, are able to match input without actually consuming it (meaning they leave the input position untouched even when they succeed).

Lookahead behaves exactly like its child parser, except for the restoration of the initial input position. Construct with rule#ahead().

Basic example: seq(str("="), digit.at_least(1)).ahead()

Not succeeds only if its child parser fails, which can be useful to check for a delimiter or reserved workds. Construct with rule#not().

Basic example: seq(str("{", seq(str("}").not(), any).at_least(0), str("}")), which matches a pair of matching curly braces and everything in between (e.g. "{abc}" or "{{}", but not "{}}").

Advanced note: in the classical PEG semantics, lookahead can be defined in terms of a double negation (i.e. something.not().not()). This does not work in Autumn (and most other frameworks), because negation discards any parse tree that may have been built when it fails (i.e. when its child succeeds).

Primitive Parsers

These are parsers which are not combinators, i.e. who don't have subparsers and whose success generally depends on a direct check against the input.

First, there are the Empty and Fail parsers (build with empty and fail), which always succeeds without consuming any input, and always fails (respectively).

Instances of CharPredicate match a single character when the input is a string. You can define your own predicate using cpred, or use one of the pre-defined ones:

  • any — matches any character
  • digit — matches a decimal digit (from '0' to '9')
  • octal_digit — matches an octal digit (from '0' to '7')
  • hex_digit — matches an hexadecimal digit (from '0' to '9', 'a' to 'f' and 'A' to 'F')
  • alpha — matches a single ASCII alphabetic character
  • alphanum — matches a single ASCII alpha-numeric character

(I didn't put individual links, these are all fields in Grammar).

It's also possible to match a single character (character) (can also be done with str), as well as ranges (range) and sets (set(char...) and set(String)) of characters.

Basic examples, a couple of parsers matching 'a', 'b', 'c' or 'd':

  • range('a', 'd')
  • set('a', 'b', 'c', 'd')
  • set("abcd")
  • cpred(c -> 'a' <= c && c <= 'd')

In the same way, it's possible to match single objects with ObjectPredicate when the input is a list of objects. Construct with opred.

Finally, it's possible to match whole strings (when the input is a string) with str.

Matching Whitespace

We've discussed matching whitespace in the corresponding section of section A2.

Grammar defines a ws field, which you can freely assign in order to define what consistutes whitespace in your language. It should always succeed, and match as much whitespace as possible.

This field is reused by the word and rule#word methods. The first matches its string parameter followed by ws. The second matches the receiver followed by ws.

Note that these methods capture the value of ws at the moment when they are called. As such, it is best to define the whitespace as one of the first things you do in a grammar definition (as indeed we do in A2 (Your First Grammar)).

Lazy Parsing and Recursion

Because we define grammars as a collection of rule- or Parser-valued fields that refer to one another, we run into a fundamental limitation: in the definition of a field A, we can't refer to a field B that is declared after A. However, we need to do this if our grammar includes any kind of recursion!

A solution to this problem is the lazy parser that is showcased in the corresponding section of A2. Like we explained there:

lazy returns a parser that will be initialized when first used, based on the lambda that it was passed.

This lazily initialized parser is an instance of LazyParser.

Here's a basic example that shows a recursive grammar where rule A matches one or more repetition of the string "ab":

rule A = 
    seq(str("a"), lazy(() -> this.B));
    
rule B = choice(
    seq(str("b"), A),
    str("b"));

lazy can also be used to make self-recursive parsers:

rule A = lazy(() -> 
    choice(
        seq(str("ab"), this.A), 
        str("ab"));

(Of course, a much better way to write this rule is str("ab").at_least(1).)

Beware: Autumn doesn't support naive left-recursion, so make sure that you don't use lazy to cause a parser to (directly or indirectly) call itself at the same input position: this would cause an infinite loop (or, in practice, a stack overflow)! For instance, don't do this:

rule A = lazy(() -> 
    choice(
        seq(this.A, str("ab")), // left-recursion using "recursive" - STACK OVERFLOW!
        str("ab"));

We'll explain how to tackle this issue easily in A6. Left-Recursion and Associativity.

Advanced

Here is a small map of where you can find information on the advanced parsers / combinators we didn't talk about in this section.

(TODO)