Skip to content

Latest commit

 

History

History
141 lines (106 loc) · 8.69 KB

B2-memoization.md

File metadata and controls

141 lines (106 loc) · 8.69 KB

B2. Memoization

NOTE: This section is outdated and needs to be rewritten/improved.

Memoization is a technique that consists of saving the result of a function so that it does not need to be recomputed if it is needed again.

Indeed, parsers can be seen as functions that will always return the same result if invoked at the same input position, and with the same relevant context. This equivalence is explained in section A3. How Autumn Works (sub-section on "Vertical Backtracking"). The role and mechanisms of context are explained in B1. Context-Sensititive (Stateful) Parsing.

Parsers may be invoked multiple times at the same position and with the same context because Autumn backtracks (see section A3 if this is unclear). In some cases, these repeated invocations might add up to significant wasted time during the parse.

Many parser frameworks based on PEG systematically memoize all the intermediate parse results. This is called packrat parsing. Some studies were made that suggest that this is in general inefficient because PEG grammars tend to be "mostly deterministic", i.e. they don't backtrack enough to justify the overhead. Instead, it is suggested that memoizing just a couple of parsers is what yields the best results (*1). See my thesis (SOON) for a full discussion and references.

Another issue with packrat parsing is that even in the cases where it is beneficial, it trades execution time for memory use, which might become consequent: on the order of gigabyte(s) (depending on the implementation) even for inputs less than a couple thousands lines long.

In light of these findings, which match our own experimental observations, Autumn doesn't automatically memoize intermediate parser results. However, since there is something to be gained by memoizing some parsers, we enable selective parser memoization through a few parser combinators.

The next sub-section will explain this mechanism. As to when to memoize — this should generally be decided after making performance measurements on meaningful input. Autumn includes facilities to facilitate such measurement, which will be covered in section B4. Debugging & Tracing a Parse.

Memoization in Autumn

Memoizing a parser in Autumn is achieved by wrapping the parser in a Memo parser.

The Memo parser has two additional parameters: a ParseState for an implementation of the Memoizer interface, and an optional (may be null) function from a Parse to an object that represent the relevant context for the underlying parser (the context object).

(ParseState was explained in B1. Context-Sensititive (Stateful) Parsing.)

The Memoizer interface defines the operations that a memoization strategy must support (namely handling a new parse result, and attempting to retrieve an existing result).

Autumn supplies two implementations of Memoizer, but users can define their own. The first strategy is MemoTable, which memoizes every result it is passed. This strategy ensures the same result is never computed twice but may have large memory requirements. The second strategy is MemoCache, which reserves a limited number of slots for memoizing results. A new result will cause the oldest stored result to be evicted from the cache if it is full. With this strategy, results could potentially be computed multiple times, but the memory requirement is bounded.

Both strategies can be further parameterized by deciding whether results are memoized based on their position and optionally the context object, or whether the particular parser used to produce the result should also be taken into account.

Taking the parser into account allows sharing the same memoizer between multiple Memo parsers.

Regarding the context object (if available), its hashCode() method is used by memoizers to store/retrieve the result and its equals() method is used to determine whether a memoized result is applicable in the current context (by comparing the context object stored in the result and the context object extracted in the current context).

Memoization Combinators

Instances of Memo can be constructed using a family of combinators:

A special note on those combinators that take a ParseState<Memoizer>: recall that (from B1, sub-section on ParseState) you can declare a ParseState inside your grammar and pass it to the combinator without fear that multiple parses will write to the same Memoizer (ParseState maintains separate states for each parse).

Custom Memoizers & Memoizing Parsers

It's possible for users to implement their own Memoizer. This is mostly straightforward, just refer to the Javadoc for information. One thing to note however is that if you need to obtain a hashcode to store a MemoEntry (which stores a parse result), you should use the the Memoizer.hash functions.

It's also possible to use a Memoizer from a parser that isn't a Memo. Here again, it's just a matter of using the Memoizer API. The important thing that you need to enforce is the single parse rule, which (from A3. How Autumn Works) says:

A parser, when called at the same input position (and in context-sensitive parses, with the same context) should should always yield the same (singular) result.


Footnotes

(*1) On the other hand, I've heard reported that some parser frameworks fare miserably in terms of performance if memoization is disabled. I have no trouble believing it, and I strongly suspect that the culprits are the performance pitfalls that can easily occur when defining the syntax of left-associative binary expressions with PEG.

If you use the facilities supplied by Autumn to handle left-recursion/left-association (section A6. Left-Recursion and Associativity), you won't run into performance issues due to these particular issues however!

See my thesis (SOON) for a full discussion of these potential performance issues in other tools.