Initial implementation of the Ion 1.1 text reader #645

zslayton · 2023-09-25T17:23:30Z

Syntax changes

Ion 1.1 only introduces a single text syntactic element: encoding expressions. Encoding expressions (e-expressions for short) represent a macro invocation in the data stream, and look like this:

(:macro_id arg1 arg2 arg3 ... argN)

Apart from the leading smiley ((:) and macro_id, they follow s-expression syntax rules.

E-expressions can appear anywhere that an Ion value can appear, including:

In a sequence:
- Top level: $ion_1_1 (:foo)
- S-expression: (1 2 (:foo) 4)
- List: [1, 2, (:foo), 4]
In a struct:
- In field value position: { a: 1, foo: (:bar), c: 3 }
- In field name position with no associated value: { a: 1, (:bar), c: 3 }
As an argument to another e-expression: (:foo 1 2 (:bar) 4)

To support this change to the grammar, the LazyDecoder trait has been modified to accommodate the possibility of an e-expression in these locations. In addition to the associated type Value, each LazyDecoder implementation D (for example: TextEncoding_1_1, BinaryEncoding_1_0, etc) now has to declare an associated type MacroInvocation that represents that encoding's syntactic representation of an e-expression. Rather than D::Reader::next() returning a D::Value, it now returns a LazyRawValueExpr<D>, an enum that could be either a value or an e-expression:

pub enum LazyRawValueExpr<'data, D: LazyDecoder<'data>> {
    /// A text Ion value literal. For example: `5`, `foo`, or `"hello"`
    ValueLiteral(D::Value),
    /// A text Ion 1.1+ macro invocation. For example: `(:employee 12345 "Sarah" "Gonzalez")`
    MacroInvocation(D::MacroInvocation),
}

Similarly, rather than returning a D::Field, a D::Struct's iterator will return a LazyRawFieldExpr<D>:

/// An item found in field position within a struct.
/// This item may be:
///   * a name/value pair (as it is in Ion 1.0)
///   * a name/e-expression pair
///   * an e-expression
#[derive(Debug)]
pub enum LazyRawFieldExpr<'data, D: LazyDecoder<'data>> {
    NameValuePair(RawSymbolTokenRef<'data>, LazyRawValueExpr<'data, D>),
    MacroInvocation(D::MacroInvocation),
}

Because Ion 1.0 will never use these new enum variants, a special type called Never has been introduced that allows the compiler to recognize code paths related to macros in Ion 1.0 as dead branches that can be optimized out:

/// An uninhabited type that signals to the compiler that related code paths are not reachable.
#[derive(Debug, Copy, Clone)]
pub enum Never {
    // Has no variants, cannot be instantiated.
}

This should result in Ion 1.0 maintaining the same performance even though ion-rust's data model is now more complex.

Expansion behavior

Upon evaluation, an e-expression expands to zero or more Ion values -- they cannot produce version markers, NOPs, or other e-expressions. This PR implements a limited set of macros:

(:values arg1 arg2 ... argN): always expands to its arguments.
(:void): always expands to 0 values. (This is equivalent to (:values /* no args */)).
(:make_string arg1 arg2 ... argN): builds a string by evaluating each of its arguments to produce a stream of text values which are then concatenated.

When an e-expression appears in a sequence, the values in its expansion are considered part of that sequence.

// Top level
$ion_1_1 1 2 (:values 3 4) 5 6    // 1 2 3 4 5 6

// List
[1, 2, (:values 3 4), 5, 6]       // [1, 2, 3, 4, 5, 6]

// S-expression
(1 2 (:values 3 4) 5 6)           // (1 2 3 4 5 6)

When an e-expression appears in a struct, its expansion behavior depends on whether it was in field value or field name position.

In field value position, each value in its expansion is treated as a field value with the same field name as preceded the e-expression.

// Field value position
{
    foo: (:values 1 2 3),
    bar: (:values 4),
    baz: (:void)
}                       
// Expands to:
{
    foo: 1,
    foo: 2,
    foo: 3,
    bar: 4
}

In field name position, the macro must evaluate to a single struct. That struct's fields will then be merged into the host struct.

// Field name position
{
    foo: 1,
    (:values {bar: 2, baz: 3}),
    quux: 4,
}                       
// Expands to:
{
    foo: 1,
    bar: 2,
    baz: 3,
    quux: 4
}

To enable this behavior, this patch introduces a fourth layer of abstraction into the lazy reader model:

Raw: where bits on the wire are turned into syntax tokens (IVMs, values, symbol ID tokens, e-expressions)
--> new <-- Expanded: e-expressions encountered in the raw level are fully expanded. The reader API hands out LazyExpandedValues, each of which may be backed either by an as-of-yet-unread value literal from the input stream or by the result of a macro's evaluation.
System: which filters out and processes encoding directives (symbol tables, module declarations) from the values surfaced by lower levels of abstraction.
Application: only data relevant to the application's data model is visible.

Macro evaluation

This PR introduces a new type, MacroEvaluator, which can evaluate a macro incrementally; it only calculates the next value in the expansion upon request. This enables e-expressions to be evaluated lazily, making skip-scanning even more powerful and preventing the unexpected spikes in memory usage that might come from eagerly evaluating expressions that may or may not be needed.

Incremental argument evaluation

The MacroEvaluator achieves this by maintaining a stack of macros that are in the process of being evaluated. Each time the next value is requested, the macro at the top of the stack is given the opportunity to either:

yield another value
push another macro onto the stack
indicate that its evaluation is complete

Some macros like (:values ...) need to partially evaluate their arguments until they find another value. Consider:

(:values
  1
  (:values 2 3)
  4)

When evaluating this expression, the evaluator will follow these steps:

Push the entire expression onto the stack. Yield 1.
Push (:values 2 3) onto the stack, yield 2.
Yield 3.
Pop (:values 2 3) off of the stack, yield 4.
Pop the entire expression off of the stack. yield None.

Eager argument evaluation

Other macros like (:make_string) need access to the expanded form of all of their arguments to yield their next value. In this case, the macro can construct a transient (short-lived) evaluator of its own.

(:make_string
  foo_
  (:values bar_ baz_)
  quux_)

The evaluation steps for this are:

Push the entire expression onto the primary stack; construct an empty string buffer and a transient evaluator.
a. Push the text content foo_ onto the buffer.
b. We cannot push (:values bar_ baz_) onto the primary evaluator without yielding flow control. Instead, we push (:values bar_ baz_) onto the secondary, transient evaluator's stack.
c. Ask the secondary evaluator for the next value, get bar_, push its text onto the buffer.
d. Ask the secondary evaluator for the next value, get baz_, push its text onto the buffer.
e. Pop (:values bar_ baz_) off the secondary evaluator's stack. Get quux_, push its text onto the buffer.
f. Pop the entire expression off of the stack, yield "foo_bar_baz_quux".

From the caller's perspective, they called next() and received "foo_bar_baz_quux". However, internally the same incremental evaluation logic was being used in a smaller scope. Note that this means each substring was evaluated one at a time; we did not need to collect them or hold them all in memory at the same time (modulo the buffer's contents).

Bump allocation

In the description of (:make_string) above, perceptive readers may have noticed some possible sources of allocations; in particular, constructing a string buffer and constructing a transient evaluator.

This PR leverages bumpalo to be able to trivially allocate and deallocate structures using a preallocated block of heap memory. Values created in this way (including string buffers, and the Vecs backing our transient evaluators) cannot live beyond the current top level value, but happily that's exactly how long we need them! As a result, we get to allocate and deallocate by simply bumping an offset within our preallocated memory.

TODO

Benchmarking (particularly of Ion 1.0 code)
Wiring up the LazyReader to switch between underlying readers when a different IVM is encountered.
Template macros (including variable expansion)
Encoding directives (module definition)
Other built-in macros (make_timestamp, if_void, etc)
Binary encoding

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

zslayton

🗺️ PR tour

zslayton · 2023-09-25T17:26:56Z

src/element/mod.rs

+    fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
+        <Element as Display>::fmt(self, f)
+    }
+}


🗺️ The default Debug implementation for Element is so verbose that your eyes glaze over just trying to see that this contains an int. It's used transitively by some other types' Debug implementations too. I've switched it over to forwarding the call to its Display impl, yielding text Ion. It's usually much more helpful.

zslayton · 2023-09-25T17:28:42Z

src/lazy/any_encoding.rs

🗺️ There are a lot of changes in this file, but they're pretty uninteresting. Each of the Any types wraps an enum-dispatched concrete implementation of one of the encodings. We added a new encoding (TextEncoding_1_1), so lots of this boilerplate was updated. There isn't much logic to consider.

zslayton · 2023-09-25T17:29:40Z

src/element/reader.rs

@@ -67,7 +67,7 @@ pub trait ElementReader {
    /// returning the complete sequence as a `Vec<Element>`.
    ///
    /// If an error occurs while reading, returns `Err(IonError)`.
-    fn read_all_elements(&mut self) -> IonResult<Vec<Element>> {
+    fn read_all_elements(&mut self) -> IonResult<Sequence> {


🗺️ A previous PR changed Element::read_all to return a Sequence (which is just an opaque wrapper around Vec<Element>. However, we neglected to make the same change to the read_all_elements method in the ElementReader trait. I've made that change here for interop/consistency.

zslayton · 2023-09-25T17:30:33Z

src/lazy/any_encoding.rs

@@ -78,12 +137,12 @@ pub struct LazyRawAnyReader<'data> {
 }

 pub enum RawReaderKind<'data> {
-    Text_1_0(LazyRawTextReader<'data>),
+    Text_1_0(LazyRawTextReader_1_0<'data>),


🗺️ LazyRawTextReader has been split into LazyRawTextReader_1_0 and LazyRawTextReader_1_1.

zslayton · 2023-09-25T17:34:23Z

src/lazy/binary/raw/reader.rs

@@ -171,8 +171,8 @@ mod tests {
        let value = reader.next()?.expect_value()?;
        let lazy_struct = value.read()?.expect_struct()?;
        let mut fields = lazy_struct.iter();
-        let field1 = fields.next().expect("field 1")?;
-        assert_eq!(field1.name(), 4.as_raw_symbol_token_ref()); // 'name'
+        let (name, _value) = fields.next().expect("field 1")?.expect_name_value()?;


🗺️ When we encounter a struct field at the raw level, we now need to check whether it's a (name, value), a (name, macro), or a (macro). In Ion 1.0, it's always (name, value).

zslayton · 2023-09-25T17:42:05Z

src/lazy/binary/raw/struct.rs

-    fn find(&self, name: &str) -> IonResult<Option<LazyRawBinaryValue<'data>>> {
-        let name: RawSymbolTokenRef = name.as_raw_symbol_token_ref();
-        for field in self {
-            let field = field?;
-            if field.name() == name {
-                let value = field.value;
-                return Ok(Some(value));
-            }
-        }
-        Ok(None)
-    }
-
-    fn get(&self, name: &str) -> IonResult<Option<RawValueRef<'data, BinaryEncoding_1_0>>> {
-        self.find(name)?.map(|f| f.read()).transpose()
-    }


🗺️ It no longer makes sense for the raw level struct to have find and get methods because some number of the fields may be encoded as macro invocations that are opaque to the raw reader. These methods have been moved to the Expanded layer.

zslayton · 2023-09-25T17:48:39Z

src/lazy/encoding.rs

+impl<'data> TextEncoding<'data> for TextEncoding_1_0 {
+    fn value_from_matched(
+        matched: MatchedRawTextValue<'data>,
+    ) -> <Self as LazyDecoder<'data>>::Value {
+        LazyRawTextValue_1_0::from(matched)
+    }
+}
+impl<'data> TextEncoding<'data> for TextEncoding_1_1 {
+    fn value_from_matched(
+        matched: MatchedRawTextValue<'data>,
+    ) -> <Self as LazyDecoder<'data>>::Value {
+        LazyRawTextValue_1_1::from(matched)
+    }
+}


🗺️ Text Ion 1.0 and Ion 1.1 have the same parsing rules for scalar value literals. This trait implementation allows us to reuse scalar matching/reading logic the same way in both version implementations.

MatchedRawTextValue doesn't have any methods, and only contains offset/length information about matched containers. You cannot try to read a MatchedRawTextValue's data until you've converted it to a LazyRawTextValue_1_0 or LazyRawTextValue_1_1, each of which have their own approach to parsing the container bytes. In particular, the _1_1 flavor knows to look for and handle nested macro invocations.

zslayton · 2023-09-25T17:57:15Z

src/lazy/system_reader.rs

+        // XXX: This `unsafe` is a workaround for https://github.com/rust-lang/rust/issues/70255
+        //      There is a rustc fix for this limitation on the horizon. See:
+        //      https://smallcultfollowing.com/babysteps/blog/2023/09/22/polonius-part-1/
+        //      Indeed, using the experimental `-Zpolonius` flag on the nightly compiler allows the
+        //      version of this code without this `unsafe` hack to work. The alternative to the
+        //      hack is wrapping the SymbolTable in something like `RefCell`, which adds a small
+        //      amount of overhead to each access. Given that the `SymbolTable` is on the hot
+        //      path and that a fix is inbound, I think this use of `unsafe` is warranted.
+        // SAFETY: At this point, the only thing that's holding potentially holding references to
+        //         the symbol table is the lazy value that represented an LST directive. We've
+        //         already read through that value in full to populate the `PendingLst`. Updating
+        //         the symbol table will invalidate data in that lazy value, so we just have to take
+        //         care not to read from it after updating the symbol table.
+        let symbol_table = unsafe {
+            let mut_ptr = ptr as *mut SymbolTable;
+            &mut *mut_ptr
+        };


🗺️ I want to draw additional attention to this as it's a short-term hack. The linked blog post happened to be published after I'd been banging my head against the wall on this for a few hours.

zslayton · 2023-09-25T18:02:01Z

src/lazy/system_reader.rs

🗺️ The LazySystemReader now wraps a LazyExpandedReader<D> instead of a LazyRawReader<D>.

zslayton · 2023-09-25T18:04:09Z

src/lazy/text/buffer.rs

🗺️ There are likely a lot of opportunities to DRY this up, sharing logic between some of the 1.0 and 1.1 container parsing methods. I'd like to leave that for a future PR.

popematt

Overall, I think it looks pretty good. There are a few things that look odd to me, and I've commented on them. Any of the comments related to DRY-ing or other refactoring are things that can be punted for later, though I would appreciate it if that one function with 4 levels of control-flow nesting could be broken up as part of this PR.

src/lazy/any_encoding.rs

src/lazy/binary/raw/sequence.rs

src/lazy/encoding.rs

src/lazy/expanded/template.rs

src/lazy/text/raw/v1_1/reader.rs

src/lazy/text/value.rs

src/lazy/value_ref.rs

src/types/struct.rs

codecov · 2023-09-28T19:44:55Z

Codecov Report

Attention: 776 lines in your changes are missing coverage. Please review.

Files	Coverage Δ
src/element/mod.rs	`81.70% <100.00%> (+1.33%)`	⬆️
src/element/reader.rs	`91.34% <100.00%> (ø)`
src/lazy/expanded/e_expression.rs	`100.00% <100.00%> (ø)`
src/lazy/reader.rs	`75.63% <100.00%> (+0.84%)`	⬆️
src/lazy/text/encoded_value.rs	`90.97% <ø> (ø)`
src/lazy/bytes_ref.rs	`50.72% <0.00%> (-0.75%)`	⬇️
src/lazy/expanded/template.rs	`96.96% <96.96%> (ø)`
src/symbol_table.rs	`90.41% <0.00%> (-1.26%)`	⬇️
src/lazy/binary/raw/sequence.rs	`47.22% <60.00%> (+1.87%)`	⬆️
src/lazy/sequence.rs	`73.27% <75.00%> (+0.61%)`	⬆️
... and 25 more

... and 2 files with indirect coverage changes

📢 Thoughts on this report? Let us know!.

* Initial implementation of 1.1 text reader * Fixed private doc links * Recursive expansion of TDL containers * Relocate `From<LazyExpandedValue> for LazyValue` impls * Incorporates feedback from PR #645 * Expanded doc comments --------- Co-authored-by: Zack Slayton <[email protected]>

zslayton added 2 commits September 25, 2023 11:29

Initial implementation of 1.1 text reader

2185cc3

Fixed private doc links

222bcb2

zslayton marked this pull request as ready for review September 25, 2023 17:24

zslayton requested a review from popematt September 25, 2023 17:24

zslayton commented Sep 25, 2023

View reviewed changes

popematt reviewed Sep 27, 2023

View reviewed changes

This was referenced Sep 28, 2023

Recursive macro expansion in TDL containers #647

Merged

Make a "commons" module for pub(crate) utilities #649

Open

Implement ion_type() with a macro #650

Open

Eliminate boilerplate in AnyEncoding #651

Open

zslayton added a commit that referenced this pull request Sep 28, 2023

Incorporates feedback from PR #645

f992f86

zslayton mentioned this pull request Sep 28, 2023

Incorporates feedback from PR #645 #652

Merged

popematt approved these changes Sep 28, 2023

View reviewed changes

Merge branch 'main' into 1_1-text-reader

15f45c7

cargo fmt

b603cfa

zslayton merged commit 1d2d46b into main Sep 28, 2023

zslayton deleted the 1_1-text-reader branch September 28, 2023 19:54

This was referenced Oct 16, 2023

Text parser support for e-expressions #655

Closed

Evaluating e-expressions #656

Closed

Evaluating template macro invocations #657

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial implementation of the Ion 1.1 text reader #645

Initial implementation of the Ion 1.1 text reader #645

zslayton commented Sep 25, 2023 •

edited

Loading

zslayton left a comment

zslayton Sep 25, 2023

zslayton Sep 25, 2023

zslayton Sep 25, 2023

zslayton Sep 25, 2023

zslayton Sep 25, 2023

zslayton Sep 25, 2023

zslayton Sep 25, 2023

zslayton Sep 25, 2023

zslayton Sep 25, 2023

zslayton Sep 25, 2023

popematt left a comment

codecov bot commented Sep 28, 2023 •

edited

Loading

Initial implementation of the Ion 1.1 text reader #645

Initial implementation of the Ion 1.1 text reader #645

Conversation

zslayton commented Sep 25, 2023 • edited Loading

Syntax changes

Expansion behavior

Macro evaluation

Incremental argument evaluation

Eager argument evaluation

Bump allocation

TODO

zslayton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

popematt left a comment

Choose a reason for hiding this comment

codecov bot commented Sep 28, 2023 • edited Loading

Codecov Report

zslayton commented Sep 25, 2023 •

edited

Loading

codecov bot commented Sep 28, 2023 •

edited

Loading