-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial implementation of the Ion 1.1 text reader #645
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ PR tour
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result { | ||
<Element as Display>::fmt(self, f) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ The default Debug
implementation for Element
is so verbose that your eyes glaze over just trying to see that this contains an int
. It's used transitively by some other types' Debug
implementations too. I've switched it over to forwarding the call to its Display
impl, yielding text Ion. It's usually much more helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ There are a lot of changes in this file, but they're pretty uninteresting. Each of the Any
types wraps an enum-dispatched concrete implementation of one of the encodings. We added a new encoding (TextEncoding_1_1
), so lots of this boilerplate was updated. There isn't much logic to consider.
@@ -67,7 +67,7 @@ pub trait ElementReader { | |||
/// returning the complete sequence as a `Vec<Element>`. | |||
/// | |||
/// If an error occurs while reading, returns `Err(IonError)`. | |||
fn read_all_elements(&mut self) -> IonResult<Vec<Element>> { | |||
fn read_all_elements(&mut self) -> IonResult<Sequence> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ A previous PR changed Element::read_all
to return a Sequence
(which is just an opaque wrapper around Vec<Element>
. However, we neglected to make the same change to the read_all_elements
method in the ElementReader
trait. I've made that change here for interop/consistency.
@@ -78,12 +137,12 @@ pub struct LazyRawAnyReader<'data> { | |||
} | |||
|
|||
pub enum RawReaderKind<'data> { | |||
Text_1_0(LazyRawTextReader<'data>), | |||
Text_1_0(LazyRawTextReader_1_0<'data>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ LazyRawTextReader
has been split into LazyRawTextReader_1_0
and LazyRawTextReader_1_1
.
@@ -171,8 +171,8 @@ mod tests { | |||
let value = reader.next()?.expect_value()?; | |||
let lazy_struct = value.read()?.expect_struct()?; | |||
let mut fields = lazy_struct.iter(); | |||
let field1 = fields.next().expect("field 1")?; | |||
assert_eq!(field1.name(), 4.as_raw_symbol_token_ref()); // 'name' | |||
let (name, _value) = fields.next().expect("field 1")?.expect_name_value()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ When we encounter a struct field at the raw level, we now need to check whether it's a (name, value)
, a (name, macro)
, or a (macro)
. In Ion 1.0, it's always (name, value)
.
fn find(&self, name: &str) -> IonResult<Option<LazyRawBinaryValue<'data>>> { | ||
let name: RawSymbolTokenRef = name.as_raw_symbol_token_ref(); | ||
for field in self { | ||
let field = field?; | ||
if field.name() == name { | ||
let value = field.value; | ||
return Ok(Some(value)); | ||
} | ||
} | ||
Ok(None) | ||
} | ||
|
||
fn get(&self, name: &str) -> IonResult<Option<RawValueRef<'data, BinaryEncoding_1_0>>> { | ||
self.find(name)?.map(|f| f.read()).transpose() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ It no longer makes sense for the raw level struct to have find
and get
methods because some number of the fields may be encoded as macro invocations that are opaque to the raw reader. These methods have been moved to the Expanded
layer.
impl<'data> TextEncoding<'data> for TextEncoding_1_0 { | ||
fn value_from_matched( | ||
matched: MatchedRawTextValue<'data>, | ||
) -> <Self as LazyDecoder<'data>>::Value { | ||
LazyRawTextValue_1_0::from(matched) | ||
} | ||
} | ||
impl<'data> TextEncoding<'data> for TextEncoding_1_1 { | ||
fn value_from_matched( | ||
matched: MatchedRawTextValue<'data>, | ||
) -> <Self as LazyDecoder<'data>>::Value { | ||
LazyRawTextValue_1_1::from(matched) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ Text Ion 1.0 and Ion 1.1 have the same parsing rules for scalar value literals. This trait implementation allows us to reuse scalar matching/reading logic the same way in both version implementations.
MatchedRawTextValue
doesn't have any methods, and only contains offset/length information about matched containers. You cannot try to read a MatchedRawTextValue
's data until you've converted it to a LazyRawTextValue_1_0
or LazyRawTextValue_1_1
, each of which have their own approach to parsing the container bytes. In particular, the _1_1
flavor knows to look for and handle nested macro invocations.
// XXX: This `unsafe` is a workaround for https://github.com/rust-lang/rust/issues/70255 | ||
// There is a rustc fix for this limitation on the horizon. See: | ||
// https://smallcultfollowing.com/babysteps/blog/2023/09/22/polonius-part-1/ | ||
// Indeed, using the experimental `-Zpolonius` flag on the nightly compiler allows the | ||
// version of this code without this `unsafe` hack to work. The alternative to the | ||
// hack is wrapping the SymbolTable in something like `RefCell`, which adds a small | ||
// amount of overhead to each access. Given that the `SymbolTable` is on the hot | ||
// path and that a fix is inbound, I think this use of `unsafe` is warranted. | ||
// SAFETY: At this point, the only thing that's holding potentially holding references to | ||
// the symbol table is the lazy value that represented an LST directive. We've | ||
// already read through that value in full to populate the `PendingLst`. Updating | ||
// the symbol table will invalidate data in that lazy value, so we just have to take | ||
// care not to read from it after updating the symbol table. | ||
let symbol_table = unsafe { | ||
let mut_ptr = ptr as *mut SymbolTable; | ||
&mut *mut_ptr | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ I want to draw additional attention to this as it's a short-term hack. The linked blog post happened to be published after I'd been banging my head against the wall on this for a few hours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ The LazySystemReader
now wraps a LazyExpandedReader<D>
instead of a LazyRawReader<D>
.
src/lazy/text/buffer.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ There are likely a lot of opportunities to DRY this up, sharing logic between some of the 1.0 and 1.1 container parsing methods. I'd like to leave that for a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think it looks pretty good. There are a few things that look odd to me, and I've commented on them. Any of the comments related to DRY-ing or other refactoring are things that can be punted for later, though I would appreciate it if that one function with 4 levels of control-flow nesting could be broken up as part of this PR.
Codecov ReportAttention:
... and 2 files with indirect coverage changes 📢 Thoughts on this report? Let us know!. |
* Initial implementation of 1.1 text reader * Fixed private doc links * Recursive expansion of TDL containers * Relocate `From<LazyExpandedValue> for LazyValue` impls * Incorporates feedback from PR #645 * Expanded doc comments --------- Co-authored-by: Zack Slayton <[email protected]>
Syntax changes
Ion 1.1 only introduces a single text syntactic element: encoding expressions. Encoding expressions (e-expressions for short) represent a macro invocation in the data stream, and look like this:
Apart from the leading smiley (
(:
) andmacro_id
, they follow s-expression syntax rules.E-expressions can appear anywhere that an Ion value can appear, including:
$ion_1_1 (:foo)
(1 2 (:foo) 4)
[1, 2, (:foo), 4]
{ a: 1, foo: (:bar), c: 3 }
{ a: 1, (:bar), c: 3 }
(:foo 1 2 (:bar) 4)
To support this change to the grammar, the
LazyDecoder
trait has been modified to accommodate the possibility of an e-expression in these locations. In addition to the associated typeValue
, eachLazyDecoder
implementationD
(for example:TextEncoding_1_1
,BinaryEncoding_1_0
, etc) now has to declare an associated typeMacroInvocation
that represents that encoding's syntactic representation of an e-expression. Rather thanD::Reader::next()
returning aD::Value
, it now returns aLazyRawValueExpr<D>
, an enum that could be either a value or an e-expression:Similarly, rather than returning a
D::Field
, aD::Struct
's iterator will return aLazyRawFieldExpr<D>
:Because Ion 1.0 will never use these new enum variants, a special type called
Never
has been introduced that allows the compiler to recognize code paths related to macros in Ion 1.0 as dead branches that can be optimized out:This should result in Ion 1.0 maintaining the same performance even though
ion-rust
's data model is now more complex.Expansion behavior
Upon evaluation, an e-expression expands to zero or more Ion values -- they cannot produce version markers,
NOP
s, or other e-expressions. This PR implements a limited set of macros:(:values arg1 arg2 ... argN)
: always expands to its arguments.(:void)
: always expands to 0 values. (This is equivalent to(:values /* no args */)
).(:make_string arg1 arg2 ... argN)
: builds a string by evaluating each of its arguments to produce a stream of text values which are then concatenated.When an e-expression appears in a sequence, the values in its expansion are considered part of that sequence.
When an e-expression appears in a struct, its expansion behavior depends on whether it was in field value or field name position.
In field value position, each value in its expansion is treated as a field value with the same field name as preceded the e-expression.
In field name position, the macro must evaluate to a single struct. That struct's fields will then be merged into the host struct.
To enable this behavior, this patch introduces a fourth layer of abstraction into the lazy reader model:
Raw
: where bits on the wire are turned into syntax tokens (IVMs, values, symbol ID tokens, e-expressions)Expanded
: e-expressions encountered in the raw level are fully expanded. The reader API hands outLazyExpandedValue
s, each of which may be backed either by an as-of-yet-unread value literal from the input stream or by the result of a macro's evaluation.System
: which filters out and processes encoding directives (symbol tables, module declarations) from the values surfaced by lower levels of abstraction.Application
: only data relevant to the application's data model is visible.Macro evaluation
This PR introduces a new type,
MacroEvaluator
, which can evaluate a macro incrementally; it only calculates the next value in the expansion upon request. This enables e-expressions to be evaluated lazily, making skip-scanning even more powerful and preventing the unexpected spikes in memory usage that might come from eagerly evaluating expressions that may or may not be needed.Incremental argument evaluation
The
MacroEvaluator
achieves this by maintaining a stack of macros that are in the process of being evaluated. Each time the next value is requested, the macro at the top of the stack is given the opportunity to either:Some macros like
(:values ...)
need to partially evaluate their arguments until they find another value. Consider:When evaluating this expression, the evaluator will follow these steps:
1
.(:values 2 3)
onto the stack, yield2
.3
.(:values 2 3)
off of the stack, yield4
.None
.Eager argument evaluation
Other macros like
(:make_string)
need access to the expanded form of all of their arguments to yield their next value. In this case, the macro can construct a transient (short-lived) evaluator of its own.The evaluation steps for this are:
a. Push the text content
foo_
onto the buffer.b. We cannot push
(:values bar_ baz_)
onto the primary evaluator without yielding flow control. Instead, we push(:values bar_ baz_)
onto the secondary, transient evaluator's stack.c. Ask the secondary evaluator for the next value, get
bar_
, push its text onto the buffer.d. Ask the secondary evaluator for the next value, get
baz_
, push its text onto the buffer.e. Pop
(:values bar_ baz_)
off the secondary evaluator's stack. Getquux_
, push its text onto the buffer.f. Pop the entire expression off of the stack, yield
"foo_bar_baz_quux"
.From the caller's perspective, they called
next()
and received"foo_bar_baz_quux"
. However, internally the same incremental evaluation logic was being used in a smaller scope. Note that this means each substring was evaluated one at a time; we did not need to collect them or hold them all in memory at the same time (modulo the buffer's contents).Bump allocation
In the description of
(:make_string)
above, perceptive readers may have noticed some possible sources of allocations; in particular, constructing a string buffer and constructing a transient evaluator.This PR leverages
bumpalo
to be able to trivially allocate and deallocate structures using a preallocated block of heap memory. Values created in this way (including string buffers, and theVec
s backing our transient evaluators) cannot live beyond the current top level value, but happily that's exactly how long we need them! As a result, we get to allocate and deallocate by simply bumping an offset within our preallocated memory.TODO
LazyReader
to switch between underlying readers when a different IVM is encountered.make_timestamp
,if_void
, etc)By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.