Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds ion-tests integration for the lazy reader #639

Merged
merged 39 commits into from
Sep 7, 2023
Merged

Conversation

zslayton
Copy link
Contributor

@zslayton zslayton commented Sep 6, 2023

Adds ion-tests integration for the lazy reader. Also addresses a number bugs surfaced by the conformance tests.

Fixes issue #636.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@codecov
Copy link

codecov bot commented Sep 6, 2023

@zslayton zslayton changed the base branch from main to lazy-clobs September 6, 2023 18:32
Copy link
Contributor Author

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ PR tour

Copy link
Contributor Author

@zslayton zslayton Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Previously, the binary reader had a monolithic peek_value method that contained logic for reading fields, annotations, NOPs, and the values themselves. I refactored this so that each has its own code path with clearer invariants. This made it easier to track down a NOP-related buffer indexing problem.

self.buffer = buffer;
// If the value we read doesn't start where we began reading, there was a NOP.
let num_nop_bytes = lazy_value.input.offset() - buffer.offset();
self.buffer = buffer.consume(num_nop_bytes);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This was the buffer indexing bug I referred to in an earlier comment.

// Skip the type descriptor
let input = self.input.consume(1);
// Skip the type descriptor and length bytes
let input = ImmutableBuffer::new(self.value_body()?);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This placeholder impl lived longer than it was supposed to, and always slipped past the tests because I didn't dummy up any decimals that were > 13 bytes long.

Copy link
Contributor Author

@zslayton zslayton Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ I tried to standardize the reader level names a bit more. There are three levels: Raw, System, and Application. Each level has three supported decoders: Text, Binary, or Any, where Any uses enum dispatch to abstract over Text or Binary.

  • LazyRawBinaryReader, LazyRawTextReader, LazyRawAnyReader
  • LazySystemBinaryReader, LazySystemTextReader, LazySystemAnyReader
  • LazyApplicationBinaryReader, LazyApplicationSystemTextReader, LazyApplicationSystemAnyReader

Because users will almost always interact with the Application level, that is the default and type aliases are provided to allow the level to be omitted.

  • LazyBinaryReader, LazyTextReader

Similarly, most users won't want to have to specify the format, so they can just use the LazyReader to get a LazyApplicationAnyReader.

@@ -107,7 +124,7 @@ impl<'data, D: LazyDecoder<'data>> LazySystemReader<'data, D> {
return Ok(false);
}
if let Some(symbol_ref) = lazy_value.annotations().next() {
return Ok(symbol_ref? == ION_SYMBOL_TABLE);
return Ok(symbol_ref?.matches_sid_or_text(3, "$ion_symbol_table"));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Now that there's a text reader, we cannot simply look at symbol IDs when comparing annotations or field names.

Comment on lines +59 to +65
if let RawStreamItem::VersionMarker(major, minor) = matched {
if (major, minor) != (1, 0) {
return IonResult::decoding_error(format!(
"Ion version {major}.{minor} is not supported"
));
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ There are ion-tests which expect the reader to reject future Ion versions.

@@ -191,8 +199,8 @@ mod tests {
// Second item
2 /*comment before comma*/,
// Third item
3
]
3 ,]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ I added a trailing comma while debugging.

.slice_to_end(1)
.match_optional_comments_and_whitespace()
.with_context("skipping a list's trailing comma", input_after_ws)?;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ If a list had a trailing comma, this would break.

.slice_to_end(1)
.match_optional_comments_and_whitespace()
.with_context("skipping a list's trailing comma", input_after_ws)?;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ If the struct had a trailing comma, this would break.

let value: Value = value_ref.try_into()?;
Ok(value.into())
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This enables: let element: Element = reader.next()?.read()?.try_into()?;

@zslayton zslayton marked this pull request as ready for review September 6, 2023 19:27
Copy link
Contributor Author

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Restoring PR tour comments that GitHub decided to remove/hide as "outdated."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Previously, the binary reader had a monolithic peek_value method that contained logic for reading fields, annotations, NOPs, and the values themselves. I refactored this so that each has its own code path with clearer invariants. This made it easier to track down a NOP-related buffer indexing problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ I tried to normalize the reader level names a bit more. There are three levels: Raw, System, and Application. Each level has three supported decoders: Text, Binary, or Any, where Any uses enum dispatch to abstract over Text or Binary.

  • LazyRawBinaryReader, LazyRawTextReader, LazyRawAnyReader
  • LazySystemBinaryReader, LazySystemTextReader, LazySystemAnyReader
  • LazyApplicationBinaryReader, LazyApplicationSystemTextReader, LazyApplicationSystemAnyReader

Because users will almost always interact with the Application level, that is the default and type aliases are provided to allow the level to be omitted.

  • LazyBinaryReader, LazyTextReader

Similarly, most users won't want to have to specify the format, so they can just use the LazyReader to get a LazyApplicationAnyReader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed this comment when I was reviewing the code. However, I think my question is still valid. Why do we need the encoding distinction at every level of reader?

Base automatically changed from lazy-clobs to main September 7, 2023 12:07
Copy link
Contributor

@popematt popematt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks good. I do have one question though.

You've made a "Binary", "Text", and "Any" implementation for the "Raw", "System", and "Application" readers. Is there any reason that the text and binary distinction has to be pushed up that far? Would it be possible to create a RawAnyReader that is an enum of binary and text raw readers so that higher levels of readers can have just one implementation?

@@ -55,11 +56,11 @@ use crate::{IonError, IonResult};
///# Ok(())
///# }
/// ```
pub struct LazyReader<'data, D: LazyDecoder<'data>> {
pub struct LazyApplicationReader<'data, D: LazyDecoder<'data>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc comments need to be updated. It still refers to this as a "binary" reader.

@@ -357,7 +366,7 @@ impl<'data> TextBufferView<'data> {
/// input bytes where the field name is found, and the value.
pub fn match_struct_field_name_and_value(
self,
) -> IonParseResult<'data, ((MatchedSymbol, Range<usize>), LazyRawTextValue<'data>)> {
) -> IonParseResult<'data, ((MatchedFieldName, Range<usize>), LazyRawTextValue<'data>)> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{'''f''' '''o''' '''o''': '''b''' '''a''' '''r'''}

🤯

@@ -701,7 +722,7 @@ impl<'data> TextBufferView<'data> {

/// Matches the three parts of an int--its base, its sign, and its digits--without actually
/// constructing an Int from them.
fn match_int(self) -> IonParseResult<'data, MatchedInt> {
pub fn match_int(self) -> IonParseResult<'data, MatchedInt> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious—why does this need to be pub now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted each of the methods that matched a full Ion type to be reusable from outside the buffer (but not outside the crate without more consideration).

@zslayton
Copy link
Contributor Author

zslayton commented Sep 7, 2023

Overall it looks good. I do have one question though.

You've made a "Binary", "Text", and "Any" implementation for the "Raw", "System", and "Application" readers. Is there any reason that the text and binary distinction has to be pushed up that far? Would it be possible to create a RawAnyReader that is an enum of binary and text raw readers so that higher levels of readers can have just one implementation?

The old reader uses a Box<dyn IonReader> to abstract over the raw reader and format, and unfortunately it adds a really substantial amount of overhead. A quick test I threw together showed it was about 25% slower than using static dispatch with a single format. The Any readers use enum dispatch, and my assumption is that it will be far faster than dyn (branch predictors are great!), but I haven't quantified the performance impact yet. If it's a slowdown, users focused on performance will probably want the option to work with a format-specific reader.

Note that LazyApplicationReader<D> is pub no matter what, so we're really just talking about what type aliases to provide.

@zslayton zslayton merged commit 700e983 into main Sep 7, 2023
@zslayton zslayton deleted the lazy-reader-ion-tests branch September 7, 2023 20:45
zslayton added a commit that referenced this pull request Sep 7, 2023
zslayton added a commit that referenced this pull request Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants