Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds read/write support for system eexps, placeholders for remaining system macros #862

Merged
merged 19 commits into from
Dec 2, 2024

Conversation

zslayton
Copy link
Contributor

@zslayton zslayton commented Nov 29, 2024

Builds on PR #861.

  • MacroTable now implements Send, allowing instances to be shared across threads. This made it possible for the system symbol table to be a singleton for all readers. To achieve this:
    • Rc references to other Macro instances are now Arc. This incurs higher overhead for new macro definitions at compile time, but no overhead at usage time. Reader and Writer instantiation benefit by not having to create a new system macro table instance.
    • Rc<str> instances used to store parameter and macro names have been replaced by CompactString, which doesn't allocate for strings that are <=24 bytes long.
  • Text read/write support for $ion::<addr> and $ion::<name> e-expressions.
  • Binary read/write support for 0xEF system e-expressions.
  • Stubs out macros for all of the yet-to-be-implemented system macros, so all macros now have the expected address.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

* Macros now impleement `Send`, allowing them to be shared across
  threads. This makes it possible for all readers and writers to
  share the same system macro table instance instead of allocating
  their own copy.
* Adds a `SystemMacroAddress` type that is range-checked at creation.
* Adds a `SystemAddress` variant to the `MacroIdRef` enum.
* Adds `Opcode::from_byte` support for `SystemEExpression`.
* `Parameter` and `Macro` now use `CompactString` to store their names,
  skipping all heap allocations in the common case.
Copy link

codecov bot commented Nov 29, 2024

Codecov Report

Attention: Patch coverage is 87.46439% with 44 lines in your changes missing coverage. Please review.

Project coverage is 77.82%. Comparing base (137d79e) to head (111ecfc).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/lazy/text/buffer.rs 58.06% 12 Missing and 1 partial ⚠️
src/lazy/encoder/text/v1_1/writer.rs 33.33% 0 Missing and 10 partials ⚠️
src/lazy/text/raw/v1_1/reader.rs 68.42% 6 Missing ⚠️
src/lazy/encoder/binary/v1_1/value_writer.rs 78.26% 0 Missing and 5 partials ⚠️
src/lazy/expanded/macro_table.rs 97.76% 4 Missing ⚠️
src/lazy/expanded/compiler.rs 94.44% 2 Missing ⚠️
src/lazy/binary/raw/v1_1/type_descriptor.rs 0.00% 1 Missing ⚠️
src/lazy/encoder/text/v1_1/value_writer.rs 85.71% 1 Missing ⚠️
src/lazy/expanded/e_expression.rs 0.00% 1 Missing ⚠️
src/lazy/expanded/template.rs 97.29% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #862      +/-   ##
==========================================
- Coverage   77.83%   77.82%   -0.01%     
==========================================
  Files         136      136              
  Lines       34021    34199     +178     
  Branches    34021    34199     +178     
==========================================
+ Hits        26479    26617     +138     
- Misses       5602     5626      +24     
- Partials     1940     1956      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor Author

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ PR Tour 🧭

@@ -53,6 +53,7 @@ base64 = "0.12"
# chrono < 0.5 brings in a deprecated version of the `time` crate via `oldtime` feature by default
# this makes it explicitly not do this as there is an advisory warning against this:
# See: https://github.com/chronotope/chrono/issues/602
compact_str = "0.8.0"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ There are many crates offering an optimized small string type--here's a comparison of the available options.

This one is actively developed and stack-allocates strings up to 24 bytes long, which is enough to cover most parameter and macro names.

'timestamp': timestamp,
'threadId': thread_id,
'threadName': (make_string "scheduler-thread-" thread_name),
'timestamp': (%timestamp),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This benchmark had broken following recent changes to TDL syntax.

.unwrap()
.read_all_elements()?;

let mut text_1_0_group = c.benchmark_group("text 1.0");
// Visit each top level value in the stream without reading it.
text_1_0_group.bench_function("scan all", |b| {
b.iter(|| {
let mut reader = Reader::new(v1_1::Text, text_1_0_data.as_slice()).unwrap();
let mut reader = Reader::new(v1_0::Text, text_1_0_data.as_slice()).unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't have mattered much if at all, but we were using a v1_1 text reader to read 1.0 text.

@@ -726,6 +726,10 @@ impl BinaryValueWriter_1_1<'_, '_> {
MacroIdRef::LocalAddress(_address) => {
todo!("macros with addresses greater than {MAX_20_BIT_ADDRESS}");
}
MacroIdRef::SystemAddress(address) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Binary write logic for system macro invocations.

@@ -25,7 +25,10 @@ pub struct TextAnnotatedValueWriter_1_1<'value, W: Write> {
}

impl<'value, W: Write + 'value> AnnotatableWriter for TextValueWriter_1_1<'value, W> {
type AnnotatedValueWriter<'a> = TextAnnotatedValueWriter_1_1<'a, W> where Self: 'a;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ New rustfmt release has new opinions.

Comment on lines -179 to +187
thread_local! {
/// An instance of the Ion 1.1 system macro table is lazily instantiated once per thread,
/// minimizing the number of times macro compilation occurs.
///
/// The thread-local instance holds `Rc` references to its macro names and macro definitions,
/// making its contents inexpensive to `clone()` and reducing the number of duplicate `String`s
/// being allocated over time.
///
/// Each time the user constructs a reader, it clones the thread-local copy of the system macro
/// table.
pub static ION_1_1_SYSTEM_MACROS: MacroTable = MacroTable::construct_system_macro_table();
}
/// A lazily initialized singleton instance of the Ion 1.1 system macro table.
///
/// This instance is shared by all readers, minimizing the number of times that macro compilation
/// and heap allocation occurs.
pub static ION_1_1_SYSTEM_MACROS: LazyLock<MacroTable> =
LazyLock::new(MacroTable::construct_system_macro_table);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Originally, each reader or writer would instantiate its own copy of the system macro table. This involved several heap allocations and some compilation logic that could feel quite heavyweight for a short-lived stream.

To avoid paying this cost for every reader and writer, I made the system macro table a thread-local value. Only the first reader or writer on a thread would pay the initialization cost, and Macro and Parameter instances could continue using lightweight Rcs to share data within the thread.

However, this introduced a problem: unlike static values that live for the duration of the program, thread locals only live for the duration of their thread; they do not have a named lifetime. You can only access a thread local value via a closure, and you cannot retain a reference to it. As a result, it wasn't possible to get a reference (&) to a system macro from the thread local system macro table—only an Rc.

As a compromise, I used a thread local to build the first system macro table—performing the necessary macro compilation steps—and then had subsequent readers and writers clone() that instance of the table. This allowed most readers and writers to avoid the cost of compiling system macros, but they still had to pay for the heap allocations. Not terribly satisfying.

Recently, in Rust v1.80, the LazyLock type was stabilized. It uses synchronization primitives to manage initializing the value, but all accesses thereafter can get a & reference for 'free'.

By switching the internals of the Macro type from Rc<str> to CompactString and Rc<Macro> to Arc<Macro>, I was able to create an immutable, lazily-initialized global singleton system macro table. References to the system macros are now trivial to get and can be reused across all readers and writers.

num_annotations: 0,
}),
},
ExpansionAnalysis::single_application_value(IonType::String),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Note the use of named ExpansionAnalysis constructors for clarity.

@@ -707,8 +938,9 @@ impl MacroTable {
}
}

// TODO: Remove this
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Once we implement the encoding module sequence, we will no longer initialize the local macro table by cloning the system macro table.

@@ -731,6 +963,9 @@ impl MacroTable {
match id {
MacroIdRef::LocalName(name) => self.macro_with_name(name),
MacroIdRef::LocalAddress(address) => self.macro_at_address(address),
MacroIdRef::SystemAddress(system_address) => {
ION_1_1_SYSTEM_MACROS.macro_at_address(system_address.as_usize())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Example of getting a & reference to a system macro.

@@ -1177,8 +1177,49 @@ impl<'top> TextBuffer<'top> {
Ok((exp_body_after_id, id))
}

pub fn match_system_eexp_id(self) -> IonParseResult<'top, MacroIdRef<'top>> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Text parsing logic for $ion::<name> and $ion::<addr>. Later on we can generalize this for any qualified macro reference.

Base automatically changed from deny-old-idioms to main December 2, 2024 19:08
benches/read_many_structs.rs Outdated Show resolved Hide resolved
benches/read_many_structs.rs Outdated Show resolved Hide resolved
benches/read_many_structs.rs Outdated Show resolved Hide resolved
src/lazy/encoder/binary/v1_1/value_writer.rs Outdated Show resolved Hide resolved
src/lazy/encoder/text/v1_1/value_writer.rs Show resolved Hide resolved
@@ -127,6 +127,7 @@ impl<'top, D: Decoder> EExpression<'top, D> {
environment = self.new_evaluation_environment()?;
MacroExpansionKind::Template(TemplateExpansion::new(template_ref))
}
MacroKind::ToDo => todo!("system macro {}", invoked_macro.name().unwrap()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Good idea :D

src/lazy/expanded/macro_table.rs Outdated Show resolved Hide resolved
@zslayton zslayton merged commit cff0c7f into main Dec 2, 2024
25 of 33 checks passed
@zslayton zslayton deleted the sys-macro-opcode branch December 2, 2024 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants