Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Baked data is big, and compiles slowly, for finely sliced data markers #5230

Open
sffc opened this issue Jul 11, 2024 · 37 comments
Open

Baked data is big, and compiles slowly, for finely sliced data markers #5230

sffc opened this issue Jul 11, 2024 · 37 comments
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-large Size: A few weeks (larger feature, major refactoring)

Comments

@sffc
Copy link
Member

sffc commented Jul 11, 2024

icu_datetime compile times have regressed a lot since I added neo datetime data, and #5221 appears to be choking in CI.

The finely sliced data markers (small data structs designed to work with many data marker attributes) give compelling data sizes and stack sizes in Postcard (#4818, #4779). However, in Baked, they significantly increase file size, and the numbers for data size are also not as compelling because baked data includes a lot more pointers (for example, at least 24 bytes for a ZeroVec) which are duplicated for each and every instance of the data struct.

Example data struct that is used in a finely sliced data marker:

pub struct PackedSkeletonDataV1<'data> {
    pub index_info: SkeletonDataIndex,
    #[cfg_attr(feature = "serde", serde(borrow))]
    pub patterns: VarZeroVec<'data, PatternULE>,
}

Some ideas:

  1. Instead of storing many static instances of PackedSkeletonDataV1<'static>, we could instead store many static instances of (SkeletonDataIndex, &[u8]), and build an instance of PackedSkeletonDataV1<'static> at runtime. This is "free", and it should significantly reduce file size, but it causes us to use a Yoke code path.
  2. Make the struct derive VarULE and store all of the data in a big VarZeroVec<PackedSkeletonDataV1ULE>, and build an instance of PackedSkeletonDataV1<'static> at runtime. This should result in the smallest file size and data size, in line with postcard sizes, but is a bit more of a runtime cost since we need to do a VZV lookup. However, it's only one lookup and only when the locale was found, so I don't think we should try to avoid this cost for the sake of avoiding this cost.
  3. Construct static instances via pub fn PackedSkeletonDataV1::new_unchecked(SkeletonDataIndex, &[u8]), reducing file size and therefore probably compile times without changing any runtime characteristics. See DataBake: split serialized form from runtime form #2452.

@robertbastian @Manishearth @younies

@sffc sffc added C-data-infra Component: provider, datagen, fallback, adapters discuss Discuss at a future ICU4X-SC meeting discuss-priority Discuss at the next ICU4X meeting labels Jul 11, 2024
@Manishearth
Copy link
Member

2 sounds compelling if we can make it work cleanly in our baking infra

@Manishearth
Copy link
Member

potentially have a flag for data keys that marks them as "VZV packable"

@sffc
Copy link
Member Author

sffc commented Jul 16, 2024

Discussed this briefly with @robertbastian. Some points:

  • Basically the proposal is to store data as immutable slices instead of references to mutable structs
  • ZeroFrom is costly for nested structs. How costly is it for shallow ones?
  • Fully borrowed constructors like those in icu_properties are still possible but are a little bit more intricate because we end up with 3 states of the data struct: yoked, borrowed-from-yoke, and fully-borrowed, where the fully-borrowed type is separate from the data struct type and is the one used internally in baked.

To illustrate that last point:

// Data struct type
#[derive(Clone)]
pub struct ThingV1<'data> {
    pub a: VarZeroVec<'data, str>,
    pub b: VarZeroVec<'data, str>,
}

// Borrowed type
#[derive(Copy, Clone)]
pub(crate) struct ThingV1Borrowed<'data> {
    pub a: &'data VarZeroSlice<str>,
    pub b: &'data VarZeroSlice<str>,
}

// Example top-level owned type
pub struct ThingFormatter {
    payload: DataPayload<ThingV1Marker>
}

// Example top-level borrowed type
pub struct ThingFormatterBorrowed<'data> {
    payload: ThingV1Borrowed<'data>
}

// To get from one to the other
impl ThingFormatter {
    pub fn as_borrowed(&self) -> ThingFormatterBorrowed {
        self.payload.get().as_borrowed()
    }
}

In the above example, ThingV1Borrowed can always be constructed from borrowed data, so compiled_data constructors directly to the borrowed type can still work fine.

@robertbastian
Copy link
Member

  1. (SkeletonDataIndex, &[u8]) is (SkeletonDataIndex, &VarZeroSlice<'data, PatternULE>). The generic way to do this would be to have a fully borrowed version of each data struct, instead of always having ZeroVec and Cow at the leaves, which have a cost.
  2. I don't see how this would be done. We have a DataPayload<ExportMarker>, and we can call bake() on the data struct. Then what?
  3. Isn't this the same as (1)?

@sffc
Copy link
Member Author

sffc commented Jul 16, 2024

You're right about &[u8] and &VarZeroSlice being the same. I'll switch to using &VarZeroSlice.

Maybe 1, 2, 3 are better illustrated with examples:

// Option 0, Current (not exactly, but equivalent):
payloads: [
    PackedSkeletonDataV1 {
        index_info: SkeletonDataIndex::from_raw(...),
        patterns: VarZeroVec::from_bytes_unchecked(...)
    },
    // ...
]


// Option 1:
payloads: [
    (SkeletonDataIndex::from_raw(...), &'static VarZeroSlice::from_bytes_unchecked(...)),
    // ...
]
// at runtime, use ZeroFrom to get the PackedSkeletonDataV1,
// or put it directly into the borrowed struct


// Option 2:
payloads: VarZeroSlice::from_bytes_unchecked(
    // entries are the VarULE repr of PackedSkeletonDataV1
)
// at runtime, use ZeroFrom to get the PackedSkeletonDataV1,
// or put it directly into the borrowed struct


// Option 3:
impl PackedSkeletonDataV1 {
    pub const unsafe fn from_parts(raw_index_info, raw_patterns) -> Self {
        Self {
            index_info: SkeletonDataIndex::from_raw(raw_index_info),
            patterns: VarZeroVec::from_bytes_unchecked(raw_patterns),
        }
    }
}
payloads: [
    PackedSkeletonDataV1::from_parts(..., ...)
]
// identical runtime characteristics to the current implementation

@sffc
Copy link
Member Author

sffc commented Jul 16, 2024

Actually, in most constructors we are already using ZeroFrom, so I'm not convinced that option 1 is actually a regression from the status quo. I think it's basically equivalent, we just store &'static VarZeroSlice instead of VarZeroVec<'static> which is 1-2 words smaller.

@robertbastian
Copy link
Member

The baked provider used to use ZeroFrom, but it's not anymore since we added DataPayloadInner::StaticRef, which was a change you can detect in benchmarks.

// correction: currently we have a slice of structs, not an array of struct refs 
payloads: &'static [
    PackedSkeletonDataV1 {
        index_info: SkeletonDataIndex::from_raw(...),
        patterns: VarZeroVec::from_bytes_unchecked(...)
    },
    // ...
]

Option 3 is equivalent to option 1, because payloads is stored in a const. from_parts will be evaluated at compile time, and the data struct stored in the const. If we instead stored the borrowed version of the data struct in the const, we're back at option 1.

@sffc
Copy link
Member Author

sffc commented Jul 16, 2024

Yes, I briefly forgot about DataPayloadInner::StaticRef. So options 0 and 1 are different, as claimed in the OP.

Option 3 is equivalent to the current solution, option 0 (not option 1), except for file size being slightly smaller.

@sffc
Copy link
Member Author

sffc commented Jul 16, 2024

A more radical solution (bigger change but maybe better outcome) would be to add it to DynamicDataMarker

pub trait DynamicDataMarker {
    type Borrowed<'a>;
    // not sure if this is the right syntax but you get the idea:
    type Yokeable: for<'a> Yokeable<'a> + ZeroFrom<Self::Borrowed<'a>>;
}

@robertbastian
Copy link
Member

This^ is the only implementation path I see for option 1, what alternative were you thinking of?

@sffc
Copy link
Member Author

sffc commented Jul 16, 2024

The other way to implement option 1 would be to have a databake derive attribute that defines how to get from the static representation to the struct, which we could do as part of #2452. baked_exporter uses it when available, and otherwise serializes the struct directly.

@Manishearth
Copy link
Member

One thing databake could potentially do is have a way for the CrateEnv to collect auxiliary codegen, which would work by:

  • CrateEnv gets a map you can append to, that has some form of key and in the value it has a list of things and a "flush" function that produces additional const data
  • individual databake impls when they need to append auxiliary not-per-value data start appending to the appropriate entry on this map, and then usign a constnat that refeerences the data that shall be generated.

Something like:

struct PackedDataSkeletonV1PatternsAux {
   patterns: Vec<_>,
}

fn bake(&self, env: &CrateEnv) -> TokenStream2 {
   // This uses an anymap or something
   let map = env.get_or_insert::<PackedDataSkeletonV1PatternsAux>(PackedDataSkeletonV1PatternsAux::default(),
         // The flush function. This is just an example.
        |aux| {quote!(  const ALL_THE_PATTERNS = #patterns  )});
   let start = map.patterns.len();
   map.patterns.extend(self.patterns);
  let end = map.patterns.len();
   quote!(PackedSkeletonDataV1 {
     // ...
     patterns: ALL_THE_PATTERNS[#start..#end],
   })
}

This still needs some way to make the types work, the example above doesn't attempt to address that, but this could help for tricks like "store all of the data in a big VarZeroVec<PackedSkeletonDataV1ULE>,"

@sffc sffc added this to the ICU4X 2.0 ⟨P1⟩ milestone Jul 23, 2024
@sffc
Copy link
Member Author

sffc commented Jul 23, 2024

This is not 2.0 blocking unless we implement a breaking solution for #5187

@sffc sffc changed the title Baked data is big and slow for finely sliced data markers Baked data is big and slow at compile time for finely sliced data markers Jul 25, 2024
@sffc sffc changed the title Baked data is big and slow at compile time for finely sliced data markers Baked data is big and slow for finely sliced data markers Jul 25, 2024
@sffc sffc changed the title Baked data is big and slow for finely sliced data markers Baked data is big, and compiles slowly, for finely sliced data markers Jul 25, 2024
@sffc
Copy link
Member Author

sffc commented Jul 25, 2024

  • @sffc - Broke in Rust nightly
  • @robertbastian - Not convinced current suggestions will improve compile times
  • @sffc - # 2 (build PackedSkeletonDataV1 at runtime) would produce the smallest code and probably the fastest compile times
  • @robertbastian - Moves the waiting from compile time to runtime. Want to avoid runtime intensive work
  • @sffc - Long compile times are a result of components/datetime/src/provider/mod.rs, which grabs all the baked data at once.
  • @robertbastian Could be inefficiencies in how we use macros
  • @sffc - Theory that a lot of the stress is on the lexer. If we make the AST thinner, that could reduce compile time. Option # 3 (construction via PackedSkeletonDataV1::new_unchecked(SkeletonDataIndex, &[u8])) reduces that scale
  • @robertbastian - Not sure if moving to const evaluation will be faster. Okay with trying, but I want to ask the Rust compiler team how we could best compile this data.
  • @sffc - Another alternative (illustrated in comments below options) is splitting up the AST to make it simpler to parse (ThingV1 example mixed with pub trait DynamicDataMarker {type Borrowed<'a>; type Yokeable: for<'a> Yokeable<'a> + ZeroFrom<Self::Borrowed<'a>>;})
  • @sffc We can do this in a non-breaking way. Option 3 moves work from source code lexing to const evaluation. Options 1 and 2 can use DataPayload::from_owned instead of DataPayload::from_static_ref, which increases runtime cost, but that can be something we weigh and measure.

(additional discussion not recorded)

Current thinking:

  • Produce a standalone crate reproducing this, such as by copying in a lot of ICU4X skeleton data
  • Ask someone who knows more, such as a Rust compiler expert, what we can do to make the crate compile faster
  • Optional: Try implementing the options to measure the impact
  • Should not block 2.0 because:
    1. This would be a lot of time and effort
    2. There may be paths to change this in a non-breaking ways
    3. Maybe the compiler will get better on its own

LGTM: @sffc @robertbastian

@sffc sffc added S-large Size: A few weeks (larger feature, major refactoring) and removed discuss Discuss at a future ICU4X-SC meeting discuss-priority Discuss at the next ICU4X meeting labels Jul 25, 2024
@Manishearth
Copy link
Member

LGTM as well

@sffc
Copy link
Member Author

sffc commented Jul 29, 2024

To add some urgency to this issue, @kartva says:

Compiling icu_experimental on main causes rustc to reliably crash on my laptop due to out-of-memory errors. Are there servers or codespaces that people working on icu4x can use?

(icu_experimental and icu_datetime are the two crates with the most finely sliced baked data)

@sffc
Copy link
Member Author

sffc commented Jul 30, 2024

Building the latest icu_experimental data:

	Command being timed: "cargo build"
	User time (seconds): 30.57
	System time (seconds): 8.20
	Percent of CPU this job got: 100%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:38.41
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 16492072
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 383
	Minor (reclaiming a frame) page faults: 4667399
	Voluntary context switches: 1181
	Involuntary context switches: 529
	Swaps: 0
	File system inputs: 83232
	File system outputs: 752960
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

@sffc
Copy link
Member Author

sffc commented Jul 30, 2024

Here are the figures for impl_units_display_name_v1_marker!(Baked); by itself:

	Command being timed: "cargo build"
	User time (seconds): 18.72
	System time (seconds): 5.43
	Percent of CPU this job got: 100%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.14
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 16230900
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 225
	Minor (reclaiming a frame) page faults: 4573824
	Voluntary context switches: 693
	Involuntary context switches: 726
	Swaps: 0
	File system inputs: 26344
	File system outputs: 322536
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

That's a big share so I'll focus on reproducing just this in isolation.

@sffc
Copy link
Member Author

sffc commented Jul 31, 2024

Here's with changing the paths to be imported instead of absolute:

	Command being timed: "cargo build"
	User time (seconds): 17.70
	System time (seconds): 5.07
	Percent of CPU this job got: 100%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:22.60
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 16218176
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 4571559
	Voluntary context switches: 434
	Involuntary context switches: 158
	Swaps: 0
	File system inputs: 0
	File system outputs: 403136
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

And here's with using a const constructor:

	Command being timed: "cargo build"
	User time (seconds): 7.46
	System time (seconds): 1.50
	Percent of CPU this job got: 100%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.88
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 3983060
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 1208393
	Voluntary context switches: 352
	Involuntary context switches: 59
	Swaps: 0
	File system inputs: 0
	File system outputs: 216696
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

And a helper function:

	Command being timed: "cargo build"
	User time (seconds): 7.71
	System time (seconds): 1.59
	Percent of CPU this job got: 100%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:09.21
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 3947592
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 1198126
	Voluntary context switches: 357
	Involuntary context switches: 269
	Swaps: 0
	File system inputs: 0
	File system outputs: 194544
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

include_bytes! for the trie data:

	Command being timed: "cargo build"
	User time (seconds): 7.26
	System time (seconds): 1.54
	Percent of CPU this job got: 100%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.73
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 3961872
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 1202201
	Voluntary context switches: 390
	Involuntary context switches: 65
	Swaps: 0
	File system inputs: 0
	File system outputs: 193984
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

@sffc
Copy link
Member Author

sffc commented Jul 31, 2024

In tabular form (each row builds on the one above it):

Scenario File Size User Time Clock Time Maximum Resident Set Size
Full Experimental Data N/A 30.57 38.41 16492072
Unit Names Only 13070690 18.72 24.14 16230900
With Imports 11301364 17.70 22.60 16218176
With Const Constructor 5038254 7.46 08.88 3983060
With Helper Wrapping Constructor 3923658 7.63 09.00 3953796
With Helper Directly Building 3923875 7.71 09.21 3947592
With include_bytes! trie 3042840 7.26 08.73 3961872

So there seems to be a pretty strong correlation between file size, compile time, and memory usage, except the last few rows with wrapper functions which reduce file size but don't seem to impact the other metrics. include_bytes! doesn't seem to have a big impact, at least not for the big trie byte string.

@Manishearth
Copy link
Member

Do we know which compiler pass is actually slow (-Z time-passes, I believe)

Would be useful to compare that output to that of a normal utils crate and see where the big differences are.

@sffc
Copy link
Member Author

sffc commented Sep 11, 2024

Revisiting this since we're getting 143 errors again in CI.

Here's where we're currently at on main:

$ cargo clean; /usr/bin/time -v cargo +nightly build -p icu_experimental --all-features
...
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 1m 00s
	Command being timed: "cargo +nightly build -p icu_experimental --all-features"
	User time (seconds): 63.19
	System time (seconds): 18.96
	Percent of CPU this job got: 135%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 1:00.51
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 15888616
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 395
	Minor (reclaiming a frame) page faults: 6703629
	Voluntary context switches: 27185
	Involuntary context switches: 2375
	Swaps: 0
	File system inputs: 33184
	File system outputs: 2091712
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

I ran it with -Z time-passes:

   Compiling icu_experimental v0.1.0 (/usr/local/google/home/sffc/projects/icu4x/components/experimental)
time:   0.000; rss:   47MB ->   48MB (   +1MB)	parse_crate
time:   0.000; rss:   48MB ->   48MB (   +0MB)	incr_comp_prepare_session_directory
time:   0.000; rss:   48MB ->   49MB (   +1MB)	setup_global_ctxt
time:   0.000; rss:   51MB ->   51MB (   +0MB)	crate_injection
time:   1.013; rss:   51MB -> 1182MB (+1130MB)	expand_crate
time:   1.013; rss:   51MB -> 1182MB (+1131MB)	macro_expand_crate
time:   0.000; rss: 1182MB -> 1182MB (   +0MB)	maybe_building_test_harness
time:   0.028; rss: 1182MB -> 1182MB (   +0MB)	AST_validation
time:   0.001; rss: 1182MB -> 1182MB (   +0MB)	finalize_imports
time:   0.004; rss: 1182MB -> 1182MB (   +0MB)	finalize_macro_resolutions
time:   0.479; rss: 1182MB -> 1353MB ( +170MB)	late_resolve_crate
time:   0.023; rss: 1353MB -> 1353MB (   +0MB)	resolve_check_unused
time:   0.044; rss: 1353MB -> 1353MB (   +0MB)	resolve_postprocess
time:   0.551; rss: 1182MB -> 1353MB ( +171MB)	resolve_crate
time:   0.025; rss: 1273MB -> 1273MB (   +0MB)	write_dep_info
time:   0.020; rss: 1273MB -> 1273MB (   +0MB)	complete_gated_feature_checking
time:   0.072; rss: 1612MB -> 1456MB ( -156MB)	drop_ast
time:   1.630; rss: 1273MB -> 1307MB (  +34MB)	looking_for_derive_registrar
time:   1.898; rss: 1273MB -> 1308MB (  +35MB)	misc_checking_1
time:   0.274; rss: 1309MB -> 1331MB (  +22MB)	coherence_checking
time:   6.609; rss: 1308MB -> 1437MB ( +128MB)	type_check_crate
time:  36.802; rss: 1437MB -> 2248MB ( +811MB)	MIR_borrow_checking
time:   2.214; rss: 2248MB -> 2270MB (  +22MB)	MIR_effect_checking
time:   0.130; rss: 2270MB -> 2283MB (  +14MB)	module_lints
time:   0.130; rss: 2270MB -> 2283MB (  +14MB)	lint_checking
time:   0.253; rss: 2283MB -> 2283MB (   +0MB)	privacy_checking_modules
time:   0.061; rss: 2283MB -> 2277MB (   -6MB)	check_lint_expectations
time:   0.526; rss: 2270MB -> 2277MB (   +8MB)	misc_checking_3
time:   0.191; rss: 2301MB -> 2312MB (  +10MB)	monomorphization_collector_graph_walk
time:   0.033; rss: 2312MB -> 2312MB (   +1MB)	partition_and_assert_distinct_symbols
time:   0.587; rss: 2277MB -> 2294MB (  +16MB)	generate_crate_metadata
time:   1.064; rss: 2296MB -> 2320MB (  +24MB)	codegen_to_LLVM_IR
time:   3.448; rss: 2310MB -> 2320MB (  +11MB)	LLVM_passes
time:   3.540; rss: 2294MB -> 2319MB (  +26MB)	codegen_crate
time:   0.000; rss: 2319MB -> 2319MB (   +0MB)	check_dirty_clean
time:   0.358; rss: 2319MB -> 2323MB (   +4MB)	encode_query_results
time:   0.391; rss: 2319MB -> 2318MB (   -1MB)	incr_comp_serialize_result_cache
time:   0.391; rss: 2319MB -> 2318MB (   -1MB)	incr_comp_persist_result_cache
time:   0.392; rss: 2319MB -> 2318MB (   -1MB)	serialize_dep_graph
time:   0.101; rss: 2318MB -> 1314MB (-1004MB)	free_global_ctxt
time:   0.007; rss: 1314MB -> 1314MB (   +0MB)	copy_all_cgu_workproducts_to_incr_comp_cache_dir
time:   0.007; rss: 1314MB -> 1314MB (   +0MB)	finish_ongoing_codegen
time:   0.151; rss: 1288MB -> 1334MB (  +46MB)	link_rlib
time:   0.167; rss: 1288MB -> 1334MB (  +46MB)	link_binary
time:   0.170; rss: 1288MB -> 1279MB (   -9MB)	link_crate
time:   0.179; rss: 1314MB -> 1279MB (  -35MB)	link
time:  54.585; rss:   32MB ->  224MB ( +192MB)	total

Relative to what other crates do, the following steps are the biggest:

  • expand_crate
  • macro_expand_crate
  • late_resolve_crate
  • resolve_crate
  • MIR_borrow_checking

Of these, MIR_borrow_checking is by far the slowest relative to other crates; expand_crate and macro_expand_crate add the most to the memory usage.

Reminder from my previous exploration: adding a const constructor instead of doing raw struct construction did reduce compile times and peak memory usage (this was before I was measuring the rustc phase breakdown). This is consistent with @Manishearth's interpretation of the above figures.

@sffc
Copy link
Member Author

sffc commented Sep 12, 2024

I made a reproducible, standalone case here:

https://github.com/sffc/icu4x_compile_sample/tree/standalone

You can also see some of my previous work on making changes in the July 30 commits on the main branch:

https://github.com/sffc/icu4x_compile_sample/commits/main/

@sffc
Copy link
Member Author

sffc commented Sep 13, 2024

I added const constructors where they were most impactful in #5541.

I don't consider this a long-term solution because:

  1. Custom bake impls are unsafe and should generally be avoided
  2. The compile times are cut approximately in half, but they are still too big

But, for the short term, it should make CI less flaky.

@Manishearth
Copy link
Member

Filed rust-lang/rust#134404, with reduced testcase https://github.com/Manishearth/icu4x_compile_sample. Further reduced testcase in rm-macro that shows the difference when you avoid the macro step.

@sffc
Copy link
Member Author

sffc commented Dec 17, 2024

I still think options 1 and 2 are on the table, at least for keys that would benefit from it. ZeroFrom has a runtime cost, but I don't anticipate it being very large relative to the overall application.

  • Finely sliced data markers aren't using the infallible static struct constructors; they are using the binary search or ZeroTrie lookup, which takes time of its own.
  • Postcard already uses this code path.
  • We can introduce Borrowed formatters if we really need to juice the extra few nanoseconds (Should compiled data constructors be on the owned or borrowed type? #5440).

Note: option 1 could store the data in a VarZeroVec instead of a tuple slice to reduce the number of tokens.

In case it was lost earlier, I want to emphasize that options 1 and 2 have the added benefit of reducing compiled baked data size (#5429) by not putting the struct stacks into the binary.

@Manishearth
Copy link
Member

I agree that they are still on the table. Option 2 doesn't sound too bad. We could have toggles for data loading in the baked system that can produce baked data via alternate means (not just calling Bake per-entry).

The idea of baked was to avoid deserialization as much as possible, this does not mean we should avoid it completely; a couple VarULEs here and there that get poked during data loading isn't terrible, the format is good at random-access.

@sffc
Copy link
Member Author

sffc commented Dec 17, 2024

Thought: if we want to avoid the yoke, we could add a third DataPayload variant with a 'static bound. The variants would be something like:

enum DataPayloadInner<M> {
    Yoke(Yoke<M::Yokeable, Option<Rc<[u8]>>>),
    StaticRef(&'static M::Yokeable),
    StaticOwned(M::Yokeable), // NEW
}

StaticOwned can be constructed from static borrowed data. It doesn't have to own any heap memory.

Not sure if this is better than just using the Yoke with an empty cart.

@Manishearth
Copy link
Member

Another idea I had was that the Baked context could grow a helper method that shortens imports.

Basically instead of generating

// thousands of times
icu::experimental::dimension::provider::units::UnitsDisplayNameV1 { patterns: icu::experimental::relativetime::provider::PluralPatterns { ... } }

we generate

use icu::experimental::dimension::provider::units::UnitsDisplayNameV1;
use icu::experimental::relativetime::provider::PluralPatterns;

// thousands of times
UnitsDisplayNameV1 { patterns: PluralPatterns { ... }}

when generating a path, instead of doing quote!(icu::experimental::dimension::provider::units::UnitsDisplayNameV1), you'd do ctx.generate_path("icu::experimental::dimension::provider::units", "UnitsDisplayNameV1"), and it appends to a list of imports. If that name is already present in that list with a different path, the full path is used, first-come-first-serve. Not ideal, but not too complicated, and not hard to implement.

A wrinkle is that rust doesn't like having multiple imports in the same file, and these are macro generated, so we have to either:

  • have the macro shove these impls into a temporary module
  • have a separate "imports" macro that is generated once for the entire data crate
  • ??? there are probably other tricks

What I've noticed so far is that:

@sffc
Copy link
Member Author

sffc commented Dec 17, 2024

The idea of baked was to avoid deserialization as much as possible, this does not mean we should avoid it completely

It's interesting, and somewhat accurate but with caveats, to consider options 1 and 2 to be deserialization.

I generally think of deserialization as a fallible operation that checks a bunch of invariants and unpacks a structure with lots of hopping around. Zero-copy deserialization does this but without allocating memory.

What I'm talking about here is basically:

struct Message<'data> { a: &'data str, b: &'data str }

// Current:
const BAKED_MESSAGE: &'static Message<'static> = Message { a: "a", b: "b" };

// Proposed, option 1 with tuple slice:
const MESSAGE_STRS: &'static (&'static str, &'static str) = &[("a", "b")];
const fn get_message(i: usize) {
    let (a, b) = MESSAGE_STRS[i];
    Message { a, b }
}

// Proposed, option 1 with VZV:
// not sure if this code works; could also be written as a byte string literal
const MESSAGE_STRS: &'static VarZeroSlice<VarTuple2ULE<str, str>> = var_zero_slice![("a", "b")];
fn get_message(i: usize) {
    let tpl = MESSAGE_STRS.get(i).unwrap();
    Message { a: tpl.get_0(), b: tpl.get_1() }
}

Do we consider get_message to be "deserialization"? Not sure.

@Manishearth
Copy link
Member

(To be clear, my note about "avoid deserialization" was not a disagreement)

Zero-copy deserialization does this but without allocating memory

Yep, though even with fully trusted data a thing that zero-copy deserialization has to do (that baked typically does not) is figuring out where different pieces of data start and end, and that's the real benefit of baked over an unsafe deserialization framework.

Do we consider get_message to be "deserialization"? Not sure.

I think get_message is doing the same kind of work that deserialization does, regardless of what we call it. I think it's fine for us to do this judiciously in baked data. Having a mode for the baked provider to bake a data blob as a single vec rather than as multiple pieces of data would be quite neat.

@sffc
Copy link
Member Author

sffc commented Dec 17, 2024

Another idea I had was that the Baked context could grow a helper method that shortens imports.

...

If that name is already present in that list with a different path, the full path is used, first-come-first-serve. Not ideal, but not too complicated, and not hard to implement.

Or, the context can always just generate unique names, like a JS minifier. A, B, C, ..., AA, AB, ...

What I've noticed so far is that:

I got more like 5-10% on shortening the imports in #5230 (comment) (measured in a slightly different context)

@Manishearth
Copy link
Member

Or, the context can always just generate unique names, like a JS minifier. A, B, C, ..., AA, AB, ...

Oh, good point. I'd like it to be readable, though, so perhaps doing stuff like UnitsDisplayNameV1, UnitsDisplayNameV1_ etc.

I got more like 5-10% on shortening the imports in #5230 (comment) (measured in a slightly different context)

Sounds good, I didn't look too closely at the numbers, the expand_crate numbers are the interesting ones to me since token/AST RAM usage seems to be a part of the issue.

@sffc
Copy link
Member Author

sffc commented Feb 16, 2025

See additional discussion in #3260 (comment)

robertbastian pushed a commit that referenced this issue Feb 17, 2025
Toward #5230

Reduces the diff in #6133

Not intended to be controversial.

Autosubmit = true
@sffc
Copy link
Member Author

sffc commented Feb 19, 2025

Follow-up tasks from #6133:

  • It would be better for maybe_encode_as_varule to return Box<VarULE> to increase flexibility; for example, if a struct is (usize, Cow<str>), it seems fine that the VarULE could be VarTuple even if the usize is stored on the stack. However, doing this requires wiring up alloc features in datagen mode since we need to return Boxes.
  • In general, MaybeEncodeAsVarULE should be gated by feature = "datagen" where applicable.

Both of these are best done after we migrate clients away from icu_provider_macros.

@Manishearth
Copy link
Member

Manishearth commented Feb 19, 2025

  • It would be better for maybe_encode_as_varule to return Box<VarULE> to increase flexibility; for example, if a struct is (usize, Cow<str>), it seems fine that the VarULE could be VarTuple even if the usize is stored on the stack. However, doing this requires wiring up alloc features in datagen mode since we need to return Boxes.

Is that not just encode_var_ule_as_box? I don't see anything wrong with (usize, Cow<str>) encoding to VarTuple.

That's a compelling argument for not using EncodeAsVarULE here and instead using a different trait though (in opposition to my previous comment): It's nice to be able to customize what it encodes to. So I retract my objection to MaybeEncodeAsVar, though I'd still like to keep ZeroFrom around.

(I would still prefer the trait to be named so that it is clear that it is about this optimization, but that's not so important to figure out now)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-large Size: A few weeks (larger feature, major refactoring)
Projects
None yet
Development

No branches or pull requests

4 participants