cranelift-codegen-meta is slowish to build #1168

glandium · 2020-01-07T08:07:31Z

This is especially a problem when building with opt-level=2, since cargo doesn't know to apply the optimization level only to target code, while cranelift-codegen-meta is host code.

Interestingly, the build time for the cranelift-codegen-meta is dominated by LLVM (80 to 90% of the time), mostly spent between multiple LLVM module passes, LTO passes and codegen passes.

This seems to be mostly caused by the size of shared::instructions::define and shared::legalize::define.

Cc: @alexcrichton

The text was updated successfully, but these errors were encountered:

alexcrichton · 2020-01-07T15:15:15Z

I don't know much about this crate and haven't really analyzed it much before, but massive functions are known to cause perfomance issues in LLVM, mostly because we can't parallelize anything about them. Crates have been found in the past to be 90% dominated by one function, and the compile time of the crate drastically increases when the functions become smaller. If the two functions there are abnormally large the best thing to do for compile times is probably to figure out a way to shrink the functions and let LLVM take care of inlining and such as necessary.

abrown · 2020-01-07T18:23:53Z

Those functions (and maybe also ISA-specific ones like isa::x86::encodings::define) could probably be split up into smaller functions; any suggestions on how to organize the instructions?

bnjbvr · 2020-01-08T12:24:09Z

Fwiw, I've looked into this a bit today. Unfortunately I misread the initial comment and thought that x86::encodings::define was the biggest offender. I started splitting shared::instructions::define though, to get an idea of what it would take.

@glandium How did you find out this function was the biggest offender? I found about rustc -Z self-profile (this, but it gives a very high-level information about which passes took time, and it doesn't seem to work on Nightly right away.

I tried to group x86 encodings by instruction category. Names and groups to be bikeshedded of course, but it seems fairly consistent.

Regarding instruction definitions, things needs to be done carefully there: bindings for typevars and operands are consistently rewritten in the file (which was a shortcut taken when porting the meta code from Python). So I've tried to make it so that the split function wouldn't redefine these bindings, by reducing their lifetimes (through the use of smaller block scopes), and define typevars/operands very closely to the instruction definition. If anybody wanted to pick up this work, they should do the same to clean this technical debt.

Diffing the output of the meta crate generated folder before/after the patch only showed differences of orderings in Encodings, and not a single difference in instructions. It's a bit unfortunate we can't just sort the encodings once they're defined, because they store their index into the containing vector (not sure why...).

This is dumb, repetitive work, but wallclock measurements show a 3% speedup in compile time (with Rust stable), so it's definitely worth continuing. If anybody wants to take the rest of this over, please feel free to do so!

abrown · 2020-01-08T17:43:15Z

Yeah, definitely not fun work. I merged bytecodealliance/cranelift#1322 and here are things that I think remain:

finish separating out shared::instructions::define into alu, fpu, memory, move, simd (others?)
separate out shared legalizations shared::legalize::define
separate out x86 legalizations x86::legalize::define

alexcrichton · 2020-01-10T23:59:00Z

I was digging in to this a bit today, and to give a bit of an idea of what's going on, here's some data for this:

First I executed the following to gather data:

$ cargo +nightly rustc --release -p cranelift-codegen-meta -- -Z self-profile -C save-temps -Z time-passes 2>&1 | tee out.log

That basically compiled this crate with a bunch of extra flags to help debugging later. Next I used the measurme repository's crox tool to generate some data in chrome:

$ ../measureme/target/release/crox ./cranelift_codegen_meta-17489 --collapse-threads --minimum-duration 10

which gave this picture:

From this it's clear that there's 1 CGU which is taking forever. Thread 11 performs initial optimizations and Thread 2 later picks it up for ThinLTO passes. Because it's taking so long your machine is basically sitting idle while it's optimizing that CGU except for one core, which generally isn't great.

Unfortunately I don't know of a great way to go from this graph to which CGU that is. To do that I cross-referenced the graphical timing data with the output of -Ztime-passes which led me to conclude that cranelift_codegen_meta.d20hpyi0-cgu.8 is the offending CGU here. I passed -C save-temps so next I ran find target -name '*cranelift_codegen_meta.d20hpyi0-cgu.8*' and then ran llvm-dis over that file.

Next I had a different tool which I wrote for something else a long time ago which yielded the number of instructions per function in this CGU:

        core::ptr::real_drop_in_place::h8a707b6e0f5fa37b: 27
        core::ptr::real_drop_in_place::he03fad37f2658b56: 69
        core::ptr::real_drop_in_place::h61e657fcd74d9eb5: 87
        core::ptr::real_drop_in_place::hc09e686cac34d126: 188
        cranelift_codegen_meta::shared::instructions::define_control_flow::hb9bff9385b83a94b: 13458
        cranelift_codegen_meta::shared::instructions::define::h304f7623765f3c04: 71636

so clearly the define function is huge! There's only 6 functions in this CGU and it still takes forever to optimize :).

In any case that's at least one way to find offenders for what takes so long in a crate.

Also FWIW, some reasons why instructions::define is likely slow to compile:

It's a 3kloc function, so it's already huge!
There's a lot of local variables, all of which require destructors. There's then a huge number of function calls, each of which needs a landing pad for all the previous destructors. That's a lot of destructors!
Using the builder pattern, while convenient, generates a lot of function calls which introduces more places destructor landing pads need generating.
There's quite a few temporary Vec<T> instances, all of which require destructors as well

Overall it's just a really really big function that LLVM wastes tons of time trying to optimize when in fact it's probably only called once-per-program and there's really no reason to optimize it. The best way to fix it is likely to break it up into many more little functions which should be much more digestable for LLVM.

bjorn3 · 2020-01-11T08:45:38Z

There's then a huge number of function calls, each of which needs a landing pad for all the previous destructors.

Maybe set panic=abort?

bnjbvr · 2020-01-23T21:13:26Z

Thanks for the detailed approach @alexcrichton !
Also I just discovered cargo-llvm-lines, which gives the number of LLVM IR lines per function across all instantiations (for generic functions). It yields the same conclusions effortlessly:

  Lines Copies  Function name
  48459      1  cranelift_codegen_meta::shared::legalize::define
  25984     64  alloc::raw_vec::RawVec<T,A>::reserve_internal
  25333      1  cranelift_codegen_meta::isa::x86::legalize::define
  21762    155  core::iter::traits::iterator::Iterator::try_fold
  17385    182  core::option::Option<T>::map
  15959      1  cranelift_codegen_meta::shared::instructions::define
  15896    945  core::ptr::real_drop_in_place
  13215     81  <alloc::vec::Vec<T> as alloc::vec::SpecExtend<T,I>>::spec_extend
  11749     31  hashbrown::raw::RawTable<T>::rehash_in_place
  10948      1  cranelift_codegen_meta::isa::x86::recipes::define

See bytecodealliance#1168

See #1168

bjorn3 · 2021-10-01T21:21:13Z

Removing the old style backends significantly improved compile tines for the meta crate. If I recall correctly it now only takes like 2s to compile.

alexcrichton transferred this issue from bytecodealliance/cranelift Feb 28, 2020

alexcrichton added cranelift:goal:build-time Reduce the Rustc build time of Cranelift itself cranelift Issues related to the Cranelift code generator labels Feb 28, 2020

arkpar pushed a commit to paritytech/wasmtime that referenced this issue Mar 4, 2020

Fix clippy warnings (bytecodealliance#1168)

1176e4f

abrown added a commit to abrown/wasmtime that referenced this issue Mar 5, 2020

Refactor SIMD legalizations to separate define* function

c92a7fc

See bytecodealliance#1168

abrown mentioned this issue Mar 6, 2020

Refactor SIMD implementation #1246

Merged

abrown added a commit to abrown/wasmtime that referenced this issue Mar 6, 2020

Refactor SIMD legalizations to separate define* function

b3cd294

See bytecodealliance#1168

abrown added a commit to abrown/wasmtime that referenced this issue Mar 6, 2020

Refactor SIMD legalizations to separate define* function

2fc60d3

See bytecodealliance#1168

abrown added a commit that referenced this issue Mar 6, 2020

Refactor SIMD legalizations to separate define* function

442edf5

See #1168

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cranelift-codegen-meta is slowish to build #1168

cranelift-codegen-meta is slowish to build #1168

glandium commented Jan 7, 2020

alexcrichton commented Jan 7, 2020

abrown commented Jan 7, 2020

bnjbvr commented Jan 8, 2020

abrown commented Jan 8, 2020

alexcrichton commented Jan 10, 2020

bjorn3 commented Jan 11, 2020

bnjbvr commented Jan 23, 2020

bjorn3 commented Oct 1, 2021

cranelift-codegen-meta is slowish to build #1168

cranelift-codegen-meta is slowish to build #1168

Comments

glandium commented Jan 7, 2020

alexcrichton commented Jan 7, 2020

abrown commented Jan 7, 2020

bnjbvr commented Jan 8, 2020

abrown commented Jan 8, 2020

alexcrichton commented Jan 10, 2020

bjorn3 commented Jan 11, 2020

bnjbvr commented Jan 23, 2020

bjorn3 commented Oct 1, 2021