High memory usage compiling `keccak` benchmark #54208

nnethercote · 2018-09-13T23:35:16Z

According to perf.rust-lang.org, a "Clean" build of keccak-check has a max-rss of 637 MB. Here's a Massif profile of the heap memory usage.

The spike is due to a single allocation of 500,363,244 bytes here:

rust/src/librustc/middle/liveness.rs

Line 601 in 28bcffe

users: vec![invalid_users(); num_live_nodes * num_vars],

Each vector element is a Users, which is a three field struct taking up 12 bytes. num_live_nodes is 16,371, and num_vars is 2,547, and 12 * 16,371 * 2,547 = 500,363,244.

I have one idea to improve this: Users is a triple contains two u32s and a bool, which means that it is 96 bytes even though it only contains 65 bytes of data. If we split it up so we have 3 vectors instead of a vector of triples, we'd end up with 4 * 16,371 * 2,547 + 4 * 16,371 * 2,547 + 1 * 16,371 * 2,547 = 375,272,433, which is a reduction of 125,090,811 bytes. This would get max-rss down from 637MB to 512MB, a reduction of 20%.

Alternatively, if we packed the bools into a bitset we could get it down to 338,787,613 bytes, which is a reduction of 161,575,631 bytes. This would get max-rss down from 637MB to 476MB, a reduction of 25%. But it might slow things down... depends if the improved locality is outweighed by the extra instructions needs for bit manipulations.

@nikomatsakis: do you have any ideas for improving this on the algorithmic side? Is this dense num_live_nodes * num_vars representation avoidable?

The text was updated successfully, but these errors were encountered:

nnethercote · 2018-09-13T23:59:18Z

Now for NLL. According to perf.rust-lang.org, an "Nll" build of keccak-clean has a max-rss of 1239MB. Here's a Massif profile of the heap usage:

The 500MB Liveness spike is still visible, but it is outweighed by a later plateau that is dominated by three allocation sites that allocate 308,588,896 and 308,588,896 and 270,743,088 bytes respectively.

The three allocation sites are here:

rust/src/librustc_mir/borrow_check/mod.rs

Lines 171 to 197 in 28bcffe

    
           let mut flow_inits = FlowAtLocation::new(do_dataflow( 
        
               tcx, 
        
               mir, 
        
               id, 
        
               &attributes, 
        
               &dead_unwinds, 
        
               MaybeInitializedPlaces::new(tcx, mir, &mdpe), 
        
               |bd, i| DebugFormatted::new(&bd.move_data().move_paths[i]), 
        
           )); 
        
           let flow_uninits = FlowAtLocation::new(do_dataflow( 
        
               tcx, 
        
               mir, 
        
               id, 
        
               &attributes, 
        
               &dead_unwinds, 
        
               MaybeUninitializedPlaces::new(tcx, mir, &mdpe), 
        
               |bd, i| DebugFormatted::new(&bd.move_data().move_paths[i]), 
        
           )); 
        
           let flow_ever_inits = FlowAtLocation::new(do_dataflow( 
        
               tcx, 
        
               mir, 
        
               id, 
        
               &attributes, 
        
               &dead_unwinds, 
        
               EverInitializedPlaces::new(tcx, mir, &mdpe), 
        
               |bd, i| DebugFormatted::new(&bd.move_data().inits[i]), 
        
           ));

Each do_dataflow() call ends up here:

rust/src/librustc_mir/dataflow/mod.rs

Line 710 in 28bcffe

vec![IdxSet::new_empty(bits_per_block); num_blocks]

In each case num_blocks is 25,994, and bits_per_block is 94,972 in the first two and 83,308 in the third.

I tried changing on_entry_sets to a HybridIdxSet but it didn't help, because it turns out these bitsets end up with many bits sets, so HybridIdxSet just switches to the dense representation anyway.

One trivial idea: it looks like flow_inits doesn't need to be live at the same time as flow_uninits and flow_ever_inits. If that's right, we should be able to reduce the peak by 308MB, from 1239MB to 931MB, a 25% reduction.

@nikomatsakis: any other thoughts here from the algorithmic side?

nnethercote · 2018-09-14T03:04:06Z

I have one idea to improve this: Users is a triple contains two u32s and a bool, which means that it is 96 bytes even though it only contains 65 bytes of data. If we split it up so we have 3 vectors instead of a vector of triples,

I have implemented this in #54211.

nnethercote · 2018-09-14T05:02:59Z

One trivial idea: it looks like flow_inits doesn't need to be live at the same time as flow_uninits and flow_ever_inits. If that's right, we should be able to reduce the peak by 308MB, from 1239MB to 931MB, a 25% reduction.

I have implemented this in #54213.

nnethercote · 2018-09-21T11:39:37Z

#54420 improves the non-NLL case some more.

nnethercote · 2018-09-24T05:59:44Z

#54420 improves the non-NLL case some more.

Because of this, the NLL:non-NLL ratio for max-rss has worsened, to 269%, i.e. NLL uses 2.69x more memory.

nikomatsakis · 2018-09-25T15:18:25Z

@nnethercote two questions:

Are you still investigating here?
Do you know how the memory usage is distributed between the various dataflow computations?

nikomatsakis · 2018-09-25T15:19:10Z

I guess this answers my question:

In each case num_blocks is 25,994, and bits_per_block is 94,972 in the first two and 83,308 in the third.

nnethercote · 2018-09-25T20:45:00Z

@nikomatsakis: I have run out of ideas on this one.

If it helps, here is what the flow_uninits looks like, where each line represents a row (i.e. a BB) and shows the number of set bits plus the total number of bits (which is rounded up to the nearest multiple of 64, because BitSet uses 64-bit words):

94971 / 94976
94968 / 94976
94968 / 94976
94966 / 94976
94964 / 94976
...
49547 / 94976
49543 / 94976
49542 / 94976
49540 / 94976
49537 / 94976

In other words, it is 25994 x 94976 bits (308.6MB), and the rows start off almost entirely set, and by the end drop down to about half set. About 75% of the bits are set.

And here's what flow_ever_inits looks like:

1 / 83328
79728 / 83328
5 / 83328
8 / 83328
9 / 83328
...
64142 / 83328
64146 / 83328
64148 / 83328
64151 / 83328
64155 / 83328

It is 25994 x 83328 bits (270.8MB). Apart from the second row, the rows start of almost empty and get fuller until they are 77% full by the end. About 38% of the bits are set.

I didn't look at flow_inits because its lifetime is separate and so it's no longer part of the memory peak.

I can't see how to represent this data more compactly, and I don't understand the algorithm in enough detail to know if less data could be stored. I also looked into separating the lifetimes of the two structures but they are used in tandem, as far as I can tell.

pnkfelix · 2018-10-02T14:33:09Z

Discussed with @nikomatsakis during triage of NLL issues.

We decided that the memory usage on this case should not block NLL's inclusion in RC2.

In terms of whether to put this on the Release milestone or not, we decided that it would be a better idea, at least in the short-to-middle term, to focus effort more on Polonius, since that component might end up replacing the dataflow entirely, and thus the pay-off from optimizing rustc_mir::dataflow may not be so great.

So, tagging as NLL-deferred, with the intention of revisiting after we've learned more about what we plan to do with Polonius, if anything.

pnkfelix · 2019-02-20T20:29:58Z

NLL triage. P-medium. WG-compiler-performance.

jackh726 · 2022-01-29T00:33:22Z

Unsure if this is still relevant. But retagging wg- label to wg-nll

nnethercote · 2022-02-24T08:12:07Z

#93984 has been merged. It reduced max-rss on CI from 974MB to 399MB, a 2.44x reduction. This wins back enough of the original 2.69x regression that I am happy to declare victory here 😄

estebank added I-slow Issue: Problems and improvements with respect to performance of generated code. A-NLL Area: Non-lexical lifetimes (NLL) WG-compiler-middle labels Sep 14, 2018

nnethercote mentioned this issue Sep 14, 2018

Split Liveness::users into three. #54211

Merged

matthewjasper added the NLL-performant Working towards the "performance is good" goal label Sep 22, 2018

pnkfelix added the NLL-deferred label Oct 2, 2018

matthewjasper removed the NLL-deferred label Dec 1, 2018

pnkfelix added the WG-compiler-performance Working group: Compiler Performance label Feb 20, 2019

pnkfelix added the P-medium Medium priority label Feb 20, 2019

pnkfelix added the I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. label Apr 12, 2019

crlf0710 added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Jun 11, 2020

jackh726 added WG-nll Working group: Non-lexical lifetimes and removed WG-compiler-middle WG-compiler-performance Working group: Compiler Performance labels Jan 29, 2022

nnethercote mentioned this issue Feb 15, 2022

Introduce ChunkedBitSet and use it for some dataflow analyses. #93984

Merged

nnethercote closed this as completed Feb 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High memory usage compiling `keccak` benchmark #54208

High memory usage compiling `keccak` benchmark #54208

nnethercote commented Sep 13, 2018 •

edited by pnkfelix

Loading

nnethercote commented Sep 13, 2018

nnethercote commented Sep 14, 2018

nnethercote commented Sep 14, 2018

nnethercote commented Sep 21, 2018

nnethercote commented Sep 24, 2018

nikomatsakis commented Sep 25, 2018

nikomatsakis commented Sep 25, 2018

nnethercote commented Sep 25, 2018

pnkfelix commented Oct 2, 2018

pnkfelix commented Feb 20, 2019

jackh726 commented Jan 29, 2022

nnethercote commented Feb 24, 2022

High memory usage compiling keccak benchmark #54208

High memory usage compiling keccak benchmark #54208

Comments

nnethercote commented Sep 13, 2018 • edited by pnkfelix Loading

nnethercote commented Sep 13, 2018

nnethercote commented Sep 14, 2018

nnethercote commented Sep 14, 2018

nnethercote commented Sep 21, 2018

nnethercote commented Sep 24, 2018

nikomatsakis commented Sep 25, 2018

nikomatsakis commented Sep 25, 2018

nnethercote commented Sep 25, 2018

pnkfelix commented Oct 2, 2018

pnkfelix commented Feb 20, 2019

jackh726 commented Jan 29, 2022

nnethercote commented Feb 24, 2022

High memory usage compiling `keccak` benchmark #54208

High memory usage compiling `keccak` benchmark #54208

nnethercote commented Sep 13, 2018 •

edited by pnkfelix

Loading