-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dest prop: Support removing writes when this unblocks optimizations #105813
Conversation
Some changes occurred to MIR optimizations cc @rust-lang/wg-mir-opt |
209b0b9
to
efbf7f6
Compare
efbf7f6
to
984afb2
Compare
Do we have any tests for Could you comment about tradeoffs relatively to implementation along the lines proposed in #105190? |
I think so. It seemed simple enough to include for now, so I've kept it in. I can also add
I'm not so sure that these are really alternatives. That PR is approximately equivalent to deciding that we don't want to have a distinction between copy/move in MIR. That's a decision we could possibly make, but it has broader implications than just this opt/PR. In the mean time, this seems like a reasonable option for unblocking opts |
I was thinking about changing move to copy where there is de facto no difference between the two, which is more or less what that pull request is doing. In call terminators when argument is passed indirectly, changing move to copy introduces additional copy. Why would the net result be an improvement in this case? |
I see commentary about Call terminators, so I'm just sharing some of what I've learned through experimentation. Consider this perf run: #105190 (comment) which only converts Move to Copy when the type is a pointer or smaller than a pointer. This is a common-sense optimization I would suggest when writing Rust so there should be no way that the regressions here are from making the final codegen worse. But as a MIR transform, it produces a compile time regression because it breaks this optimization in Inline: rust/compiler/rustc_mir_transform/src/inline.rs Lines 683 to 689 in 35a99ee
Just blindly converting all So I don't know what the effect of this additional copy is, but I'm extremely wary of our ability to assess its importance at the moment, because of how Inline and DestinationPropagation are tangled up without this PR. |
Two things to consider here:
|
match op { | ||
Operand::Move(place) => { | ||
if self.should_remove_write_at(place.local, location) { | ||
*op = Operand::Copy(*place); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only valid if place
is sized as unsized values can't be copied. I don't think should_remove_write_at
can return true for unsized places (maybe there is a case where an unsized place can be re-initialized after a move though?), but it may be a good idea to check anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this is a theoretical problem in future versions of dest prop I think. God, I really wish we could just treat unsized values as completely unsupported. I'll add a check.
Thanks for bringing this up. Indeed with inlining the difference extends beyond the case when the arguments are passed indirectly. With recent changes that disabled top down inlining, I think you could avoid this negative interaction by running the pass after inlining (or leaving
Wouldn't it be unnecessary at that point? If there is no difference in behavior, |
I don't think this is likely to be what happens. I expect that this particular difference in behavior will go away, but probably others will remain. In other words, this change is useful if we end up in a world where there is a reason to have a distinction between In any case, this is speculation about the future. I imagine that today the vast majority of the code this is transforming looks like this: _5 = _3;
foo(move _5);
// into
foo(_3); In the case of by-ref parameters, I'd expect this to be neutral in terms of codegen. In the case of by-val parameters, it does lead to better codegen though, modulo LLVM optimizations (I'd hope LLVM figures this out but wouldn't depend on it). To me that means that we should consider the second version the canonical one and thus do the transform. |
It is unavoidable for unsized arguments (as copying those is impossible) and a perf improvement for sized arguments if there is no prior copy. Honestly it did be nice if either most of those prior copies could be avoided during MIR building or copy would be turned into move after dest prop where possible. |
A perf improvement for compile time or the generated code? I'm looking for examples of this effect I can inspect. |
Both compile time and generated code. It would avoid an unnecessary copy. @pcwalton is currently working on trying to make LLVM elide memcpy calls where possibly, but that comes at a compile time cost AFAIK and I believe it won't eliminate all eliminatable copies as LLVM knows less about the semantics of rust than rustc itself does. And I would expect most memcpy calls only to be omitted when compiling in release mode (or when using Cranelift), so the extra copies have an even bigger runtime impact in those cases. |
Do you have any examples of the compile time impact that are not the situation I explained above? Runtime impacts would also be concerning, but I'm not sure we have any benchmarks or examples to even demonstrate the issue. Do we? We have the additional problem with this PR that we are discussing a MIR opt which is not accessible by rustc-perf. So without even the ability to collect the usual benchmark data I'm not really sure there is anything to base a decision on. |
Currently all call arguments are preceded by a copy (except for unsized arguments), so changing a move to a copy operand if that allows removing this copy has exactly zero impact. In fact I think it will codegen to effectively the same LLVM ir. Maybe it would even make sense to only support move operands in function calls? A copy operand is implemented as a copy to a temporary followed by handling this temporary as a move argument anyway. Changing arbitrary move operands to copies when there is no prior copy for dest prop to remove however does have an effect. |
Ahhhhh that makes sense, thanks. So it would make the most sense to have a separate pass which converts never-used-again copies to moves before codegen. |
Moving copies directly into the call terminators sounds reasonable as a normal form. I don't quite see why it would be an improvement for the code generation, though. Can you give an example? In that case we probably should limit the number of move operands changed into copy per propagation to at most one. Consider for example: pub fn f<T: Copy>(a: T) {
g(a, a)
}
#[inline(never)]
pub fn g<T: Copy>(a: T, b: T) {} With changes from this pull request, before the call to
In contrast, the existing implementation ends up with only one copy of
|
☔ The latest upstream changes (presumably #106103) made this pull request unmergeable. Please resolve the merge conflicts. |
(I've said this before but just to reiterate: I consider this at least as much a problem with our semantics for unsized values as anything else)
I don't think this is the right strategy. I can expand on this more on Zulip if you like, but (and this was not clear to me until recently either) "implicit copy" semantics are the better semantics for function calls in a high level IR like MIR.
Yeah, I can add that, although this is a bit of an oversimplified heuristic. Doing anything smarter seems really difficult though |
Do we have any evidence for worse codegen to back up limiting the pass itself? |
If an argument is passed indirectly, then for Copy and Constant operands a temporary copy is made before the call. In contrast, for Move operands a copy is omitted. The implementation is in: rust/compiler/rustc_codegen_ssa/src/mir/block.rs Lines 1024 to 1036 in f6cc345
Without any cost-benefit analysis, the cost of transformation might exceeds the benefit like in the example given in #105813 (comment). type T = [u64; 32];
#[no_mangle]
pub fn f(a: T, b: fn(_: T, _: T)) {
b(a, a)
} rustc --crate-type=lib -O a.rs -Zmir-enable-passes=+DestinationPropagation -Zfuel=a=2 -Zunpretty=mirfn f(_1: [u64; 32], _2: fn([u64; 32], [u64; 32])) -> () {
debug a => _1; // in scope 0 at a.rs:3:10: 3:11
debug b => _2; // in scope 0 at a.rs:3:16: 3:17
let mut _0: (); // return place in scope 0 at a.rs:3:35: 3:35
let mut _3: [u64; 32]; // in scope 0 at a.rs:4:7: 4:8
bb0: {
StorageLive(_3); // scope 0 at a.rs:4:7: 4:8
_3 = _1; // scope 0 at a.rs:4:7: 4:8
_0 = move _2(move _3, move _1) -> bb1; // scope 0 at a.rs:4:5: 4:12
}
bb1: {
StorageDead(_3); // scope 0 at a.rs:4:11: 4:12
return; // scope 0 at a.rs:5:2: 5:2
}
} rustc --crate-type=lib -O a.rs -Zmir-enable-passes=+DestinationPropagation -Zfuel=a=2 --emit llvm-irdefine void @f(ptr noalias nocapture noundef dereferenceable(256) %a, ptr nocapture noundef nonnull readonly %b) unnamed_addr #0 {
start:
%_3 = alloca [32 x i64], align 8
call void @llvm.lifetime.start.p0(i64 256, ptr nonnull %_3)
call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 8 dereferenceable(256) %_3, ptr noundef nonnull align 8 dereferenceable(256) %a, i64 256, i1 false)
call void %b(ptr noalias nocapture noundef nonnull dereferenceable(256) %_3, ptr noalias nocapture noundef nonnull dereferenceable(256) %a)
call void @llvm.lifetime.end.p0(i64 256, ptr nonnull %_3)
ret void
} rustc --crate-type=lib -O a.rs -Zmir-enable-passes=+DestinationPropagation -Zfuel=a=3 -Zunpretty=mirfn f(_1: [u64; 32], _2: fn([u64; 32], [u64; 32])) -> () {
debug a => _1; // in scope 0 at a.rs:3:10: 3:11
debug b => _2; // in scope 0 at a.rs:3:16: 3:17
let mut _0: (); // return place in scope 0 at a.rs:3:35: 3:35
bb0: {
_0 = move _2(_1, _1) -> bb1; // scope 0 at a.rs:4:5: 4:12
}
bb1: {
return; // scope 0 at a.rs:5:2: 5:2
}
} rustc --crate-type=lib -O a.rs -Zmir-enable-passes=+DestinationPropagation -Zfuel=a=3 --emit llvm-irdefine void @f(ptr noalias nocapture noundef readonly dereferenceable(256) %a, ptr nocapture noundef nonnull readonly %b) unnamed_addr #0 {
start:
%0 = alloca [32 x i64], align 8
%1 = alloca [32 x i64], align 8
call void @llvm.lifetime.start.p0(i64 256, ptr nonnull %1)
call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 8 dereferenceable(256) %1, ptr noundef nonnull align 8 dereferenceable(256) %a, i64 256, i1 false)
call void @llvm.lifetime.start.p0(i64 256, ptr nonnull %0)
call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 8 dereferenceable(256) %0, ptr noundef nonnull align 8 dereferenceable(256) %a, i64 256, i1 false)
call void %b(ptr noalias nocapture noundef nonnull dereferenceable(256) %1, ptr noalias nocapture noundef nonnull dereferenceable(256) %0)
call void @llvm.lifetime.end.p0(i64 256, ptr nonnull %1)
call void @llvm.lifetime.end.p0(i64 256, ptr nonnull %0)
ret void
} |
Ah okay that makes sense. We regress back to current output, so here DestinationPropagation just fails to improve final codegen 👍 |
|
…iasko Add a regression test for argument copies with DestinationPropagation This example, as a codegen test: rust-lang#105813 (comment) r? `@tmiasko`
for (src, candidates) in candidates.c.iter() { | ||
if merged_locals.contains(*src) { | ||
for (src, candidates) in candidates.c.drain() { | ||
if merged_locals.contains(src) { | ||
continue; | ||
} | ||
let Some(dest) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let Some(dest) = | |
let Some((dest, writes_to_remove)) = |
@JakobDegen Ping from triage: Can you post your status on this PR? This has sat idle for a few months. |
I'm investing my time in Lir right now, so I'm not going to push this further. I'll keep the branch around, if someone else wants to pick it up they're welcome to |
The logic this uses is quite simple: When we collect information about writes, we mark some of these writes as "removable." In cases where a removable write would normally invalidate an optimization candidate, we then instead keep the candidate around and mark that particular write as being in need of removing (if we do decide to use the candidate). When we finally merge candidates, we remove all those writes.
Removable writes are
Operand::Move
(removable by turning intoOperand::Copy
) andDeinit
statements (removable by deleting the statement).This increases the number of times that the opt fires on a
-Zbuild-std
build of regex by ~10%.r? @tmiasko