-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support arrays of zeros in Vec's __rust_alloc_zeroed optimization #95362
Conversation
(rust-highfive has picked a reviewer for you, use r? to override) |
This comment has been minimized.
This comment has been minimized.
Hm. The current IsZero implementations were all really cheap -- basically a single cmp -- while scanning an array to check it's all zeroes is comparatively expensive. That said, I think we don't currently use it for anything except the calloc optimization, and there I think it is likely to be a win to know up front you're allocating a set of zeroes. I suppose with the possible exception where the array is long and total number of elements is low, so the extra pass over the array ends up being more expensive than the knowledge the time saved with memset(0) over memcpy. But probably that case is quite rare. |
r=me if you agree with that rationale, otherwise we can discuss more. |
Yeah, I think this'll be too expensive for large array. It might make sense to do this only when its size is small enough, e.g. containable in a cache line. |
Part of my worry is that due to various processor optimizations and e.g. the ability to calloc by directly asking the kernel for zeroed pages (in theory), there's a really hard balance to strike for this kind of thing. So for short arrays it definitely seems reasonable -- where the cost of checking is quite low. I guess it's purely an optimization though, so should be easy to make that happen. |
The issue is that if the array is a chain of zeros followed by a non-zero byte, then we spend all time checking it without using the calloc path. |
Yes, the trait isn't even Basically, the odds of a particularly long only-non-zero-at-the-end array getting passed to @bors r=Mark-Simulacrum (It's not obvious where the threshold for stopping would be if we did make one, because I suspect that the things passed to Oh, and in the long-term I look forward to having something from safe transmute so we'll be able to just have a |
📌 Commit 8034c45 has been approved by |
We could have different optimization for literals/values visible to optimizer. I am not convinced that a runtime check is a good one. |
This doesn't seem true: https://godbolt.org/z/KcP1sjEj3 |
☀️ Test successful - checks-actions |
Finished benchmarking commit (bf61143): comparison url. Summary: This benchmark run did not return any relevant results. If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. @rustbot label: -perf-regression |
…acrum Tweak the vec-calloc runtime check to only apply to shortish-arrays r? `@Mark-Simulacrum` `@nbdd0121` pointed out in rust-lang#95362 (comment) that LLVM currently doesn't constant-fold the `IsZero` check for long arrays, so that seems like a reasonable justification for limiting it. It appears that it's based on length, not byte size, (https://godbolt.org/z/4s48Y81dP), so that's what I used in the PR. Maybe it's a ["the number of inlining shall be three"](https://youtu.be/s4wnuiCwTGU?t=320) sort of situation. Certainly there's more that could be done here -- that generated code that checks long arrays byte-by-byte is highly suboptimal, for example -- but this is an easy, low-risk tweak.
Test shows that current LLVM threshold is up to 98 elements (including), so it may make sense to limit |
The value I got is 49 elements. |
@ChayimFriedman2 I also got 49 -- but see #96596 where I've already limited this check. |
I used |
Tweak the vec-calloc runtime check to only apply to shortish-arrays r? `@Mark-Simulacrum` `@nbdd0121` pointed out in rust-lang/rust#95362 (comment) that LLVM currently doesn't constant-fold the `IsZero` check for long arrays, so that seems like a reasonable justification for limiting it. It appears that it's based on length, not byte size, (https://godbolt.org/z/4s48Y81dP), so that's what I used in the PR. Maybe it's a ["the number of inlining shall be three"](https://youtu.be/s4wnuiCwTGU?t=320) sort of situation. Certainly there's more that could be done here -- that generated code that checks long arrays byte-by-byte is highly suboptimal, for example -- but this is an easy, low-risk tweak.
I happened to notice in https://users.rust-lang.org/t/any-advantage-of-box-u64-16-16-16-over-vec-u64/73500/3?u=scottmcm that the calloc optimization wasn't applying to vectors-of-arrays, so here's the easy fix for that.