Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
Merge pull request #5110 from rajatd:oxy-no-ldslot This PR enhances the existing Partial Redundancy Elimination (PRE) infrastructure to catch multi-level field loads, i.e., field loads of the form o.x.y. **Challenges:** 1. Loading o.x.y translates to roughly the following IR: T1 = o.x T2 = T1.y In order to put a load of `T1.y` in the loop landing pad, we need `T1` to be live there. But, since it’s a temp its not live in the landing pad. 2. In the prepass, when we create initial values for `o.x` and `T1.y`, we give them new symStores (because we aren't sure what symStore will those syms end up with at the end of the loop). Symbol-to-value map entries look something like this: symbol -> value -> symstore o.x -> v1 -> T3 T1.y ->v2 -> T4 The PRE code that runs after the loop prepass, then uses these symStores to preload the PRE candidate property syms (`o.x` and `T1.y` in the above example). Assuming we could solve the previous problem by inserting `T1 = o.x` in the landing pad, landing pad would look something like this: T3 = o.x T1 = o.x T4 = T1.y But, now when we optimize the landing pad, `T1.y` gets object-pointer copy-propped to `T3.y`. So, now we have `T3.y` live coming in to the loop, but `T1.y` live on the back-edge. Since, these are two different property syms, they don’t produce a value for any `.y` during merging info on the loop header. This defeats the purpose of putting an initial value for `T1.y` in the landing pad. **Solutions:** 1. For making `T1` live in the landing pad, I rely on it being single-def and go with the premise that if its single def, I should be able to insert its defining instruction in the landing pad, as long as the rhs of the instruction is live in the landing pad; if its not, I try to make the rhs live by either: a. Processing it as a PRE candidate (if it was one), or b. recursing on the same logic as above for making it live. 2. Since, `T3` is the symstore for `o.x` on the back-edge, and we insert `T3 = o.x` in the landing pad, `T3` will eventually object-pointer copy-prop into `T2 = T1.y` (making it `T2 = T3.y`) during the main pass of the loop. To then field copy prop `T3.y`, we need `T3.y` to have a value on the loop header and to have that, we need `T3.y` live on the back edge. And that is exactly what I do - when preloading `T1.y` in the landing pad, I recognize that `T3` is going to be the copy-prop sym for `T1` and I make `T3.y` live in the landing pad and all the back edges. **Perf** ``` Kraken Left run time Right run time ? Run time ? Run time % Comment -------------------------------- ----------------- ----------------- ---------- ------------ --------------- Ai-astar 203.80 ms ±0.90% 205.83 ms ±0.92% 2.03 ms 1.00% Audio-beat-detection 105.00 ms ±0.95% 102.50 ms ±0.49% -2.50 ms -2.38% Audio-dft 141.89 ms ±0.99% 140.75 ms ±0.18% -1.14 ms -0.80% Audio-fft 74.50 ms ±0.70% 75.91 ms ±0.74% 1.41 ms 1.89% Audio-oscillator 73.63 ms ±0.43% 69.25 ms ±0.24% -4.38 ms -5.94% Improved Imaging-darkroom 177.83 ms ±0.92% 176.33 ms ±0.68% -1.50 ms -0.84% Imaging-desaturate 75.57 ms ±0.49% 75.43 ms ±0.76% -0.14 ms -0.19% Imaging-gaussian-blur 189.25 ms ±0.13% 165.60 ms ±0.24% -23.65 ms -12.50% Improved Json-parse-financial 58.50 ms ±0.85% 57.60 ms ±0.89% -0.90 ms -1.54% Json-stringify-tinderbox 32.00 ms ±0.81% 32.00 ms ±0.99% 0.00 ms 0.00% Stanford-crypto-aes 145.20 ms ±0.96% 144.80 ms ±0.94% -0.40 ms -0.28% Stanford-crypto-ccm 112.00 ms ±0.89% 111.50 ms ±0.45% -0.50 ms -0.45% Stanford-crypto-pbkdf2 246.33 ms ±0.27% 244.80 ms ±0.91% -1.53 ms -0.62% Stanford-crypto-sha256-iterative 63.75 ms ±0.39% 62.83 ms ±0.96% -0.92 ms -1.44% -------------------------------- ----------------- ----------------- ---------- ------------ --------------- Total 1699.25 ms ±0.67% 1665.14 ms ±0.66% -34.11 ms -2.01% Likely improved Octane Left score Right score ∆ Score ∆ Score % Comment ---------------- --------------- --------------- ------- --------- --------------- Box2d 23756.13 ±0.16% 24083.13 ±0.58% 327.00 1.38% Code-load 8355.50 ±0.25% 8435.50 ±0.15% 80.00 0.96% Improved Crypto 25798.93 ±0.24% 26194.50 ±0.18% 395.57 1.53% Improved Deltablue 19073.33 ±0.66% 20072.00 ±0.23% 998.67 5.24% Improved Earley-boyer 32181.25 ±0.36% 31788.40 ±0.49% -392.85 -1.22% Gbemu 36176.00 ±0.20% 36268.00 ±0.38% 92.00 0.25% Mandreel 20893.25 ±0.32% 20803.50 ±0.27% -89.75 -0.43% Mandreel latency 61067.78 ±0.71% 61170.50 ±0.23% 102.72 0.17% Navier-stokes 29619.50 ±0.30% 31657.71 ±0.30% 2038.21 6.88% Improved Pdfjs 13850.75 ±0.41% 13977.63 ±0.44% 126.88 0.92% Raytrace 31511.67 ±0.38% 31847.71 ±0.30% 336.05 1.07% Likely improved Regexp 3627.00 ±0.51% 3586.13 ±0.11% -40.88 -1.13% Richards 17123.00 ±0.49% 17286.50 ±0.26% 163.50 0.95% Splay 16760.20 ±0.49% 16873.75 ±0.32% 113.55 0.68% Splay latency 34004.50 ±0.58% 34297.50 ±0.44% 293.00 0.86% Typescript 28073.38 ±0.49% 28221.88 ±0.41% 148.50 0.53% Zlib 70900.88 ±0.17% 70701.75 ±0.25% -199.13 -0.28% ---------------- --------------- --------------- ------- --------- --------------- Total 22911.87 ±0.39% 23154.84 ±0.31% 242.96 1.06% Likely improved ``` Follow up work will target catching multi-level field loads that have LdEnv in the load sequence, and loads of and stores to the same field in the same loop
- Loading branch information