Skip to content

Commit

Permalink
[MERGE #5110 @rajatd] PRE for multi-level field loads in a loop
Browse files Browse the repository at this point in the history
Merge pull request #5110 from rajatd:oxy-no-ldslot

This PR enhances the existing Partial Redundancy Elimination (PRE) infrastructure to catch multi-level field loads, i.e., field loads of the form o.x.y.

**Challenges:**
1. Loading o.x.y translates to roughly the following IR:

                T1 = o.x
                T2 = T1.y

	In order to put a load of `T1.y` in the loop landing pad, we need `T1` to be live there. But, since it’s a temp its not live in the landing pad.

2. In the prepass, when we create initial values for `o.x` and `T1.y`, we give them new symStores (because we aren't sure what symStore will those syms end up with at the end of the loop). Symbol-to-value map entries look something like this:

             symbol -> value -> symstore
                  o.x -> v1 -> T3
                  T1.y ->v2 -> T4

	The PRE code that runs after the loop prepass, then uses these symStores to preload the PRE candidate property syms (`o.x` and `T1.y` in the above example). Assuming we could solve the previous problem by inserting `T1 = o.x` in the landing pad, landing pad would look something like this:

                T3 = o.x
                T1 = o.x
                T4 = T1.y

	But, now when we optimize the landing pad, `T1.y` gets object-pointer copy-propped to `T3.y`. So, now we have `T3.y` live coming in to the loop, but `T1.y` live on the back-edge. Since, these are two different property syms, they don’t produce a value for any `.y` during merging info on the loop header. This defeats the purpose of putting an initial value for `T1.y` in the landing pad.

**Solutions:**
1. For making `T1` live in the landing pad, I rely on it being single-def and go with the premise that if its single def, I should be able to insert its defining instruction in the landing pad, as long as the rhs of the instruction is live in the landing pad; if its not, I try to make the rhs live by either:
		a. Processing it as a PRE candidate (if it was one), or
		b. recursing on the same logic as above for making it live.

2. Since, `T3` is the symstore for `o.x` on the back-edge, and we insert `T3 = o.x` in the landing pad, `T3` will eventually object-pointer copy-prop into `T2 = T1.y` (making it `T2 = T3.y`) during the main pass of the loop. To then field copy prop `T3.y`, we need `T3.y` to have a value on the loop header and to have that, we need `T3.y` live on the back edge.
	And that is exactly what I do - when preloading `T1.y` in the landing pad, I recognize that `T3` is going to be the copy-prop sym for `T1` and I make `T3.y` live in the landing pad and all the back edges.

**Perf**
```
Kraken                            Left run time      Right run time     ? Run time  ? Run time %  Comment
--------------------------------  -----------------  -----------------  ----------  ------------  ---------------
Ai-astar                           203.80 ms ±0.90%   205.83 ms ±0.92%     2.03 ms         1.00%
Audio-beat-detection               105.00 ms ±0.95%   102.50 ms ±0.49%    -2.50 ms        -2.38%
Audio-dft                          141.89 ms ±0.99%   140.75 ms ±0.18%    -1.14 ms        -0.80%
Audio-fft                           74.50 ms ±0.70%    75.91 ms ±0.74%     1.41 ms         1.89%
Audio-oscillator                    73.63 ms ±0.43%    69.25 ms ±0.24%    -4.38 ms        -5.94%  Improved
Imaging-darkroom                   177.83 ms ±0.92%   176.33 ms ±0.68%    -1.50 ms        -0.84%
Imaging-desaturate                  75.57 ms ±0.49%    75.43 ms ±0.76%    -0.14 ms        -0.19%
Imaging-gaussian-blur              189.25 ms ±0.13%   165.60 ms ±0.24%   -23.65 ms       -12.50%  Improved
Json-parse-financial                58.50 ms ±0.85%    57.60 ms ±0.89%    -0.90 ms        -1.54%
Json-stringify-tinderbox            32.00 ms ±0.81%    32.00 ms ±0.99%     0.00 ms         0.00%
Stanford-crypto-aes                145.20 ms ±0.96%   144.80 ms ±0.94%    -0.40 ms        -0.28%
Stanford-crypto-ccm                112.00 ms ±0.89%   111.50 ms ±0.45%    -0.50 ms        -0.45%
Stanford-crypto-pbkdf2             246.33 ms ±0.27%   244.80 ms ±0.91%    -1.53 ms        -0.62%
Stanford-crypto-sha256-iterative    63.75 ms ±0.39%    62.83 ms ±0.96%    -0.92 ms        -1.44%
--------------------------------  -----------------  -----------------  ----------  ------------  ---------------
Total                             1699.25 ms ±0.67%  1665.14 ms ±0.66%   -34.11 ms        -2.01%  Likely improved

Octane            Left score       Right score      ∆ Score  ∆ Score %  Comment
----------------  ---------------  ---------------  -------  ---------  ---------------
Box2d             23756.13 ±0.16%  24083.13 ±0.58%   327.00      1.38%
Code-load          8355.50 ±0.25%   8435.50 ±0.15%    80.00      0.96%  Improved
Crypto            25798.93 ±0.24%  26194.50 ±0.18%   395.57      1.53%  Improved
Deltablue         19073.33 ±0.66%  20072.00 ±0.23%   998.67      5.24%  Improved
Earley-boyer      32181.25 ±0.36%  31788.40 ±0.49%  -392.85     -1.22%
Gbemu             36176.00 ±0.20%  36268.00 ±0.38%    92.00      0.25%
Mandreel          20893.25 ±0.32%  20803.50 ±0.27%   -89.75     -0.43%
Mandreel latency  61067.78 ±0.71%  61170.50 ±0.23%   102.72      0.17%
Navier-stokes     29619.50 ±0.30%  31657.71 ±0.30%  2038.21      6.88%  Improved
Pdfjs             13850.75 ±0.41%  13977.63 ±0.44%   126.88      0.92%
Raytrace          31511.67 ±0.38%  31847.71 ±0.30%   336.05      1.07%  Likely improved
Regexp             3627.00 ±0.51%   3586.13 ±0.11%   -40.88     -1.13%
Richards          17123.00 ±0.49%  17286.50 ±0.26%   163.50      0.95%
Splay             16760.20 ±0.49%  16873.75 ±0.32%   113.55      0.68%
Splay latency     34004.50 ±0.58%  34297.50 ±0.44%   293.00      0.86%
Typescript        28073.38 ±0.49%  28221.88 ±0.41%   148.50      0.53%
Zlib              70900.88 ±0.17%  70701.75 ±0.25%  -199.13     -0.28%
----------------  ---------------  ---------------  -------  ---------  ---------------
Total             22911.87 ±0.39%  23154.84 ±0.31%   242.96      1.06%  Likely improved

```

Follow up work will target catching multi-level field loads that have LdEnv in the load sequence, and loads of and stores to the same field in the same loop
  • Loading branch information
rajatd committed May 12, 2018
2 parents c246386 + 0c15d38 commit 5fb9787
Show file tree
Hide file tree
Showing 18 changed files with 793 additions and 111 deletions.
34 changes: 28 additions & 6 deletions lib/Backend/FlowGraph.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3695,6 +3695,24 @@ Loop::IsSymAssignedToInSelfOrParents(StackSym * const sym) const
return false;
}

BasicBlock *
Loop::GetAnyTailBlock() const
{
BasicBlock * tail = nullptr;

BasicBlock * loopHeader = this->GetHeadBlock();
FOREACH_PREDECESSOR_BLOCK(pred, loopHeader)
{
if (this->IsDescendentOrSelf(pred->loop))
{
tail = pred;
}
} NEXT_PREDECESSOR_BLOCK;

Assert(tail);
return tail;
}

#if DBG_DUMP
uint
Loop::GetLoopNumber() const
Expand Down Expand Up @@ -5216,14 +5234,14 @@ GlobOpt::CloneValues(BasicBlock *const toBlock, GlobOptBlockData *toData, GlobOp
ProcessValueKills(toBlock, toData);
}

PRECandidatesList * GlobOpt::FindBackEdgePRECandidates(BasicBlock *block, JitArenaAllocator *alloc)
PRECandidates * GlobOpt::FindBackEdgePRECandidates(BasicBlock *block, JitArenaAllocator *alloc)
{
// Iterate over the value table looking for propertySyms which are candidates to
// pre-load in the landing pad for field PRE

GlobHashTable *valueTable = block->globOptData.symToValueMap;
Loop *loop = block->loop;
PRECandidatesList *candidates = nullptr;
PRECandidates *candidates = JitAnew(this->tempAlloc, PRECandidates);

for (uint i = 0; i < valueTable->tableSize; i++)
{
Expand Down Expand Up @@ -5284,7 +5302,7 @@ PRECandidatesList * GlobOpt::FindBackEdgePRECandidates(BasicBlock *block, JitAre
if (!landingPadValue)
{
// Value should be added as initial value or already be there.
return nullptr;
continue;
}

IR::Instr * ldInstr = this->prePassInstrMap->Lookup(propertySym->m_id, nullptr);
Expand All @@ -5294,12 +5312,16 @@ PRECandidatesList * GlobOpt::FindBackEdgePRECandidates(BasicBlock *block, JitAre
continue;
}

if (!candidates)
if (!candidates->candidatesList)
{
candidates = Anew(alloc, PRECandidatesList, alloc);
candidates->candidatesList = JitAnew(alloc, PRECandidatesList, alloc);
candidates->candidatesToProcess = JitAnew(alloc, BVSparse<JitArenaAllocator>, alloc);
candidates->candidatesBv = JitAnew(alloc, BVSparse<JitArenaAllocator>, alloc);
}

candidates->Prepend(&bucket);
candidates->candidatesList->Prepend(&bucket);
candidates->candidatesToProcess->Set(propertySym->m_id);
candidates->candidatesBv->Set(propertySym->m_id);

} NEXT_SLISTBASE_ENTRY;
}
Expand Down
1 change: 1 addition & 0 deletions lib/Backend/FlowGraph.h
Original file line number Diff line number Diff line change
Expand Up @@ -756,6 +756,7 @@ class Loop
void SetLoopTopInstr(IR::LabelInstr * loopTop);
Func * GetFunc() const { return GetLoopTopInstr()->m_func; }
bool IsSymAssignedToInSelfOrParents(StackSym * const sym) const;
BasicBlock * GetAnyTailBlock() const;
#if DBG_DUMP
bool GetHasCall() const { return hasCall; }
uint GetLoopNumber() const;
Expand Down
Loading

0 comments on commit 5fb9787

Please sign in to comment.