-
Notifications
You must be signed in to change notification settings - Fork 332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blog post: leaving the Sea of Nodes #797
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super cool, all nits, so up to you if you want to address them or just keep as is. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a PDF version of this given that the png seems to be the same?
In this example, without control edges, nothing would prevent the `return`s from being executed before the `branch`, which would obviously be wrong. | ||
The crucial thing here is that the control edges only impose an order of the operations that have such incoming or outgoing edges, but not on other operations such as the arithmetic operations. This is the main difference between Sea of Nodes and Control flow graphs. | ||
|
||
Let’s now add effectful operations (eg, loads and stores from and to memory) in the mix. Similarly to control nodes, effectful operations often have no value dependencies, but still cannot run in a random order. For instance, `a[0] += 42; x = a[0]` and `x = a[0]; a[0] += 42` are not equivalent. So, we need a way to impose an order (= a schedule) on effectful operations. We could reuse the control chain for this purpose, but this would be stricter than required. For instance, consider this small snippet: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no, two whitespaces!
Let’s now add effectful operations (eg, loads and stores from and to memory) in the mix. Similarly to control nodes, effectful operations often have no value dependencies, but still cannot run in a random order. For instance, `a[0] += 42; x = a[0]` and `x = a[0]; a[0] += 42` are not equivalent. So, we need a way to impose an order (= a schedule) on effectful operations. We could reuse the control chain for this purpose, but this would be stricter than required. For instance, consider this small snippet: | |
Let’s now add effectful operations (eg, loads and stores from and to memory) in the mix. Similarly to control nodes, effectful operations often have no value dependencies, but still cannot run in a random order. For instance, `a[0] += 42; x = a[0]` and `x = a[0]; a[0] += 42` are not equivalent. So, we need a way to impose an order (= a schedule) on effectful operations. We could reuse the control chain for this purpose, but this would be stricter than required. For instance, consider this small snippet: |
(Which I guess, won't even be visible in the resulting HTML... :))
} | ||
``` | ||
|
||
By putting `a[2]` (which reads memory) on the control chain, we would force it to happen before the branch on `c`, even though, in practice, this load could easily happen after the branch if its result is only used inside the body of the then-branch. Having lots of nodes in the program on the control chain would defeat the goal of Sea of Nodes, since we would basically end up with a CFG-like IR where only pure operations float around. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this load could easily happen after the branch if its result is only used inside the body of the then-branch.
And I guess, assuming that the operation does not fail, meaning that a
is not null or undefined or any other value that could trigger an exception? Not sure if that level of correctness is relevant here, it could be however, as we are talking about effect and control flow edges as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, but I prefer to keep things as simple as possible here :)
|
||
 | ||
|
||
In this example, `arr[0] = 42` and `let x = arr[a]` have no value dependency (ie, the former is not an input of the latter, and vice versa) . However, because `a` could be `0`, `arr[0] = 42` should be executed before `x = arr[a]`, in order for the latter to always load the correct value from the array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this example, `arr[0] = 42` and `let x = arr[a]` have no value dependency (ie, the former is not an input of the latter, and vice versa) . However, because `a` could be `0`, `arr[0] = 42` should be executed before `x = arr[a]`, in order for the latter to always load the correct value from the array. | |
In this example, `arr[0] = 42` and `let x = arr[a]` have no value dependency (ie, the former is not an input of the latter, and vice versa) . However, because `a` could be `0`, `arr[0] = 42` should be executed before `x = arr[a]` in order for the latter to always load the correct value from the array. |
|
||
## Manually/visually inspecting and understanding a Sea of Nodes graph is hard | ||
|
||
We’ve already seen that on small programs, CFG is easier to read, as it is closer to the original source code, which is what developers (including Compiler Engineers\!) are used to write. For the unconvinced readers, let me offer a slightly larger example, so that you understand the issue better. Consider the following JavaScript function, which concatenates an array of strings: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We’ve already seen that on small programs, CFG is easier to read, as it is closer to the original source code, which is what developers (including Compiler Engineers\!) are used to write. For the unconvinced readers, let me offer a slightly larger example, so that you understand the issue better. Consider the following JavaScript function, which concatenates an array of strings: | |
We’ve already seen that on small programs the CFG is easier to read, as it is closer to the original source code, which is what developers (including Compiler Engineers\!) are used to write. For the unconvinced readers, let me offer a slightly larger example, so that you understand the issue better. Consider the following JavaScript function, which concatenates an array of strings: |
In general there seem to be so many commas in the whole post but I'm also not sure how many I'd want to remove, so I'll just report the ones that I think are misplaced and make it harder to parse the sentence. :)
|
||
You’ll notice that while the source JavaScript program has two identical divisions, the Sea of Node graph only has one. In reality, Sea of Nodes would start with two divisions, but since this is a pure operation (assuming double inputs), redundancy elimination would easily deduplicate them into one. | ||
Then when reaching the scheduling phase, we would have to find a place to schedule this division. Clearly, it cannot go after `case 1` or `case 2`, since it’s used in the other one. Instead, it would have to be scheduled before the `switch`. The downside is that, now, `a / b` will be computed even when `c` is `3`, where it doesn’t really need to be computed. This is a real issue that can lead to many deduplicated instructions floating to the common dominator of their users, slowing down many paths that don’t need them. | ||
There is a fix though: Turbofan’s scheduler will try to identify these cases, and duplicate the instructions so that they are only computed on the paths that need them. The downside is that this makes the scheduler more complex, requiring additional logic to figure out which nodes could and should be duplicated, and how to duplicate them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a fix though: Turbofan’s scheduler will try to identify these cases, and duplicate the instructions so that they are only computed on the paths that need them. The downside is that this makes the scheduler more complex, requiring additional logic to figure out which nodes could and should be duplicated, and how to duplicate them. | |
There is a fix though: Turbofan’s scheduler will try to identify these cases and duplicate the instructions so that they are only computed on the paths that need them. The downside is that this makes the scheduler more complex, requiring additional logic to figure out which nodes could and should be duplicated, and how to duplicate them. |
|
||
## Finding a good order to visit the graph is difficult | ||
|
||
All passes of a compiler need to visit the graph, be it to lower nodes, to apply local optimizations, or to run analysis over the whole graph. In a CFG, the order in which to visit nodes is usually straightforward: start from the first block (assuming a single-entry function), and iterate through each node of the block, and then move on to the successors and so on. In a [peephole optimization](https://en.wikipedia.org/wiki/Peephole_optimization) phase (such as [strength reduction](https://en.wikipedia.org/wiki/Strength_reduction)), a nice property of processing the graph in this order is that inputs are always optimized before a node is processed, and visiting each node exactly once is thus enough to apply most peephole optimizations. Consider for instance the following sequence of reductions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All passes of a compiler need to visit the graph, be it to lower nodes, to apply local optimizations, or to run analysis over the whole graph. In a CFG, the order in which to visit nodes is usually straightforward: start from the first block (assuming a single-entry function), and iterate through each node of the block, and then move on to the successors and so on. In a [peephole optimization](https://en.wikipedia.org/wiki/Peephole_optimization) phase (such as [strength reduction](https://en.wikipedia.org/wiki/Strength_reduction)), a nice property of processing the graph in this order is that inputs are always optimized before a node is processed, and visiting each node exactly once is thus enough to apply most peephole optimizations. Consider for instance the following sequence of reductions | |
All passes of a compiler need to visit the graph, be it to lower nodes, to apply local optimizations, or to run analysis over the whole graph. In a CFG, the order in which to visit nodes is usually straightforward: start from the first block (assuming a single-entry function), and iterate through each node of the block, and then move on to the successors and so on. In a [peephole optimization](https://en.wikipedia.org/wiki/Peephole_optimization) phase (such as [strength reduction](https://en.wikipedia.org/wiki/Strength_reduction)), a nice property of processing the graph in this order is that inputs are always optimized before a node is processed, and visiting each node exactly once is thus enough to apply most peephole optimizations. Consider for instance the following sequence of reductions: |
|
||
In total, it took three steps to optimize the whole sequence, and each step did useful work. After which, dead code elimination would remove `v1` and `v2`, resulting in one less instruction than in the initial sequence. | ||
|
||
With Sea of Nodes, it’s not possible to process pure instructions from start to end, since they aren’t on any control or effect chain, and thus there is no pointer to pure roots or anything like that. Instead, the usual way to process a Sea of Nodes graph for peephole optimizations is to start from the end (e.g., `return` instructions), and go up the value, effect and control inputs. This has the nice property that we won’t visit any unused instruction, but the upsides stop about there, because for peephole optimization, this is about the worst visitation order you could get. On the example above, here are the steps we would take: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With Sea of Nodes, it’s not possible to process pure instructions from start to end, since they aren’t on any control or effect chain, and thus there is no pointer to pure roots or anything like that. Instead, the usual way to process a Sea of Nodes graph for peephole optimizations is to start from the end (e.g., `return` instructions), and go up the value, effect and control inputs. This has the nice property that we won’t visit any unused instruction, but the upsides stop about there, because for peephole optimization, this is about the worst visitation order you could get. On the example above, here are the steps we would take: | |
With Sea of Nodes it’s not possible to process pure instructions from start to end since they aren’t on any control or effect chain and thus there is no pointer to pure roots or anything like that. Instead, the usual way to process a Sea of Nodes graph for peephole optimizations is to start from the end (e.g., `return` instructions) and go up the value, effect and control inputs. This has the nice property that we won’t visit any unused instruction, but the upsides stop about there, because for peephole optimization this is about the worst visitation order you could get. On the example above, here are the steps we would take: |
|
||
## Cache unfriendliness | ||
|
||
Almost all phases in Turbofan mutate the graph in-place. Given that nodes are fairly large in memory (mostly because each node has pointers to both its inputs and its uses), we try to reuse nodes as much as possible. However, inevitably, when we lower nodes to sequences of multiple nodes, we have to introduce new nodes, which will necessarily not be allocated close to the original node in memory. As a result, the deeper we go through the Turbofan pipeline and the more phases we run, the less cache friendly the graph is. Here is an illustration of this phenomenon: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
Almost all phases in Turbofan mutate the graph in-place. Given that nodes are fairly large in memory (mostly because each node has pointers to both its inputs and its uses), we try to reuse nodes as much as possible. However, inevitably, when we lower nodes to sequences of multiple nodes, we have to introduce new nodes, which will necessarily not be allocated close to the original node in memory. As a result, the deeper we go through the Turbofan pipeline and the more phases we run, the less cache friendly the graph is. Here is an illustration of this phenomenon: | |
Almost all phases in Turbofan mutate the graph in-place. Given that nodes are fairly large in memory (mostly because each node has pointers to both its inputs and its uses), we try to reuse nodes as much as possible. However, inevitably, when we lower nodes to sequences of multiple nodes, we have to introduce new nodes, which will necessarily not be allocated close to the original node in memory. As a result, the deeper we go through the Turbofan pipeline and the more phases we run, the less cache friendly the graph becomes. Here is an illustration of this phenomenon: |
|
||
**It’s hard to figure out what is inside of a loop.** Before lots of nodes are floating outside of the control chain, it’s hard to figure out what is inside each loop. As a result, basic optimizations such as loop peeling and loop unrolling are hard to implement. | ||
|
||
**Compiling is slow.** This is a direct consequence of multiple issues that I’ve already mentioned: it’s hard to find a good visitation order for nodes, which leads to many useless revisitation, state tracking is expensive, memory usage is bad, cache locality is bad… This might not be a big deal for an ahead of time compiler, but in a JIT compiler, compiling slowly means that we keep executing slow unoptimized code until the optimized code is ready, while taking away resources from other tasks (eg, other compilation jobs, or the Garbage Collector). One consequence of this is that we are forced to think very carefully about the compile time \- speedup tradeoff of new optimizations, often erring towards the size of optimizing less to keep optimizing fast. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**Compiling is slow.** This is a direct consequence of multiple issues that I’ve already mentioned: it’s hard to find a good visitation order for nodes, which leads to many useless revisitation, state tracking is expensive, memory usage is bad, cache locality is bad… This might not be a big deal for an ahead of time compiler, but in a JIT compiler, compiling slowly means that we keep executing slow unoptimized code until the optimized code is ready, while taking away resources from other tasks (eg, other compilation jobs, or the Garbage Collector). One consequence of this is that we are forced to think very carefully about the compile time \- speedup tradeoff of new optimizations, often erring towards the size of optimizing less to keep optimizing fast. | |
**Compiling is slow.** This is a direct consequence of multiple issues that I’ve already mentioned: it’s hard to find a good visitation order for nodes, which leads to many useless revisitation, state tracking is expensive, memory usage is bad, cache locality is bad… This might not be a big deal for an ahead of time compiler, but in a JIT compiler, compiling slowly means that we keep executing slow unoptimized code until the optimized code is ready, while taking away resources from other tasks (e.g. other compilation jobs or the Garbage Collector). One consequence of this is that we are forced to think very carefully about the compile time \- speedup tradeoff of new optimizations, often erring towards the size of optimizing less to keep optimizing fast. |
No description provided.