sync the whole Meg-LM fused_kernels sub-tree #260

stas00 · 2022-03-01T23:35:53Z

As flagged by @thomasw21 in #259 - in we have only synced part of the fused_kernels fixes applied to Megatron-LM here #151.

I tried to track all the changes since then, but there are too many and often are mixed with other unrelated PRs, so how about we just sync the whole folder and other related files.

this PR is trying just that.

I have no idea how to track all the individual contributors across many PRs, but I think it was primarily @hyunwoongko so it should be easy to push him in as a contributor:

git commit --author "hyunwoongko <[email protected]>" -am "author attribution" --allow-empty

and it will be so once this is squash-merged.

stas00 · 2022-03-02T01:31:33Z

this is not good, the performance is worse and then it OOMed after 4 iterations:


before fused kernel fixes:

 iteration        2/   95367 | consumed samples:         4096 | consumed tokens:      8388608 | elapsed time per iteration (s): 152.10 | learning rate: 3.787E-06 | global batch size:  2048 | lm loss: 6.353651E+01 | grad norm: 21.493 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 13.465 | TFLOPs: 141.12 |

after fused kernel fixes:

 iteration        2/   95367 | consumed samples:         4096 | consumed tokens:      8388608 | elapsed time per iteration (s): 159.85 | learning rate: 3.787E-06 | global batch size:  2048 | lm loss: 6.353651E+01 | grad norm: 21.493 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 12.812 | TFLOPs: 134.27 |

it's possible I missed some other changes outside of these folders.

Probably need to do it properly and track and replay each change

thomasw21 · 2022-03-02T02:08:32Z

megatron/model/fused_softmax.py

            and attn_batches % 4 == 0  # np * b must be divisor of 4
        ):
-            if 0 <= sk <= 2048:
+            if 0 <= sk <= 4096:


That test becomes useless no? unless we need to test sq now

I have just copied the code from Megatron-LM verbatim and only re-added back any changes we added.

i.e. I haven't added any code of my own. I only changed the outdated comment to match 4096

I need to use a smaller model and restart the testing, as it was on brink of OOM already. So it's very likely the issues I'm seeing are unrelated.

I will report back when I get new numbers.

stas00 · 2022-03-02T03:51:35Z

OK, I was testing with a broken merge of the deepspeed branch which introduced a memory leak. Found the issue now and will re-test anew with this and your PRs.

stas00 · 2022-03-02T04:07:31Z

OK, after finding an issue elsewhere this PR works just fine. Except it makes no difference whatsoever to the outcome. Perhaps we aren't impacted since we don't hit the constraints that weren't done well originally. I haven't investigated.

But the numbers are telling:

before fused kernel fixes:

 iteration        2/   95367 | consumed samples:         4096 | consumed tokens:      8388608 | elapsed time per iteration (s): 135.32 | learning rate: 3.787E-06 | global batch size:  2048 | lm loss: 6.354185E+01 | grad norm: 19.988 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 15.134 | TFLOPs: 139.05 |

mem: 59GB

after fused kernel fixes (this PR):

 iteration        2/   95367 | consumed samples:         4096 | consumed tokens:      8388608 | elapsed time per iteration (s): 134.96 | learning rate: 3.787E-06 | global batch size:  2048 | lm loss: 6.354185E+01 | grad norm: 19.988 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 15.175 | TFLOPs: 139.42 |

mem: 59GB

the small fluctuation is fine - they are identical throughputs.

stas00 · 2022-03-07T02:47:59Z

Tested it some more w/ and w/o this change and they track very close

stas00 and others added 2 commits March 1, 2022 15:35

sync the whole Meg-LM fused_kernels sub-tree

34b556c

author attribution

5c54956

stas00 mentioned this pull request Mar 1, 2022

Fix softmax #259

Closed

part 2

0415583

thomasw21 reviewed Mar 2, 2022

View reviewed changes

stas00 merged commit 1cb76a6 into main Mar 7, 2022

stas00 deleted the sync-meg-lm branch March 7, 2022 02:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync the whole Meg-LM fused_kernels sub-tree #260

sync the whole Meg-LM fused_kernels sub-tree #260

stas00 commented Mar 1, 2022 •

edited

Loading

stas00 commented Mar 2, 2022 •

edited

Loading

thomasw21 Mar 2, 2022

stas00 Mar 2, 2022 •

edited

Loading

stas00 Mar 2, 2022

stas00 commented Mar 2, 2022

stas00 commented Mar 2, 2022 •

edited

Loading

stas00 commented Mar 7, 2022

sync the whole Meg-LM fused_kernels sub-tree #260

sync the whole Meg-LM fused_kernels sub-tree #260

Conversation

stas00 commented Mar 1, 2022 • edited Loading

stas00 commented Mar 2, 2022 • edited Loading

thomasw21 Mar 2, 2022

Choose a reason for hiding this comment

stas00 Mar 2, 2022 • edited Loading

Choose a reason for hiding this comment

stas00 Mar 2, 2022

Choose a reason for hiding this comment

stas00 commented Mar 2, 2022

stas00 commented Mar 2, 2022 • edited Loading

stas00 commented Mar 7, 2022

stas00 commented Mar 1, 2022 •

edited

Loading

stas00 commented Mar 2, 2022 •

edited

Loading

stas00 Mar 2, 2022 •

edited

Loading

stas00 commented Mar 2, 2022 •

edited

Loading