Full transformer #11

thecharlieblake · 2023-04-27T15:53:52Z

No description provided.

DouglasOrr

Very nice, all slotting together neatly!

LGTM (pre-approved); a bunch of minor comments / things to think about.

examples/scale_analysis.py

unit_scaling/functional.py

DouglasOrr · 2023-04-28T11:02:36Z

unit_scaling/functional.py

+    sparse: bool = False,
+) -> Tensor:
+    batch_size = prod(input.shape)
+    weight = scale_bwd(weight, (weight.shape[0] / batch_size) ** 0.5)


I think this is the right rule, but it does sometimes feel a bit risky! Perhaps in the case where it's risky (e.g. knowledge graph vocab_size=1M), the user really needs to set sparse=True and we should also do something else.

Yeah that's a good point, I'd forgotten about this issue. I'm tempted to say that for now we shouldn't support sparse=true and add that to our todo list for some point down the line. For huge vocab or tiny batch we may have an issue.

Having said that, even for 2**20 vocab and 2**8 batch the scaling factor is only 64 which isn't too bad. And in the sparse setting if you don't have that then maybe you just get dominated by the non-sparse decoder grads in the long-run, unless you have this slightly crazy scaling for the encoder grads?

Sorry, feels like a long time to reply... 👍 sounds reasonable.

unit_scaling/functional.py

DouglasOrr · 2023-04-28T11:11:04Z

unit_scaling/modules.py

+    functionality (e.g. causal masking, positional embeddings, usage for inference).
+
+    Args:
+        hidden_size (int): _description_


Are these _description_ placeholders? Can't see any of your magic @s to fill them in.

Just me forgetting to write the docs 🏅🐟

DouglasOrr · 2023-04-28T11:14:09Z

unit_scaling/modules.py

+        vocab_size (int): _description_
+        layers (int): _description_
+        heads (int): _description_
+        dropout_p (float, optional): _description_. Defaults to 0.1.


I had wondered if default-on dropout felt a bit weird. (I remember asking in #9. I wonder, did you see that / did the github auto-collapsing thing get in the way?)

Oh no. I missed 9 comments there because of auto-collapsing - lesson learned! (bad UI? bad user?) I'll address them here

DouglasOrr · 2023-04-28T11:21:09Z

unit_scaling/modules.py

+
+    def forward(self, input_ids: Tensor, labels: Tensor) -> Tensor:
+        input = self.embedding(input_ids)
+        input = U.dropout(input, self.dropout_p, self.training)


I usually put a layer_norm here and no dropout, but OPT and LLaMA seem to just use the embedding directly (LLaMA might have dropout, not visible in inference).

I guess this seems reasonable for now, and I presume we've got some opportunity to tweak the default before loads of people depend on it.

I like your version, it has a symmetry to it: embed->LN->transformer body->LN->un-embed

DouglasOrr · 2023-04-28T11:26:40Z

unit_scaling/tests/test_modules.py

+        else:
+            threshold = 2.5
+        assert p.grad is not None
+        assert p.grad.std().detach() == pytest.approx(1, rel=threshold), name


I think this test would check that std() is between [-19, 21] in the case of layer_norm.bias. Perhaps [1/20, 20] would be a better range?

Add transformer layer

thecharlieblake · 2023-05-02T08:25:48Z

Updates based on review feedback (including a new feature in the docs DSL thing!) Changes here - 85074c8

DouglasOrr · 2023-05-03T12:19:48Z

unit_scaling/docs.py

+
+    for arg in unsupported_args:
+        if arg not in default_kwargs:
+            print(default_kwargs, argspec)


Is this print deliberately retained?

DouglasOrr · 2023-05-03T12:22:47Z

unit_scaling/functional.py

+        if isinstance(scale, Sequence):
+            output_scale, left_grad_scale, right_grad_scale = scale  # type: ignore
+        else:
+            output_scale = left_grad_scale = right_grad_scale = scale


Perhaps this block should go to constraints as apply_ternary or something?

I agree what I have is a bit ugly, but I'm also a bit concerned that another level of indirection might be hard for new users to follow. Might leave this as-is for now...

DouglasOrr

Thanks, looks good! unsupported_arg is v. good.

thecharlieblake force-pushed the full-transformer branch 2 times, most recently from 56b2400 to d3c232d Compare April 27, 2023 16:11

DouglasOrr approved these changes Apr 28, 2023

View reviewed changes

thecharlieblake added 4 commits April 28, 2023 16:26

Merge pull request #10 from graphcore-research/transformer-layer

cddcbae

Add transformer layer

Add embedding and cross entropy loss

9ea2b93

Add full transformer decoder layer

8c56644

Add unsupported arg option to docs wrapper

85074c8

thecharlieblake force-pushed the full-transformer branch from d3c232d to 85074c8 Compare May 2, 2023 08:12

DouglasOrr reviewed May 3, 2023

View reviewed changes

DouglasOrr approved these changes May 3, 2023

View reviewed changes

thecharlieblake merged commit 6f87e8d into transformer-layer May 4, 2023

thecharlieblake mentioned this pull request May 4, 2023

Revert "Full transformer" #12

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full transformer #11

Full transformer #11

thecharlieblake commented Apr 27, 2023

DouglasOrr left a comment

DouglasOrr Apr 28, 2023

thecharlieblake Apr 28, 2023

DouglasOrr May 3, 2023

DouglasOrr Apr 28, 2023

thecharlieblake Apr 28, 2023

DouglasOrr Apr 28, 2023

thecharlieblake Apr 28, 2023

DouglasOrr Apr 28, 2023

thecharlieblake Apr 28, 2023

DouglasOrr Apr 28, 2023

thecharlieblake commented May 2, 2023

DouglasOrr May 3, 2023

thecharlieblake May 4, 2023

DouglasOrr May 3, 2023

thecharlieblake May 4, 2023

DouglasOrr left a comment

Full transformer #11

Full transformer #11

Conversation

thecharlieblake commented Apr 27, 2023

DouglasOrr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thecharlieblake commented May 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DouglasOrr left a comment

Choose a reason for hiding this comment