Bug in `CausalAttention`'s use of `context_length`? #548

gsganden · 2025-02-26T19:31:35Z

gsganden
Feb 26, 2025

I looks to me like the CausalAttention implementation in https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01_main-chapter-code/ch03.ipynb does not handle context_length correctly: it limits the mask size to context_length x context_length but does not truncate x accordingly.

As a result, context_length values greater than 1 and less than the input text length RuntimeError. https://colab.research.google.com/drive/1aAkYHATiSq5jWR6RxdMmVhF9Rb24R89D?usp=sharing demonstrates this problem and a possible solution.

Let me know if I can help, e.g. by contributing a PR with this change.

gsganden · 2025-02-26T19:37:32Z

gsganden
Feb 26, 2025
Author

It looks like the same comments apply to MultiHeadAttention

0 replies

rasbt · 2025-02-26T19:53:14Z

rasbt
Feb 26, 2025
Maintainer

Thanks for the feedback! One of the reasons why it's probably not there is that I build the whole LLM first before taking it apart and discussing the individual components. And I think the embedding layers will complain before the MHA will complain if inputs (x) exceeds the context length. So, in that case truncating the inputs is probably not necessary. One of the limitations when writing the book is to avoid unnecessary code as there are strict page limits.

I can see your point though when looking at Chapter 3 in isolation. I would maybe say adding the truncation as a commented line would be a good compromise. This way it doesn't deviate from the book contents but provides a helpful tip to readers. What do you think?

3 replies

gsganden Feb 26, 2025
Author

Ah, so in the context of the whole LLM the inputs would have already been truncated. Would it still make sense to truncate them inside the MHA class to decouple the MHA implementation from this particular context? If not, then I agree that a commented line might be a good compromise.

rasbt Feb 27, 2025
Maintainer

Ah, so in the context of the whole LLM the inputs would have already been truncated.

yes, that's correct. In the LLM:

class GPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])

        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])

        self.final_norm = LayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds  # Shape [batch_size, num_tokens, emb_size]
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits

you would essentially get the complaint here:

pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))

because of the context_length defined here

self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])

The code here is mainly for educational purposes, not production. But in general, I would avoid making checks part of the LLM as it would add additional overhead during training (and in training you already know exactly how large the input data is because you set that up yourself). So the truncation would ideally happen in the function that processes the user input.

But for the MHA in Chapter 03, since we are looking at it in isolation, I think adding a comment or so would still not be a bad idea.

gsganden Feb 27, 2025
Author

Makes sense, thank you! PR submitted: #549

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in `CausalAttention`'s use of `context_length`? #548

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Bug in CausalAttention's use of context_length? #548

gsganden Feb 26, 2025

Replies: 2 comments · 3 replies

gsganden Feb 26, 2025 Author

rasbt Feb 26, 2025 Maintainer

gsganden Feb 26, 2025 Author

rasbt Feb 27, 2025 Maintainer

gsganden Feb 27, 2025 Author

Bug in `CausalAttention`'s use of `context_length`? #548

gsganden
Feb 26, 2025

Replies: 2 comments 3 replies

gsganden
Feb 26, 2025
Author

rasbt
Feb 26, 2025
Maintainer

gsganden Feb 26, 2025
Author

rasbt Feb 27, 2025
Maintainer

gsganden Feb 27, 2025
Author