Add soft capping to reversible embedding layer #1718
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Forgetting the final output soft-cap is a really easy mistake, and worse, outputs will still look plausible for generations without the softcap, just with worse actual results.
Adding this to our reversible embedding layer will be much more robust. As long as you use the layer to compute logits over the vocab, you can no longer forget the soft-cap.
Before this fix, we were missing it from our actual
CausalLM
functional model output, meaning soft-capping was not applied during training!