Skip to content

Commit

Permalink
Add OOV abbreviation (#180)
Browse files Browse the repository at this point in the history
  • Loading branch information
SponsoredByPuma authored Dec 3, 2024
1 parent cc44c8d commit 4d2fde4
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 3 deletions.
6 changes: 3 additions & 3 deletions docs/lectures/language_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -862,9 +862,9 @@ Since the perplexity is the inverse probability of the set set, the lower the pe
The perplexity score of GPT-2 is reported to be 18.34 on the [WikiText-2](https://openai.com/research/better-language-models).
There is no official perplexity score published by OpenAI for later versions of GPT, but according to [this](https://www.predictea.com/exploring-the-latest-advancements-in-gpt-4-a-comprehensive-overview/) source, GPT-3.5 achieves a perplexity score of 4.5 while GPT-4 achieves a perplexity score of 2.6.

## Out of Vocabulary Words
## Out of Vocabulary (OOV) Words

In some cases we need to deal with words that we **haven't seen before**. Such words are called **out of vocabulary** (OOV) words, and are usually replaced by a **special token `<unk>`**.
In some cases we need to deal with words that we **haven't seen before**. Such words are called OOV words, and are usually replaced by a **special token `<unk>`**.

We need to think about how to make predictions for words that we have not seen in the training corpus. What would be the N-gram probability of a word that is not in the corpus?

Expand All @@ -887,7 +887,7 @@ Any other words would be replaced by `<unk>`.

!!! warning

Using a lot of out of vocabulary words can influence the **perplexity** score. If there are a lot of OOV words in the test set, the model will predict them with a high probability, which will result in a low perplexity score.
Using a lot of OOV words can influence the **perplexity** score. If there are a lot of OOV words in the test set, the model will predict them with a high probability, which will result in a low perplexity score.

This means the model with generate sentences that contain a lot of `<unk>` tokens.

Expand Down
1 change: 1 addition & 0 deletions includes/abbreviations.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@
*[CBOW]: Continuous Bag-of-Words
*[HTWG]: Hochschule Konstanz Technik, Wirtschaft und Gestaltung
*[NER]: Named Entity Recognition
*[OOV]: out of vocabulary
*[TF-IDF]: Term Frequency-Inverse Document Frequency

0 comments on commit 4d2fde4

Please sign in to comment.