Add OOV abbreviation (#180)

pkeilbach · Dec 3, 2024 · 4d2fde4 · 4d2fde4
1 parent cc44c8d
commit 4d2fde4
Show file tree

Hide file tree

Showing 2 changed files with 4 additions and 3 deletions.
diff --git a/docs/lectures/language_models.md b/docs/lectures/language_models.md
@@ -862,9 +862,9 @@ Since the perplexity is the inverse probability of the set set, the lower the pe
     The perplexity score of GPT-2 is reported to be 18.34 on the [WikiText-2](https://openai.com/research/better-language-models).
     There is no official perplexity score published by OpenAI for later versions of GPT, but according to [this](https://www.predictea.com/exploring-the-latest-advancements-in-gpt-4-a-comprehensive-overview/) source, GPT-3.5 achieves a perplexity score of 4.5 while GPT-4 achieves a perplexity score of 2.6.
 
-## Out of Vocabulary Words
+## Out of Vocabulary (OOV) Words
 
-In some cases we need to deal with words that we **haven't seen before**. Such words are called **out of vocabulary** (OOV) words, and are usually replaced by a **special token `<unk>`**.
+In some cases we need to deal with words that we **haven't seen before**. Such words are called OOV words, and are usually replaced by a **special token `<unk>`**.
 
 We need to think about how to make predictions for words that we have not seen in the training corpus. What would be the N-gram probability of a word that is not in the corpus?
 
@@ -887,7 +887,7 @@ Any other words would be replaced by `<unk>`.
 
 !!! warning
 
-    Using a lot of out of vocabulary words can influence the **perplexity** score. If there are a lot of OOV words in the test set, the model will predict them with a high probability, which will result in a low perplexity score.
+    Using a lot of OOV words can influence the **perplexity** score. If there are a lot of OOV words in the test set, the model will predict them with a high probability, which will result in a low perplexity score.
 
     This means the model with generate sentences that contain a lot of `<unk>` tokens.
 

diff --git a/includes/abbreviations.md b/includes/abbreviations.md
@@ -2,4 +2,5 @@
 *[CBOW]: Continuous Bag-of-Words
 *[HTWG]: Hochschule Konstanz Technik, Wirtschaft und Gestaltung
 *[NER]: Named Entity Recognition
+*[OOV]: out of vocabulary
 *[TF-IDF]: Term Frequency-Inverse Document Frequency