-
Notifications
You must be signed in to change notification settings - Fork 0
Markov models
When you actually try to implement a language model you are implementing a Markov model or a Markov Chain.
A Markov model is any model which makes use of the markov assumption. A markov model is defined by:
- A set of history states Q
$\set{q_{1},q_{2}, q_{3}, ..., q_{n}}$ - A set of predicted states R
$\set{r_{1},r_{2}, r_{3}, ..., r_{m}}$ (What you try to predict) - A transition probability matrix A (size: NxM)
- An initial probability distribution
$\pi$
In a bigram model, Q and R are the same set. With larger n-gram models they are not the same. Q is bigger then R because the histrories are longer.
A difference between Q and R is that Q will contain the BoS while R contains the EoS.
The transition probability matrix A encodes the probability of going from a state
Q usually corresponds to the rows of A and R to the collums. This is done, so we can get relative frequencies by row-normalizing counts.
This is starting state. The initial distribution is given by the transitions between states Q which only consits of BoS symbols and the r states. This is where all sequences start. The initical distribtion can be set in advance or estimated.
This is usually just embeded in A by augmenting Q with the BoS state.
So usually this a row in A with only beginning of sequence symbols. It is the inital state.
Looking at this table really makes it more concrete. The idea is that the sum of an the entire row is one. The sum of a collum does not have to be one.
How do you decide on the n-gram size? The lambda's? Which discounting method you will use? What k will you use if you use Laplace etc.
We can do this like any other machine learning problem. In this case the loss/cost/error function is perplexity and we want to minimize it. Try to minimize the perplexity of the dev set by tuning the hyperparameters.
Higher N-grams are more constraining and precise, but more sparce. If we have 20k types, then there are:
- 20K$^2$ bigrams =
$4*10^8$ . - 20K$^{3}$ trigrams =
$8 * 10^{12}$ . - 20K$^{4}$ tetragrams =
$1.6 * 10^{17}$ .
So you need to have the data to support this otherwise you will hit something you have not seen before too often.
THIS is why Normalization is important because it reduces the number of different histrories from which you can predict.
Although I have tried my best to make sure this summary is correct, I will take no responsibility for mistakes that might lead to you having a lower grade.
If you see anything that you think might be wrong then please create an issue on the Github repository or even better, create a pull request 😄
Do you appreciate my summaries, and you want to thank me then you can support me here:
Every model is wrong, but some models are usefull.