-
Notifications
You must be signed in to change notification settings - Fork 0
Hidden Markov Models
This is like a markov model but we also care about hidden states. In PoS taggging we don't observe the states we want to predict as we do in Language Modeling. Basically the PoS tags are hidden. We observe a sequence of words and want to find the best sequence of tags for that particular sequence of words out of all possible sequence of tags.
A Hidden Markov Model (HMM) consists of the following components:
- Q - A finite set of N states.
- A - A state transition probability matrix.
-
$\pi$ - An initial probability distrobution. - O - A sequence of T observations.
- B - An overservation likelihood matrix. Probability for a specific observation.
The last 2 of these are what is different from Markov models. But the difference is that we use Q, A and
Rather than consisting of words or engram, Q consits of PoS tags. So, Q contains the states of the HMM. This also includes BoS and EoS.
A is a |Q|-by-|Q| matrix, where each cell
We can do the estimation using maximum likelyhood estimages. The formula of that is:
Similarly to the Markov Chain
The set of observed events: In PoS tagging, O contains words. This set has to be finite otherwise we can not finish. Every observation in O has to be able to be tagged from a state in Q.
B is another matrix. It is of size |Q|-by-|O|, where each cell
To compute B, we need to find out how often each word occurs tagged with a particular PoS tag in a corpus. Again this needs annotated data.
So B encodes the probability that a certain word occurs since we observed a certain tag. Given that we observed a noun, how likely is it that this noun is exaclty dog for instance and not aardvark.
This is also calculated with ML:
This devides the number of times a specific word is tagged as a certain tag.
So we want to know the likeliest tag for a word, but we compute the likeliest word after observing a tag. This is like Bayes rule. We aim for the posterior but compute the likelihood and the prior, and then estimate the posterior.
Both of these assumptions simplify the problem to make it possible to compute.
The probility of the next tag is only determined by the local history, not the whole sequence.
The probability of a word only depends on the state that produced the corresponding tag, not on any other state or word.
Although I have tried my best to make sure this summary is correct, I will take no responsibility for mistakes that might lead to you having a lower grade.
If you see anything that you think might be wrong then please create an issue on the Github repository or even better, create a pull request 😄
Do you appreciate my summaries, and you want to thank me then you can support me here:
Every model is wrong, but some models are usefull.