Skip to content

Hidden Markov Models

tintin10q edited this page Mar 23, 2022 · 1 revision

Hidden Markov Models

This is like a markov model but we also care about hidden states. In PoS taggging we don't observe the states we want to predict as we do in Language Modeling. Basically the PoS tags are hidden. We observe a sequence of words and want to find the best sequence of tags for that particular sequence of words out of all possible sequence of tags.

A Hidden Markov Model (HMM) consists of the following components:

  • Q - A finite set of N states.
  • A - A state transition probability matrix.
  • $\pi$ - An initial probability distrobution.
  • O - A sequence of T observations.
  • B - An overservation likelihood matrix. Probability for a specific observation.

The last 2 of these are what is different from Markov models. But the difference is that we use Q, A and $\pi$ for the hidden states. So the states in Q are not visible to anymore. With O and B we encode what we know or what is visible. We do know that Q contains BoS and EoS.

Components

Q

Rather than consisting of words or engram, Q consits of PoS tags. So, Q contains the states of the HMM. This also includes BoS and EoS.

A

A is a |Q|-by-|Q| matrix, where each cell $A_{ij}$ inidicates the probability of moving from state $i \in Q$ to state $j \in Q$. This means that we need some corpus with annotated data where we can observe PoS tags.

We can do the estimation using maximum likelyhood estimages. The formula of that is: $$p(t_{i}, t_{i-n:i-1}) = \frac{c(t_{i-n:i-1}, t_{i})}{c(t_{i-n:i-1})}$$ c is a count function. So you devide the cocurancy frequency of the current tags by the frequency of the preceding tags. That gives you the probabilaty the tag given the preceding tags.

$\pi$

Similarly to the Markov Chain $\pi$ encodes the probability that each state $q\in Q$ follows the BoS symbol. $\pi$ can be fixed in advance if you want to put constraints on what can come at the start, or you can estimate it from a corpus. So instead of words is about tags.

O

The set of observed events: In PoS tagging, O contains words. This set has to be finite otherwise we can not finish. Every observation in O has to be able to be tagged from a state in Q.

B

B is another matrix. It is of size |Q|-by-|O|, where each cell $b_{qo}$ indicates the probability that a word $o \in O$ is generated by a state $q \in Q$.

To compute B, we need to find out how often each word occurs tagged with a particular PoS tag in a corpus. Again this needs annotated data.

So B encodes the probability that a certain word occurs since we observed a certain tag. Given that we observed a noun, how likely is it that this noun is exaclty dog for instance and not aardvark.

This is also calculated with ML: $$p(w_{i}|t_{i}) = \frac{c(t_{i},w_{i})}{c(t_{i})}$$

This devides the number of times a specific word is tagged as a certain tag.

So we want to know the likeliest tag for a word, but we compute the likeliest word after observing a tag. This is like Bayes rule. We aim for the posterior but compute the likelihood and the prior, and then estimate the posterior.

Assumptions

Both of these assumptions simplify the problem to make it possible to compute.

Markov Assumption

The probility of the next tag is only determined by the local history, not the whole sequence.

Output independence

The probability of a word only depends on the state that produced the corresponding tag, not on any other state or word.

There are 4 topics:

Clone this wiki locally