In the Generative Models lecture, we talked about 'goodness of fit' of a model to some data. Parametric models using a fixed finite number of parameters may end up with a poor matching between the complexity of the model and the data. Non-parametric approaches fit a single model which adapts it's complexity based on the data given instead of having the complexity of the model given as a parameter. Our fundamental goal is to find good models for natural language sentence structure. As we have argued, no finite set of sentences is a good model for natural languages like English. The flexibility of this class of models, where the number of parameters is itself considered a random variable, makes bayesian non-parametrics a good choice for our problem. for the actual course, I would try to lead with a problem that needs to be solved, this discussion is too abstract right now. The most natural problem would be a distribution over an unbounded number of lexical items.
One option for defining a prior over infinite-dimensional parameters is the Dirichlet Process (DP). This process defines a prior on parameters for a multinomial distribution with an infinite number of (mostly unused) possible outcomes. The DP is a stochastic process defining a distributions over distributions. That is, each draw from a DP is is itself a probability distribution. Just like a dirichlet distribution, but infinite. The Dirichlet Process is the best known example of a non-parametric distribution. The term non-parametric refers to statistical models whose size or complexity can grow with the data, rather than being specified in advance.
As a high-level note, I would give an intuitive overview of the DP before I jumped into any math. I like to present it as a distribution over infinite-sided dice. This naturally raises the question of how you define an infinite sided die. The clearest answer is the stick-breaking process. I find the limit derivation somewhat harder to follow and justify since it is a more "extensional" derivation and less "procedural"
Recall that a standard probability distribution used over the (K-1)-dimensional simplex is the Dirichlet distribution, defined as follows.
$$p\left(\theta_{1},\ldots ,\theta_{K};\alpha _{1},\ldots ,\alpha _{K}\right)= \frac{1}{B(\vec{\alpha})}\prod {i=1}^{K}\theta{i}^{\alpha _{i}-1}$$
In order to understand the definition of the DP as an
infinite-dimentional prior, it is important to note that the Dirichlet
distribution satisfies conditions for the expansion rule, which can be
used to increase the dimentionality of a Dirichlet distribution. Let
Repeatedly splitting a Dirichlet distributions into components based on the expansion rule gives the following, where K is the number of components.
Taking the limit as K goes to infinity gives a prior over an infinite-dimensional space.
We may now define the DP. As previously stated, we take the limit as K goes to infinity.
For each point in this distribution, we assign a value drawn from a
base distribution
Where
As you can see, the DP is parameterized by
In this way it is easy to
show the existence of the DP by considering some finite measurable
partition of
It is important to note that in practice all but a finite subset will not be used. The amound of components used in the end should reflect a the complexity of the data.
Understanding the DP can be difficult, but a number of constructions may help to highlight some key properties.
The stick-breaking process is an intuitive way to visualize draws from
a DP. Imagine drawing an infinite sequence of samples from a Beta
distribution with parameters 1,
Ultimately we would like to define a distribution on an infinite set
of discrete outcomes that will represent categories or mixture
components, but we start by defining a distribution on the natural
numbers. The probability of the natural number
Consider
Notice that the length of the piece that we break off is determined by
the concentration parameter
The Chinese Restaurant Process (CRP) is a more complex and widely used construction for understanding the Dirichlet Process. It is important to note that these are alternative but equivalent ways to construct the Dirichlet process. The CRP is usually described as a sequential sampling scheme using the metaphor of a restaurant.
We imagine a restaurant with an infinite number of tables. The first customer enters the restaurant and sits at the first unoccupied table. The (N + 1)th customer enters the restaurant and sits at either an already occupied table or a new, unoccupied table, according to the following distribution.
More intuitively, customers sit at a table which is already occupied
with probability proportional to the number of individuals already
seated at that table. Customers sit at a new table with probability
controlled by the concentration parameter
Each table has a dish associated with it. Each dish
One way of understanding the CRP is to think of it as defining a
distribution over ways of partitioning
The probability of a particular partition of
So the CRP is related to the DP just like the polya urn scheme is related to the Dirichlet distribution. I think this is really worth drawing out. I.e., the CRP is the posterior predictive distribution when you integrate out draws from the DP. You can look through the notes on the Polya urn scheme from the lectures
The CRP construction helps to highlight some properties of the DP. The first is that the CRP implements a simplicity bias. It assigns a higher probability to partitions which (1) have fewer customers, (2) have fewer tables, and (3) for a fixed number of customers N, favors assigning them to the smallest number of tables.
Thus the CRP favors simple restaurants and implements a rich-get-richer scheme, or self-reinforcing property. Because customers sit at a table with a probability proportional to the number of customers already at the table, it should be clear that tables with more customers have higher probability of being chosen by later customers. These properties mean that, all else being equal, when we use the CRP, we will favor reuse of previously computed values.
A final visualization of Dirichlet process and chinese restaurant process comes from an extension of the Polya urn scheme which allows a continuum of colors.
Imagine that you start with an urn filled with
-
If the ball is black, sample a new color from base distribution
$$H$$ , label a new ball with this colour, and put both balls back in the urn. -
If the ball is some non-black colour, label a new ball with the same colour, and put both balls back in the urn.
This construction is the same as the Polya urn scheme except for the fact that the Polya urn scheme allows for draws from some finite set of colours X, while the Blackwell-MacQueen Urn scheme is not restricted to finite X. (Number of colours in the urn could be infinite).
Convince yourself that the resulting distribution over colours will be the same as the distribution over tables in the Chinese restaurant process.
Anderson, J. (1991). The adaptive nature of human categorization. Psychological Review, 98:409-429.I think that Anderson accidentally defined something which looked like the CRP here
Gershman, S. and Blei, D. (2011). A tutorial on Bayesian nonparametric models ...
Hjort, N., Holmes, C., Muller, P., and Walker, S., editors. (2010) Bayesian Nonparametrics. Number 28 in Cambridge series in Statistical and Probabilistic Mathematics. Cambridge University Press.
Xing, E. “Bayesian Nonparametrics: Dirichlet Processes” Probabilistic Graphical Models, 10-708, Spring 2014, lecture.