title |
---|
Formal Languages |
The goal of modern generative linguistics is to achieve a precise computational understanding of how language works. How do speakers turn the meanings they wish to communicate into utterances that can be spoken, written, or signed and how do listeners map these incoming signals to the meanings that they understand? Clearly, this is a big question whose answer will involve many components from understand the basic perceptual systems of the human mind to understanding what meaning is in the first place. Like all complex scientific problems, we need to make some simplifying assumptions in order to make progress.
Let's start by making a few remarks about the part of the problem we will deal with in this course. Our first simplifying assumption is that we will focus on just the problem of explaining the structure of sentences. Consider the following English sentence.
John loves Mary.
What can we say about a sentence like this? First, it is obviously built from a set of basic units like words and morphemes. Second, sentences are compositional: the meaning of the whole sentence is the result of the meaning of the individual words used in the sentence as well as the way they are combined. For example,
Mary loves John.
means something quite different from the first sentence. As English speakers, we know something which tells us that the ordering of the words affects their combined meaning in systematic and predictable ways. Furthermore, many combinations of words are not valid in English.
$$^*$$loves Mary John.
Note here we are using the English convention of using the symbol
Colorless green ideas sleep furiously.
$$^*$$Furiously green sleep ideas colorless.
In the preceding examples, the first sentence is well-formed in English, while the second is not. What is striking is that the well-formedness of the first sentence doesn't seem to mean anything that makes any sense. Chomsky used this example to illustrate the point that whatever principles tell us what a possible English sentence is, they must be at least partially independent of whether or not the sequence has a meaning. Another famous example comes from Lewis Carrol's poem Jabberwocky which begins.
Twas bryllyg, and the slythy toves
Did gyre and gymble in the wabe:
All mimsy were the borogoves;
And the mome raths outgrabe.
These examples seem to suggest that we might make a start of our study of sentence structure by asking which sequences of words are possible, grammatical English sentences and which are not. This question of grammaticality is one of the central problems of natural language syntax—the scientific study of the grammar of sentences.
Before moving on, it is worth making a few comments on the study of grammatical and ungrammatical word sequences since it often leads to confusion. Our focus on th problem of grammatical and ungrammatical sequences is an idealization. The examples above suggest that we may be able say something useful about sentence structure without considering meaning. This isn't to suggest that meaning is not important: Everyone agrees that the ultimate problem we wish to solve is to understand how to map from form to meaning and back again. Considerations of meaning are central to the study of syntax, and we will bring them into play very shortly in this course. We start, however, with this simplified problem in the hope that it will teach us some lessons which will continue to be important as we develop more sophisticated models and consider more sophisticated kinds of data.
To begin our idealization, let's assume (incorrectly) that we have a fixed set of words in our language. We will write this set of words
To begin studying possible and impossible sentences, we will need some notion of a sequence of words. In formal language theory, this is called a string and is any finite sequence of symbols in
The length of a string
There is a special, unique string called the null string that has length
Let
When talking about sets of strings, we will take the empty set
The basic operation that we can perform on strings is concatenation. If we have two string variables
Concatenation is associative, which means it doesn't matter what order you do it in, i.e.,
We write
A prefix of a string
We have now defined a simple idealization of a sentence as a string of words (or symbols). How can we represent this in LISP? In this course, we will use symbols to represent the atomic symbols in
(def my-string '(a b a b))
my-string
We have already seen how we can build up lists by using "cons" together with the null string "'()". When using "cons" to build a string of symbols, we pass in two arguments: a symbol and another string.
(cons 'a my-string)
Can we use "cons" to define concatenation? What happens if we "cons" two strings together?
(cons '(a b) '(a b))
We have ended up with a pair consisting of two string, rather than a single string with the elements of both. This is not the desired behavior as an implementation of concatenate. Clojure has a primitive operation which takes two strings and produces the result of putting them together to make a single string called "concat" which will use as our implementation of concatenate.
(concat '(a b) '(a b))
We can also define a predicate "prefix?" which tests whether on string is a prefix of another.
(defn prefix? [pr str]
(if (> (count pr) (count str))
false
(if (empty? pr)
true
(if (= (first pr) (first str))
(prefix? (rest pr) (rest str))
false))))
(prefix? '(a b c) '(a b))
So far we have defined the formal notion of a string as a model of word sequences in natural language. Our initial goal was however to be able to distinguish between grammatical strings like colorless green ideas sleep furiously and
One simple way to distinguish between well-formed strings and ill-formed strings is simply to define the well-formed strings as constituting a set, called a formal language,
A perhaps more unusual definition is the concatenation of two formal languages:
We write
Some important formal languages we will encounter in this course include the following. The set of all alternating $$a$$s and $$b$$s,
We have now defined the notion of a formal language. How can we use it to characterize the idea of well-formed and ill-formed sentences? One easy way is to say that all the well formed sentences are within some formal language
This is a good start, but how do we say which sentences are within
Let us consider a very restricted subset of the English language. We will start with a single sentence:
Alice talked to the father of Bob.
This is clearly a grammatical sentence in the English language, and we can use it as the basis for a sequence of longer and longer sentences:
Alice talked to the father of the mother of Bob.
Alice talked to the father of the mother of the father of Bob.
Alice talked to the father of the father of the mother of Bob.
Alice talked to the father of the mother of the father of the mother of Bob.
Alice talked to the father of the father of the mother of the mother of Bob.
These sentences were constructed by adding the prepositional phrases of the mother and of the father to the base sentence. It is clear that each of these new sentences is grammatical, and there is no reason to stop after only four iterations of this process: In principle we can insert these prepositional phrases as many times as we want, and the resulting sentence will be grammatical.
In order to characterize this subset of English, we can try to simplify the strings we are talking about. Let's define a function from strings of English words to strings over the vocabulary $${[],[a],[b]}^$$ (i.e., $$E^ \mapsto {[],[a],[b] }^* $$) called a systematic relabelling or homomorphism.
It is not hard to show (though we won't) that any functions from English strings to $${[],[a],[b]}^$$ that has this property can be defined by saying which string in $${[],[a],[b]}^$$ each word of English
Let's define
The first sentence, containing only one instance of father, will then be transformed to the string
We must conclude that a formal language model of at least some natural languages, notably English, must be infinite. In formal language theory we often work with infinite languages like this. It is worth noting a few things. First, we are not arguing that every natural language must be infinite; we may find one that is finite. This is an empirical question. Second, our use of infinite languages is an idealization. Of course, no amount of English that we sample in the wild will be infinite, we will always observe just a finite subset of the language. The point is that there is particular bound on the length of sentences
We saw in the preceding section that we cannot use finite formal languages as models of natural language sentences. At the very least, we will need infinite formal languages to model some natural languages like English. But how do we define an infinite language?
It is here that our notion of a model of computation become indispensable. While we cannot directly list the elements of an infinite formal language, we can write a computer program that characterizes the set in some way.
Take the example of the formal language
First we could write a computer program which tells us whether or not a given string was in this language. This approach is called language recognition. How do we write a recognizer for the language
(defn lang-ab*? [str]
(if (empty? str)
true
(if (prefix? '(a b) str)
(lang-ab*? (rest (rest str)))
false)))
There is a lot going on in this example, so let's break it down. The function is recursive. The most basic case is if the string is empty, that is if the formal string is equal to
(lang-ab*? '())
Next consider the case where the string isn't empty. Now, we need to check if the beginning of the string is equal to
(lang-ab*? '(b b a b))
Another way to specify a formal language using a computer program is by generation—providing a program which constructs or generates the elements of the set.
Consider again our formal language
(defn generate-abn [n]
(if (= n 0)
'()
(concat '(a b) (generate-abn (- n 1)))))
(generate-abn 10)
Of course, this procedure doesn't generate the whole set $${[ab]}^$$. However, in a certain sense it does precisely characterize the entire set. In particular, this function provides a mapping from an infinite set which we understand well, the natural (or counting) numbers to the set of strings in $${[ab]}^$$. For each natural number
There are at least two ways in which we can use such a mapping.
First, since we know how to count from
Second, we are free to imagine that there is another, unknown, process which somehow chooses the natural number
In some sense, the nondetermistic choice perspective is a more fundamental characterization of the set. When we think about the language
Recognizers and generators are both ways of characterizing sets intensionally rather than extensionally, that is, by listing. They differ in one important way. While recognizers give you a yes or no answer (when the language is decidable in any case), generators only give you a characterization for the strings that are actually in the language.
Modern linguistics is often called generative linguistics. This name stems from the use of generators as models of linguistic structure. As we will see, the field actually uses both the generator and recognizer perspectives to define models of language. This dual perspective also corresponds to the dual problem of explaining both language production (generation) and comprehension (recognition). The important thing is that we will try to define finite programs that characterize the set of possible English sentences. In the next lecture we will start building such models of natural language sentence structure.