-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathtutor7.txt
2299 lines (1699 loc) · 62.7 KB
/
tutor7.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
LET'S BUILD A COMPILER!
By
Jack W. Crenshaw, Ph.D.
7 November 1988
Part VII: LEXICAL SCANNING
*****************************************************************
* *
* COPYRIGHT NOTICE *
* *
* Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. *
* *
*****************************************************************
INTRODUCTION
In the last installment, I left you with a compiler that would
ALMOST work, except that we were still limited to single-
character tokens. The purpose of this session is to get rid of
that restriction, once and for all. This means that we must deal
with the concept of the lexical scanner.
Maybe I should mention why we need a lexical scanner at all ...
after all, we've been able to manage all right without one, up
till now, even when we provided for multi-character tokens.
The ONLY reason, really, has to do with keywords. It's a fact of
computer life that the syntax for a keyword has the same form as
that for any other identifier. We can't tell until we get the
complete word whether or not it IS a keyword. For example, the
variable IFILE and the keyword IF look just alike, until you get
to the third character. In the examples to date, we were always
able to make a decision based upon the first character of the
token, but that's no longer possible when keywords are present.
We need to know that a given string is a keyword BEFORE we begin
to process it. And that's why we need a scanner.
In the last session, I also promised that we would be able to
provide for normal tokens without making wholesale changes to
what we have already done. I didn't lie ... we can, as you will
see later. But every time I set out to install these elements of
the software into the parser we have already built, I had bad
feelings about it. The whole thing felt entirely too much like a
band-aid. I finally figured out what was causing the problem: I
was installing lexical scanning software without first explaining
to you what scanning is all about, and what the alternatives are.
Up till now, I have studiously avoided giving you a lot of
theory, and certainly not alternatives. I generally don't
respond well to the textbooks that give you twenty-five different
ways to do something, but no clue as to which way best fits your
needs. I've tried to avoid that pitfall by just showing you ONE
method, that WORKS.
But this is an important area. While the lexical scanner is
hardly the most exciting part of a compiler, it often has the
most profound effect on the general "look & feel" of the
language, since after all it's the part closest to the user. I
have a particular structure in mind for the scanner to be used
with KISS. It fits the look & feel that I want for that
language. But it may not work at all for the language YOU'RE
cooking up, so in this one case I feel that it's important for
you to know your options.
So I'm going to depart, again, from my usual format. In this
session we'll be getting much deeper than usual into the basic
theory of languages and grammars. I'll also be talking about
areas OTHER than compilers in which lexical scanning plays an
important role. Finally, I will show you some alternatives for
the structure of the lexical scanner. Then, and only then, will
we get back to our parser from the last installment. Bear with
me ... I think you'll find it's worth the wait. In fact, since
scanners have many applications outside of compilers, you may
well find this to be the most useful session for you.
LEXICAL SCANNING
Lexical scanning is the process of scanning the stream of input
characters and separating it into strings called tokens. Most
compiler texts start here, and devote several chapters to
discussing various ways to build scanners. This approach has its
place, but as you have already seen, there is a lot you can do
without ever even addressing the issue, and in fact the scanner
we'll end up with here won't look much like what the texts
describe. The reason? Compiler theory and, consequently, the
programs resulting from it, must deal with the most general kind
of parsing rules. We don't. In the real world, it is possible
to specify the language syntax in such a way that a pretty simple
scanner will suffice. And as always, KISS is our motto.
Typically, lexical scanning is done in a separate part of the
compiler, so that the parser per se sees only a stream of input
tokens. Now, theoretically it is not necessary to separate this
function from the rest of the parser. There is only one set of
syntax equations that define the whole language, so in theory we
could write the whole parser in one module.
Why the separation? The answer has both practical and
theoretical bases.
In 1956, Noam Chomsky defined the "Chomsky Hierarchy" of
grammars. They are:
o Type 0: Unrestricted (e.g., English)
o Type 1: Context-Sensitive
o Type 2: Context-Free
o Type 3: Regular
A few features of the typical programming language (particularly
the older ones, such as FORTRAN) are Type 1, but for the most
part all modern languages can be described using only the last
two types, and those are all we'll be dealing with here.
The neat part about these two types is that there are very
specific ways to parse them. It has been shown that any regular
grammar can be parsed using a particular form of abstract machine
called the state machine (finite automaton). We have already
implemented state machines in some of our recognizers.
Similarly, Type 2 (context-free) grammars can always be parsed
using a push-down automaton (a state machine augmented by a
stack). We have also implemented these machines. Instead of
implementing a literal stack, we have relied on the built-in
stack associated with recursive coding to do the job, and that in
fact is the preferred approach for top-down parsing.
Now, it happens that in real, practical grammars, the parts that
qualify as regular expressions tend to be the lower-level parts,
such as the definition of an identifier:
<ident> ::= <letter> [ <letter> | <digit> ]*
Since it takes a different kind of abstract machine to parse the
two types of grammars, it makes sense to separate these lower-
level functions into a separate module, the lexical scanner,
which is built around the idea of a state machine. The idea is to
use the simplest parsing technique needed for the job.
There is another, more practical reason for separating scanner
from parser. We like to think of the input source file as a
stream of characters, which we process right to left without
backtracking. In practice that isn't possible. Almost every
language has certain keywords such as IF, WHILE, and END. As I
mentioned earlier, we can't really know whether a given
character string is a keyword, until we've reached the end of it,
as defined by a space or other delimiter. So in that sense, we
MUST save the string long enough to find out whether we have a
keyword or not. That's a limited form of backtracking.
So the structure of a conventional compiler involves splitting up
the functions of the lower-level and higher-level parsing. The
lexical scanner deals with things at the character level,
collecting characters into strings, etc., and passing them along
to the parser proper as indivisible tokens. It's also considered
normal to let the scanner have the job of identifying keywords.
STATE MACHINES AND ALTERNATIVES
I mentioned that the regular expressions can be parsed using a
state machine. In most compiler texts, and indeed in most
compilers as well, you will find this taken literally. There is
typically a real implementation of the state machine, with
integers used to define the current state, and a table of actions
to take for each combination of current state and input
character. If you write a compiler front end using the popular
Unix tools LEX and YACC, that's what you'll get. The output of
LEX is a state machine implemented in C, plus a table of actions
corresponding to the input grammar given to LEX. The YACC output
is similar ... a canned table-driven parser, plus the table
corresponding to the language syntax.
That is not the only choice, though. In our previous
installments, you have seen over and over that it is possible to
implement parsers without dealing specifically with tables,
stacks, or state variables. In fact, in Installment V I warned
you that if you find yourself needing these things you might be
doing something wrong, and not taking advantage of the power of
Pascal. There are basically two ways to define a state machine's
state: explicitly, with a state number or code, and implicitly,
simply by virtue of the fact that I'm at a certain place in the
code (if it's Tuesday, this must be Belgium). We've relied
heavily on the implicit approaches before, and I think you'll
find that they work well here, too.
In practice, it may not even be necessary to HAVE a well-defined
lexical scanner. This isn't our first experience at dealing with
multi-character tokens. In Installment III, we extended our
parser to provide for them, and we didn't even NEED a lexical
scanner. That was because in that narrow context, we could
always tell, just by looking at the single lookahead character,
whether we were dealing with a number, a variable, or an
operator. In effect, we built a distributed lexical scanner,
using procedures GetName and GetNum.
With keywords present, we can't know anymore what we're dealing
with, until the entire token is read. This leads us to a more
localized scanner; although, as you will see, the idea of a
distributed scanner still has its merits.
SOME EXPERIMENTS IN SCANNING
Before getting back to our compiler, it will be useful to
experiment a bit with the general concepts.
Let's begin with the two definitions most often seen in real
programming languages:
<ident> ::= <letter> [ <letter> | <digit> ]*
<number ::= [<digit>]+
(Remember, the '*' indicates zero or more occurences of the terms
in brackets, and the '+', one or more.)
We have already dealt with similar items in Installment III.
Let's begin (as usual) with a bare cradle. Not surprisingly, we
are going to need a new recognizer:
{--------------------------------------------------------------}
{ Recognize an Alphanumeric Character }
function IsAlNum(c: char): boolean;
begin
IsAlNum := IsAlpha(c) or IsDigit(c);
end;
{--------------------------------------------------------------}
Using this let's write the following two routines, which are very
similar to those we've used before:
{--------------------------------------------------------------}
{ Get an Identifier }
function GetName: string;
var x: string[8];
begin
x := '';
if not IsAlpha(Look) then Expected('Name');
while IsAlNum(Look) do begin
x := x + UpCase(Look);
GetChar;
end;
GetName := x;
end;
{--------------------------------------------------------------}
{ Get a Number }
function GetNum: string;
var x: string[16];
begin
x := '';
if not IsDigit(Look) then Expected('Integer');
while IsDigit(Look) do begin
x := x + Look;
GetChar;
end;
GetNum := x;
end;
{--------------------------------------------------------------}
(Notice that this version of GetNum returns a string, not an
integer as before.)
You can easily verify that these routines work by calling them
from the main program, as in
WriteLn(GetName);
This program will print any legal name typed in (maximum eight
characters, since that's what we told GetName). It will reject
anything else.
Test the other routine similarly.
WHITE SPACE
We also have dealt with embedded white space before, using the
two routines IsWhite and SkipWhite. Make sure that these
routines are in your current version of the cradle, and add the
the line
SkipWhite;
at the end of both GetName and GetNum.
Now, let's define the new procedure:
{--------------------------------------------------------------}
{ Lexical Scanner }
Function Scan: string;
begin
if IsAlpha(Look) then
Scan := GetName
else if IsDigit(Look) then
Scan := GetNum
else begin
Scan := Look;
GetChar;
end;
SkipWhite;
end;
{--------------------------------------------------------------}
We can call this from the new main program:
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
repeat
Token := Scan;
writeln(Token);
until Token = CR;
end.
{--------------------------------------------------------------}
(You will have to add the declaration of the string Token at the
beginning of the program. Make it any convenient length, say 16
characters.)
Now, run the program. Note how the input string is, indeed,
separated into distinct tokens.
STATE MACHINES
For the record, a parse routine like GetName does indeed
implement a state machine. The state is implicit in the current
position in the code. A very useful trick for visualizing what's
going on is the syntax diagram, or "railroad-track" diagram.
It's a little difficult to draw one in this medium, so I'll use
them very sparingly, but the figure below should give you the
idea:
|-----> Other---------------------------> Error
|
Start -------> Letter ---------------> Other -----> Finish
^ V
| |
|<----- Letter <---------|
| |
|<----- Digit <----------
As you can see, this diagram shows how the logic flows as
characters are read. Things begin, of course, in the start
state, and end when a character other than an alphanumeric is
found. If the first character is not alpha, an error occurs.
Otherwise the machine will continue looping until the terminating
delimiter is found.
Note that at any point in the flow, our position is entirely
dependent on the past history of the input characters. At that
point, the action to be taken depends only on the current state,
plus the current input character. That's what make this a state
machine.
Because of the difficulty of drawing railroad-track diagrams in
this medium, I'll continue to stick to syntax equations from now
on. But I highly recommend the diagrams to you for anything you
do that involves parsing. After a little practice you can begin
to see how to write a parser directly from the diagrams.
Parallel paths get coded into guarded actions (guarded by IF's or
CASE statements), serial paths into sequential calls. It's
almost like working from a schematic.
We didn't even discuss SkipWhite, which was introduced earlier,
but it also is a simple state machine, as is GetNum. So is their
parent procedure, Scan. Little machines make big machines.
The neat thing that I'd like you to note is how painlessly this
implicit approach creates these state machines. I personally
prefer it a lot over the table-driven approach. It also results
is a small, tight, and fast scanner.
NEWLINES
Moving right along, let's modify our scanner to handle more than
one line. As I mentioned last time, the most straightforward way
to do this is to simply treat the newline characters, carriage
return and line feed, as white space. This is, in fact, the way
the C standard library routine, iswhite, works. We didn't
actually try this before. I'd like to do it now, so you can get
a feel for the results.
To do this, simply modify the single executable line of IsWhite
to read:
IsWhite := c in [' ', TAB, CR, LF];
We need to give the main program a new stop condition, since it
will never see a CR. Let's just use:
until Token = '.';
OK, compile this program and run it. Try a couple of lines,
terminated by the period. I used:
now is the time
for all good men.
Hey, what happened? When I tried it, I didn't get the last
token, the period. The program didn't halt. What's more, when I
pressed the 'enter' key a few times, I still didn't get the
period.
If you're still stuck in your program, you'll find that typing a
period on a new line will terminate it.
What's going on here? The answer is that we're hanging up in
SkipWhite. A quick look at that routine will show that as long
as we're typing null lines, we're going to just continue to loop.
After SkipWhite encounters an LF, it tries to execute a GetChar.
But since the input buffer is now empty, GetChar's read statement
insists on having another line. Procedure Scan gets the
terminating period, all right, but it calls SkipWhite to clean
up, and SkipWhite won't return until it gets a non-null line.
This kind of behavior is not quite as bad as it seems. In a real
compiler, we'd be reading from an input file instead of the
console, and as long as we have some procedure for dealing with
end-of-files, everything will come out OK. But for reading data
from the console, the behavior is just too bizarre. The fact of
the matter is that the C/Unix convention is just not compatible
with the structure of our parser, which calls for a lookahead
character. The code that the Bell wizards have implemented
doesn't use that convention, which is why they need 'ungetc'.
OK, let's fix the problem. To do that, we need to go back to the
old definition of IsWhite (delete the CR and LF characters) and
make use of the procedure Fin that I introduced last time. If
it's not in your current version of the cradle, put it there now.
Also, modify the main program to read:
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
repeat
Token := Scan;
writeln(Token);
if Token = CR then Fin;
until Token = '.';
end.
{--------------------------------------------------------------}
Note the "guard" test preceding the call to Fin. That's what
makes the whole thing work, and ensures that we don't try to read
a line ahead.
Try the code now. I think you'll like it better.
If you refer to the code we did in the last installment, you'll
find that I quietly sprinkled calls to Fin throughout the code,
wherever a line break was appropriate. This is one of those
areas that really affects the look & feel that I mentioned. At
this point I would urge you to experiment with different
arrangements and see how you like them. If you want your
language to be truly free-field, then newlines should be
transparent. In this case, the best approach is to put the
following lines at the BEGINNING of Scan:
while Look = CR do
Fin;
If, on the other hand, you want a line-oriented language like
Assembler, BASIC, or FORTRAN (or even Ada... note that it has
comments terminated by newlines), then you'll need for Scan to
return CR's as tokens. It must also eat the trailing LF. The
best way to do that is to use this line, again at the beginning
of Scan:
if Look = LF then Fin;
For other conventions, you'll have to use other arrangements.
In my example of the last session, I allowed newlines only at
specific places, so I was somewhere in the middle ground. In the
rest of these sessions, I'll be picking ways to handle newlines
that I happen to like, but I want you to know how to choose other
ways for yourselves.
OPERATORS
We could stop now and have a pretty useful scanner for our
purposes. In the fragments of KISS that we've built so far, the
only tokens that have multiple characters are the identifiers and
numbers. All operators were single characters. The only
exception I can think of is the relops <=, >=, and <>, but they
could be dealt with as special cases.
Still, other languages have multi-character operators, such as
the ':=' of Pascal or the '++' and '>>' of C. So while we may
not need multi-character operators, it's nice to know how to get
them if necessary.
Needless to say, we can handle operators very much the same way
as the other tokens. Let's start with a recognizer:
{--------------------------------------------------------------}
{ Recognize Any Operator }
function IsOp(c: char): boolean;
begin
IsOp := c in ['+', '-', '*', '/', '<', '>', ':', '='];
end;
{--------------------------------------------------------------}
It's important to note that we DON'T have to include every
possible operator in this list. For example, the paretheses
aren't included, nor is the terminating period. The current
version of Scan handles single-character operators just fine as
it is. The list above includes only those characters that can
appear in multi-character operators. (For specific languages, of
course, the list can always be edited.)
Now, let's modify Scan to read:
{--------------------------------------------------------------}
{ Lexical Scanner }
Function Scan: string;
begin
while Look = CR do
Fin;
if IsAlpha(Look) then
Scan := GetName
else if IsDigit(Look) then
Scan := GetNum
else if IsOp(Look) then
Scan := GetOp
else begin
Scan := Look;
GetChar;
end;
SkipWhite;
end;
{--------------------------------------------------------------}
Try the program now. You will find that any code fragments you
care to throw at it will be neatly broken up into individual
tokens.
LISTS, COMMAS AND COMMAND LINES
Before getting back to the main thrust of our study, I'd like to
get on my soapbox for a moment.
How many times have you worked with a program or operating system
that had rigid rules about how you must separate items in a list?
(Try, the last time you used MSDOS!) Some programs require
spaces as delimiters, and some require commas. Worst of all,
some require both, in different places. Most are pretty
unforgiving about violations of their rules.
I think this is inexcusable. It's too easy to write a parser
that will handle both spaces and commas in a flexible way.
Consider the following procedure:
{--------------------------------------------------------------}
{ Skip Over a Comma }
procedure SkipComma;
begin
SkipWhite;
if Look = ',' then begin
GetChar;
SkipWhite;
end;
end;
{--------------------------------------------------------------}
This eight-line procedure will skip over a delimiter consisting
of any number (including zero) of spaces, with zero or one comma
embedded in the string.
TEMPORARILY, change the call to SkipWhite in Scan to a call to
SkipComma, and try inputting some lists. Works nicely, eh?
Don't you wish more software authors knew about SkipComma?
For the record, I found that adding the equivalent of SkipComma
to my Z80 assembler-language programs took all of 6 (six) extra
bytes of code. Even in a 64K machine, that's not a very high
price to pay for user-friendliness!
I think you can see where I'm going here. Even if you never
write a line of a compiler code in your life, there are places in
every program where you can use the concepts of parsing. Any
program that processes a command line needs them. In fact, if
you think about it for a bit, you'll have to conclude that any
time you write a program that processes user inputs, you're
defining a language. People communicate with languages, and the
syntax implicit in your program defines that language. The real
question is: are you going to define it deliberately and
explicitly, or just let it turn out to be whatever the program
ends up parsing?
I claim that you'll have a better, more user-friendly program if
you'll take the time to define the syntax explicitly. Write down
the syntax equations or draw the railroad-track diagrams, and
code the parser using the techniques I've shown you here. You'll
end up with a better program, and it will be easier to write, to
boot.
GETTING FANCY
OK, at this point we have a pretty nice lexical scanner that will
break an input stream up into tokens. We could use it as it
stands and have a servicable compiler. But there are some other
aspects of lexical scanning that we need to cover.
The main consideration is <shudder> efficiency. Remember when we
were dealing with single-character tokens, every test was a
comparison of a single character, Look, with a byte constant. We
also used the Case statement heavily.
With the multi-character tokens being returned by Scan, all those
tests now become string comparisons. Much slower. And not only
slower, but more awkward, since there is no string equivalent of
the Case statement in Pascal. It seems especially wasteful to
test for what used to be single characters ... the '=', '+', and
other operators ... using string comparisons.
Using string comparison is not impossible ... Ron Cain used just
that approach in writing Small C. Since we're sticking to the
KISS principle here, we would be truly justified in settling for
this approach. But then I would have failed to tell you about
one of the key approaches used in "real" compilers.
You have to remember: the lexical scanner is going to be called a
_LOT_! Once for every token in the whole source program, in
fact. Experiments have indicated that the average compiler
spends anywhere from 20% to 40% of its time in the scanner
routines. If there were ever a place where efficiency deserves
real consideration, this is it.
For this reason, most compiler writers ask the lexical scanner to
do a little more work, by "tokenizing" the input stream. The
idea is to match every token against a list of acceptable
keywords and operators, and return unique codes for each one
recognized. In the case of ordinary variable names or numbers,
we just return a code that says what kind of token they are, and
save the actual string somewhere else.
One of the first things we're going to need is a way to identify
keywords. We can always do it with successive IF tests, but it
surely would be nice if we had a general-purpose routine that
could compare a given string with a table of keywords. (By the
way, we're also going to need such a routine later, for dealing
with symbol tables.) This usually presents a problem in Pascal,
because standard Pascal doesn't allow for arrays of variable
lengths. It's a real bother to have to declare a different
search routine for every table. Standard Pascal also doesn't
allow for initializing arrays, so you tend to see code like
Table[1] := 'IF';
Table[2] := 'ELSE';
.
.
Table[n] := 'END';
which can get pretty old if there are many keywords.
Fortunately, Turbo Pascal 4.0 has extensions that eliminate both
of these problems. Constant arrays can be declared using TP's
"typed constant" facility, and the variable dimensions can be
handled with its C-like extensions for pointers.
First, modify your declarations like this:
{--------------------------------------------------------------}
{ Type Declarations }
type Symbol = string[8];
SymTab = array[1..1000] of Symbol;
TabPtr = ^SymTab;
{--------------------------------------------------------------}
(The dimension used in SymTab is not real ... no storage is
allocated by the declaration itself, and the number need only be
"big enough.")
Now, just beneath those declarations, add the following:
{--------------------------------------------------------------}
{ Definition of Keywords and Token Types }
const KWlist: array [1..4] of Symbol =
('IF', 'ELSE', 'ENDIF', 'END');
{--------------------------------------------------------------}
Next, insert the following new function:
{--------------------------------------------------------------}
{ Table Lookup }
{ If the input string matches a table entry, return the entry
index. If not, return a zero. }
function Lookup(T: TabPtr; s: string; n: integer): integer;
var i: integer;
found: boolean;
begin
found := false;
i := n;
while (i > 0) and not found do
if s = T^[i] then
found := true
else
dec(i);
Lookup := i;
end;
{--------------------------------------------------------------}
To test it, you can temporarily change the main program as
follows:
{--------------------------------------------------------------}
{ Main Program }
begin
ReadLn(Token);
WriteLn(Lookup(Addr(KWList), Token, 4));
end.
{--------------------------------------------------------------}
Notice how Lookup is called: The Addr function sets up a pointer
to KWList, which gets passed to Lookup.
OK, give this a try. Since we're bypassing Scan here, you'll
have to type the keywords in upper case to get any matches.
Now that we can recognize keywords, the next thing is to arrange
to return codes for them.
So what kind of code should we return? There are really only two
reasonable choices. This seems like an ideal application for the
Pascal enumerated type. For example, you can define something
like
SymType = (IfSym, ElseSym, EndifSym, EndSym, Ident, Number,
Operator);
and arrange to return a variable of this type. Let's give it a
try. Insert the line above into your type definitions.
Now, add the two variable declarations:
Token: Symtype; { Current Token }
Value: String[16]; { String Token of Look }
Modify the scanner to read:
{--------------------------------------------------------------}
{ Lexical Scanner }
procedure Scan;
var k: integer;
begin
while Look = CR do
Fin;
if IsAlpha(Look) then begin
Value := GetName;
k := Lookup(Addr(KWlist), Value, 4);
if k = 0 then
Token := Ident
else
Token := SymType(k - 1);
end
else if IsDigit(Look) then begin
Value := GetNum;
Token := Number;
end
else if IsOp(Look) then begin
Value := GetOp;
Token := Operator;
end
else begin
Value := Look;
Token := Operator;
GetChar;
end;
SkipWhite;
end;
{--------------------------------------------------------------}
(Notice that Scan is now a procedure, not a function.)
Finally, modify the main program to read:
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
repeat
Scan;
case Token of
Ident: write('Ident ');
Number: Write('Number ');
Operator: Write('Operator ');
IfSym, ElseSym, EndifSym, EndSym: Write('Keyword ');
end;
Writeln(Value);
until Token = EndSym;
end.
{--------------------------------------------------------------}
What we've done here is to replace the string Token used earlier
with an enumerated type. Scan returns the type in variable Token,
and returns the string itself in the new variable Value.
OK, compile this and give it a whirl. If everything goes right,
you should see that we are now recognizing keywords.
What we have now is working right, and it was easy to generate
from what we had earlier. However, it still seems a little
"busy" to me. We can simplify things a bit by letting GetName,
GetNum, GetOp, and Scan be procedures working with the global
variables Token and Value, thereby eliminating the local copies.
It also seems a little cleaner to move the table lookup into
GetName. The new form for the four procedures is, then:
{--------------------------------------------------------------}
{ Get an Identifier }
procedure GetName;
var k: integer;
begin
Value := '';
if not IsAlpha(Look) then Expected('Name');
while IsAlNum(Look) do begin
Value := Value + UpCase(Look);
GetChar;
end;
k := Lookup(Addr(KWlist), Value, 4);
if k = 0 then
Token := Ident
else
Token := SymType(k-1);
end;
{--------------------------------------------------------------}
{ Get a Number }
procedure GetNum;
begin
Value := '';
if not IsDigit(Look) then Expected('Integer');
while IsDigit(Look) do begin
Value := Value + Look;
GetChar;
end;
Token := Number;
end;
{--------------------------------------------------------------}
{ Get an Operator }
procedure GetOp;
begin
Value := '';
if not IsOp(Look) then Expected('Operator');
while IsOp(Look) do begin
Value := Value + Look;
GetChar;
end;
Token := Operator;
end;
{--------------------------------------------------------------}
{ Lexical Scanner }
procedure Scan;
var k: integer;
begin
while Look = CR do
Fin;
if IsAlpha(Look) then
GetName
else if IsDigit(Look) then
GetNum
else if IsOp(Look) then
GetOp
else begin
Value := Look;
Token := Operator;
GetChar;
end;
SkipWhite;
end;
{--------------------------------------------------------------}
RETURNING A CHARACTER
Essentially every scanner I've ever seen that was written in
Pascal used the mechanism of an enumerated type that I've just
described. It is certainly a workable mechanism, but it doesn't
seem the simplest approach to me.
For one thing, the list of possible symbol types can get pretty