Skip to content

Commit

Permalink
scanner simplified
Browse files Browse the repository at this point in the history
  • Loading branch information
karlstroetmann committed Nov 2, 2024
1 parent fd8cfef commit d634236
Show file tree
Hide file tree
Showing 6 changed files with 108 additions and 194 deletions.
93 changes: 28 additions & 65 deletions Lecture-Notes/context-free-languages.tex
Original file line number Diff line number Diff line change
Expand Up @@ -348,32 +348,6 @@ \subsection{Derivations}
\end{enumerate}

\exerciseEng
We define for $w \in \Sigma^*$ and $c \in \Sigma$ the function
\\[0.2cm]
\hspace*{1.3cm}
$\textsl{count}(w,c)$.
\\[0.2cm]
The function counts how many times the letter $c$ occurs in the word $w$. The definition is done by induction
on the string $w$.
\begin{enumerate}
\item[B.C.:] $w = \lambda$.

We set
\\[0.2cm]
\hspace*{1.3cm}
$\textsl{count}(\lambda, c) := 0$.
\item[I.S.:] $w = dv$ with $d \in \Sigma$ and $v \in \Sigma^*$.

Then $\textsl{count}(dv,c)$ is defined by case distinction:
\\[0.2cm]
\hspace*{1.3cm}
$\textsl{count}(dv,c) := \left\{
\begin{array}[c]{llr}
\textsl{count}(v,c) + 1 & \mbox{if $c = d$}; \\
\textsl{count}(v,c) & \mbox{falls $c \not= d$}. \\
\end{array}\right.
$ \eox
\end{enumerate}
We define $\Sigma = \{ \squoted{A}, \squoted{B} \}$ and define the language $L$ as the
set of words $w\in\Sigma^*$ in which the letters \squoted{A} and \squoted{B} occur with the
same frequency:
Expand Down Expand Up @@ -993,7 +967,7 @@ \section{Top-Down Parser}
(abbreviated as \textsc{Ebnf}-grammar). Theoretically, the expressive power of
\textsc{Ebnf} grammars is the same as the expressive power of context-free grammars.
In practice, however, it turns out that the construction of top-down parsers for
\textsc{Ebnf} grammars is easier, because in an \textsc{Ebnf} grammar the left recursion can be replaced
\textsc{Ebnf} grammars is easier, because in an \textsc{Ebnf} grammar the left recursion can often be replaced
by iteration.
\end{enumerate}
In the rest of this chapter we will discuss these two procedures in more detail using the grammar for
Expand Down Expand Up @@ -1167,7 +1141,6 @@ \subsection{Rewriting a Grammar to Eliminate Left Recursion \label{left-recursi
\\[0.2cm]
where $\textsl{op} \in \{ \quoted{$\cdot$}, \quoted{/} \}$ holds.
\end{enumerate}
\pagebreak

\exerciseEng \label{exercise:regexp}
\begin{enumerate}[(a)]
Expand Down Expand Up @@ -1203,6 +1176,23 @@ \subsection{Rewriting a Grammar to Eliminate Left Recursion \label{left-recursi


\subsection{Implementing a Top Down Parser in \textsl{Python}}
\noindent
Now we are ready to implement a parser for recognizing arithmetic expressions.
We will use the grammar that is shown in Figure \ref{fig:Expr2} on page \pageref{fig:Expr2}.
Before we can implement the parser, we need a scanner. We will use a hand-coded scanner that is shown in
Figure \ref{fig:Top-Down-Parser:scanner.ipynb} on page \pageref{fig:Top-Down-Parser:scanner.ipynb}.
The function \texttt{tokenize} implemented in this scanner receives a string \texttt{s} as argument and returns
a list of tokens. The string \texttt{s} is supposed to represent an arithmetical expression.
In order to understand the implementation, you need to know the following:
\begin{enumerate}[(a)]
\item We need to set the flag \texttt{re.VERBOSE} in our call of the function \texttt{findall}
below because otherwise we are not able to format the regular expression \texttt{lexSpec} the way
we have done it. In prticular, we would not be able to use comment inside the regular expression
and we would not be able to format the regular expression using white space.
\item The regular expression \texttt{lexSpec} contains 3 alternatives: white space, numbers, and operator symbols.
White space is removed, while everything else is collected in the list \texttt{result}.
Furthermore, the empty string that occurs at the end has to be removed in the same way as white space.
\end{enumerate}


\begin{figure}[!ht]
Expand All @@ -1215,28 +1205,17 @@ \subsection{Implementing a Top Down Parser in \textsl{Python}}
xleftmargin = 0.0cm,
xrightmargin = 0.0cm
]{python3}
import re

def tokenize(s):
lexSpec = r'''([ \t]+) | # blanks and tabs
([1-9][0-9]*|0) | # number
([()]) | # parentheses
([-+*/]) | # arithmetical operators
(.) # unrecognized character
def tokenize(s: str) -> list[str]:
lexSpec = r'''[ \t]+ | # blanks and tabs
[1-9][0-9]*|0 | # numbers
[-+*/()] | # arithmetical operators and parentheses
'''
tokenList = re.findall(lexSpec, s, re.VERBOSE)
result = []
for ws, number, parenthesis, operator, error in tokenList:
if ws: # skip blanks and tabs
pass
if number:
result += [ number ]
if parenthesis:
result += [ parenthesis ]
if operator:
result += [ operator ]
if error:
result += [ f'ERROR({error})']
for token in tokenList:
if token == '' or token[0] in [' ', '\t']: # skip blanks and tabs
continue
result += [ token ]
return result
\end{minted}
\vspace*{-0.3cm}
Expand Down Expand Up @@ -1308,22 +1287,6 @@ \subsection{Implementing a Top Down Parser in \textsl{Python}}


\noindent
Now we are ready to implement a parser for recognizing arithmetic expressions.
We will use the grammar that is shown in Figure \ref{fig:Expr2} on page \pageref{fig:Expr2}.
Before we can implement the parser, we need a scanner. We will use a hand-coded scanner that is shown in
Figure \ref{fig:Top-Down-Parser:scanner.ipynb} on page \pageref{fig:Top-Down-Parser:scanner.ipynb}.
The function \texttt{tokenize} implemented in this scanner receives a string \texttt{s} as argument and returns
a list of tokens. The string \texttt{s} is supposed to represent an arithmetical expression.
In order to understand the implementation, you need to know the following:
\begin{enumerate}[(a)]
\item We need to set the flag \texttt{re.VERBOSE} in our call of the function \texttt{findall}
below because otherwise we are not able to format the regular expression \texttt{lexSpec} the way
we have done it.
\item The regular expression \texttt{lexSpec} contains 5 parenthesized groups. Therefore,
\texttt{findall} returns a list of 5-tuples where the 5 components correspond to the 5
groups of the regular expression. As the 5 groups are non-overlapping, exactly one of the 5 components
will be a non-empty string.
\end{enumerate}
Figure \ref{fig:Top-Down-Parser.ipynb} on page
\pageref{fig:Top-Down-Parser.ipynb} shows an implementation of a recursive descent parser in
\textsc{Python}.
Expand Down Expand Up @@ -1371,7 +1334,7 @@ \subsection{Implementing a Top Down Parser in \textsl{Python}}
\verb|[9, [')', '*', 2]]|.
\\[0.2cm]
Here, the part \verb|['(', 1, '+', 2, ')', '*', 3]| has been parsed and evaluated as
the number $9$ and \verb|[')', '*', 2]| is the list of tokens that have not yet been
the number $9$ and \\ \verb|[')', '*', 2]| is the list of tokens that have not yet been
processed.

In order to parse an arithmetic expression, the function first parses a
Expand Down Expand Up @@ -1588,7 +1551,7 @@ \subsection{Implementing a Recursive Descent Parser that Uses an \textsc{EBNF} G

\paragraph{Historical Notes} The language \textsc{Algol} \cite{backus:1959,naur:1960} was the first
programming language with a syntax that was based on an \textsc{Ebnf} grammar.

\pagebreak

\section{Check your Understanding}
\begin{enumerate}[(a)]
Expand Down
50 changes: 25 additions & 25 deletions Lecture-Notes/formal-languages.idx
Original file line number Diff line number Diff line change
Expand Up @@ -79,30 +79,30 @@
\indexentry{grammar rule|hyperpage}{60}
\indexentry{start symbol|hyperpage}{61}
\indexentry{derivation-step|hyperpage}{61}
\indexentry{palindrome|hyperpage}{64}
\indexentry{palindrome|hyperpage}{63}
\indexentry{parse-tree|hyperpage}{64}
\indexentry{ambiguous grammar|hyperpage}{66}
\indexentry{left-recursive|hyperpage}{67}
\indexentry{ambiguous grammar|hyperpage}{65}
\indexentry{left-recursive|hyperpage}{66}
\indexentry{\textsc{Ebnf}-Grammar|hyperpage}{73}
\indexentry{Cocke-Younger-Kasami-Algorithmus|hyperpage}{81}
\indexentry{\textsc{Cyk}-Algorithmus|hyperpage}{81}
\indexentry{Earley-Objekt|hyperpage}{81}
\indexentry{rightmost derivation|hyperpage}{88}
\indexentry{leftmost derivation|hyperpage}{88}
\indexentry{shift-reduce parser|hyperpage}{89}
\indexentry{parser configuration|hyperpage}{89}
\indexentry{marked rule|hyperpage}{96}
\indexentry{closure of a set of marked rules|hyperpage}{97}
\indexentry{\(\textsl{closure}(\mathcal{M})\)|hyperpage}{97}
\indexentry{augmented grammar|hyperpage}{98}
\indexentry{$\lambda$-generating|hyperpage}{99}
\indexentry{augmented grammar|hyperpage}{100}
\indexentry{shift-reduce conflict|hyperpage}{103}
\indexentry{reduce-reduce conflict|hyperpage}{103}
\indexentry{SLR grammar|hyperpage}{103}
\indexentry{extended marked rule|hyperpage}{108}
\indexentry{e.m.R.|hyperpage}{108}
\indexentry{shift-reduce conflict|hyperpage}{110}
\indexentry{reduce-reduce conflict|hyperpage}{110}
\indexentry{\textsl{Integer}-\mytt{C}|hyperpage}{140}
\indexentry{symbol table|hyperpage}{151}
\indexentry{Cocke-Younger-Kasami-Algorithmus|hyperpage}{80}
\indexentry{\textsc{Cyk}-Algorithmus|hyperpage}{80}
\indexentry{Earley-Objekt|hyperpage}{80}
\indexentry{rightmost derivation|hyperpage}{87}
\indexentry{leftmost derivation|hyperpage}{87}
\indexentry{shift-reduce parser|hyperpage}{88}
\indexentry{parser configuration|hyperpage}{88}
\indexentry{marked rule|hyperpage}{95}
\indexentry{closure of a set of marked rules|hyperpage}{96}
\indexentry{\(\textsl{closure}(\mathcal{M})\)|hyperpage}{96}
\indexentry{augmented grammar|hyperpage}{97}
\indexentry{$\lambda$-generating|hyperpage}{98}
\indexentry{augmented grammar|hyperpage}{99}
\indexentry{shift-reduce conflict|hyperpage}{102}
\indexentry{reduce-reduce conflict|hyperpage}{102}
\indexentry{SLR grammar|hyperpage}{102}
\indexentry{extended marked rule|hyperpage}{107}
\indexentry{e.m.R.|hyperpage}{107}
\indexentry{shift-reduce conflict|hyperpage}{109}
\indexentry{reduce-reduce conflict|hyperpage}{109}
\indexentry{\textsl{Integer}-\mytt{C}|hyperpage}{139}
\indexentry{symbol table|hyperpage}{150}
Binary file modified Lecture-Notes/formal-languages.pdf
Binary file not shown.
15 changes: 12 additions & 3 deletions Python/Chapter-04-05/RegExp-Parser.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -126,8 +126,8 @@
"- the operator symbols `+` and `*`, \n",
"- the parentheses `(`, `)`, \n",
"- single upper or lower case letters, \n",
"- `0`, \n",
"- the empty string `\"\"`.\n",
"- the symbol `0` that matches the empty language, \n",
"- the symbol `𝜀` that matches the empty string.\n",
"\n",
"All whitespace characters (and, indeed, all characters that could not be matched) are discarded."
]
Expand Down Expand Up @@ -351,7 +351,16 @@
"metadata": {},
"outputs": [],
"source": [
"parse('a+b')"
"parse('abc')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"parse('a+b+c')"
]
},
{
Expand Down
Loading

0 comments on commit d634236

Please sign in to comment.