scanner simplified

karlstroetmann · Nov 2, 2024 · d634236 · d634236
1 parent fd8cfef
commit d634236
Show file tree

Hide file tree

Showing 6 changed files with 108 additions and 194 deletions.
diff --git a/Lecture-Notes/context-free-languages.tex b/Lecture-Notes/context-free-languages.tex
@@ -348,32 +348,6 @@ \subsection{Derivations}
 \end{enumerate}
 
 \exerciseEng
-We define for $w \in \Sigma^*$ and $c \in \Sigma$ the function
-\\[0.2cm]
-\hspace*{1.3cm}
-$\textsl{count}(w,c)$.
-\\[0.2cm]
-The function counts how many times the letter $c$ occurs in the word $w$.  The definition is done by induction
-on the string $w$.
-\begin{enumerate}
-\item[B.C.:] $w = \lambda$.  
-
-            We set
-            \\[0.2cm]
-            \hspace*{1.3cm}
-            $\textsl{count}(\lambda, c) := 0$.
-\item[I.S.:] $w = dv$ with $d \in \Sigma$ and $v \in \Sigma^*$.  
-
-            Then $\textsl{count}(dv,c)$ is defined by case distinction:
-            \\[0.2cm]
-            \hspace*{1.3cm}
-            $\textsl{count}(dv,c) := \left\{
-             \begin{array}[c]{llr}
-               \textsl{count}(v,c) + 1 & \mbox{if $c = d$};    \\
-               \textsl{count}(v,c) & \mbox{falls $c \not= d$}. \\
-             \end{array}\right.
-            $ \eox
-\end{enumerate}
 We define $\Sigma = \{ \squoted{A}, \squoted{B} \}$ and define the language $L$ as the
 set of words $w\in\Sigma^*$ in which the letters \squoted{A} and \squoted{B} occur with the
 same frequency:
@@ -993,7 +967,7 @@ \section{Top-Down Parser}
       (abbreviated as \textsc{Ebnf}-grammar).   Theoretically, the expressive power of
       \textsc{Ebnf} grammars is the same as the expressive power of context-free grammars.
       In practice, however, it turns out that the construction of top-down parsers for
-      \textsc{Ebnf} grammars is easier, because in an \textsc{Ebnf} grammar the left recursion can be replaced
+      \textsc{Ebnf} grammars is easier, because in an \textsc{Ebnf} grammar the left recursion can often be replaced
       by iteration. 
 \end{enumerate}
 In the rest of this chapter we will discuss these two procedures in more detail using the grammar for
@@ -1167,7 +1141,6 @@ \subsection{Rewriting a  Grammar to Eliminate Left Recursion \label{left-recursi
       \\[0.2cm]
       where $\textsl{op} \in \{ \quoted{$\cdot$}, \quoted{/} \}$ holds. 
 \end{enumerate}
-\pagebreak
 
 \exerciseEng \label{exercise:regexp}
 \begin{enumerate}[(a)]
@@ -1203,6 +1176,23 @@ \subsection{Rewriting a  Grammar to Eliminate Left Recursion \label{left-recursi
 
 
 \subsection{Implementing a Top Down Parser in \textsl{Python}}
+\noindent
+Now we are ready to implement a parser for recognizing arithmetic expressions.
+We will use the grammar that is shown in Figure \ref{fig:Expr2} on page \pageref{fig:Expr2}.
+Before we can implement the parser, we need a scanner.  We will use a hand-coded scanner that is shown in
+Figure \ref{fig:Top-Down-Parser:scanner.ipynb} on page \pageref{fig:Top-Down-Parser:scanner.ipynb}.
+The function \texttt{tokenize} implemented in this scanner receives a string \texttt{s} as argument and returns
+a list of tokens.  The string \texttt{s} is supposed to represent an arithmetical expression. 
+In order to understand the implementation, you need to know the following:
+\begin{enumerate}[(a)]
+\item We need to set the flag \texttt{re.VERBOSE} in our call of the function \texttt{findall}
+      below because otherwise we are not able to format the regular expression \texttt{lexSpec} the way 
+      we have done it.  In prticular, we would not be able to use comment inside the regular expression
+      and we would not be able to format the regular expression using white space.
+\item The regular expression \texttt{lexSpec} contains 3 alternatives: white space, numbers, and operator symbols.
+      White space is removed, while everything else is collected in the list \texttt{result}.
+      Furthermore, the empty string that occurs at the end has to be removed in the same way as white space.
+\end{enumerate}
 
 
 \begin{figure}[!ht]
@@ -1215,28 +1205,17 @@ \subsection{Implementing a Top Down Parser in \textsl{Python}}
                 xleftmargin   = 0.0cm,
                 xrightmargin  = 0.0cm
               ]{python3}
-    import re
-
-    def tokenize(s):
-        lexSpec = r'''([ \t]+)        |  # blanks and tabs
-                      ([1-9][0-9]*|0) |  # number
-                      ([()])          |  # parentheses 
-                      ([-+*/])        |  # arithmetical operators
-                      (.)                # unrecognized character
+    def tokenize(s: str) -> list[str]:
+        lexSpec = r'''[ \t]+        |  # blanks and tabs
+                      [1-9][0-9]*|0 |  # numbers
+                      [-+*/()]      |  # arithmetical operators and parentheses
                    '''
         tokenList = re.findall(lexSpec, s, re.VERBOSE)
         result    = []
-        for ws, number, parenthesis, operator, error in tokenList:
-            if ws:        # skip blanks and tabs
-                pass
-            if number:
-                result += [ number ]
-            if parenthesis:
-                result += [ parenthesis ]
-            if operator:
-                result += [ operator ]
-            if error:
-                result += [ f'ERROR({error})']
+        for token in tokenList:
+            if token == '' or token[0] in [' ', '\t']:        # skip blanks and tabs
+                continue
+            result += [ token ]
         return result
 \end{minted}
 \vspace*{-0.3cm}
@@ -1308,22 +1287,6 @@ \subsection{Implementing a Top Down Parser in \textsl{Python}}
 
 
 \noindent
-Now we are ready to implement a parser for recognizing arithmetic expressions.
-We will use the grammar that is shown in Figure \ref{fig:Expr2} on page \pageref{fig:Expr2}.
-Before we can implement the parser, we need a scanner.  We will use a hand-coded scanner that is shown in
-Figure \ref{fig:Top-Down-Parser:scanner.ipynb} on page \pageref{fig:Top-Down-Parser:scanner.ipynb}.
-The function \texttt{tokenize} implemented in this scanner receives a string \texttt{s} as argument and returns
-a list of tokens.  The string \texttt{s} is supposed to represent an arithmetical expression. 
-In order to understand the implementation, you need to know the following:
-\begin{enumerate}[(a)]
-\item We need to set the flag \texttt{re.VERBOSE} in our call of the function \texttt{findall}
-      below because otherwise we are not able to format the regular expression \texttt{lexSpec} the way 
-      we have done it.
-\item The regular expression \texttt{lexSpec} contains 5 parenthesized groups.  Therefore,
-      \texttt{findall} returns a list of 5-tuples where the 5 components correspond to the 5
-      groups of the regular expression.  As the 5 groups are non-overlapping, exactly one of the 5 components
-      will be a non-empty string.
-\end{enumerate}
 Figure \ref{fig:Top-Down-Parser.ipynb} on page
 \pageref{fig:Top-Down-Parser.ipynb} shows an implementation of a recursive descent parser in
 \textsc{Python}. 
@@ -1371,7 +1334,7 @@ \subsection{Implementing a Top Down Parser in \textsl{Python}}
       \verb|[9, [')', '*', 2]]|.
       \\[0.2cm]
       Here, the part \verb|['(', 1, '+', 2, ')', '*', 3]| has been parsed and evaluated as
-      the number $9$ and \verb|[')', '*', 2]| is the list of tokens that have not yet been
+      the number $9$ and \\ \verb|[')', '*', 2]| is the list of tokens that have not yet been
       processed.
 
       In order to parse an arithmetic expression, the function first parses a
@@ -1588,7 +1551,7 @@ \subsection{Implementing a Recursive Descent Parser that Uses an \textsc{EBNF} G
 
 \paragraph{Historical Notes} The language \textsc{Algol} \cite{backus:1959,naur:1960} was the first
 programming language with a syntax that was based on an \textsc{Ebnf} grammar.  
-
+\pagebreak
 
 \section{Check your Understanding}
 \begin{enumerate}[(a)]

diff --git a/Lecture-Notes/formal-languages.idx b/Lecture-Notes/formal-languages.idx
@@ -79,30 +79,30 @@
 \indexentry{grammar rule|hyperpage}{60}
 \indexentry{start symbol|hyperpage}{61}
 \indexentry{derivation-step|hyperpage}{61}
-\indexentry{palindrome|hyperpage}{64}
+\indexentry{palindrome|hyperpage}{63}
 \indexentry{parse-tree|hyperpage}{64}
-\indexentry{ambiguous grammar|hyperpage}{66}
-\indexentry{left-recursive|hyperpage}{67}
+\indexentry{ambiguous grammar|hyperpage}{65}
+\indexentry{left-recursive|hyperpage}{66}
 \indexentry{\textsc{Ebnf}-Grammar|hyperpage}{73}
-\indexentry{Cocke-Younger-Kasami-Algorithmus|hyperpage}{81}
-\indexentry{\textsc{Cyk}-Algorithmus|hyperpage}{81}
-\indexentry{Earley-Objekt|hyperpage}{81}
-\indexentry{rightmost derivation|hyperpage}{88}
-\indexentry{leftmost derivation|hyperpage}{88}
-\indexentry{shift-reduce parser|hyperpage}{89}
-\indexentry{parser configuration|hyperpage}{89}
-\indexentry{marked rule|hyperpage}{96}
-\indexentry{closure of a set of marked rules|hyperpage}{97}
-\indexentry{\(\textsl{closure}(\mathcal{M})\)|hyperpage}{97}
-\indexentry{augmented grammar|hyperpage}{98}
-\indexentry{$\lambda$-generating|hyperpage}{99}
-\indexentry{augmented grammar|hyperpage}{100}
-\indexentry{shift-reduce conflict|hyperpage}{103}
-\indexentry{reduce-reduce conflict|hyperpage}{103}
-\indexentry{SLR grammar|hyperpage}{103}
-\indexentry{extended marked rule|hyperpage}{108}
-\indexentry{e.m.R.|hyperpage}{108}
-\indexentry{shift-reduce conflict|hyperpage}{110}
-\indexentry{reduce-reduce conflict|hyperpage}{110}
-\indexentry{\textsl{Integer}-\mytt{C}|hyperpage}{140}
-\indexentry{symbol table|hyperpage}{151}
+\indexentry{Cocke-Younger-Kasami-Algorithmus|hyperpage}{80}
+\indexentry{\textsc{Cyk}-Algorithmus|hyperpage}{80}
+\indexentry{Earley-Objekt|hyperpage}{80}
+\indexentry{rightmost derivation|hyperpage}{87}
+\indexentry{leftmost derivation|hyperpage}{87}
+\indexentry{shift-reduce parser|hyperpage}{88}
+\indexentry{parser configuration|hyperpage}{88}
+\indexentry{marked rule|hyperpage}{95}
+\indexentry{closure of a set of marked rules|hyperpage}{96}
+\indexentry{\(\textsl{closure}(\mathcal{M})\)|hyperpage}{96}
+\indexentry{augmented grammar|hyperpage}{97}
+\indexentry{$\lambda$-generating|hyperpage}{98}
+\indexentry{augmented grammar|hyperpage}{99}
+\indexentry{shift-reduce conflict|hyperpage}{102}
+\indexentry{reduce-reduce conflict|hyperpage}{102}
+\indexentry{SLR grammar|hyperpage}{102}
+\indexentry{extended marked rule|hyperpage}{107}
+\indexentry{e.m.R.|hyperpage}{107}
+\indexentry{shift-reduce conflict|hyperpage}{109}
+\indexentry{reduce-reduce conflict|hyperpage}{109}
+\indexentry{\textsl{Integer}-\mytt{C}|hyperpage}{139}
+\indexentry{symbol table|hyperpage}{150}
diff --git a/Lecture-Notes/formal-languages.pdf b/Lecture-Notes/formal-languages.pdf
diff --git a/Python/Chapter-04-05/RegExp-Parser.ipynb b/Python/Chapter-04-05/RegExp-Parser.ipynb
@@ -126,8 +126,8 @@
     "- the operator symbols `+` and `*`, \n",
     "- the parentheses `(`, `)`, \n",
     "- single upper or lower case letters, \n",
-    "- `0`, \n",
-    "- the empty string `\"\"`.\n",
+    "- the symbol `0` that matches the empty language, \n",
+    "- the symbol `𝜀` that matches the empty string.\n",
     "\n",
     "All whitespace characters (and, indeed, all characters that could not be matched) are discarded."
    ]
@@ -351,7 +351,16 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "parse('a+b')"
+    "parse('abc')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "parse('a+b+c')"
    ]
   },
   {