forked from bobzhang/ocaml-book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathocaml-lex.tex
127 lines (108 loc) · 3.59 KB
/
ocaml-lex.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
\section{Ocamlllex}
\href{http://caml.inria.fr/pub/docs/manual-ocaml/manual026.html}{ocamllex}
\begin{enumerate}
\item \textit{module Lexing}
\begin{ocamlcode}
se_str "from" "Lexing";;
\end{ocamlcode}
\begin{ocamlcode}
val from_string : string -> lexbuf
val from_function : (string -> int -> int) -> lexbuf
val from_input : BatIO.input -> Lexing.lexbuf
val from_channel : BatIO.input -> Lexing.lexbuf
\end{ocamlcode}
\item syntax \\
\begin{bluetext}
{header}
let ident = regexp ...
rule entrypoint [arg1 .. argn ] =
parse regexp {action }
| ..
| regexp {action}
and entrypoint [arg1 .. argn] =
parse ..
and ...
{trailer}
\end{bluetext}
The parse keyword can be replaced by shortest keyword.
Typically, the header section contains the \textit{open} directives
required by the actions
All identifiers starting with \verb|__ocaml_lex| are reserved for use by
\textbf{ocamllex}
\item example
for me, best practice is put some test code in the trailer part, and
use \textit{ocamlbuild fc\_lexer.byte --} to verify, or write a
makefile. you can write several indifferent rule in a file using and.
\begin{bluetext}
(* verbatim translate *)
rule translate = parse
| "current_directory" {print_string (Sys.getcwd ()); translate lexbuf}
| _ as c {print_char c ; translate lexbuf}
| eof {exit 0}
{
let _ =
let chan = open_in "fc_lexer.mll" in begin
translate (Lexing.from_channel chan );
close_in chan
end
}
\end{bluetext}
\begin{alternate}
Legacy.Printexc.print;;
- : ('a -> 'b) -> 'a -> 'b = <fun>
\end{alternate}
\item caveat \\
the longest(shortest) win, then consider the order of each regexp
later.
Actions are evaluated after the \textit{lexbuf} is bound to the
current lexer buffer and the identifier following the keyword
\textit{as} to the matched string.
\item position \\
The lexing engine manages only the \textit{pos\_cnum} field of
\textit{lexbuf.lex\_curr\_p} with the number of chars read from the
start of lexbuf. you are responsible for the other fields to be
accurate.
i.e.
\begin{bluetext}
let incr_linenum lexbuf = Lexing.(
let pos = lexbuf.lex_curr_p in
lexbuf.lex_curr_p <- { pos with
pos_lnum = pos.pos_lnum + 1; (* line number *)
pos_bol = pos.pos_cnum; (* the offset of the beginning of the
line *)
})
\end{bluetext}
\item combine with ocamlyacc \\
normally just add \textit{open Parse} in the header, and use the
token defined in \textit{Parse}
\item tips \\
\begin{enumerate}
\item keyword table
\begin{bluetext}
{let keyword_table = Hashtbl.create 72
let _ = ...
}
rule token = parse
| ['A'-'z' 'a'-'z'] ['A'-'z' 'A'-'z' '0'-'9' '_'] * as id
{try Hashtbl.find keyword_table id with Not_found -> IDENT id}
| ...
\end{bluetext}
\item for sharing \textbf{why ocamllex sucks}\\
some complex regexps are not easy to write, like string, but sharing
is hard. To my knowledge, cpp preprocessor is fit for this task here.
camlp4 is not fit, it will check other syntax, if you use ulex, camlp4
will do this job.
So, my Makefile is part like this
\begin{bluetext}
lexer :
cpp fc_lexer.mll.bak > fc_lexer.mll
ocamlbuild -no-hygiene fc_lexer.byte --
\end{bluetext}
even so, sharing is still very hard, since the built in compiler used another way to write string lexing. painful too sharing. so ulex wins in both aspects.
sharing in ulex is much easier.
\end{enumerate}
\end{enumerate}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "master"
%%% End: