-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathFINAL_PROJECT_REPORT.txt
102 lines (84 loc) · 5.11 KB
/
FINAL_PROJECT_REPORT.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
################################################################################
# Pyflex - A Python Lexical Analyzer
################################################################################
# Final Project Report
#
# Authors: Judson James and Peyton Chandarana
# Course: CSCE 578
# Semester: Spring 2019
#
################################################################################
Abstract:
Pyflex is a python tool that emulates a C utility called Fast Lex, a lexical
analyzer used for finding matches based on a specified ruleset.
Pyflex takes the ideas of specifying rulesets or regular expressions and uses
the specifications to parse a file and find matching expressions throughout an
input file. When matches are found an optional action in the form of a python
function can be called to perform some arbitrary task.
################################################################################Caveats:
- Regexes cannot necessarily exclude some things.
For example T3_Flex.pyfl uses a regex that attempts to cut out noun-verbs
(the N in the tag section), but sometimes does not catch the N if it at
the end of the tag.
################################################################################
File/Program Structure:
generated_code.py - This file contains the code generated by generator.py
for use in parsing/scanning for matches to the regular
expressions specified in the pyfl file. It also contains
any user defined code that was placed in the pyfl file.
generator.py - This python code generates the code found in
generated_code.py.
parser.py - This file gets the specifications and code stored in the
pyfl file. It interprets and stores these in a symbol
table for later use by the generator to create the
generated code.
main.py - This is the driver of the pyflex program.
symbol_table.py - This file contains the SymbolTable class which acts as
a data structure for storing the information that the
parser gets from the pyfl file.
################################################################################
Methodology:
This section describes how the overall program flows from task to task.
In main.py we first capture the arguments that the user passes in from the
command line when calling python3 main.py. We do some arugment checking and
then proceed to call the main function that begins doing the actual work of the
program.
In this main function we first call upon the Parser. The parser takes in the
first argument that the user passed in as the pyflex specification to use. The
Parser takes the specifications defined in the pyflex file and stores them in
a Parser object. We then take this object and call ParseFile() which returns
a dictionary that contains the different sections of the pyflex file (i.e. the
ruleset, instructions, and the user defined code).
After successfully parsing the pyflex file to get the information stored in
the pyflex file we take the dictionary mentioned before and store the different
components in a custom data structure we call a SymbolTable. This SymbolTable
stores the RULESET, INSTRUCTIONS, and CODE. We implemented this SymbolTable to
simplify using the data from the pyflex versus doing dictionary lookups.
Finally we enter the code generating portion of the pyflex program. First we
call Generator to create a new instance of a generator object. Then we call
GenerateNewScript on this instance to generate the code based on the
specifications found in the pyflex file (i.e. the generator places the user
defined code at the top, then takes the regular expressions and places them
as if-statements inside of a double for loop) and place the generated code in
generated_code.py.
The final step is for the user to call python3 on the generated_code.py
script with the input files that they wish to scan for matches. The results
are the printed to standard output.
################################################################################
Testing on COCA Data to count verbs:
To test this program on the COCA data T3_Flex.pyfl should be passed in as the
pyfl file into main.py after entering the pipenv shell.
python3 main.py pyflex_file/T3_Flex.pyfl
Now the code is generated for the specifications in the pyfl file and you can
run the generated code.
python3 generated_code.py <zoutFILE## COCA Files>
This will count the verbs in the COCA file and then print the total count as
well as a list of the verbs that matched. (could not get all noun-verbs to
be excluded due to caveat of regular expressions)
################################################################################
Future Improvements:
- Implement a way for a user to use pyflex as a module.
- Speed/efficiency improvements.
- Enable user to define code before the ruleset to be used later on (preamble)
- Enable user to call some code natively after completing scan
################################################################################