-
Notifications
You must be signed in to change notification settings - Fork 4
ErgProcessing
This page is intended as a collection of pointers related to using the ERG (for parsing or generation), with an eye towards first-time users. The page was initiated by StephanOepen, who intends to maintain it over time. Seeing that it will be important for the information provided here to be accurate and up-to-date, please be both pedantic and careful in making (non-trivial) revisions.
To parse running text using the ERG, a number of tools are required. Probably the most straightforward way of installing a full DELPH-IN toolchain is through the so-called LOGON distribution (see the page LogonTop for background). On any reasonably recent Linux distribution (running on 32- or 64-bit x86 derivatives, where 32-bit compatibility libraries need to be available on natively 64-bit environments), do the following (note that the full installation requires several gigabytes of available disk space, and the process of downloading the full tree can take between ten minutes and a couple of hours):
svn co http://svn.emmtee.net/trunk logon
For the full functionality of the LOGON tree, there is a certain amount of first-time configuration, documented as Step (1) on the LogonInstallation page. However, to merely parse a sequence of sentences, it should work to directly move on to just running the system, from the command line:
cd logon
./parse --binary --erg+tnt --best 1 --text /tmp/test.txt
Here, the assumption is that the file /tmp/test.txt provides a newline-separated list of strings (using Un*x-style line break conventions), where each line will be fed to the parser individually for syntactic analysis (with a standard configuration, handling unknown words on the basis of a PoS tagging pre-processing step and lightweight RE-based named entity detection; for details, see the PetInput page).
If all goes well (as it should), the above command will produce tracing outputs somewhat like the following:
International Allegro CL Enterprise Edition
8.2 [64-bit Linux (x86-64)] (Oct 27, 2011 17:11)
Copyright (C) 1985-2010, Franz Inc., Oakland, CA, USA. All Rights Reserved.
This standard runtime copy of Allegro CL was built by:
[TC13152] Universitetet i Oslo (IFI)
; Loading /ltg/oe/src/logon/dot.tsdbrc
[...]
[sh 0.0] (1) `The ERG is easy to install and use .' [100000] --- 1 (0.11|0.10:0.11 s) <58:1058> {2612:4471} (26M).
[sh 0.0] (2) `Parsing English with the ERG is a real pleasure .' [100000] --- 1 (0.12|0.11:0.12 s) <59:959> {2108:5709} (32M).
[sh 0.0] (3) `We are grateful to everyone who contributed to the grammar and software .' [100000] --- 1 (0.18|0.16:0.18 s) <84:1777> {4202:8436} (45M).
[t40002] total elapsed parse time 0.4s; 3 items; avg time per item 0.1333s
flush-cache(): flushing `erg/1111/test/12-01-17/pet' cache ... done.
The numbers following each parser input on the last few lines above record various statistics, for example: (a) producing a single analysis in all cases (meaning the grammar was able to assign a parse to each of these utterances, and reflecting that the --best 1 option in our example command asks for only the most probable analysis to be extracted); (b) taking between 110 and 190 miliseconds per sentence; or (c) requiring between 26 and 45 megabytes of dynamic memory while parsing one sentence. More detailed information about the batch parsing process is available through the LogonProcessing/BatchParsing page.
As a side effect of the above run, parsing results and statistics were written into a simple (file-based, textual) database, a so-called competence and performance profile. For our running example, this profile will be located in a newly created sub-directory called erg/1111/test/12-01-17/pet, within the so-called [incr tsdb()] database home lingo/lkb/src/tsdb/home/ inside the LOGON tree.
While it is possible to operate directly on the profile directory, it may be more convenient (for non-expert users) to export key elements of each parsing result into a (slightly) more human-readable form. To do so, a command like the following can be be used (where the name of the profile holding parsing results needs to reflect the current date, of course):
./redwoods --binary --erg --default --composite --target /tmp \
--export derivation,tree,mrs,eds --active all \
erg/1111/test/12-01-17/pet
This command asks to export four distinct views on each analysis: (a) the so-called derivation tree (i.e. the exact HPSG recipe); (b) a simplified syntactic constituent tree (using a set of conventional category labels); (c) a logical-form meaning representation in Minimal Recursion Semantics; and (d) a reduced variant of the semantics, in the form of elementary dependencies. These outputs will be available, in a newly created file in the --target directory, named after the original parsing profile, i.e. in this case /tmp/erg.1111.test.12-01-17.pet.gz (note that export files by default are compressed using GNU gzip(1)); sample output for our running example is available as a separate ErgProcessing/SampleExport page.
If you like what you are seeing, it is probably about time to read more about the ERG and DELPH-IN technology, for example starting from the ErgTop and LogonTop pages on this wiki, maybe perusing our mailing list archives, or preparing a grant application or donation to work with us on improving the grammar and tools. There are numerous ways of running the toolchain and of adapting the grammar and engine to various subject domains, genres, and more generally to a specific use case. Furthermore, (even) more detailed syntacto-semantic information is available from the full HPSG analyses delivered by the grammar than what is exposed through the four interface representations shown in the above.
Home | Forum | Discussions | Events