DeepBank_OneZero

Background

This page documents Version 1.0 of DeepBank, released in October 2013. In this release, there are annotations for Sections 00–21 of the venerable Wall Street Journal (WSJ) text from the Penn Treebank (PTB). The selection of sentences included in DeepBank is aligned with the PTB, but otherwise it is fully independent of the original PTB annotations, i.e. none of the linguistic information in DeepBank is derivative of the PTB.

Tokenization Conventions

Treebank Counts

Sections 00–21 comprise a total of 43,541 sentences, of which 37,085 (or 85.2%) have manually validated HPSG analyses. In a small number of cases (167 sentences), there is more than one gold-standard HPSG analysis; for another 27 sentences (not overlapping with the ambiguously annotated cases), the annotator has indicated a minor deficiency in the HPSG analysis. To reflect this latter distinction, we occassionally talk about gold- vs. silver-standard annotations.

For the almost 15% of sentences for which the HPSG system either did not provide any candidate analyses (within certain bounds on time and memory), or where all available analyses were rejected during annotation, we seek to fill the resulting ‘coverage gap’ in the treebank through automated parsing with the robust, approximative parser of Zhang & Krieger (2011). For another 5,927 sentences (or 13.6%), this release includes what Zhang & Krieger (2011) dub HPSG pseudo-derivations, i.e. a derivation tree similar in form and content to the ones produced by the full HPSG parser, but potentially combining constructions and lexical entries for which the unification of the full HPSG constraints would fail. For these reasons, some percentage of these robust analyses are lost in the various derived formats, as with increasing degrees of inconsistency in the pseudo-derivations it may be impossible to convert to the other formats (see below).

529 sentences (1.2%) from these 22 sections of the WSJ text have no analysis at all in DeepBank.

Sentence Identifiers

The Master Index

To keep track of the different levels of quality available for each sentence, the file ‘Items’ in the top-level directory provides a ‘master index’ of available annotations. For each sentence, the file contains one line, constituting a tab-separated triple with the fields id, confidence, and active. The first field is the unique sentence identifier (see above); the second field is a numerically coded indication of the quality of annotations available, distinguishing the following levels:

3: gold-standard, manually validated
2: silver-standard, manually validated
1: robust, automatically parsed pseudo-derivation
0: no available annotation.

Finally, the third field, active, provides the number of HPSG analyses accepted during annotation, i.e. will be 1 for the vast majority of items and an integer greater than one for the small number of items that were not fully disambiguated. For the robust pseudo-derivations, the active field will always be 1 (as the parser was run in one-best mode); for unannoated items, the number of active analyses is by definition 0.

Available File Formats

The native representation of the HPSG analyses in DeepBank is in the form of what is called [incr tsdb()] profiles, essentially flat-file relational databases. These reside inside the ‘tsdb/‘ sub-directory and are usually processed using [incr tsdb()] and other components of the DELPH-IN toolchain.

A flat-file, textual representation of various views on the full HPSG analyses is provided in the form of DELPH-IN export files, inside the ‘export/’ directory. These files contain (a) the original ‘raw’ string; (b) the initial sequence of PTB-style tokens input to the parser; (c) the parser-internal lattice of ERG-style tokens; (d) the full HPSG derivation (ItsdbDerivations); (e) a simplified phrase structure tree, labeled with common category abbreviations (ErgTrees); (f) a logical-form meaning representation in Minimal Recursion Semantics (MRS; RmrsRfc; and (g) a reduction of the MRS into a variable-free Elementary Dependency Structure (EDS; RmrsEds). Please see the ItsdbExport page for additional information on the format conventions used in these files.

For immediate compatibility with much mainstream work, there is a conversion of the full HPSG analyses into bi-lexical syntactic and semantic dependencies, using a token-oriented, tab-separated file format inspired by the Shared Task of the 2008 Conference on Computational Language Learning (CoNLL), in the ‘conll/’ sub-directory.

Combining DeepBank with Other DELPH-IN Tools

Home | Forum | Discussions | Events

Provide feedback

Saved searches

Use saved searches to filter your results more quickly