-
Notifications
You must be signed in to change notification settings - Fork 4
DeepBank_OneZero
This page documents Version 1.0 of DeepBank, released in October 2013. In this release, there are annotations for Sections 00–21 of the venerable Wall Street Journal (WSJ) text from the Penn TreeBank (PTB). The selection of sentences included in DeepBank is aligned with the PTB, but otherwise it is fully independent of the original PTB annotations, i.e. none of the linguistic information in DeepBank is derivative of the PTB.
Sections 00–21 comprise a total of 43,541 sentences, of which 37,085 (or 85.2%) have manually validated HPSG analyses. In a small number of cases (167 sentences), there is more than one gold-standard HPSG analysis; for another 27 sentences (not overlapping with the ambiguously annotated cases), the annotator has indicated a minor deficiency in the HPSG analysis. To reflect this latter distinction, we occassionally talk about gold- vs. silver-standard annotations.
For the almost 15% of sentences for which the HPSG system either did not provide any candidate analyses (within certain bounds on time and memory), or where all available analyses were rejected during annotation, we seek to fill the resulting ‘coverage gap’ in the treebank through automated parsing with the robust, approximative parser of Zhang & Krieger (2011). For another 5,927 sentences (or 13.6%), this release includes what Zhang & Krieger (2011) dub HPSG pseudo-derivations, i.e. a derivation tree similar in form and content to the ones produced by the full HPSG parser, but potentially combining constructions and lexical entries for which the unification of the full HPSG constraints would fail. For these reasons, some percentage of these robust analyses are lost in the various derived formats, as with increasing degrees of inconsistency in the pseudo-derivations it may be impossible to convert to the other formats (see below).
The native representation of the HPSG analyses in DeepBank is in the form of what is called [incr tsdb()] profiles, essentially flat-file relational databases. These reside inside the ‘tsdb/‘ sub-directory and are usually processed using [incr tsdb()] and other components of the DELPH-IN toolchain.
Home | Forum | Discussions | Events