-
Notifications
You must be signed in to change notification settings - Fork 4
BarcelonaPreprocessing
Besides parsing 'clean' standard corpora for treebanking etc., there seems to be a growing interest and need to use HPSG parsing with DELPH-IN tools for larger-scale processing of unseen real-world text such as the content of scientific papers, Wikipedia, and newspaper articles. There are various ways to improve parsing coverage by pre-processing, e.g.
* shallow divide-and-conquer approaches (separating subclauses, splitting long sentences, recognizing titles, addresses, tables, figures) * punctuation/sentence boundary detection * text repair * unknown word handling * integration with PoS tagging and named entity recognition,
- use of (interchangeable) external ontologies
* chart mapping * coreference resolution * ...
Some of these methods have been implemented as part of LKB, PET, Heart of Gold, or project-specifically and can be characterized as absolute prerequisites (still not always optimally solved or fitting to the domain), others have the status of 'good idea but never done' or go into the direction of application- or domain-oriented pre-processing.
The aim of this discussion is to collect and discuss the efforts that have been made by member of the DELPH-IN community recently, maybe prospectively even try to unify them. There seem to be more good-practice solutions than have been published or made available as downloadable tools. Participants are encouraged to briefly report on their needs and solutions (maybe even with a slide). The slot partly overlaps with the chart mapping tutorial by Peter Adolphs and the robustness techniques tutorial by Yi Zhang. Moreover, I will present some of the efforts conducted at our site in my presentation on HyLaP.
Home | Forum | Discussions | Events