-
Notifications
You must be signed in to change notification settings - Fork 4
BarcelonaPreprocessing
Besides parsing 'clean' standard corpora for treebanking etc., there seems to be a growing interest and need to use HPSG parsing with DELPH-IN tools for larger-scale processing of unseen real-world text such as the content of scientific papers, Wikipedia, and newspaper articles. There are various ways to improve parsing coverage by pre-processing, e.g.
- shallow divide-and-conquer approaches (separating subclauses, splitting long sentences, recognizing titles, addresses, tables, figures)
- punctuation/sentence boundary detection
- text repair
- unknown word handling
- integration with PoS tagging and named entity recognition, use of (interchangeable) external ontologies
- chart mapping
- coreference resolution
- ...
Some of these methods have been implemented as part of LKB, PET, Heart of Gold, or project-specifically and can be characterized as absolute prerequisites (still not always optimally solved or fitting to the domain), others have the status of 'good idea but never done' or go into the direction of application- or domain-oriented pre-processing.
The aim of this discussion is to collect and discuss the efforts that have been made by member of the DELPH-IN community recently, maybe prospectively even try to unify them. There seem to be more good-practice solutions than have been published or made available as downloadable tools. Participants are encouraged to briefly report on their needs and solutions (maybe even with a slide). The slot partly overlaps with the chart mapping tutorial by Peter Adolphs and the robustness techniques tutorial by Yi Zhang. Moreover, I will present some of the efforts conducted at our site in my presentation on HyLaP.
- processing PDF
-
Ulrich: Using [http://incubator.apache.org/pdfbox/ Apache PDFBox] in [http://hylap.dfki.de/ HyLap] and [http://take.dfki.de/ TAKE]. It is possible to retrieve layout information with PDFBox.
-
Francis: There is a pipeline developed within the [http://www.kyoto-project.eu/ KYOTO Project] for converting PDF into [http://xmlgroup.iit.cnr.it/kyoto/index.php?option=com_content&view=article&id=141&Itemid=130 KAF (KYOTO Annotation Format)], based on [http://poppler.freedesktop.org/ Poppler], which returned in general the best results. Sharing welcome.
-
Mike: Good experience with [http://www.unixuser.org/~euske/python/pdfminer/index.html PDFMiner]. Good unicode support. Turned out to be useful to use font family for identifying consequent text blocks and font sizes for identifying further logical text structure (like headings, text body, captions, etc.).
-
- sentence splitting
-
Gisle:
-
tried tokenizer from [http://www.nltk.org/ NLTK] but results were not good enough
-
now using tokenizer from CIS München, which is also open to adaptation to other languages
-
-
Ulrich: using JTok from [http://heartofgold.dfki.de/ HoG]. Can be adapted.
-
- divide-and-conquer approaches:
-
Dan: Gert Jan van Noord in Alpino uses divide-and-conquer strategy, with home-grown machinery, with good success. He is supposed to use external tools he adapted to his needs.
-
Dan: We really need to consider divide-and-conquer strategies. Especially for WeScience, the biggest chunk of the sentences in the 50-60 token length fail because of hitting the resource limits. For half of the sentence where we don’t get a parse, we hit a resource limit.
-
Berthold: Try the topoparser approach developed within the Whiteboard project. PET was guided by shallow topological parser output. Code is not in PET anymore, but Bernd Kiefer should still have a copy of it.
-
Dan: We should evaluate what others are doing in this area.
-
Dan: We could enter chart items of length 0 to indicate chunk borders, with a similar effect as punctuation right now.
-
Stephan: Something similar is currently done to preserve italic markup.
-
- citations
-
Dan: planning to implement token mapping rules for treating references and citations pretty soon (with a focus on WeScience). Should not be difficult, since they follow a quite restricted language, that might vary a bit though between different sources. Definitely, they should be part of the grammar.
-
Ulrich: using ParseKit so far for finding references in the text. Problem: the output of ParseKit is a regenerated reference that seldomly matches the input.
-
Richard (answering Dan): in SciBorg, references and citations were ignored.
-
- coreference resolution
-
Problem: loosing a lot of sentence where pronouns are used to refer back to some entity mentioned before. Relevant to all sorts of IE tasks.
-
Ulrich: currently evaluating different tools for coref resolution
-
Dan: Ann has been experimenting with semantics-based anaphoric binding in SciBorg. Supposed to be ready in a couple of months.
-
Stephan: doctoral fellow in Oslo will look into coreference resolution soon. Considered building on the [http://www.sfs.uni-tuebingen.de/~versley/BART/ BART toolkit].
-
Bart: Charniak told him that the BART toolkit didn't return good results for pronoun resolution. Bart also pointed out it's not his fault.
-
Rebecca: Charniak is currently working on a coref resolution.
-
Dan: Ann was planning to do some implementation on coreference resolution on top of MRS.
-
- word-sense disambiguation
-
Francis: disambuation of WordNet senses:
-
[http://www.d.umn.edu/~tpederse/similarity.html WordNet::Similarity] by [http://www.d.umn.edu/~tpederse/ Ted Pedersen]
-
[http://ixa2.si.ehu.es/ukb/ UKB: Graph Based Word Sense Disambiguation and Similarity], written in C, GPL
-
Francis himself plans to do something in the KYOTO project on disambiguation of WordNet senses, based on bilingual resources; willing to cooperate with other DELPH-IN members
-
chicken and egg problem: people have shown a) disambiguated word senses helps parsing and b) parsing helps word sense disambiguation
-
-
Home | Forum | Discussions | Events