Skip to content

polifonia-project/Polifonia-Knowledge-Extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

component-id name description type release-date release-number work-package licence related-components bibliography contributors
Polifonia-Knowledge-Extractor
Polifonia Knowledge Extractor
Polifonia Knowledge Extractor pipeline documentation, output AMR graph bank's description and search script.
SoftwareLibrary
28/06/2022
v1.0
WP4
CC-BY_v4
persona
Valeriana
Carolina
extends
Polifonia-Corpus
deliverable-document
D4.5 Software for knowledge extraction from text – context – 1st version

Polifonia Knowledge Extractor

The Polifonia Knowledge Extractor is software that allows the extraction of knowledge from text. It uses Abstract Meaning Representation (AMR) to parse sentences into semantic graphs and offers the possibility to search within large AMR graphs banks.

Pipeline

The pipeline of the Polifonia Knowledge Extractor is depicted in the figure below.

pipeline

Step 1 - corpus (input). The model takes as input a textual corpus. For our work, we used as corpus the Polifonia Textual Corpus (PTC).

Step 2 - sampling. The large size of the PTC does not allow to easily analyze the results of our methodology for knowledge extraction. For this reason, we decided to conduct our experiments on a sample of the PTC, the Polifonia Mini Textual Corpus (PMTC). PMTC is downloadable from DOI

Step 3 - pre-processing (cleaning). The PTC contains a large portion of historical documents obtained using Optical Character Recognition technologies. It thus contains errors that must be corrected in order to avoid parsing inaccuracies. As a first step towards post-OCR correction, we focused on fixing the major errors. As the input unit to the Text2AMR parsing step, as implemented in our pipeline, is the sentence, we paid specific attention to OCR errors in source texts that impacted sentence cohesion. We noticed that periodicals' format peculiarities, such as the arrangement of text into two or more columns on a single page, systematically caused incorrect sentence break issues. Driven by the hypothesis that reconstructing sentence cohesion could enhance the output quality of our Text2AMR parsing step, we decided to implement a minimal rule-based strategy to support the reconstruction of sentence cohesion while preserving source text paragraph breaks. This strategy is released and described in Post-OCR Correction Rules repository. Furthermore, the PTC contains long documents that cannot be parsed as a whole but have to be split into sentences. This step leads to possible loss of information, especially regarding the use of coreferences. For this reason, we decided to adopt a coreference resolution module and substitute pronouns with actual names. As coreference resolution module we adopted SpaCy neuralcoref. We focused only on nouns and pronouns mentioned in a span of 5 sentences.

Step 3 - parsing. Once the sentences have been cleaned, and pronouns have been replaced with proper names, it is possible to pass them to an AMR parser. As an AMR parser, we used SPRING to obtain an initial set of AMR graphs from the sampled and pre-processed corpus.

Step 4 - filtering. Given that we are using non-standard texts (historical documents), the results of the AMR parser may be inaccurate. For this reason, we decided to use a back-translation approach that converts the generated graphs back to sentences. We used SPRING for this task. With the back-translated sentences, we are in a position to compute similarity scores among the original sentences and the generated ones. This score serves to determine the quality of the AMR graphs. The rationale behind this is that if, starting from the generated graph, the model can produce a sentence similar enough to the original sentence, this is a signal of the good quality of the graph. Another strategy to validate the quality of the AMR graphs consists in developing a web application that allows human validators to check the correctness of specific portions of the graphs.

Step 5 - graph bank. The final step of our pipeline consists in obtaining a large set of (presumably) good quality graphs that can be queried and used in other applications such as Knowledge Graph construction and/or Question Answering.

The graph bank

The filtering procedure allowed us to retain 26 588 sentences from the 125 683 parsed sentences. We recall that the high number of discarded sentences is due to noise in the input sentences, mainly due to OCR errors. The filtered AMR graphs bank is downloadable from DOI

Statistics

Some statistics of our graph bank can be found in the following table:

Module Lang #parsed sentence #triples
Encyclopedic English 10.015 298.369
Books English 8.443 198.862
Periodicals English 6.398 109.623
Child (Pilots) English 1.227 24.049
Meetups (Pilots) English 464 13.903
MusicBO (Pilots) English 41 1.297
Overall English 26.588 646.103

Installation

git clone https://github.com/polifonia-project/Polifonia-Knowledge-Extractor

pip install -r requirements.txt

How to use

The parameters of the model are the following:

--propbank-predicate: it is the PropBank predicate to search in the graph bank. It is a required parameter. The complete list of PropBank predicates, together with their description and role set, can be found here.

--save_to_file:This parameter indicates if the results of the search has to saved into a tsv file (located in the out folder)

The results

The results of each search can be saved in a specific file inside the out folder. The filename corresponds to the PropBank predicate used for the search, followed by a .tsv extension. The output file has 5 fields:

  1. document id: the ID of the document from which the information has been extracted
  2. sentence id: the ID of the sentence from which the information has been extracted
  3. number of triples: the number of triples (AMR node name, AMR relation, AMR node name) extracted from the sentence
  4. triples: the extracted triples
  5. sent: the sentence from which the triples have been extracted

Example

As an example, we can use the play-11 PropBank predicate:

python search.py --propbank_predicate play-11

With this search, 311 sentences are selected,d, and we can easily discover that Antonio_Buonomo plays an instrument that is the clarinet or that Georges Barrère played in the New York Symphony Orchestra.

pipeline

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages