-
Notifications
You must be signed in to change notification settings - Fork 4
AclAnthologyCorpus
In July 2012, on the occasion of the 50th anniversary of the Association for Computational Linguistics (ACL), a special workshop was devoted to “issues related to preserving, analysing and exploiting the scientific heritage of the ACL”. Spurred by this event, a community effort emerged, aiming to provide the full ACL Anthology as a high-quality corpus with rich markup, following the TEI P5 guidelines. The goals of this initiative are threefold: (a) to provide a shared resource for experimentation on scientific text; (b)to serve as a basis for advanced search over the ACL Anthology, based on textual content and citations; and, by combining the aforementioned goals, (c)~to present a showcase of the benefits of natural language processing to a broader audience.
This community effort, dubbed the ACL Anthology Corpus (AAC), continues the tradition of projects like the ACL Anthology Reference Corpus (ACL ARC), ACL Anthology Network (ANN), or the ACL Anthology Searchbench (ACL ASB).
As of mid-2012 at least, AAC is in its early stages in many respects. This page aims to provide a stable on-line access point to AAC-related information, launched at the time of the 2012 Annual Meeting of the ACL.
For the time being, this page has been redirected into the wiki of the DELPH-IN community (whose members stand behind the ACL Anthology Searchbench). However, for long-term validity, please always refer to the canonical URL for the AAC initiative as: http://www.delph-in.net/aac/.
The main idea behind what was called a contributed task at ACL 2012 was to combine techniques from Optical Character Recognition (OCR) and ‘native’ text stream extraction from born-digital PDF documents, to let alternate approaches complement each other and aim for the creation of a rich XML format. This method would rely on OCR exclusively only in cases where no born-digital PDFs are available—in case of the ACL Anthology mostly papers published before the year 2000.
Details on specific sub-tasks and examples of challenges to high-quality text and format extraction from the ACL Anthology, for the time being, remain available from the ACL 2012 pages. However, we expect to migrate all relevant information and files to a more permanent communication infrastructure (see below), hence please monitor this page for updates.
svn co http://svn.delph-in.net/aac/trunk aac
Home | Forum | Discussions | Events