Bulk metadata corrections 2025-02-02 (#4541)

* Processed metadata corrections (closes #4538) * Processed metadata corrections (closes #4537) * Processed metadata corrections (closes #4536) * Processed metadata corrections (closes #4534) * Processed metadata corrections (closes #4533) * Processed metadata corrections (closes #4532) * Processed metadata corrections (closes #4531) * Processed metadata corrections (closes #4530) * Processed metadata corrections (closes #4526) * Processed metadata corrections (closes #4525) * Processed metadata corrections (closes #4524) * Processed metadata corrections (closes #4522) * Processed metadata corrections (closes #4520) * Processed metadata corrections (closes #4519) * Processed metadata corrections (closes #4518) * Processed metadata corrections (closes #4517) * Processed metadata corrections (closes #4515) * Processed metadata corrections (closes #4514) * Processed metadata corrections (closes #4513) * Processed metadata corrections (closes #4511) * Processed metadata corrections (closes #4510) * Processed metadata corrections (closes #4508) * Processed metadata corrections (closes #4502) * Processed metadata corrections (closes #4501) * Processed metadata corrections (closes #4500) * Processed metadata corrections (closes #4499) * Processed metadata corrections (closes #4495) * Processed metadata corrections (closes #4493) * Processed metadata corrections (closes #4490) * Processed metadata corrections (closes #4483) * Processed metadata corrections (closes #4481) * Processed metadata corrections (closes #4479) * Processed metadata corrections (closes #4478) * Processed metadata corrections (closes #4477) * Processed metadata corrections (closes #4476) * Processed metadata corrections (closes #4473) * Processed metadata corrections (closes #4471) * Processed metadata corrections (closes #4470) * Processed metadata corrections (closes #4468) * Processed metadata corrections (closes #4466) * Processed metadata corrections (closes #4464) * Processed metadata corrections (closes #4463) * Processed metadata corrections (closes #4459) * Processed metadata corrections (closes #4451) * Processed metadata corrections (closes #4449) * Processed metadata corrections (closes #4448) * Processed metadata corrections (closes #4403) * Processed metadata corrections (closes #4324) * Processed metadata corrections (closes #4463) * Processed metadata corrections (closes #4463) * Script changes: * Use existing branch if present * Handle frontmatter
acl-org · Feb 3, 2025 · c2fcfd6 · c2fcfd6
1 parent 723825b
commit c2fcfd6
Show file tree

Hide file tree

Showing 29 changed files with 120 additions and 105 deletions.
diff --git a/bin/process_bulk_metadata.py b/bin/process_bulk_metadata.py
@@ -109,9 +109,12 @@ def _apply_changes_to_xml(self, xml_repo_path, anthology_id, changes):
 
         _, volume_id, paper_id = deconstruct_anthology_id(anthology_id)
 
-        paper_node = tree.getroot().find(
-            f"./volume[@id='{volume_id}']/paper[@id='{paper_id}']"
-        )
+        if paper_id == "0":
+            paper_node = tree.getroot().find(f"./volume[@id='{volume_id}']/meta")
+        else:
+            paper_node = tree.getroot().find(
+                f"./volume[@id='{volume_id}']/paper[@id='{paper_id}']"
+            )
         if paper_node is None:
             raise Exception(f"-> Paper not found in XML file: {xml_repo_path}")
 
@@ -137,13 +140,15 @@ def _apply_changes_to_xml(self, xml_repo_path, anthology_id, changes):
             real_ids = set()
             for author in changes["authors"]:
                 id_ = author.get("id", None)
+
+                author_tag = "editor" if paper_id == "0" else "author"
                 if id_:
-                    existing_author = paper_node.find(f"author[@id='{id_}']")
+                    existing_author = paper_node.find(f"{author_tag}[@id='{id_}']")
                     if existing_author is not None:
                         real_ids.add(id_)
 
             # remove existing author nodes
-            for author_node in paper_node.findall("author"):
+            for author_node in paper_node.findall(author_tag):
                 paper_node.remove(author_node)
 
             prev_sibling = paper_node.find("title")
@@ -155,7 +160,7 @@ def _apply_changes_to_xml(self, xml_repo_path, anthology_id, changes):
                     attrib["id"] = author["id"]
                 # create author_node and add as sibling after insertion_point
                 author_node = make_simple_element(
-                    "author", attrib=attrib, parent=paper_node, sibling=prev_sibling
+                    author_tag, attrib=attrib, parent=paper_node, sibling=prev_sibling
                 )
                 prev_sibling = author_node
                 for key in ["first", "last", "affiliation", "variant"]:
@@ -198,15 +203,14 @@ def process_metadata_issues(
         today = datetime.now().strftime("%Y-%m-%d")
         new_branch_name = f"bulk-corrections-{today}"
 
-        # Check if branch already exists, and if so, remove it
+        # If the branch exists, use it, else create it
         if new_branch_name in self.local_repo.heads:
-            if verbose:
-                print(f"Deleting existing branch {new_branch_name}", file=sys.stderr)
-            self.local_repo.delete_head(new_branch_name, force=True)
-
-        # Create new branch
-        ref = self.local_repo.create_head(new_branch_name, base_branch)
-        print(f"Created branch {new_branch_name} from {base_branch}", file=sys.stderr)
+            ref = self.local_repo.heads[new_branch_name]
+            print(f"Using existing branch {new_branch_name}", file=sys.stderr)
+        else:
+            # Create new branch
+            ref = self.local_repo.create_head(new_branch_name, base_branch)
+            print(f"Created branch {new_branch_name} from {base_branch}", file=sys.stderr)
 
         # store the current branch
         current_branch = self.local_repo.head.reference
@@ -292,6 +296,7 @@ def process_metadata_issues(
                 except Exception as e:
                     if verbose:
                         print(e, file=sys.stderr)
+                    continue
 
                 if tree:
                     indent(tree.getroot())

diff --git a/data/xml/2021.mtsummit.xml b/data/xml/2021.mtsummit.xml
@@ -55,9 +55,9 @@
       <author><first>Aakash</first><last>Banerjee</last></author>
       <author><first>Aditya</first><last>Jain</last></author>
       <author><first>Shivam</first><last>Mhaskar</last></author>
-      <author><first>Sourabh</first><last>Dattatray Deoghare</last></author>
+      <author><first>Sourabh</first><last>Deoghare</last></author>
       <author><first>Aman</first><last>Sehgal</last></author>
-      <author><first>Pushpak</first><last>Bhattacharya</last></author>
+      <author><first>Pushpak</first><last>Bhattacharyya</last></author>
       <pages>35-47</pages>
       <url hash="571ce526">2021.mtsummit-research.4</url>
       <abstract>In this paper and we explore different techniques of overcoming the challenges of low-resource in Neural Machine Translation (NMT) and specifically focusing on the case of English-Marathi NMT. NMT systems require a large amount of parallel corpora to obtain good quality translations. We try to mitigate the low-resource problem by augmenting parallel corpora or by using transfer learning. Techniques such as Phrase Table Injection (PTI) and back-translation and mixing of language corpora are used for enhancing the parallel data; whereas pivoting and multilingual embeddings are used to leverage transfer learning. For pivoting and Hindi comes in as assisting language for English-Marathi translation. Compared to baseline transformer model and a significant improvement trend in BLEU score is observed across various techniques. We have done extensive manual and automatic and qualitative evaluation of our systems. Since the trend in Machine Translation (MT) today is post-editing and measuring of Human Effort Reduction (HER) and we have given our preliminary observations on Translation Edit Rate (TER) vs. BLEU score study and where TER is regarded as a measure of HER.</abstract>

diff --git a/data/xml/2022.lrec.xml b/data/xml/2022.lrec.xml
@@ -46,7 +46,11 @@
       <author><first>Serge</first><last>Gladkoff</last></author>
       <author><first>Lifeng</first><last>Han</last></author>
       <pages>13–21</pages>
-      <abstract>Traditional automatic evaluation metrics for machine translation have been widely criticized by linguists due to their low accuracy, lack of transparency, focus on language mechanics rather than semantics, and low agreement with human quality evaluation. Human evaluations in the form of MQM-like scorecards have always been carried out in real industry setting by both clients and translation service providers (TSPs). However, traditional human translation quality evaluations are costly to perform and go into great linguistic detail, raise issues as to inter-rater reliability (IRR) and are not designed to measure quality of worse than premium quality translations. In this work, we introduce <b>HOPE</b>, a task-oriented and <i><b>h</b></i>uman-centric evaluation framework for machine translation output based <i><b>o</b></i>n professional <i><b>p</b></i>ost-<i><b>e</b></i>diting annotations. It contains only a limited number of commonly occurring error types, and uses a scoring model with geometric progression of error penalty points (EPPs) reflecting error severity level to each translation unit. The initial experimental work carried out on English-Russian language pair MT outputs on marketing content type of text from highly technical domain reveals that our evaluation framework is quite effective in reflecting the MT output quality regarding both overall system-level performance and segment-level transparency, and it increases the IRR for error type interpretation. The approach has several key advantages, such as ability to measure and compare less than perfect MT output from different systems, ability to indicate human perception of quality, immediate estimation of the labor effort required to bring MT output to premium quality, low-cost and faster application, as well as higher IRR. Our experimental data is available at <url>https://github.com/lHan87/HOPE</url>.</abstract>
+      <abstract>Traditional automatic evaluation metrics for machine translation have been widely criticized by linguists due to their low accuracy, lack of transparency, focus on language mechanics rather than semantics, and low agreement with human quality evaluation. Human evaluations in the form of MQM-like scorecards have always been carried out in real industry setting by both clients and translation service providers (TSPs). However, traditional human translation quality evaluations are costly to perform and go into great linguistic detail, raise issues as to inter-rater reliability (IRR) and are not designed to measure quality of worse than premium quality translations. In this work, we introduce <b>HOPE</b>, a task-oriented and <i>
+          <b>h</b></i>uman-centric evaluation framework for machine translation output based <i>
+          <b>o</b></i>n professional <i>
+          <b>p</b></i>ost-<i>
+          <b>e</b></i>diting annotations. It contains only a limited number of commonly occurring error types, and uses a scoring model with geometric progression of error penalty points (EPPs) reflecting error severity level to each translation unit. The initial experimental work carried out on English-Russian language pair MT outputs on marketing content type of text from highly technical domain reveals that our evaluation framework is quite effective in reflecting the MT output quality regarding both overall system-level performance and segment-level transparency, and it increases the IRR for error type interpretation. The approach has several key advantages, such as ability to measure and compare less than perfect MT output from different systems, ability to indicate human perception of quality, immediate estimation of the labor effort required to bring MT output to premium quality, low-cost and faster application, as well as higher IRR. Our experimental data is available at <url>https://github.com/lHan87/HOPE</url>.</abstract>
       <url hash="a50ccfaa">2022.lrec-1.2</url>
       <bibkey>gladkoff-han-2022-hope</bibkey>
       <pwccode url="https://github.com/lhan87/hope" additional="false">lhan87/hope</pwccode>
@@ -5034,9 +5038,9 @@
     </paper>
     <paper id="404">
       <title>Comparing Annotated Datasets for Named Entity Recognition in <fixed-case>E</fixed-case>nglish Literature</title>
-      <author><first>Rositsa</first><last>Ivanova</last></author>
-      <author><first>Marieke</first><last>van Erp</last></author>
+      <author><first>Rositsa V.</first><last>Ivanova</last></author>
       <author><first>Sabrina</first><last>Kirrane</last></author>
+      <author><first>Marieke</first><last>van Erp</last></author>
       <pages>3788–3797</pages>
       <abstract>The growing interest in named entity recognition (NER) in various domains has led to the creation of different benchmark datasets, often with slightly different annotation guidelines. To better understand the different NER benchmark datasets for the domain of English literature and their impact on the evaluation of NER tools, we analyse two existing annotated datasets and create two additional gold standard datasets. Following on from this, we evaluate the performance of two NER tools, one domain-specific and one general-purpose NER tool, using the four gold standards, and analyse the sources for the differences in the measured performance. Our results show that the performance of the two tools varies significantly depending on the gold standard used for the individual evaluations.</abstract>
       <url hash="67ec8d36">2022.lrec-1.404</url>

diff --git a/data/xml/2022.privatenlp.xml b/data/xml/2022.privatenlp.xml
@@ -20,7 +20,7 @@
       <bibkey>privatenlp-2022-privacy</bibkey>
     </frontmatter>
     <paper id="1">
-      <title>Differential Privacy in Natural Language Processing The Story So Far</title>
+      <title>Differential Privacy in Natural Language Processing: The Story So Far</title>
       <author><first>Oleksandra</first><last>Klymenko</last></author>
       <author><first>Stephen</first><last>Meisenbacher</last></author>
       <author><first>Florian</first><last>Matthes</last></author>

diff --git a/data/xml/2022.wanlp.xml b/data/xml/2022.wanlp.xml
@@ -649,9 +649,9 @@
     </paper>
     <paper id="53">
       <title>Building an Ensemble of Transformer Models for <fixed-case>A</fixed-case>rabic Dialect Classification and Sentiment Analysis</title>
-      <author><first>Abdullah Salem</first><last>Khered</last><affiliation>The University of Manchester</affiliation></author>
-      <author><first>Ingy Yasser Hassan Abdou</first><last>Abdelhalim</last><affiliation>The University of Manchester</affiliation></author>
-      <author><first>Riza</first><last>Batista-Navarro</last><affiliation>Department of Computer Science, The University of Manchester</affiliation></author>
+      <author><first>Abdullah</first><last>Khered</last></author>
+      <author><first>Ingy Abdelhalim</first><last>Abdelhalim</last></author>
+      <author><first>Riza</first><last>Batista-Navarro</last></author>
       <pages>479-484</pages>
       <abstract>In this paper, we describe the approaches we developed for the Nuanced Arabic Dialect Identification (NADI) 2022 shared task, which consists of two subtasks: the identification of country-level Arabic dialects and sentiment analysis. Our team, UniManc, developed approaches to the two subtasks which are underpinned by the same model: a pre-trained MARBERT language model. For Subtask 1, we applied undersampling to create versions of the training data with a balanced distribution across classes. For Subtask 2, we further trained the original MARBERT model for the masked language modelling objective using a NADI-provided dataset of unlabelled Arabic tweets. For each of the subtasks, a MARBERT model was fine-tuned for sequence classification, using different values for hyperparameters such as seed and learning rate. This resulted in multiple model variants, which formed the basis of an ensemble model for each subtask. Based on the official NADI evaluation, our ensemble model obtained a macro-F1-score of 26.863, ranking second overall in the first subtask. In the second subtask, our ensemble model also ranked second, obtaining a macro-F1-PN score (macro-averaged F1-score over the Positive and Negative classes) of 73.544.</abstract>
       <url hash="caaa9131">2022.wanlp-1.53</url>

diff --git a/data/xml/2023.arabicnlp.xml b/data/xml/2023.arabicnlp.xml
@@ -407,7 +407,7 @@
       <author><first>Amr</first><last>Keleg</last></author>
       <author><first>Walid</first><last>Magdy</last></author>
       <pages>385-398</pages>
-      <abstract>Automatic Arabic Dialect Identification (ADI) of text has gained great popularity since it was introduced in the early 2010s. Multiple datasets were developed, and yearly shared tasks have been running since 2018. However, ADI systems are reported to fail in distinguishing between the micro-dialects of Arabic. We argue that the currently adopted framing of the ADI task as a single-label classification problem is one of the main reasons for that. We highlight the limitation of the incompleteness of the Dialect labels and demonstrate how it impacts the evaluation of ADI systems. A manual error analysis for the predictions of an ADI, performed by 7 native speakers of different Arabic dialects, revealed that <tex-math>\approx</tex-math> 67% of the validated errors are not true errors. Consequently, we propose framing ADI as a multi-label classification task and give recommendations for designing new ADI datasets.</abstract>
+      <abstract>Automatic Arabic Dialect Identification (ADI) of text has gained great popularity since it was introduced in the early 2010s. Multiple datasets were developed, and yearly shared tasks have been running since 2018. However, ADI systems are reported to fail in distinguishing between the micro-dialects of Arabic. We argue that the currently adopted framing of the ADI task as a single-label classification problem is one of the main reasons for that. We highlight the limitation of the incompleteness of the Dialect labels and demonstrate how it impacts the evaluation of ADI systems. A manual error analysis for the predictions of an ADI, performed by 7 native speakers of different Arabic dialects, revealed that <tex-math>\approx</tex-math> 66% of the validated errors are not true errors. Consequently, we propose framing ADI as a multi-label classification task and give recommendations for designing new ADI datasets.</abstract>
       <url hash="396853b2">2023.arabicnlp-1.31</url>
       <bibkey>keleg-magdy-2023-arabic</bibkey>
       <doi>10.18653/v1/2023.arabicnlp-1.31</doi>