Source code for Extracting the author of news stories with Machine Learning and DOM-based segmentation
This repository contains
- Source code for Web2Text, including additional features specific to the Author Extraction task
- Feature representations for +70,000 news articles from All The News under
public/train_and_test
(23.39GB), in CSV format. The corresponding HTML files were not uploaded because it contains copyrighted material - Train Model Task
- Weights from a pre-trained model with the above dataset (under
public/trained_model_all_the_news
) - Inference Task
-
Install Scala and SBT. The code was tested with SBT 1.3.6.
-
Install Python 3 with Tensorflow (tested with 2.1.0), Keras, NumPy, sklearn, HuggingFace Transformers and wget. Running an Anaconda instance is recommended.
./extract_page_features.sh <html_file.html>
(This will generate a CSV file)
Extract feature representations of all HTML files located in public/html
:
./extract_corpus_features.sh
Both single-page and Corpus feature extraction will generate CSV files and store them under public/train_and_test
.
Both the page and the corpus feature extraction generate a file named /public/DOM/dom.html
which contains a visual DOM tree. This file is used for troubleshooting during implementation and on inference time.
Train the model with all the feature representations located in public/train_and_test
:
./train_model.sh
The true labels are expected to be in public/authors.csv
. The syntax of this file is URL Hash; Author name
.
This generates model files located under public/trained_model_all_the_news
.
./inference_from_html.sh <html_file.html>
./inference_from_csv.sh <csv_file.csv>
./inference_from_url.sh <URL>
All inference scripts are slow to run (about 3 minutes) due to the fact that there are several steps involved (load weights, load Tensorflow and BERT libraries) and a suboptimal switch back and forth between Scala and Python.