AuthorExtractor

Source code for Extracting the author of news stories with Machine Learning and DOM-based segmentation

Introduction

This repository contains

Source code for Web2Text, including additional features specific to the Author Extraction task
Feature representations for +70,000 news articles from All The News under public/train_and_test (23.39GB), in CSV format. The corresponding HTML files were not uploaded because it contains copyrighted material
Train Model Task
Weights from a pre-trained model with the above dataset (under public/trained_model_all_the_news)
Inference Task

Installation

Install Scala and SBT. The code was tested with SBT 1.3.6.
Install Python 3 with Tensorflow (tested with 2.1.0), Keras, NumPy, sklearn, HuggingFace Transformers and wget. Running an Anaconda instance is recommended.

Usage

Recipe: Extract Page Features of a single local HTML file

./extract_page_features.sh <html_file.html> (This will generate a CSV file)

Recipe: Extract Page Features of an entire Corpus

Extract feature representations of all HTML files located in public/html:

./extract_corpus_features.sh

Both single-page and Corpus feature extraction will generate CSV files and store them under public/train_and_test.

Both the page and the corpus feature extraction generate a file named /public/DOM/dom.html which contains a visual DOM tree. This file is used for troubleshooting during implementation and on inference time.

Recipe: Train Model

Train the model with all the feature representations located in public/train_and_test:

./train_model.sh

The true labels are expected to be in public/authors.csv. The syntax of this file is URL Hash; Author name.

This generates model files located under public/trained_model_all_the_news.

Recipe: Inference from a local HTML file

./inference_from_html.sh <html_file.html>

Recipe: Inference from a local CSV file

./inference_from_csv.sh <csv_file.csv>

Recipe: Inference from a URL

./inference_from_url.sh <URL>

All inference scripts are slow to run (about 3 minutes) due to the fact that there are several steps involved (load weights, load Tensorflow and BERT libraries) and a suboptimal switch back and forth between Scala and Python.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.vscode		.vscode
lib		lib
other_frameworks		other_frameworks
project		project
public		public
src/main		src/main
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt
extract_corpus_features.sh		extract_corpus_features.sh
extract_page_features.sh		extract_page_features.sh
inference_from_csv.sh		inference_from_csv.sh
inference_from_html.sh		inference_from_html.sh
inference_from_url.sh		inference_from_url.sh
train_model.sh		train_model.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AuthorExtractor

Introduction

Installation

Usage

Recipe: Extract Page Features of a single local HTML file

Recipe: Extract Page Features of an entire Corpus

Recipe: Train Model

Recipe: Inference from a local HTML file

Recipe: Inference from a local CSV file

Recipe: Inference from a URL

About

Releases

Packages

Languages

kennethkenneth/AuthorExtractor

Folders and files

Latest commit

History

Repository files navigation

AuthorExtractor

Introduction

Installation

Usage

Recipe: Extract Page Features of a single local HTML file

Recipe: Extract Page Features of an entire Corpus

Recipe: Train Model

Recipe: Inference from a local HTML file

Recipe: Inference from a local CSV file

Recipe: Inference from a URL

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages