Skip to content

Compare the ability of various summarizing ML/AI methods on scientific articles.

Notifications You must be signed in to change notification settings

jhrcook/sci-article-summarization

Repository files navigation

Summarizing scientific articles

Comparing the ability of various summarizing ML/AI methods on scientific articles. (work in progress)

Check out examples of summarizing the paper "The origins and genetic interactions of KRAS mutations are allele- and tissue-specific" in the examples directory.

Or you can compare the results of the different summary methods using the Streamlit web app: Streamlit App

The purpose of this project is that I wanted to play around with various AI amd ML summarization methods. Therefore, I have created a system by which a scientific article is downloaded, parsed, and fed through various summarization models under different configurations. I chose to use scientific articles as a medium because I thought it would present an interesting, novel, and diverse set of test-cases. Also, there are standard practices in scientific articles that makes scoring the summary's accuracy easy such as the Abstract and Results sub-section titles.

At the moment, I have a system for parsing Nature Communication articles from their webpage and summarizing the paper with the three methods listed below. My next step is to create a structured method for saving the results for easy comparison. I will run multiple articles through the methods with various parameters for the models. I may also standardize the system/API for getting a parsed article so that I can create parsing systems for multiple journals (though this is a low priority).

python Streamlit pytorch
pre-commit Code style: black Checked with mypy pydocstyle

Entrypoints

There are various entrypoints available as CLI commands to the article parsing and summarization functions available in the summarize.py script.

Summarizing a single scientific article

Here is an example of using the CLI to summarize a single article.

./summarize.py summarize "https://www.nature.com/articles/s41467-021-22125-z" "TEXTRANK"
#> 'The origins and genetic interactions of KRAS mutations are allele- and tissue-specific'
#>   summarization method: TEXTRANK
#> ========================================================================================
#>
#> Introduction
#> ------------
#> Importantly, the activating alleles found in KRAS vary ...
#> ...

There are some other options for this command that you can peruse using

./summarize.py summarize --help

Generate examples

I made a specific command to generate the example summarizations of my paper "The origins and genetic interactions of KRAS mutations are allele- and tissue-specific". There examples are available in the examples directory. The following command runs the paper through each summarization method with some specific configurations.

./summarize.py make-examples

Run the summarization pipeline for all URLs and configurations (work-in-progress)

This is still a work-in-progress, but running this command will run a pipeline that summarizes many URLs with different summarization model configurations. The output will be saved as pickle files so that they can be re-read into Python and displayed in an interactive application for easier comparisons.

./summarize.py summarize-all

Parse article

This command just parses an article and is useful for checking if an article's webpage is processed properly.

./summarize.py parse-article "https://www.nature.com/articles/s41467-021-22125-z"

Streamlit app

This project has a web application built with Streamlit to make comparing two different summaries easier. It it available online, but you can also launch the Streamlit app locally using the following command:

streamlit run app.py

Setup

Because of the all the ML/AI libraries required for this project, I used conda to manage dependencies. The environment was created using the following command:

conda create --prefix ./.venv -f enviornment.yaml
conda activate ./.venv

You need an API key to use OpenAI. This can be created by creating an account here, logging in, and going to "Personal/View API Keys". Make a file called ".env" and add your API key as the name OPENAI_API_KEY. It should look something like this:

OPENAI_API_KEY="your-key-here"

While ".env" is in the ".gitignore", it is worth double-checking that this file is not being tracked by git.


To-Do

  • break down the Results section into sub-sections - it will make it easier to read the summary.
  • look into different options from HuggingFace (more info here)
    • and other parameters for the HuggingFace models
  • system for:
    • different model configurations
    • multiple article URLs
    • structured output for later display and comparison results

ML/AI methods

Model parameters

The models all have various parameters for tuning how the model behaves and the output. Below are the descriptions for the various parameters I have included in my experimentation.

Textrank

BART

GPT-3

https://beta.openai.com/docs/api-reference/completions/create

About

Compare the ability of various summarizing ML/AI methods on scientific articles.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Languages