- Introduction
- Downloading the Data
- Loading into a database
- Database design and organization
- Data generation and metrics
- Developer documentation
- Limitations
- Contact
OnSIDES is an international, comprehensive database of drugs and their adverse events using data from drug product labels. Information was extracted by fine-tuning a PubMedBERT language model on 200 manually curated labels available from Denmer-Fushman et al.. This comprehensive database will be updated quarterly, and currently contains more than 7.1 million drug-ADE pairs for 4,097 drug ingredients extracted from 51,460 labels, processed from all of the labels available to download from DailyMed (USA), EMA (EU), EMC (UK), and KEGG (Japan) as of April 2025.
If you use the OnSIDES database, results, or methods in your work, please reference our paper:
Tanaka Y, Chen HY, Belloni P, Gisladottir U, Kefeli J, Patterson J, Srinivasan A, Zietz M, Sirdeshmukh G, Berkowitz J, LaRow Brown K, Tatonetti NP. OnSIDES database: Extracting adverse drug events from drug labels using natural language processing models. Med. 2025 Mar 27:100642. doi: 10.1016/j.medj.2025.100642. PMID: 40179876.
OnSIDES is released as a set of flat files (comma-separated) and as an interactive dataset at onsidesdb.org. Downloads are available in the Releases section.
We provide utilities to simplify loading OnSIDES into a relational database of your choice (e.g. MySQL, PostgreSQL, etc.). We have pre-built database schema files for MySQL, PostgreSQL, and SQLite, along with example bash scripts to create a database and load the files in the proper order (MySQL, PostgreSQL, and SQLite). These example scripts use Podman to run the databases in containers, but they can be easily adapted to work with existing databases. We also provide OnSIDES as a single, pre-built SQLite database, which you could import into a database with specialized tools like pgloader or sqlite3-to-mysql.
The OnSIDES database has the following seven tables:
product_label
- Individual drug productsproduct_adverse_effect
- Extracted adverse effects for drug productsproduct_to_rxnorm
- Mapping from drug products to RxNorm products (many-to-many)vocab_rxnorm_product
- RxNorm productsvocab_rxnorm_ingredient_to_product
- Mapping from RxNorm products to RxNorm ingredients (many-to-many)vocab_rxnorm_ingredient
- RxNorm ingredientsvocab_meddra_adverse_effect
- MedDRA adverse effect terms
Tables starting with product_
contain data from drug labels, while vocab_
tables contain vocabulary mapping information from the UMLS Metathesaurus and the OMOP vocabularies (from Athena).
Here's a diagram of the database schema:
In addition to this, there is one additional CSV file called high_confidence.csv
.
This file contains ingredient-adverse effect pairs that were observed in all four sources (US, UK, EU, Japan).
We found 562,281 high-confidence ADEs across 5761 ingredients and 733 effects.
Since this table is derived, it has not been included in the database schemas. A suitable definition (SQLite here) would be:
CREATE TABLE high_confidence (
ingredient_id TEXT,
effect_meddra_id INTEGER,
FOREIGN KEY(ingredient_id) REFERENCES vocab_rxnorm_ingredient(rxnorm_id),
FOREIGN KEY(effect_meddra_id) REFERENCES vocab_meddra_adverse_effect(meddra_id)
);
After creating the table, import as follows (SQLite again):
.mode csv
.import --skip 1 'csv/high_confidence.csv' high_confidence
As part of the v3.0.0 release and on, we include a set of manual annotations for download as well. For more information, see the annotation description.
OnSIDES is generated through the following steps:
- Find and download drug labels for all human prescription drugs from four sources (DailyMed, EMA, EMC, KEGG).
- Parse raw formats (e.g. PDF, XML), extract relevant sections, and format all label texts.
- Identify string matches of MedDRA terms in the extracted label texts (note, MedDRA Japan for Japanese labels).
- (Non-Japan): Apply PubMedBERT to score each matched term, and apply a cutoff threshold on the predictions.
- Map all drug products to RxNorm.
- Gather extracted data into a database (products, adverse effects identified). Combine with product-to-ingredient information from RxNorm and MedDRA concept mappings.
- Validate the database, and export tables as flat files.
The fine-tuned PubMedBERT model used for OnSIDES achieved the following performance on 200 manually annotated FDA labels:
Section | F1 | Precision | Recall | AUROC | N |
---|---|---|---|---|---|
Boxed Warning | 0.964 | 0.971 | 0.957 | 0.977 | 686 |
Warnings and Precautions | 0.882 | 0.883 | 0.881 | 0.933 | 10254 |
Adverse Reactions | 0.935 | 0.946 | 0.924 | 0.956 | 12122 |
In a separate comparison, this modeling approach was compared to the TAC 2017 dataset, and achieved the following performance:
Performance metrics evaluated against the TAC gold standard
Metric | TAC (Best Model†) | SIDER 4.1 | OnSIDES v1.0.0 | OnSIDES v2/3.0.0 |
---|---|---|---|---|
F1 Score | 82.19 | 74.36 | 82.01 | 87.54 |
Precision | 80.69 | 43.49 | 88.76 | 91.29 |
Recall | 85.05 | 52.89 | 77.12 | 84.08 |
This project uses Snakemake to reproducibly download and build the database.
Snakemake is a workflow management tool that ensures work is not unnecessarily duplicated.
Various steps are stored as Snakefile
s, in the snakemake
directory.
By default, the first rule in a file is run.
Each file's first rule is a catch-all rule called all
.
To run a Snakefile, ensure it's installed (e.g. nix develop
or install manually), then run e.g. snakemake -s snakemake/us/download/Snakefile
.
Adding -n
makes it a dry run, meaning it tells you what it would do, without actually doing it.
Because Snakefiles can overwrite things, I recommend an initial dry run before running any other command, just to verify that you want it to do is what it will do.
The snakemake
directory is organized as follows, with each sub-directory having a Snakefile
and any additional scripts that are needed.
You should run these Snakefile
s from the project root directory.
snakemake
├── eu
│ ├── download
│ └── parse
├── jp
│ ├── download
│ └── parse
├── onsides
│ ├── evaluate
│ └── export
├── uk
│ ├── download
│ └── parse
└── us
├── download
└── parse
To run everything, you'll want to download each source, then parse, then evaluate (onsides/evaluate
), then export (onsides/export
).
Here is a minimal set of commands to generate everything:
snakemake -s snakemake/us/download/Snakefile
snakemake -s snakemake/uk/download/Snakefile
snakemake -s snakemake/eu/download/Snakefile
snakemake -s snakemake/jp/download/Snakefile
snakemake -s snakemake/us/parse/Snakefile
snakemake -s snakemake/uk/parse/Snakefile
snakemake -s snakemake/eu/parse/Snakefile
snakemake -s snakemake/jp/parse/Snakefile
snakemake -s snakemake/onsides/evaluate/Snakefile
snakemake -s snakemake/onsides/export/Snakefile
To be transparent, the reason I didn't package everything into a single command is because these scripts will inevitably become stale in various ways (source URLs change, etc.), and this makes debugging significantly easier for the inheritors of this project.
To speed development and avoid dependency headaches, I built a development environment as a nix flake.
Nix is a declarative, reproducible package and environment manager.
What this means is that, after you install nix, all you need to run is nix develop
to enter the development shell, and every prerequisite needed for OnSIDES will be available.
If you prefer not to use nix, you'll need to install all the dependencies yourself, including Python (I recommend uv), Java, tabula, pandoc, and DuckDB, then create the virtual environment for Python (e.g. uv sync
).
Simple answer?
Install nix, then run nix develop
.
Here's how the v3.0.0 archive was created.
cd database
zip -r ../onsides-v3.0.0.zip annotations csv database_scripts schema
OnSIDES is strictly intended for academic research purposes. The adverse drug event term extraction method is far from perfect - some side effects will be missed and some predicted as true adverse events will be incorrect.
Patients/healthcare professionals seeking health information should not trust or use this data, and instead refer to the information available from their regions' respective drug regulatory agencies, such as the FDA (USA), EMA (EU), MHRA (UK), PMDA (Japan) and consult their healthcare providers for information.
Additionally, this project is under active development. We are continuing to further conduct independent validation of the performance of the models used, and improve the extraction methodology. As such, the data, methods, and statistics are subject to change at any time. Any updates to the database will be reflected on this page/in this repository.
If you would like to contribute to this project or have any suggestions on how the methods, data, or evaluation can be improved please reach out to Dr. Tatonetti via email or Twitter.