This repository will host a set of tidyverse code for Exploratory Analysis and Predictive Modelling of sequences citation in the literature. Sequences originate from the European Nucleotide Archive (ENA). The literatures originate from the EuropePMC
- Code to parse EMBL-flatfiles (
- Code to query ePMC API (
- preprocessing/database/initdb/seqref.sql.gz
- ENA release 143 (03/04/2020):
- files total 208G compressed and 3.5T uncompressed.
- The release contains 263,421,789 sequence entries comprising 408,005,271,872 nucleotides.
Division | entries |
ENV:Environmental Samples | 16,765,544 |
FUN:Fungi | 7,511,473 |
HUM:Human | 27,520,827 |
INV:Invertebrates | 40,534,979 |
MAM:Other Mammals | 16,578,137 |
MUS:Mus musculus | 10,479,013 |
PHG:Bacteriophage | 17,393 |
PLN:Plants | 85,618,575 |
PRO:Prokaryotes | 3,589,696 |
ROD:Rodents | 3,263,952 |
SYN:Synthetic | 10,049,087 |
TGN:Transgenic | 286,472 |
UNC:Unclassified | 15,943,630 |
VRL:Viruses | 3,198,057 |
VRT:Other Vertebrates | 22,064,954 |
Total | 263,421,789 |
- Sequence entry must have the /country qualifier that represent, the locality of isolation of the sequenced organism indicated in terms of political names for nations, oceans or seas, followed by regions and localities
- Sequence entry must be a non-WGS sequence
Id | Description |
accession | sequence accession |
primary_pmid | Primary pubmed id |
primary_doi | primary doi , extracted from flat file |
primary_pmcid | primary epmc id extracted from flat files |
origin | Locality of sequence isolation |
country | Country of sequence isolation |
submission_date | Sequence submission date |
first_created | Sequence entry first created |
lat_lon | geolocation |
organism | Sequence organism name |
taxid | Sequence taxonomic id |
code | sequence taxon |
project_acc | Sequence project accession |
Division | Counts |
INV | 4895011 |
ENV | 4191787 |
VRL | 2623044 |
PLN | 1952666 |
VRT | 1824633 |
FUN | 877183 |
PRO | 868913 |
MAM | 575165 |
HUM | 129549 |
ROD | 79785 |
PHG | 7958 |
MUS | 7458 |
SYN | 737 |
UNC | 246 |
TGN | 57 |
Total | 18034192 |
- Europe PubMed Central ePMC
- Sequence accession e.g.:AB013190
- Primary pubmed id e.g.: 11050544
- Project accession e.g.: PRJDB3373!/Europe32PMC32Articles32RESTful32API/search
Retrieved Field | Description |
accession | Sequenceid |
idpmc | unique ePMC id |
source | Literature source eg: MEDLINE |
pubtype | Publication type |
issn | ISSN |
isopenaccess | Is the publication open access |
secondary_pmid | pubmed id of the literature hit |
secondary_pmcid | pmc id id the literature hit |
secondary_doi | DOI of the literature hit |
author | Author name |
affiliation | Author affiliation |
country | Author country |
first_pubdate | First publication date |
first_epubdate | First electronic publication date |
orcid | Author ORCID |
language | Publication language |
grantid | Grant identifier |
grant_agency | Grant Agency |
grant_acronym | Grant Acronym |
receipt_date | Publication reception date |
revision_date | Publication revision date |
Journal | #accessions | Definition |
MED | 534039 | PubMed/MEDLINE NLM |
AGR | 4981 | Agricola |
PMC | 1756 | PubMed Central |
CBA | 77 | Chinese Biological Abstracts |
PPR | 70 | Preprints |
PAT | 24 | Biological Patents |
CTX | 5 | CiteXplore |