Scripts to generate XML files and to submit them to the European Nucleotide Archive (ENA) server

Submissions to ENA can be made using programmatic submission service using cURL. Submissions of different types (STUDY, SAMPLE, EXPERIMENT, RUN) can be made using XML files. The script generate_xml.py is used to generate these files. Informations used to generate XML are extracted from a more convenient Excel spreadsheet template (spreadsheet_template.xlsx). The XML receipt is further parsed using parse-receipt.py to extract accession numbers assigned by the service.

We encourage you to read the ENA training modules.

Installation

Requirements

cURL:

install:

sudo apt-get install curl

python3:
- hashlib
- yattag
- untangle
- pandas

install:

pip3 install -r requirements.txt

Clone the project

The easiest option is to clone the repository:

git clone https://github.com/bigey/ena-submit.git

Usage

Generate XML files

generate_xml.py [-h] [--data_dir DATA_DIR] [--out_dir OUT_DIR] SPREADSHEET_FILE

Generate xml files to submit to ENA server

positional arguments:

  SPREADSHEET_FILE      Excel spreadsheet file (xls, xlsx, ods)

optional arguments:

  -h, --help            show this help message and exit

  --data_dir DATA_DIR, -d DATA_DIR
                        directory containing the data (reads)
                        Default: current dir

  --out_dir OUT_DIR, -o OUT_DIR
                        output directory containing the generated xml files
                        Default: current dir

Parse the XML receipt of the server

parse-receipt.py [-h] [--tsv] [--out OUT_FILE] RECEIPT_XML

Parse the XML data received from the submission server

positional arguments:

  RECEIPT_XML           receipt xml file from ENA server

optional arguments:

  -h, --help            show this help message and exit

  --tsv, -t             output in tabular format (space separated value)

  --out OUT_FILE, -o OUT_FILE
                        optional output file
                        Default: stdout

Description

Use Excel to edit the submission informations in the spreadsheet template file (spreadsheet_template.xlsx). Use one sheat for each type of data:

Project

A project (also referred to as a study) is used to group other objects together, so we will look into creating a project as a first step towards to submit ENA objects.

alias: this is a unique code to refer to your study, eg proj_0000
TITLE: a descriptive short title
DESCRIPTION: an abstract detailing the project

Sample

Use one line per sample. ENA provides sample checklists which define all the mandatory and recommended attributes for specific types of samples. We do not define a checklist here, then the samples will be validated against the ENA default checklist ERC000011. This checklist has virtually no mandatory fields but contains many optional attributes that can help you to annotate your samples to the highest possible standard. You can find all the checklists here.

Mandatory

alias: a unique code, eg: sam_0000
TITLE: sample name eg: Saccharomyces cerevisiae S288C
TAXON_ID: eg: 4932
SCIENTIFIC_NAME: eg: Saccharomyces cerevisiae
COMMON_NAME: eg: baker's yeast (optional)

Optional

strain
sample_description
collected_by
geographic location (country and/or sea)
geographic location (region and locality)
isolation_source

Experiment

An experiment object represents the library that is created from a sample and used in a sequencing experiment. The experiment object contains details about the sequencing platform and library protocols. An experiment is part of a study and is assocated with a sample. It is common to have multiple libraries and sequencing experiments for a single sample. Experiments point to samples to allow sharing of sample information between multiple experiments.

alias: a unique code, eg: exp_0000
TITLE: a short title
STUDY_REF: code of the project, eg: proj_0000
SAMPLE_DESCRIPTOR: code of the sample, eg: sample_0000
LIBRARY_NAME: a unique code, eg: lib_0000
LIBRARY_STRATEGY: controlled value fields
LIBRARY_SOURCE: controlled value fields
LIBRARY_SELECTION: controlled value fields
PAIRED: yes|no
NOMINAL_LENGTH: average insert size
NOMINAL_SDEV: standard deviation of insert size (optional)
LIBRARY_CONSTRUCTION_PROTOCOL: detailed protocol
INSTRUMENT_MODEL: eg: Illumina HiSeq 2000

Run

The run points to the experiment using the experiment’s alias.

alias: a unique code, eg: run_0000
EXPERIMENT_REF: code of the experiment, eg: exp_0000
filetype: eg: fastq
filename_r1: path to read file 1
filename_r2: path to read file 2 (optional if single)

Using the script `ena-submit.sh`

Utility

This script can be used to:

upload run files (reads) to the ftp server (using curl),
generate XML files (using generate_xml.py),
submit the XML files to server (using curl),
parse the XML receipt (using parse-receipt.py)

Customization

Please give the following parameters:

Type of submission

Select:

TEST="true" for testing. You'll use the testing server. We encourage you to first validate your data using this server,
TEST="false" to submit your data. You'll use the production server. This is a real submission!

Select the appropriate action

ACTION="ADD" is used to submit new data,
ACTION="MODIFY" is used to update submited data

ENA credentials

If you have not submitted to Webin before, please register a submission account here.

Create a file containing the credentials used to connect to the ENA server: one line with user and password separated by a space character:

user password

Update this line accordingly:

CREDENDIAL=".credential"

LibreOffice spreadsheet

The name of the spreadsheet file containing your data. You would start using the template spreadsheet given with the project.

LIBREOFFICE_ODS="spreadsheet_template.xlsx"

Directory containing data/reads

This directory should contain the sequencing read files. Generally *.fastq.gz files.

DATA_IN_DIR="data"

Directory containing the generated XML files

Update if you want the XMLs to be generated in another directory.

XML_OUT_DIR="xml"

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
data		data
xml		xml
.credential		.credential
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ena-submit.sh		ena-submit.sh
generate-xml.py		generate-xml.py
parse-receipt.py		parse-receipt.py
requirements.txt		requirements.txt
server-receipt.txt		server-receipt.txt
server-receipt.xml		server-receipt.xml
spreadsheet_template.xlsx		spreadsheet_template.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scripts to generate XML files and to submit them to the European Nucleotide Archive (ENA) server

Installation

Requirements

Clone the project

Usage

Generate XML files

Parse the XML receipt of the server

Description

Project

Sample

Experiment

Run

Using the script `ena-submit.sh`

Utility

Customization

Type of submission

Select the appropriate action

ENA credentials

LibreOffice spreadsheet

Directory containing data/reads

Directory containing the generated XML files

About

Releases

Packages

Languages

License

bigey/ena-submit

Folders and files

Latest commit

History

Repository files navigation

Scripts to generate XML files and to submit them to the European Nucleotide Archive (ENA) server

Installation

Requirements

Clone the project

Usage

Generate XML files

Parse the XML receipt of the server

Description

Project

Sample

Experiment

Run

Using the script ena-submit.sh

Utility

Customization

Type of submission

Select the appropriate action

ENA credentials

LibreOffice spreadsheet

Directory containing data/reads

Directory containing the generated XML files

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Using the script `ena-submit.sh`

Packages