- What is SELECTA?
- SELECTA framework Docker-Compose version
- Installation
- Retrieving the SELECTA docker-compose version
- Usage on a single computer
- SELECTA API
- Minimum requirement for a SELECTA processing
- Retrieve SELECTA database data
- Run SELECTA Workflow manually
- Run SELECTA Workflow automatically
- Usage on a SWARM cluster
- Background
SELECTA is a rule-based workflow engine that runs analytical pipelines, first developed to manage pipelines specifically for the COMPARE community. However, SELECTA can also be setup to manage other user provided analytical pipelines. SELECTA automatically submits the analysis results output by the analytical pipelines back to the European Nucleotide Archive (ENA) mitigating the burden for a submitter when handling bulk submissions and ensures a consistent and accurate record of the analysis results.
This repository provides SELECTA toolkits as a set of Dockerfiles, configuration docker-compose files to launch the SELECTA framework without the requirement of installing every SELECTA component individually. This repository also provides a script (init_swarm.sh) to initialize a docker swarm environment and a stack file to launch the framework in a docker swarm cluster. There is also a script to remove entirely (rm_swarm.sh) the created swarm cluster environment.
The open-source software Docker and Docker-compose must be pre-installed for the proper functioning of the SELECTA docker-compose version:
Add the GPG key for the official Docker repository to the operating system.
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
Add the Docker repository to the APT (Advanced Package Tool) sources.
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
Next, update the Ubuntu package database with the newly added docker packages repository.
sudo apt-get update
Install the Docker community edition:
sudo apt-get install -y docker-ce
Verify that Docker daemon is running with the following command:
sudo systemctl status docker
Next, install the docker-compose and make it executable.
sudo curl -L https://github.com/docker/compose/releases/download/1.18.0/docker-compose-`uname -s`-`uname -m` -o /usr/local/bin/docker-compose && sudo chmod +x /usr/local/bin/docker-compose
Finally, verify that docker-compose installation is successful by checking its version.
docker-compose --version
Install the executable from the docker-compose page:
https://docs.docker.com/compose/install/
Retrieve the docker-compose version from the Github to launch the docker-compose version of the SELECTA framework.
Clone this repository:
git clone <repo>
If utilising for COMPARE you will also want to obtain the following codebases for the integrated COMPARE pipelines:
- DTU_CGE
- EMC_SLIM
- UAntwerp_Bacpipe
- RIVM_Jovian (private, request by email)
Alternatively, obtain the codebases of the analysis pipelines that you wish to implement in SELECTA.
To launch the SELECTA framework on a single machine. Access the SELECTA docker repository:
cd </path/to/ebi-selecta>
Docker-compose up --build -d
The last command in detach mode will pull the SELECTA required images dependencies from Docker hub and build where necessary SELECTA-tools docker images and finally launch all the SELECTA containers. The docker-compose also creates a docker network that allows communication between the SELECTA docker containers.
We provide in this repository SELECTA API documentation detailing SELECTA endpoints. SELECTA API implements both the GET and POST methods.
The SELECTA framework is database-centric. Two main tables, the Account and the Process_Selection tables, contain mandatory requirements to enact SELECTA processing. SELECTA API facilitates populating these tables with the required attributes. The API listens to port 5002 on the localhost, and the accompanying docker-compose.yml file defines this port.
The following command will create an account for dcc_xxxx
curl -d '{"account_id":"dcc_test","account_type":"datahub","email": "[email protected]","password": "letmein"}' -H "Content-Type: application/json" -X POST http://localhost:5002/account
The PROCESS_SELECTION table sports the processing rules for each datahub. Here we define the analytical workflow name that
is responsible for analyzing the data from a specific datahub or study or run.
The following command creates a rule in the PROCESS_SELECTION table for processing data from datahub dcc_dvorak with analysis workflow UAntwerp_Bacpipe; the rule ensures that the analysis of the run is an ongoing process by setting the CONTINUITY flag to YES.
This specific rule continually analyzes a specific run data in the dcc_dvorak datahub.
curl -d '{"datahub": "dcc_dvorak","run_accession": "ERR1102130","pipeline_name": "UAntwerp_Bacpipe","public": "NO","webin": "Webin-45433","continuity": "YES","process_type": "run"}' -H "Content-Type: application/json" -X POST http://localhost:5002/input2selection
curl -H "Content-Type: application/json" -X GET http://localhost:5002/accounts
curl -H "Content-Type: application/json" -X GET http://localhost:5002/selections
Http:localhost:5050 with the following: username: [email protected]/letmein
When manually running each SELECTA stage, run each of the following scripts sequentially: selection_to_attribute.py, data_provider.py, core_executor.py, analysis_reporter.py, and process_archival.py. Each script require SELECTA configuration file (properties.txt) as an argument:
docker run -ti --rm --network=ebi-selecta_postgres embl-ebi/selecta_selection_to_attribute:1.0 /usr/scr/app/scripts/selection_to_attribute.py -p /usr/scr/app/resources/properties.txt
docker run -ti --rm --network=ebi-selecta_postgres embl-ebi/selecta_data_provider:1.0 /usr/scr/app/scripts/data_provider.py -p /usr/scr/app/resources/properties.txt
docker run -ti --rm --network=ebi-selecta_postgres embl-ebi/selecta_core_executor:1.0 /usr/scr/app/scripts/core_executor.py -p /usr/scr/app/resources/properties.txt
docker run -ti --rm --network=ebi-selecta_postgres embl-ebi/selecta_analysis_reporter:1.0 /usr/scr/app/submission/analysis_reporter.py -p /usr/scr/app/resources/properties.txt
docker run -ti --rm --network=ebi-selecta_postgres embl-ebi/selecta_process_archival:1.0 /usr/scr/app/scripts/process_archival.py -p /usr/scr/app/resources/properties.txt
To automate the steps described above, create individual cronjobs for each stage to run periodically.
Access the SELECTA docker repository:
Cd </path/to/ebi-selecta>
./init_swarm.sh
Confirm cluster creation
docker node ls
docker node inspect node-id
To launch SELECTA in the swarm cluster, issue the following command:
Docker stack deploy -c docker-stack-compose.yml selecta
The last command will pull the SELECTA required images from the docker hub and launch all the SELECTA services as configured in the docker-stack-compose YAML file.
The SELECTA framework integrates a swarm visualizer that is accessible at
http://localhost:8080
To scale up core_executor service, issue the following command:
docker service scale selecta_core_executor=10
Confirm that the service has scaled up via swarm visualizer at http://localhost:8080
To scale down core_executor service, issue the following command:
docker service scale selecta_core_executor=1
Confirm that the service has scaled up via swarm visualizer at http://localhost:8080
To remove the swarm cluster, issue the following command:
rm_swarm.sh
The following are some useful docker commands to interacts with the containers in the SELECTA framework.
Show active containers:
Docker-compose ps
Request an interactive shell in a running container.
Docker exec -it container_name csh
List virtual volumes:
docker volume ls
List Images.
Docker images
Remove images
Docker rmi image_name
The SELECTA workflow includes the stages shown in the image above. Each step runs an associated python script(s). The workflow can be scheduled through cronjobs which run these scripts, note the direction of the flow in the SELECTA framework section of the diagram.
SELECTA utilises a backend PostGres database for storing, accessing and updating key information required for tracking the various processes. The database consists of 5 tables:
- account
- process_selection
- process_stages
- process_attributes
- process_report
The image below presents the fields for each table: