Related Article PanSpace: Fast and Scalable Indexing for Massive Bacterial Databases
panspace
is a library for creating and querying vector based indexes for bacterial genome (draft) assemblies.
panspace
pipeline for querying works as follows,
- First, each genome is represented by its Frequency matrix of the Chaos Game Representation of DNA (FCGR)
- Then, the FCGR is mapped to a n-dimensional vector, the embedding, using a Convolutional Neural Network called
CNNFCGR
, the Encoder, - Finally, the embedding --the compressed representation of the input genome-- is used to query an index of these vectors representing a bacterial pangenome.
The library is based on tensorflow and faiss index.
Encoder | Kmer | Embedding Size | Download |
---|---|---|---|
CNNFCGR | 7 | 256 | Download Index |
We provide a snakemake pipeline to query a collection of genomes (from a folder),
- Clone the repository
git clone https://github.com/pg-space/panspace.git
cd panspace
conda create -c conda-forge -c bioconda -n snakemake snakemake
conda activate snakemake
conda config --set channel_priority strict
-
set parameters in
scripts/config.yml
,- directory with sequences (accepted extensions
.fa.gz
,.fa
,.fna
) - define an output directory to save query results
- gpu or cpu usage
- path to the encoder (
<path/to/encoder>.keras
) - path to the index (<path/to/panspace-index>.index)
- directory with sequences (accepted extensions
finally run
snakemake -s scripts/query.smk --cores 8 --use-conda
Optional: for faster queries recommended if you have hundreds or thousands of assemblies to query
First install the FCGR extension to KMC3
and put the path to the installed tool in the scripts/config.yml
file and run,
snakemake -s scripts/query_fast.smk --cores 8 --use-conda
or put it directly on bash
snakemake -s scripts/query_fast.smk --cores 8 --use-conda --config fcgr_bin=<path/to/fcgr>
NOTES
- change the number of cores (
--cores <NUM_CORES>
) if you have more availables, this will allow the parallelization of k-mer counts from assemblies done by KMC3 (by defaultkmc_threads: 2
, seescripts/config.yml
). - This extension constructs FCGR representations with a C++ extending KMC3 output. The default version parses the output of KMC as a dictionary of k-mer counts and then uses the python library ComplexCGR for the construction of the FCGR.
panspace
requires python >= 3.9, < 3.11.
with CPU support
pip install "panspace[cpu] @ git+https://github.com/pg-space/panspace.git"
with GPU support
pip install "panspace[gpu] @ git+https://github.com/pg-space/panspace.git"
with CPU support
conda env create -f envs/cpu.yml
conda activate panspace-cpu
with GPU support
conda env create -f envs/gpu.yml
conda activate panspace-gpu
It provides commands for
- creating FCGR from kmer counts,
- train an encoder using metric learning (if labels are available) or an autoencoder,
- create and query an Index of embeddings.
panspace --help
Usage: panspace [OPTIONS] COMMAND [ARGS]...
🐱 Welcome to panspace, a tool for Indexing and Querying a pan-genome in an embedding space
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --install-completion Install completion for the current shell. │
│ --show-completion Show completion for the current shell, to copy it or customize the installation. │
│ --help Show this message and exit. │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ──────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ data-curation Find outliers and mislabaled samples. │
│ docs Open documentation webpage. │
│ fcgr Create FCGRs from fasta file or from txt file with kmers and counts. │
│ index Create and query index. Utilities to test index. │
│ stats-assembly N50, number of contigs, avg length, total length. │
│ trainer Train Autoencoder/Metric Learning. Utilities. │
│ utils Extract info from text or log files │
│ what-to-do 🐱 If you are new here, check this step-by-step guide │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
- Split dataset into train, validation and test sets
panspace trainer split-dataset --help
- Train
Options
- Do you have labels for each assembly?
- Use metric learning with the triplet loss
- Or metric learning with the contrastive loss
Using the
CNNFCGR
architecture.
- If you do not have labels, then use unsupervised learning with the
AutoencoderFCGR
architecture
panspace trainer metric-learning --help # triplet loss
panspace trainer one-shot --help # contrastive loss
panspace trainer autoencoder --help
Get the Encoder
- If using the triplet loss, the output model is the encoder.
- If using the contrastive loss, you can get the encoder with
panspace trainer extract-backbone-one-shot
- If using the autoencoder, you can get the encoder with
panspace trainer split-autoencoder
- Create Index
panspace index create --help
- Query Index
If querying is done from FCGR in numpy format, then use
panspace index query --help
but if you want to query the index directly from assemblies, we encourage you to use the snakemake pipelines provided above.
panspace
is developed by Jorge Avila Cartes