pnc-rs

Fast multithreaded Rust-implementation of PNC (Parallel Neighbourhood Correlation). EXPERIMENTAL.

Based on rust-py, the PNC variant of the NC Neighbourhood Correlation algorithm [1]. The PNC algorithm is described in my M.Sc. thesis [2]. Its time complexity is better than snc and snc-cpp by a factor of log(m), where m is the number of alignments to the reference database per query sequence (assuming an equal number of alignments per query). See my thesis for details.

Performance

Aligner settings (Diamond v. 2.1.3): diamond blastp --max-target-seqs 100 --sensitive --min-score 30

CPU: 8 cores, 16 threads.

Sample	Proteins	Alignments	snc-py	snc-cpp	pnc-py	pnc-rs
S. cerevisiae	6 k	36 k	6 s	1 s	0 s	0 s
HSA+MMU	38 k	1,2 M	701 s	69 s	32 s	1 s
13_proteomes	411 k	24 M	Est.: 36540 s	8864 s	880 s	25 s
UniRef50 [subsampled]	4 M	176 M	N/A	N/A	N/A	696 s

(Aligner running time not included. snc-py progresses consistently linearly during its main computation step, so the estimated time is a reliable lower bound for the actual time required.)

"13 proteomes" refers to: E. coli, S. cerevisiae, M. musculus, D. melanogaster, B. subtilis, R. norvegicus, H. sapiens, A. thaliana, D. rerio, C. elegans, D. discoideum, B. taurus, and O. sativa; downloaded from UNIPROT (2024).

Peak memory usage for pnc-rson the subsampled UniRef50 dataset was about 58 GB. The dominating term is the 4+4+4 bytes (2x query id u32's and a float32) used for storing the cross-terms (query_x_score * query_y_score) needed to complete the final NC calculation step.

The outputs of all 4 tools were sorted and compared (for HSA+MMU). Each entry deviated from the mean by at most 0.00075 [the valid range of NC-scores is 0-1], which is reasonable due to the number of significant digits and usage of 32-bit floats during the calculations in both PNC versions.

Usage (on a single computer)

Compile with Rust cargo build -r.

Run pnc ALIGNMENTS_FILE > output.nc, where ALIGNMENTS_FILE is a file containing alignments in blast-tab format (entry format: QUERY_ACC<whitespace>REF_ACC<whitespace>SCORE<newline>).

pnc outputs the computed NC-scores to stdout and various info-messages to stderr.

There are no command line options available at this time.

Usage (as a distributed computation over TCP)

Compile with Rust cargo build -r.

Create a newline-separated list of memory node process IP:PORT entries. Save to e.g. ip_addrs.

Allocate N+1 nodes for the job, where N is the number of memory nodes. I recommend experimenting with subsamples of your dataset to determine memory requirements.

Start at least one pnc-tcphashmap-node <IP:PORT> process on each memory node, so that each entry in ip_addrs has a corresponding process running. pnc assumes each node has an equal amount of memory.

DO NOT EVER EXPOSE THIS PROCESS TO THE INTERNET. USE A FIREWALL IF NECESSARY.

Run pnc ALIGNMENTS_FILE ip_addrs > output.nc on the main node. Use the fastest available file system to store the output - it is a likely bottleneck even though the writing is single-threaded.

pnc will shut down the pnc-tcphashmap-node processes one by one as they're drained of data.

There are no command line options available at this time.

References

[1] Song N, Joseph JM, Davis GB, Durand D (May 2008). Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins. PLOS Computational Biology. URL: https://doi.org/10.1371/journal.pcbi.1000063.

[2] Eldfjell Y (Dec 2023). Neighborhood Correlation in the age of megagenomics: scaling down to scale up [M.Sc. thesis]. Stockholm University. URL: https://kurser.math.su.se/pluginfile.php/105616/mod_folder/content/0/2023/Yrin_Eldfjell_MSc_datalogi_2023.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENCE		LICENCE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pnc-rs

Performance

Usage (on a single computer)

Usage (as a distributed computation over TCP)

References

About

Releases

Packages

Languages

License

yrin/pnc-rs

Folders and files

Latest commit

History

Repository files navigation

pnc-rs

Performance

Usage (on a single computer)

Usage (as a distributed computation over TCP)

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages