Skip to content

Commit

Permalink
Merge branch 'release/v1.0.1'
Browse files Browse the repository at this point in the history
  • Loading branch information
ACEnglish committed Dec 10, 2024
2 parents fdb8f62 + 48abb01 commit 64f7c78
Show file tree
Hide file tree
Showing 13 changed files with 108 additions and 251 deletions.
143 changes: 9 additions & 134 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "kanpig"
version = "1.0.0"
version = "1.0.1"
edition = "2021"

[dependencies]
Expand All @@ -20,7 +20,7 @@ page_size = { version = "0.6.0" }
petgraph = { version = "0.6.5" }
pretty_env_logger = { version = "0.5.0" }
rand = "0.8.5"
rust-htslib = { version = "0.49.0" }
rust-htslib = { version = "0.46.0" }
rust-lapper = { version = "1.1.0" }
serde = { version = "1.0", features = ["derive"] }
serde_json = { version = "1.0" }
Expand Down
38 changes: 22 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,12 @@ When performing path-finding, this threshold limits the number of paths which ar
speed up runtime but may come at a cost of recall. A higher `maxpaths` is slower and may come at a cost to
specificity.

### `--maxnodes`
If a neighborhood has too many variants, its graph will become large in memory and slow to traverse This parameter
will turn off path-finding in favor of `--one-to-one` haplotype to variant comparison (see Experimental Parameters
below), reducing runtime and memory usage. This may reduce recall in regions with many SVs, but these regions are
problematic anyway.

### `--hapsim`
After performing kmeans clustering on reads to determine the two haplotypes, if the two haplotypes have a size similarity
above `hapsim`, they are consolidated into a homozygous allele.
Expand Down Expand Up @@ -121,29 +127,29 @@ Details of `FT`
# 🔌 Compute Resources

Kanpig is highly parallelized and will fully utilize all threads it is given. However, hyperthreading doesn't seem to
help and therefore the number of threads should probably be limited to the number of physical processors available.
help and therefore the number of threads should probably be limited to the number of physical processors available. For
memory, giving kanpig 2GB per-core is usually more than enough.

The actual runtime and memory usage of kanpig run will depend on the read coverage and the number of SVs in the input
VCF. As a example of kanpig's resource usage with 16 cores available, genotyping a 30x long-read bam against a 2,199
sample VCF (4.3 million SVs) took 13 minutes with a maximum memory usage of 12GB. Converting the bam to a plup file took
4 minutes (8GB of memory) and genotyping with this plup file took 3 minutes (12GB memory).

For memory, a general rule is kanpig will need about 20x the size of the compressed `.vcf.gz`. The minimum required
memory is also dependent on the number of threads running as each will need space for its processing. For example,
a 1.6Gb vcf (~5 million SVs) using 16 cores needs at least 32Gb of RAM. That same vcf with 8 or 4 cores needs at least
24Gb and 20Gb of RAM, respectively.
While genotyping against a plup file is usually faster, bam to plup conversion is most useful for:
* genotyping a large VCF or super-high (>50x) coverage bam.
* a sample that will be genotyped multiple times (e.g. N+1 pipelines)
* long-term access to reads (a plup file is up to ~2,000x smaller than a bam)

# 🔬 Experimental Parameter Details

These parameters have a varying effect on the results and are not guaranteed to be stable across releases.

### `--try-exact`
Before performing the path-finding algorithm that applies a haplotype to the variant graph, perform a 1-to-1 comparison
of the haplotype to each node in the variant graph. If a single node matches above `sizesim` and `seqsim`, the
path-finding is skipped and haplotype applied to the node.

This parameter will boost the specificity and speed of kanpig at the cost of recall.

### `--prune`
Similar to `try-exact`, a 1-to-1 comparison is performed before path-finding. If any matches are found, all paths
which do not traverse the matching nodes are pruned from the variant graph.
### `--one-to-one`
Instead of performing the path-finding algorithm that applies a haplotype to the variant graph, perform a 1-to-1
comparison of the haplotype to each node in the variant graph. If a single node matches above `sizesim` and `seqsim`,
the haplotype is applied to it.

This parameter will boost the specificity and speed of kanpig at the cost of recall.
This parameter will boost the specificity, increase speed, and lower memory usage of kanpig at the cost of recall.

### `--maxhom`

Expand Down
7 changes: 6 additions & 1 deletion src/genotype_main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -127,9 +127,14 @@ fn task_thread(

let (haps, coverage) =
m_reads.find_pileups(&m_graph.chrom, m_graph.start, m_graph.end);

let haps = ploidy.cluster(haps, coverage, &m_args.kd);

// Only need to build the full graph sometimes
let should_build = !haps.is_empty()
&& !m_args.kd.one_to_one
&& (m_graph.node_indices.len() - 2) <= m_args.kd.maxnodes;
m_graph.build(should_build);

let mut paths: Vec<PathScore> = haps
.iter()
.map(|h| m_graph.apply_coverage(h, &m_args.kd))
Expand Down
2 changes: 1 addition & 1 deletion src/kplib/annotator.rs
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ pub struct GenotypeAnno {
pub filt: FiltFlags,
pub sq: i32,
pub gq: i32,
pub ps: Option<u16>,
pub ps: Option<u32>,
pub dp: i32,
pub ad: IntG,
pub ks: IntG,
Expand Down
Loading

0 comments on commit 64f7c78

Please sign in to comment.