Merge branch 'release/v0.3.0'

ACEnglish · Jun 11, 2024 · 705f05d · 705f05d
2 parents 9e07a25 + 3b10b52
commit 705f05d
Show file tree

Hide file tree

Showing 24 changed files with 840 additions and 2,578 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -1,24 +1,24 @@
 [package]
 name = "kanpig"
-version = "0.2.0"
+version = "0.3.0"
 edition = "2021"
 
 [dependencies]
-bitflags = "2.5.0"
+bitflags = { version = "2.5.0" }
 clap = { version = "4.0", features = ["derive"] }
 crossbeam-channel = { version = "0.5.12" }
-indexmap = "2.2.3"
-indicatif = "0.17.8"
+indexmap = { version = "2.2.3" }
+indicatif = { version = "0.17.8" }
 itertools = { version = "0.12.1" }
-lazy_static = "1.4.0"
+lazy_static = { version = "1.4.0" }
 log = { version = "0.4", features = ["std", "serde"] }
-noodles-vcf = { version = "0.49.0" }
+noodles-vcf = { version = "0.57.0" }
 ordered-float = { version = "4.0", default-features = false }
-page_size = "0.6.0"
+page_size = { version = "0.6.0" }
 petgraph = { version = "0.6.2" }
 pretty_env_logger = { version = "0.4.0" }
 rust-htslib = { version = "0.46.0" }
-rust-lapper = "1.1.0"
+rust-lapper = { version = "1.1.0" }
 
 [profile.release]
 opt-level = 3

diff --git a/README.md b/README.md
@@ -6,7 +6,7 @@ A fast tool for genotyping structural variants with long-reads.
 *Kanpig is currently under active research and development. We make no guarantees about its accuracy or the stability of features 
 before version 1.0.*
 
-# Install
+# 📥 Install
 ```
 git clone https://github.com/ACEnglish/kanpig
 cd kanpig
@@ -15,20 +15,20 @@ cargo build --release
 ```
 Alternatively, binaries are available in [releases](https://github.com/ACEnglish/kanpig/releases).
 
-# Quick Start
+# 🚀 Quick Start
 ```
 kanpig --input variant.vcf.gz --bam alignments.bam --reference ref.fa --out output.vcf
 ```
 See `kanpig -h` for all available parameters, most of which are detailed below.
 
-# Current Limitations
+# ⚠️ Current Limitations
 * Kanpig expects sequence resolved SVs. Variants with symbolic alts (e.g. `<DEL>`) and BNDs are not parsed.
 * Kanpig only looks at read pileups and does not consider split or soft-clipped alignment information. This means
   variants above ~10kbp should be skipped with the `--sizemax` parameter.
-* Please do not publish manuscripts with kanpig results until we've completed our manuscript. We're aiming to have a preprint 
+* Please do not publish manuscripts benchmarking kanpig results until we've completed our manuscript. We're aiming to have a preprint 
 available early Q3 2024.
 
-# Core Parameter Details
+# 🔧 Core Parameter Details
 
 The default parameters are tuned to work generally well for genotyping a single sample's VCF, meaning the variants are
 all expected to be present in the sample. For a multi-sample VCF (a.k.a. a project-level VCF), the optimal parameters
@@ -41,85 +41,96 @@ Sorted bed file restricts kanpig to only analyzing variants with starts and ends
 ### `--ploidy-bed`
 This bed file informs kanpig of special regions within chromosomes that should have non-diploid genotypes. For example, a female
 human sample shouldn't have any genotypes on chrY. A male human sample should have hemizygous genotypes on chrY and the
-non-pseudoautosomal regions of chrX. The `ploidy_beds/` directory has example bed files for GRCh38. All regions not
-within the `ploidy-bed` (or if no bed is provided) are assumed to be diploid.
-
-### `--hapsim`
-After performing kmeans clustering on reads to determine the two haplotypes, if the two haplotypes have a size similarity above `hapsim`, they
-are consolidated into a homozygous allele.
+non-pseudoautosomal regions of chrX. The [ploidy_beds/](https://github.com/ACEnglish/kanpig/tree/develop/ploidy_beds) directory 
+has example bed files for GRCh38. All regions not within the `--ploidy-bed` (or if no bed is provided) are assumed to be diploid.
 
 ### `--chunksize`
-Kanpig will build local variant graphs from windows of the genome. These windows are determined by `chunksize` where
-the maximum end position of an upstream window's variants is at least `chunksize` base-pairs away from the next window's
-variants' minimum start position.
+Kanpig will build local variant graphs from windows of the genome. These windows are determined by making the maximum end position
+of an upstream window's variants at least `chunksize` base-pairs away from the next window's variants' minimum start position.
 
-This chunksize also determins the region over which read pileups are generated. Only reads which pass the `--mapq` and
-`--mapflag` filter are considered. Also, reads must fully span the minimum variant start and maximum variant end. 
+This chunksize also determines the region over which read pileups are generated. Only reads with at least `mapq` mapping quality, 
+passing the `mapflag` filter, and which fully span the minimum variant start and maximum variant end are considered. 
 
-This is an important parameter because too small of a `chunksize` may not recruit read pileups which support variants
-but are far away. Similarly, too large of a value may create windows with many SVs which are also too large for reads to have a
-fully-spanning alignment. 
+This is an important parameter because too small of a `chunksize` may not recruit distant read pileups which support variants. Similarly, 
+too large of a value may create windows with many SVs which are also too large for reads to fully-span. 
 
 ### `--sizemin` and `--sizemax`
-Variant sizes are determined by `INFO/SVLEN`. If `INFO/SVLEN` tag is not in the VCF entry, the variant's size is set as
-`abs(length(ALT) - length(REF))`.
+Variant sizes are determined by `abs(length(ALT) - length(REF))`. Genotypes of variants not within the size boundaries are set to missing (`./.`).
 
 ### `--sizesim` and `--seqsim`
 When applying a haplotype to a variant graph, only paths above these two thresholds are allowed. If there are multiple
-paths above the threshold, the one with the higher `(sizesim + seqsim) / 2` is kept. Generally, `0.90` is well balanced
+paths above the threshold, the one with the highest `(sizesim + seqsim) / 2` is kept. Generally, `0.90` is well balanced
 whereas lower thresholds will boost recall at the cost of precision and vice versa for higher thresholds.
 
 ### `--maxpaths`
-When performing path-finding, this threshold limits the number of paths which are checked. A lower `--maxpaths` will
-speed up runtime but may come at a cost of recall. A higher `--maxpaths` is slower and may come at a cost to
+When performing path-finding, this threshold limits the number of paths which are checked. A lower `maxpaths` will
+speed up runtime but may come at a cost of recall. A higher `maxpaths` is slower and may come at a cost to
 specificity.
 
+### `--hapsim`
+After performing kmeans clustering on reads to determine the two haplotypes, if the two haplotypes have a size similarity 
+above `hapsim`, they are consolidated into a homozygous allele.
+
 ### `--threads`
 Number of analysis threads to use. Note that in addition to the analysis threads, kanpig keeps one dedicated IO thread
 for VCF reading and writing.
 
-# Annotations
+# 📝 Annotations
 
 The `SAMPLE` column fields populated by kanpig are:
 
-* FT - Bit flag for properties of the variant's genotyping. Flags == 0 are considered PASS. The bits definitions are:
-  * 0x1 - The genotype observed from variants matching paths is not equal to the genotype observed from measuring the
-  proportions of reads supporting the two alleles.
-  * 0x2 - The genotype quality is less than 5
-  * 0x4 - The depth (DP) is less than 5
-  * 0x8 - The sample quality (SQ) is less than 5 (only present on non-ref variants)
-  * 0x16 - The number of reads supporting the alternate allele less than 5 (only present on non-ref variants)
-  * 0x32 - The best scoring path through the variant graph only used part of the haplotype. This may be indicative of a
-    false-negative in the variant graph.
-* SQ - Phred scaled likelihood variant alternate is present in the sample
-* GQ - Phred scale difference between most and second-most likely genotypes
-* PG - Each chunk of variants is assigned a phase group
-* DP - Read coverage over the region
-* AD - Read coverage supporting the reference and alternate alleles.
-* SZ - Size similarity of the two haplotypes to this variant
-* SS - Sequence similarity of the two haplotypes to this variant
-
-# Experimental Parameter Details
+| Field   | Description |
+|---------|-------------|
+| **FT**  | Bit flag for properties of the variant's genotyping. Flags == 0 are considered PASS. |
+| **SQ**  | Phred scaled likelihood variant alternate is present in the sample |
+| **GQ**  | Phred scale difference between most and second-most likely genotypes |
+| **PS**  | Each chunk of variants is assigned a phase set |
+| **DP**  | Read coverage over the region |
+| **AD**  | Read coverage supporting the reference and alternate alleles. |
+| **SZ**  | Size similarity of the two haplotypes to this variant |
+| **SS**  | Sequence similarity of the two haplotypes to this variant |
+
+Details of `FT`
+| Flag   | Description |
+|--------|-------------|
+| 0x1    | The genotype observed from variants matching paths is not equal to the genotype observed from measuring the proportions of reads supporting the two alleles. |
+| 0x2    | The genotype quality is less than 5 |
+| 0x4    | The depth (DP) is less than 5 |
+| 0x8    | The sample quality (SQ) is less than 5 (only present on non-ref variants) |
+| 0x16   | The number of reads supporting the alternate allele less than 5 (only present on non-ref variants) |
+| 0x32   | The best scoring path through the variant graph only used part of the haplotype. This may be indicative of a false-negative in the variant graph. |
+
+# 🔌 Compute Resources
+
+Kanpig is highly parallelized and will fully utilize all threads it is given. However, hyperthreading doesn't seem to
+help and therefore the number of threads should probably be limited to the number of physical processors available. 
+
+For memory, a general rule is kanpig will need about 20x the size of the compressed `.vcf.gz`. The minimum required 
+memory is also dependent on the number of threads running as each will need space for its processing. For example, 
+a 1.6Gb vcf (~5 million SVs) using 16 cores needs at least 32Gb of RAM. That same vcf with 8 or 4 cores needs at least
+ 24Gb and 20Gb of RAM, respectively. 
+
+# 🔬 Experimental Parameter Details
 
 These parameters have a varying effect on the results and are not guaranteed to be stable across releases. 
 
 ### `--try-exact`
-Before performing the path-finding algorithm that applies haplotypes to the variant graph, perform a 1-to-1 comparison
-of the haplotypes to each node in the variant graph. If a single node matches above `--sizesim` and `--seqsim`, the
-path-finding is skipped and haplotype applied to support the node. 
+Before performing the path-finding algorithm that applies a haplotype to the variant graph, perform a 1-to-1 comparison
+of the haplotype to each node in the variant graph. If a single node matches above `sizesim` and `seqsim`, the
+path-finding is skipped and haplotype applied to the node. 
 
 This parameter will boost the specificity and speed of kanpig at the cost of recall.
 
 ### `--prune`
-Similar to `--try-exact`, a 1-to-1 comparison is performed before path-finding. If any matches are found, all paths
+Similar to `try-exact`, a 1-to-1 comparison is performed before path-finding. If any matches are found, all paths
 which do not traverse the matching nodes are pruned from the variant graph. 
 
 This parameter will boost the specificity and speed of kanpig at the cost of recall.
 
 ### `--maxhom`
 
-When performing kmer-featurization of sequences (from reads or variants), homopolymer runs above `--maxhom` are trimmed
-to `--maxhom`. For example, `--maxhom 5` will only count two four-mers in all homopolymer runs above 5bp.
+When performing kmer-featurization of sequences (from reads or variants), homopolymer runs above `maxhom` are trimmed
+to `maxhom`. For example, `--maxhom 5` will only count two four-mers in homopolymer runs above 5bp.
 
 ### `--spanoff`
 

diff --git a/changelog.md b/changelog.md