Skip to content
Adam English edited this page Jan 11, 2025 · 84 revisions

Quick start

Run this command where base is your 'truth set' SVs and comp is the comparison set of SVs.

truvari bench -b base_calls.vcf -c comp_calls.vcf -o output_dir/

Matching Parameters

Picking matching parameters can be more of an art than a science. It really depends on the precision of your callers and the tolerance you wish to allow them such that it is a fair comparison.

For example, depth of coverage callers (such as CNVnator) will have very 'fuzzy' boundaries, and don't report the exact deleted sequence but only varying regions. So thresholds of pctseq=0, pctsize=.5, pctovl=.5, refdist=1000 may seem fair.

BioGraph and many long-read callers report precise breakpoints and full alternate allele sequences. When benchmarking those results, we want to ensure our accuracy by using the stricter default thresholds.

If you're still having trouble picking thresholds, it may be beneficial to do a few runs of Truvari bench over different values. Start with the strict defaults and gradually increase the leniency. From there, you can look at the performance metrics and manually inspect differences between the runs to find out what level you find acceptable. Truvari is meant to be flexible for comparison. More importantly, Truvari helps one clearly report the thresholds used for reproducibility.

Here is a rundown of each matching parameter.

Parameter Default Definition
refdist 500 Maximum distance comparison calls must be within from base call's start/end
pctseq 0.7 Edit distance ratio between the REF/ALT haplotype sequences of base and
comparison call. See "Comparing Sequences of Variants" below.
pctsize 0.7 Ratio of min(base_size, comp_size)/max(base_size, comp_size)
pctovl 0.0 Ratio of two calls' (overlapping bases)/(longest span)
typeignore False Types don't need to match to compare calls.

Below are matching parameter diagrams to illustrate (approximately) how they work.

 █ = Deletion ^ = Insertion

--refdist REFDIST (500)
  Max reference location distance

    ACTGATCATGAT
     |--████--|    
          █████      
  
  Calls are within reference distance of 2

--pctsize PCTSIZE (0.7)
  Min pct allele size similarity

    ACTGATCATGA    sizes
      █████     -> 5bp
        ████    -> 4bp

  variants have 0.8 size similarity


--pctovl PCTOVL (0.0)
  Min pct reciprocal overlap

    ACTGATCATGA  ranges
      █████      [2,7)
        ████     [4,8)

  variants have 0.6 reciprocial overlap


--pctseq PCTSEQ (0.7)
  Min percent allele sequence similarity

    A-CTG-ACTG
     ^   ^       haplotypes
     |   └ACTG -> CTGACTGA
     └CTGA     -> CTGACTGA

  haplotypes have 100% sequence similarity

Outputs

Truvari bench writes the following files to the --output directory.

File Description
tp-base.vcf.gz True positive calls form the base VCF
tp-comp.vcf.gz True positive calls from the comparison VCF
fp.vcf.gz False positive calls from comparison
fn.vcf.gz False negative calls from base
summary.json Json output of performance stats
params.json Json output of parameters used
candidate.refine.bed Bed file of regions for refine
log.txt Run's log

summary.json

Stats generated by benchmarking are written to summary.json.

Metric Definition
TP-base Number of matching calls from the base vcf
TP-comp Number of matching calls from the comp vcf
FP Number of non-matching calls from the comp vcf
FN Number of non-matching calls from the base vcf
precision TP-comp / (TP-comp + FP)
recall TP-base / (TP-base + FN)
f1 2 * ((recall * precision) / (recall + precision))
base cnt Number of calls in the base vcf
comp cnt Number of calls in the comp vcf
TP-comp_TP-gt TP-comp with genotype match
TP-comp_FP-gt TP-comp without genotype match
TP-base_TP-gt TP-base with genotype match
TP-base_FP-gt TP-base without genotype match
gt_concordance TP-comp_TP-gt / (TP-comp_TP-gt + TP-comp_FP-gt)
gt_matrix Base GT x Comp GT Matrix of all Base calls' best, TP match
weighted Metrics weighed by variant sequence/size similarity

The gt_matrix is a table. For example:

"gt_matrix": {
    "(0, 1)": {
        "(0, 1)": 500,
        "(1, 1)": 10
    },
    "(1, 1)": {
        "(1, 1)": 800,
        "(0, 1)": 20
    }
}

Represents ->

comp    (0,1)   (1,1)
base
(0,1)   500     10
(1,1)   20      800 

Added annotations

The output vcfs are annotated with INFO fields and then sorted, compressed, and indexed inside of the output directory.

Anno Definition
TruScore Truvari score for similarity of match. ((pctseq + pctsize + pctovl) / 3 * 100)
PctSeqSimilarity Pct sequence similarity between this variant and its closest match
PctSizeSimilarity Pct size similarity between this variant and its closest match
PctRecOverlap Percent reciprocal overlap percent of the two calls
StartDistance Distance of the base call's start from comparison call's start
EndDistance Distance of the base call's end from comparison call's end
SizeDiff Difference in size of base and comp calls
GTMatch Base/comp calls' Genotypes match
MatchId Id to help tie base/comp calls together {chunkid}.{baseid}.{compid} See MatchIds wiki for details.

Refining bench output

As described in the refine wiki, a limitation of Truvari bench is 1-to-1 variant comparison. However, truvari refine can harmonize the variants to give them more consistent representations. A bed file named candidate.refine.bed is created by truvari bench and holds a set of regions which may benefit from refinement. To use it, simply run

truvari bench -b base.vcf.gz -c comp.vcf.gz -o result/
truvari refine --regions result/candidate.refine.bed \
               --reference reference.fasta \
               --recount --use-region-coords \
               result/

See refine wiki for details.

Comparing Sequences of Variants

Two SV insertions could theoretically be at the exact same position and exact same size, but have completely different sequences. Truvari's high specificity when matching variants is due in large part to comparing sequences. Truvari uses edlib to estimate sequence similarity of two alleles. In addition to comparing the direct (a.k.a. unmanipulated) similarity of sequences, truvari also checks rotations of sequences. The first rotation is based on the lexicographically minimum rotation of sequences, and the second is 'unrolled' sequences, which is formally described in this gist. These 'rolling' operations can be turned off by using --no-roll.

The main idea of rotating sequences is that in order to move variants upstream/downstream, the reference sequence flanking the variant will need to be moved downstream/upstream respectively. Or, to say this another way, we can think of the alternate sequences as being circular instead of linear. This means that in order to move the variant e.g. 1bp downstream for an INS, we could remove the first base from the ALT and append it to the end. So in the 'ab' example used to describe "Reference context" below, we only need to unroll the insertion at a position by the distance between it and another variant e.g. the INS ab at POS 2 becomes identical to the INS ba at POS 1 by rolling 2-1 = 1 bases from the start to the end. Finally, to increase matching sensitivity, the lexicographically minimum rotation is also compared as we've found many instances of smaller (<100bp) insertions which should match but fail direct and unrolled sequence similarity, but are captured in this third rotation.

As of truvari v5.0, the unroll method is the only supported sequence comparison technique

Symbolic alleles

As of truvari v5.0, some symbolic alts can be compared with sequence similarity turned on. <DEL>, <INV> sequences will be filled in when a --reference is provided if the variant's size is smaller than --max-resolve (default 25kbp). Additionally, <DUP> variants can have their sequences filled in if --dup-to-ins is also provided. The duplications are assumed to be perfect copies of the reference range spanned by the SV and are placed as a point insertion at the start position.

Variants that are symbolic alts (even after attempting to resolve) can be compared with sequence resolved SVs, so there is no need to specify --pctseq 0. Note, however the PctSeqSimilarity field will not be populated.

Symbolic Alt sub-types are ignored e.g. <DUP:TANDEM> is considered just <DUP>.

Truvari can replace the symbolic alt of resolved SVs in the output VCF with the parameter --write-resolved.

BND Comparison

Breakend (BND) variants are compared by checking a few conditions using a single threshold of --bnddist which holds the maximum distance around a breakpoint position to search for a match. Similar to the --refdist parameter, truvari looks for overlaps between the dist 'buffered' boundaries (e.g. overlaps( POS_base - dist, POS_base + dist, POS_comp - dist, POS_comp + dist) Additionally, if the CIPOS and and CIEND info tags are available in the entry, the e.g. POS is further buffered by -abs(CIPOS[0]) and +(abs(CIPOS[1]).

The baseline and comparison BNDs' POS and their joined position must both be within --bnddist to be a match candidate (i.e. no partial matches). Furthermore, the direction and strand of the two BNDs must match, for example t[p[ (piece extending to the right of p is joined after t) only matches with t[p[ and won't match to [p[t (reverse comp piece extending right of p is joined before t).

BND's are annotated in the truvari output with fields: StartDistance (baseline minus comparison POS); EndDistance (baseline minus comparison join position); TruScore which describes the percent of the allowed distance needed to find this match ((1 - ((abs(StartDistance) + abs(EndDistance)) / 2) / (bnddist*2)) * 100). For example, two BNDs 20bp apart with bnddist of 100 makes a score of 90.

BND comparison can be turned off by setting --bnddist -1. Single-end BNDs (e.g. ALT=TTT.) are still ignored.

Cross-Representation Matching

Truvari considers there to be three possible representation styles of SVs.

  1. Resolved: SVs with the full REF and ALT sequences, most frequently representing INS and DEL.
  2. Symbolic: SVs without the REF or ALT sequences having an ALT of e.g. <DEL>, <DUP>, etc.
  3. BNDs: SV breakends represented with the e.g. t[p[ ALT field.

Comparing SVs across these representation styles have the following caveats:

  1. When comparing Resolved and Symbolic SVs, sequence similarity is turned off for thresholding matches. If a user provides a --reference, symbolic SVs shorter than the --max-resolve parameter (default 25kbp) can be turned into Resolved SVs details in API docs and therefore the sequence similarity thresholds are still enforced.
  2. When a BND is compared to a with Resolved or Symbolic SV, the SV is 'decomposed' into a set of BNDs and each is compared with the original BND. If any of the decomposed BNDs matches to the original BND, the Resolved/Symbolic SV and BND are considered matching. Details of SV decomposition are in the API docs

Note that only Deletions (symbolic or resolved), INV (symbolic or resolved), and symbolic DUPs can be decomposed into BNDs. DUPs are always decomposed into DUP:TANDEM breakends.

Because SVs decompose into multiple BNDs (2 for DEL/DUP, 4 for INV), and because --pick single is the default, a decomposed SV will only match to one BND and the BNDs 'mate' will be a FN. To enable all BNDs to match to a decomposed SV, specify --pick multi.

SV decomposition into BNDs can be turned off with --no-decompose.

Controlling the number of matches

How many matches a variant is allowed to participate in is controlled by the --pick parameter. The available pickers are single, ac, and multi.

  • single (the default option) allows each variant to participate in up to one match.
  • ac uses the genotype allele count to control how many matches a variant can have. This means a homozygous alternate variant can participate in two matches (its GT is 1/1 so AC=2). A heterozygous variant can only participate in one match (GT 0/1, AC=1). And, a homozygous reference variant cannot be matched. Note that missing genotypes are considered reference alleles and do not add to the AC e.g. (GT ./1, AC=1).
  • multi variants can participate in all matches available.

As an example, imagine we have three variants in a pVCF with two samples we want to compare.

CHROM POS      ID  REF ALT           base comp
chr20 17785968 ID1 A   ACGCGCGCGCG   1/1  1/0
chr20 17785968 ID2 A   ACGCGCGCGCGCG 0/0  0/1
chr20 17785969 ID3 C   CGCGCGCGCGCGC 0/0  1/1

To compare samples inside the same vcf, we would use the command:

truvari bench -b input.vcf.gz -c input.vcf.gz -o output/ --bSample base --cSample comp --no-ref a

This VCF makes different results depending on the --pick parameter

Parameter ID1 State ID2 State ID3 State
single TP FP FP
ac TP TP FP
multi TP TP TP

--dup-to-ins

Most SV benchmarks only report DEL and INS SVTYPEs. The flag --dup-to-ins will interpret SVs with SVTYPE == DUP to SVTYPE == INS. Note that DUPs generally aren't sequence resolved (i.e. the ALT isn't a sequence) like INS. Therefore, --dup-to-ins typically should be used without sequence comparison via --pctseq 0

Size filtering

--sizemax is the maximum size of a base or comparison call to be considered.

--sizemin is the minimum size of a base call to be considered.

--sizefilt is the minimum size of a comparison call that will be matched to base calls. It can be less than sizemin for edge case variants.

For example: Imagine sizemin is set at 50 and sizefilt at 30, and a 50bp base call is 98% similar to a 49bp comparison call at the same position.

These two calls could be considered matching. However, if we removed comparison calls less than sizemin, we'd incorrectly classify the 50bp base call as a false negative. Instead, we allow comparison calls between [sizefilt,sizemin) to find matches.

This has the side effect of artificially inflating specificity. For example, if that same 49bp call described above were below the similarity threshold, it would not be classified as a FP since it is below the sizemin threshold. So we're giving the call a better chance to be useful and less chance to be detrimental to final statistics.

Include Bed & VCF Header Contigs

If an --includebed is provided, only base and comp calls contained within the defined regions are used for comparison. This is similar to pre-filtering your base/comp calls using:

(zgrep "#" my_calls.vcf.gz && bedtools intersect -u -a my_calls.vcf.gz -b include.bed) | bgzip > filtered.vcf.gz

with the exception that Truvari requires the start and the end to be contained in the same includebed region whereas bedtools intersect does not.

If an --includebed is not provided, the comparison is restricted to only the contigs present in the base VCF header. Therefore, any comparison calls on contigs not in the base calls will not be counted toward summary statistics and will not be present in any output vcfs.

Extending an Include Bed

The option --extend extends the regions of interest (set in --includebed argument) by the given number of bases on each side, allowing base variants to match comparison variants that are just outside of the original region. If a comparison variant is in the extended regions it can potentially match a base variant that is in the original regions turning it to TP. Comparison variants in the extended regions that don't have a match are not counted as FP. This strategy is similar to the one implemented for size matching where only the base variants longer than sizemin (equal to 50 by default) are considered, but they are allowed to match shorter comparison variants sizefilt (30bp by default) or longer.

See this discussionfor details.

Methodology

Here is a high-level pseudocode description of the steps Truvari bench conducts to compare the two VCFs.

* zip the Base and Comp calls together in sorted order
* create chunks of all calls overlapping within ±`--chunksize` basepairs
* make a |BaseCall| x |CompCall| match matrix for each chunk
* build a Match for each call pair in the chunk - annotate as TP if >= all thresholds 
* if the chunk has no Base or Comp calls
** return them all as FNs/FPs 
* use `--pick` method to sort and annotate variants with their best match
Clone this wiki locally