SV VRS Wave 1: Ambiguity due to intrinsic sequence-based homology #429

Mrinal-Thomas-Epic · 2023-07-05T21:45:22Z

Problem statement

As Daniel laid out in issue #425, one of our goals for SV-VRS wave 1 is to come up with a simple model for low-level structural changes from a reference sequence. Part of this goal includes support for multiple kinds of ambiguity, including intrinsic sequence-based homology (i.e., ambiguous cases where the two sequences on either end of a breakpoint are identical).

Daniel drew up the following example of sequence-based homology in his 6/1 SV SIG presentation:

There are two possible approaches we are considering to handle this ambiguity and would like to get feedback on which one is preferred.

Option 1: Extending VRS style ambiguity representation to breakpoints

Starting with the simple breakend model from #425 (which excludes property data types, since those will be decided later):

"Breakend": {
  "properties": [
    "sequence",
    "pos",
    "orientation"
  ]
}

We can create a breakpoint model that includes a field ambiguous_nucleotides for a sequence of nucleotides that are considered ambiguous due to homology.

"Breakpoint": {
  "properties": [
    "breakend_1",
    "ambiguous_nucleotides",
    "breakend_2"
  ]
}

Then, we can use a simple VOCA-like algorithm to take a potentially over-precise Breakpoint representation and produce the precision-corrected representation.

Rough Algorithm

from bioutils.sequences import reverse_complement

class Breakend:
    def __init__(self, sequence: str, pos: int, orientation: str) -> None:
        self.sequence = sequence
        self.pos = pos
        self.orientation = orientation

class Breakpoint:
    def __init__(self, breakend_1: Breakend, breakend_2: Breakend, ambiguous_nucleotides: str) -> None:
        self.breakend_1 = breakend_1
        self.breakend_2 = breakend_2
        self.ambiguous_nucleotides = ambiguous_nucleotides


def normalize_breakpoint(breakpoint:Breakpoint):
    be_1 = standardize_breakend_orientation(breakpoint.breakend_1, "connectedAfterPos")
    be_2 = standardize_breakend_orientation(breakpoint.breakend_1, "connectedBeforePos")
    ambiguous_nucleotides = breakpoint.ambiguous_nucleotides

    # shift left
    dist_left = 0
    while be_1.pos - dist_left > 0 and be_2.pos - dist_left > 0 and \
            be_1.sequence[be_1.pos - dist_left - 1] == be_2.sequence[be_2.pos - dist_left - 1]:
        dist_left += 1
        ambiguous_nucleotides = be_1.sequence[be_1.pos - dist_left] + ambiguous_nucleotides
    
    # shift right
    dist_right = 0
    while be_1.pos < len(be_1.sequence) and be_2.pos < len(be_2.sequence) and \
            be_1.sequence[be_1.pos + dist_right] == be_2.sequence[be_2.pos + dist_right]:
        ambiguous_nucleotides = ambiguous_nucleotides + be_1.sequence[be_1.pos + dist_right]
        dist_right += 1
    
    if be_1.orientation == "connectedBeforePos":
        dist_left *= -1
    if be_2.orientation == "connectedAfterPos":
        dist_right *= -1
    
    be_1_normalized = Breakend(be_1.sequence, be_1.pos - dist_left, be_1.orientation)
    be_2_normalized = Breakend(be_2.sequence, be_2.pos + dist_right, be_2.orientation)
    return Breakpoint(be_1_normalized, be_2_normalized, ambiguous_nucleotides)

def standardize_breakend_orientation(breakend: Breakend, standard_orientation: str) -> Breakend:
    if breakend.orientation != standard_orientation:
        standardized_sequence = reverse_complement(breakend.sequence)
        standardized_pos = len(breakend.sequence) - breakend.pos
        return Breakend(standardized_sequence, standardized_pos, standard_orientation)
    else:
        return breakend

So the example from Daniel's presentation, could be represented as:

"breakpoint":{
  "breakend_1": {
    "sequence": "chr1",
    "pos": 123,
    "orientation": "connectedAfterPos",
  },
  "ambiguous_nucleotides": "AAAAAA",
  "breakend_2": {
    "sequence": "chr2",
    "pos": 200,
    "orientation": "connectedAfterPos",
  }
}

Option 2: Using VCF-style ambiguity

Another approach we could go with is a VCF-style uncertainty bounds. In VCF 4.4, sequence-based homology is represented with the HOMSEQ and CIPOS info keys, where CIPOS is a integer pair representing the confidence interval bounds in either direction. Section 5.4.8 of the spec gives an example of how CIPOS can be used to represent uncertainty around breakpoint position.

Adding this to our model might look something like

"Breakend": {
  "properties": [
    "sequence",
    "pos",
    "orientation",
    "confidence_interval"
  ]
}

"Breakpoint": {
  "properties": [
    "breakend_1",
    "breakend_2"
  ]
}

With this approach, we still have the issue of deciding position should be used for the "center" of the interval? For example, these two breakend representations are equivalent:

"breakend_1": {
  "sequence": "chr1",
  "pos": 123,
  "orientation": "connectedAfterPos",
  "confidence_interval": [0, 6]
}


"breakend_1": {
  "sequence": "chr1",
  "pos": 126,
  "orientation": "connectedAfterPos",
  "confidence_interval": [-3, 3]
}

Questions

What are everyone's thoughts on these two options? (Or feel free to propose a third option)
Does anyone know of existing normalization algorithms in the community that can be applied to breakpoints, or will we have to create a new one?
Should we represent this type of ambiguity using the nucleotides or just the number of nucleotides? Can the actual sequence always be calculated from the number?
If using a string of nucleotides, which strand are ambiguous nucleotides reported with respect to?
How to handle a SNP in the nucleotide sequence?
Is there a reason to do full justification? Would either just left-shift or right-shift be simpler?
How would we combine this with other types of ambiguity?

The text was updated successfully, but these errors were encountered:

d-cameron · 2023-07-06T12:23:49Z

I think we can get away without any left/right normalisation because we can directly report the entire intervals. E.g. the example above would be represented as:

breakpoint: {
  breakends: [
    {chr1:123-129, DivergesAfter},
    {chr2:200-206, DivergesAfter}],
  homology: true }

This avoids the VCF issue of having to report a 'nominal' position - all positions are equally valid.

What we should do is define a normalisation process for eliminating inserted bases. That is:

breakpoint: {
  breakends: [
    {chr1:123, DivergesAfter, sequence: AAAAAA},
    {chr2:200, DivergesAfter}],
  homology: true }

should be normalised such that if the sequence (AAAAAA) matches the reference then the breakpoint position should be adjusted accordingly.

That is, the above example should be normalised through the following operations:

Anchor inserted base:

breakpoint: {
  breakends: [
    {chr1:124, DivergesAfter, sequence: AAAAA},
    {chr2:206, DivergesAfter}],
}

Anchor inserted base:

breakpoint: {
  breakends: [
    {chr1:125, DivergesAfter, sequence: AAAA},
    {chr2:200, DivergesAfter}],
 }

Anchor inserted base:

breakpoint: {
  breakends: [
    {chr1:126, DivergesAfter, sequence: AAA},
    {chr2:200, DivergesAfter}],
}

Anchor inserted base:

breakpoint: {
  breakends: [
    {chr1:127, DivergesAfter, sequence: AA},
    {chr2:200, DivergesAfter}],
}

Anchor inserted base:

breakpoint: {
  breakends: [
    {chr1:128, DivergesAfter, sequence: A},
    {chr2:200, DivergesAfter}],
  }

Anchor inserted base:

breakpoint: {
  breakends: [
    {chr1:129, DivergesAfter},
    {chr2:200, DivergesAfter}],
}

Calculate homology bounds:

breakpoint: {
  breakends: [
    {chr1:129, DivergesAfter},
    {chr2:200, DivergesAfter}],
  homology: true }

Repeatedly shift & consume inserted base on either breakend if it matches the closest inserted base
Calculate the homology interval for any breakpoint with no inserted bases

Mrinal-Thomas-Epic · 2023-07-17T14:32:24Z

I agree that we would need to delete the inserted bases that were equal to the reference (on either side). I think the VOCA algorithm has to deal with the same issue and they call that step "trimming". So our overall algorithm, could be:

Trim along first breakend
Trim along second breakend
Expand "left" where the sequence along first breakend is equal to sequence along second breakend
Expand "right" where the sequence along first breakend is equal to sequence along second breakend

The algorithm in my example was just handling the expand steps, but it would pretty easy to also add trim. Also, maybe "left" and "right" aren't the best words to use here, so feel free to suggest better terminology.

I also think your proposal to report the whole range of possible positions would work well, but I'm curious to hear what others would prefer.

One question I had about your approach is how you would handle cases where there is some homology, but also inserted nucleotides. For example, how would you handle this?

breakpoint: {
  breakends: [
    {chr1:123, DivergesAfter, sequence: AAAAAAA},
    {chr2:200, DivergesAfter}
  ]
}

I think in the proposed algorithm, it would go something like this:

Trim along first breakend

breakpoint: {
  breakends: [
    {chr1:129, DivergesAfter, sequence: A},
    {chr2:200, DivergesAfter}
  ]
}

Trim along second breakend (no change)

breakpoint: {
  breakends: [
    {chr1:129, DivergesAfter, sequence: A},
    {chr2:200, DivergesAfter}
  ]
}

Expand "left"

breakpoint: {
  breakends: [
    {chr1:123-129, DivergesAfter, sequence: A},
    {chr2:200-206, DivergesAfter}
  ],
  homology: true
}

Expand "right" (no change)

breakpoint: {
  breakends: [
    {chr1:123-129, DivergesAfter, sequence: A},
    {chr2:200-206, DivergesAfter}
  ],
  homology: true
}

Is this what you were thinking?

ahwagner added Docs Requirements issue for gathering technical requirements SV labels Nov 27, 2023

ahwagner mentioned this issue Apr 19, 2024

Epic #480 "Create Adjacency" #488

Merged

ahwagner closed this as completed Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SV VRS Wave 1: Ambiguity due to intrinsic sequence-based homology #429

SV VRS Wave 1: Ambiguity due to intrinsic sequence-based homology #429

Mrinal-Thomas-Epic commented Jul 5, 2023 •

edited

Loading

d-cameron commented Jul 6, 2023

Mrinal-Thomas-Epic commented Jul 17, 2023 •

edited

Loading

SV VRS Wave 1: Ambiguity due to intrinsic sequence-based homology #429

SV VRS Wave 1: Ambiguity due to intrinsic sequence-based homology #429

Comments

Mrinal-Thomas-Epic commented Jul 5, 2023 • edited Loading

Problem statement

Option 1: Extending VRS style ambiguity representation to breakpoints

Option 2: Using VCF-style ambiguity

Questions

d-cameron commented Jul 6, 2023

Mrinal-Thomas-Epic commented Jul 17, 2023 • edited Loading

Mrinal-Thomas-Epic commented Jul 5, 2023 •

edited

Loading

Mrinal-Thomas-Epic commented Jul 17, 2023 •

edited

Loading