Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SV VRS Wave 1: Ambiguity due to intrinsic sequence-based homology #429

Closed
Mrinal-Thomas-Epic opened this issue Jul 5, 2023 · 2 comments
Closed
Labels
Docs Requirements issue for gathering technical requirements SV

Comments

@Mrinal-Thomas-Epic
Copy link
Contributor

Mrinal-Thomas-Epic commented Jul 5, 2023

Problem statement

As Daniel laid out in issue #425, one of our goals for SV-VRS wave 1 is to come up with a simple model for low-level structural changes from a reference sequence. Part of this goal includes support for multiple kinds of ambiguity, including intrinsic sequence-based homology (i.e., ambiguous cases where the two sequences on either end of a breakpoint are identical).

Daniel drew up the following example of sequence-based homology in his 6/1 SV SIG presentation:
image

There are two possible approaches we are considering to handle this ambiguity and would like to get feedback on which one is preferred.

Option 1: Extending VRS style ambiguity representation to breakpoints

Starting with the simple breakend model from #425 (which excludes property data types, since those will be decided later):

"Breakend": {
  "properties": [
    "sequence",
    "pos",
    "orientation"
  ]
}

We can create a breakpoint model that includes a field ambiguous_nucleotides for a sequence of nucleotides that are considered ambiguous due to homology.

"Breakpoint": {
  "properties": [
    "breakend_1",
    "ambiguous_nucleotides",
    "breakend_2"
  ]
}

Then, we can use a simple VOCA-like algorithm to take a potentially over-precise Breakpoint representation and produce the precision-corrected representation.

Rough Algorithm
from bioutils.sequences import reverse_complement

class Breakend:
    def __init__(self, sequence: str, pos: int, orientation: str) -> None:
        self.sequence = sequence
        self.pos = pos
        self.orientation = orientation

class Breakpoint:
    def __init__(self, breakend_1: Breakend, breakend_2: Breakend, ambiguous_nucleotides: str) -> None:
        self.breakend_1 = breakend_1
        self.breakend_2 = breakend_2
        self.ambiguous_nucleotides = ambiguous_nucleotides


def normalize_breakpoint(breakpoint:Breakpoint):
    be_1 = standardize_breakend_orientation(breakpoint.breakend_1, "connectedAfterPos")
    be_2 = standardize_breakend_orientation(breakpoint.breakend_1, "connectedBeforePos")
    ambiguous_nucleotides = breakpoint.ambiguous_nucleotides

    # shift left
    dist_left = 0
    while be_1.pos - dist_left > 0 and be_2.pos - dist_left > 0 and \
            be_1.sequence[be_1.pos - dist_left - 1] == be_2.sequence[be_2.pos - dist_left - 1]:
        dist_left += 1
        ambiguous_nucleotides = be_1.sequence[be_1.pos - dist_left] + ambiguous_nucleotides
    
    # shift right
    dist_right = 0
    while be_1.pos < len(be_1.sequence) and be_2.pos < len(be_2.sequence) and \
            be_1.sequence[be_1.pos + dist_right] == be_2.sequence[be_2.pos + dist_right]:
        ambiguous_nucleotides = ambiguous_nucleotides + be_1.sequence[be_1.pos + dist_right]
        dist_right += 1
    
    if be_1.orientation == "connectedBeforePos":
        dist_left *= -1
    if be_2.orientation == "connectedAfterPos":
        dist_right *= -1
    
    be_1_normalized = Breakend(be_1.sequence, be_1.pos - dist_left, be_1.orientation)
    be_2_normalized = Breakend(be_2.sequence, be_2.pos + dist_right, be_2.orientation)
    return Breakpoint(be_1_normalized, be_2_normalized, ambiguous_nucleotides)

def standardize_breakend_orientation(breakend: Breakend, standard_orientation: str) -> Breakend:
    if breakend.orientation != standard_orientation:
        standardized_sequence = reverse_complement(breakend.sequence)
        standardized_pos = len(breakend.sequence) - breakend.pos
        return Breakend(standardized_sequence, standardized_pos, standard_orientation)
    else:
        return breakend

So the example from Daniel's presentation, could be represented as:

"breakpoint":{
  "breakend_1": {
    "sequence": "chr1",
    "pos": 123,
    "orientation": "connectedAfterPos",
  },
  "ambiguous_nucleotides": "AAAAAA",
  "breakend_2": {
    "sequence": "chr2",
    "pos": 200,
    "orientation": "connectedAfterPos",
  }
}

Option 2: Using VCF-style ambiguity

Another approach we could go with is a VCF-style uncertainty bounds. In VCF 4.4, sequence-based homology is represented with the HOMSEQ and CIPOS info keys, where CIPOS is a integer pair representing the confidence interval bounds in either direction. Section 5.4.8 of the spec gives an example of how CIPOS can be used to represent uncertainty around breakpoint position.

Adding this to our model might look something like

"Breakend": {
  "properties": [
    "sequence",
    "pos",
    "orientation",
    "confidence_interval"
  ]
}

"Breakpoint": {
  "properties": [
    "breakend_1",
    "breakend_2"
  ]
}

With this approach, we still have the issue of deciding position should be used for the "center" of the interval? For example, these two breakend representations are equivalent:

"breakend_1": {
  "sequence": "chr1",
  "pos": 123,
  "orientation": "connectedAfterPos",
  "confidence_interval": [0, 6]
}


"breakend_1": {
  "sequence": "chr1",
  "pos": 126,
  "orientation": "connectedAfterPos",
  "confidence_interval": [-3, 3]
}

Questions

  • What are everyone's thoughts on these two options? (Or feel free to propose a third option)
  • Does anyone know of existing normalization algorithms in the community that can be applied to breakpoints, or will we have to create a new one?
  • Should we represent this type of ambiguity using the nucleotides or just the number of nucleotides? Can the actual sequence always be calculated from the number?
  • If using a string of nucleotides, which strand are ambiguous nucleotides reported with respect to?
  • How to handle a SNP in the nucleotide sequence?
  • Is there a reason to do full justification? Would either just left-shift or right-shift be simpler?
  • How would we combine this with other types of ambiguity?
@d-cameron
Copy link

I think we can get away without any left/right normalisation because we can directly report the entire intervals. E.g. the example above would be represented as:

breakpoint: {
  breakends: [
    {chr1:123-129, DivergesAfter},
    {chr2:200-206, DivergesAfter}],
  homology: true }

This avoids the VCF issue of having to report a 'nominal' position - all positions are equally valid.

What we should do is define a normalisation process for eliminating inserted bases. That is:

breakpoint: {
  breakends: [
    {chr1:123, DivergesAfter, sequence: AAAAAA},
    {chr2:200, DivergesAfter}],
  homology: true }

should be normalised such that if the sequence (AAAAAA) matches the reference then the breakpoint position should be adjusted accordingly.

That is, the above example should be normalised through the following operations:

Anchor inserted base:

breakpoint: {
  breakends: [
    {chr1:124, DivergesAfter, sequence: AAAAA},
    {chr2:206, DivergesAfter}],
}

Anchor inserted base:

breakpoint: {
  breakends: [
    {chr1:125, DivergesAfter, sequence: AAAA},
    {chr2:200, DivergesAfter}],
 }

Anchor inserted base:

breakpoint: {
  breakends: [
    {chr1:126, DivergesAfter, sequence: AAA},
    {chr2:200, DivergesAfter}],
}

Anchor inserted base:

breakpoint: {
  breakends: [
    {chr1:127, DivergesAfter, sequence: AA},
    {chr2:200, DivergesAfter}],
}

Anchor inserted base:

breakpoint: {
  breakends: [
    {chr1:128, DivergesAfter, sequence: A},
    {chr2:200, DivergesAfter}],
  }

Anchor inserted base:

breakpoint: {
  breakends: [
    {chr1:129, DivergesAfter},
    {chr2:200, DivergesAfter}],
}

Calculate homology bounds:

breakpoint: {
  breakends: [
    {chr1:129, DivergesAfter},
    {chr2:200, DivergesAfter}],
  homology: true }
  • Repeatedly shift & consume inserted base on either breakend if it matches the closest inserted base
  • Calculate the homology interval for any breakpoint with no inserted bases

@Mrinal-Thomas-Epic
Copy link
Contributor Author

Mrinal-Thomas-Epic commented Jul 17, 2023

I agree that we would need to delete the inserted bases that were equal to the reference (on either side). I think the VOCA algorithm has to deal with the same issue and they call that step "trimming". So our overall algorithm, could be:

  1. Trim along first breakend
  2. Trim along second breakend
  3. Expand "left" where the sequence along first breakend is equal to sequence along second breakend
  4. Expand "right" where the sequence along first breakend is equal to sequence along second breakend

The algorithm in my example was just handling the expand steps, but it would pretty easy to also add trim. Also, maybe "left" and "right" aren't the best words to use here, so feel free to suggest better terminology.

I also think your proposal to report the whole range of possible positions would work well, but I'm curious to hear what others would prefer.

One question I had about your approach is how you would handle cases where there is some homology, but also inserted nucleotides. For example, how would you handle this?

breakpoint: {
  breakends: [
    {chr1:123, DivergesAfter, sequence: AAAAAAA},
    {chr2:200, DivergesAfter}
  ]
}

I think in the proposed algorithm, it would go something like this:

  1. Trim along first breakend
breakpoint: {
  breakends: [
    {chr1:129, DivergesAfter, sequence: A},
    {chr2:200, DivergesAfter}
  ]
}
  1. Trim along second breakend (no change)
breakpoint: {
  breakends: [
    {chr1:129, DivergesAfter, sequence: A},
    {chr2:200, DivergesAfter}
  ]
}
  1. Expand "left"
breakpoint: {
  breakends: [
    {chr1:123-129, DivergesAfter, sequence: A},
    {chr2:200-206, DivergesAfter}
  ],
  homology: true
}
  1. Expand "right" (no change)
breakpoint: {
  breakends: [
    {chr1:123-129, DivergesAfter, sequence: A},
    {chr2:200-206, DivergesAfter}
  ],
  homology: true
}

Is this what you were thinking?

@ahwagner ahwagner added Docs Requirements issue for gathering technical requirements SV labels Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Requirements issue for gathering technical requirements SV
Projects
None yet
Development

No branches or pull requests

3 participants