-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SV VRS Wave 1: Ambiguity due to intrinsic sequence-based homology #429
Comments
I think we can get away without any left/right normalisation because we can directly report the entire intervals. E.g. the example above would be represented as:
This avoids the VCF issue of having to report a 'nominal' position - all positions are equally valid. What we should do is define a normalisation process for eliminating inserted bases. That is:
should be normalised such that if the sequence ( That is, the above example should be normalised through the following operations: Anchor inserted base:
Anchor inserted base:
Anchor inserted base:
Anchor inserted base:
Anchor inserted base:
Anchor inserted base:
Calculate homology bounds:
|
I agree that we would need to delete the inserted bases that were equal to the reference (on either side). I think the VOCA algorithm has to deal with the same issue and they call that step "trimming". So our overall algorithm, could be:
The algorithm in my example was just handling the expand steps, but it would pretty easy to also add trim. Also, maybe "left" and "right" aren't the best words to use here, so feel free to suggest better terminology. I also think your proposal to report the whole range of possible positions would work well, but I'm curious to hear what others would prefer. One question I had about your approach is how you would handle cases where there is some homology, but also inserted nucleotides. For example, how would you handle this?
I think in the proposed algorithm, it would go something like this:
Is this what you were thinking? |
Problem statement
As Daniel laid out in issue #425, one of our goals for SV-VRS wave 1 is to come up with a simple model for low-level structural changes from a reference sequence. Part of this goal includes support for multiple kinds of ambiguity, including intrinsic sequence-based homology (i.e., ambiguous cases where the two sequences on either end of a breakpoint are identical).
Daniel drew up the following example of sequence-based homology in his 6/1 SV SIG presentation:

There are two possible approaches we are considering to handle this ambiguity and would like to get feedback on which one is preferred.
Option 1: Extending VRS style ambiguity representation to breakpoints
Starting with the simple breakend model from #425 (which excludes property data types, since those will be decided later):
We can create a breakpoint model that includes a field
ambiguous_nucleotides
for a sequence of nucleotides that are considered ambiguous due to homology.Then, we can use a simple VOCA-like algorithm to take a potentially over-precise Breakpoint representation and produce the precision-corrected representation.
Rough Algorithm
So the example from Daniel's presentation, could be represented as:
Option 2: Using VCF-style ambiguity
Another approach we could go with is a VCF-style uncertainty bounds. In VCF 4.4, sequence-based homology is represented with the HOMSEQ and CIPOS info keys, where CIPOS is a integer pair representing the confidence interval bounds in either direction. Section 5.4.8 of the spec gives an example of how CIPOS can be used to represent uncertainty around breakpoint position.
Adding this to our model might look something like
With this approach, we still have the issue of deciding position should be used for the "center" of the interval? For example, these two breakend representations are equivalent:
Questions
The text was updated successfully, but these errors were encountered: