Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change bounds checking in probaln_glocal #1616

Merged
merged 2 commits into from
Jun 27, 2023

Conversation

jkbonfield
Copy link
Contributor

In 3 places when filling out forwards and backwards arrays, the "u" array index has bounds checks of "u < 3 || u >= i_dim-3".

Understanding this code is tricky however! My hypothesis that the upper bounds check here is because we use u, u+1 and u+2 in array indices, and we iterate with "k <= l_ref" so we can access one beyond the end of the array.

However the arrays are allocated to be dimension (l_query+1)*i_dim, so (assuming correctness of l_ref vs l_query in bw/i_dim calculation) we have compensated for this over-step already. This has been validated with address sanitiser.

The effect of the i_dim-3 limit is that having band width equal to query length causes the final state element to be incorrectly labelled as an insertion.

This hypothesis may however be incorrect, as the lower bound "u < 3" also seems redundant, yet changing this to "u < 0" does give different quality scores in about 1 in 4000 sequences (tested on 10 million illumina short read BAQ calculations). Hence for now this is left unchanged. In normal behaviour using a band, tested using "samtools calmd -r -E" to generate BQ tags, this commit does not change output.

Fixes #1605

In 3 places when filling out forwards and backwards arrays, the "u"
array index has bounds checks of "u < 3 || u >= i_dim-3".

Understanding this code is tricky however!  My hypothesis that the
upper bounds check here is because we use u, u+1 and u+2 in array
indices, and we iterate with "k <= l_ref" so we can access one beyond
the end of the array.

However the arrays are allocated to be dimension (l_query+1)*i_dim, so
(assuming correctness of l_ref vs l_query in bw/i_dim calculation) we
have compensated for this over-step already.  This has been validated
with address sanitiser.

The effect of the i_dim-3 limit is that having band width equal to
query length causes the final state element to be incorrectly labelled
as an insertion.

This hypothesis may however be incorrect, as the lower bound "u < 3"
also seems redundant, yet changing this to "u < 0" does give different
quality scores in about 1 in 4000 sequences (tested on 10 million
illumina short read BAQ calculations).  Hence for now this is left
unchanged.  In normal behaviour using a band, tested using "samtools
calmd -r -E" to generate BQ tags, this commit does not change output.

Fixes samtools#1605
@jkbonfield
Copy link
Contributor Author

If @pd3 or @lh3 care to review this change it would help. Trying to understand the indexing here is non-trivial, so someone more familier with the code should give it a glance over.

@daviesrob
Copy link
Member

I think the fix may be correct, but the comment isn't. The loop being altered looks like it's trying to sum over the values set in the previous loop when i == l_query. The confusing part is that instead of working out which values were set in the same way (from beg to end inclusive), it instead iterates over the entire range 1 ... l_ref and then rejects the ones that weren't set by doing a check on the calculated value of u. So it's not trying to prevent an out-of-bounds access, but instead working out which values it's supposed to be summing over.

The tricky part of this is trying to work out if the restriction on u is actually doing the same job as iterating from beg to end in the earlier loop. It looks like it does from experimentation, but it's a bit difficult to prove. However, it probably doesn't matter if it goes over as anything unset should have been zeroed by calloc() and so shouldn't affect the final sum. Hopefully.

@jkbonfield
Copy link
Contributor Author

Feel free to correct the comment and push back a change. The code is convoluted so any improvement to documentation is most welcome!

@daviesrob daviesrob self-assigned this Jun 6, 2023
Adds a comment explaining that the f[] and b[] arrays count
positions from 1, allowing 0 to be used to more easily handle
the edges of the alignment matrix.

Changes the comment explaining the line:
           if (u < 3 || u >= i_dim) continue;
used in some of the loops over f[] and b[].  While it does
prevent overstepping the array boundaries, its main
function is to select the parts over which the scores have
previously been calculated.  A change in 5d7a782 to fix
excess memory usage got the high end slightly wrong (using
i_dim - 3).  When the query sequence length was less than the
band width, this could lead to the last column being
incorrectly missed out from parts of the calculation.
@daviesrob
Copy link
Member

Comments adjusted. Hopefully the new ones make what's going on a bit clearer.

This code needs more units tests, and refactoring.

@jkbonfield
Copy link
Contributor Author

Refactoring is out of scope for this PR IMO and would just muddy the evaluation and dramatically slow down acceptance. I'd prefer one PR to fix bugs, and a completely different one to refactor the code which may take much longer to get done.

Testing is more interesting. I'd extend the tests if I could find any, but it seems to be entirely outside of our testing framework bar user-tests in things like "samtools mpileup". That however has no way of setting things like band widths, so again I think full unit tests are probably best punted to a future PR, along side refactoring.

As for the comments you added, the first is correct, but you're stating the bug in 1605 only existed between 1.8 and 1.17. Have you explicitly checked this?

For the "normal" case of bw2 < l_ref, u >= i_dim - 3 is identical to the 1.7 code of u >= bw2*3+3. It was changed in 5d7a782 to prevent excessive memory usage. So there was no bug before or after 1.8 in that situation. For bw2 >= l_ref, as in this bug, it previously still used i_dim = bw2*3+6 too as the ?: conditional only appeared with your memory-reduction PR, so the change of u check has no impact there either. Is it the case that the loop termination (k <= l_ref) coupled to the over-sized memory means the off-by-one error is basically avoided through luck?

I think the comment is correct therefore, but can't be fully sure.

It also makes me wonder whether the bug was actually the ?: initialisatio of i_dim, and whether it should have just been l_ref*3+9 instead so it accounted for additional termination space. Although I think what we have is equivalent anyway.

@daviesrob daviesrob merged commit b52f3fa into samtools:develop Jun 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

probaln_glocal returns suboptimal alignment when bandwidth is too large
2 participants