Change bounds checking in probaln_glocal #1616

jkbonfield · 2023-05-04T11:40:56Z

In 3 places when filling out forwards and backwards arrays, the "u" array index has bounds checks of "u < 3 || u >= i_dim-3".

Understanding this code is tricky however! My hypothesis that the upper bounds check here is because we use u, u+1 and u+2 in array indices, and we iterate with "k <= l_ref" so we can access one beyond the end of the array.

However the arrays are allocated to be dimension (l_query+1)*i_dim, so (assuming correctness of l_ref vs l_query in bw/i_dim calculation) we have compensated for this over-step already. This has been validated with address sanitiser.

The effect of the i_dim-3 limit is that having band width equal to query length causes the final state element to be incorrectly labelled as an insertion.

This hypothesis may however be incorrect, as the lower bound "u < 3" also seems redundant, yet changing this to "u < 0" does give different quality scores in about 1 in 4000 sequences (tested on 10 million illumina short read BAQ calculations). Hence for now this is left unchanged. In normal behaviour using a band, tested using "samtools calmd -r -E" to generate BQ tags, this commit does not change output.

Fixes #1605

In 3 places when filling out forwards and backwards arrays, the "u" array index has bounds checks of "u < 3 || u >= i_dim-3". Understanding this code is tricky however! My hypothesis that the upper bounds check here is because we use u, u+1 and u+2 in array indices, and we iterate with "k <= l_ref" so we can access one beyond the end of the array. However the arrays are allocated to be dimension (l_query+1)*i_dim, so (assuming correctness of l_ref vs l_query in bw/i_dim calculation) we have compensated for this over-step already. This has been validated with address sanitiser. The effect of the i_dim-3 limit is that having band width equal to query length causes the final state element to be incorrectly labelled as an insertion. This hypothesis may however be incorrect, as the lower bound "u < 3" also seems redundant, yet changing this to "u < 0" does give different quality scores in about 1 in 4000 sequences (tested on 10 million illumina short read BAQ calculations). Hence for now this is left unchanged. In normal behaviour using a band, tested using "samtools calmd -r -E" to generate BQ tags, this commit does not change output. Fixes samtools#1605

jkbonfield · 2023-05-04T11:41:40Z

If @pd3 or @lh3 care to review this change it would help. Trying to understand the indexing here is non-trivial, so someone more familier with the code should give it a glance over.

daviesrob · 2023-06-05T16:48:52Z

I think the fix may be correct, but the comment isn't. The loop being altered looks like it's trying to sum over the values set in the previous loop when i == l_query. The confusing part is that instead of working out which values were set in the same way (from beg to end inclusive), it instead iterates over the entire range 1 ... l_ref and then rejects the ones that weren't set by doing a check on the calculated value of u. So it's not trying to prevent an out-of-bounds access, but instead working out which values it's supposed to be summing over.

The tricky part of this is trying to work out if the restriction on u is actually doing the same job as iterating from beg to end in the earlier loop. It looks like it does from experimentation, but it's a bit difficult to prove. However, it probably doesn't matter if it goes over as anything unset should have been zeroed by calloc() and so shouldn't affect the final sum. Hopefully.

jkbonfield · 2023-06-06T11:25:13Z

Feel free to correct the comment and push back a change. The code is convoluted so any improvement to documentation is most welcome!

Adds a comment explaining that the f[] and b[] arrays count positions from 1, allowing 0 to be used to more easily handle the edges of the alignment matrix. Changes the comment explaining the line: if (u < 3 || u >= i_dim) continue; used in some of the loops over f[] and b[]. While it does prevent overstepping the array boundaries, its main function is to select the parts over which the scores have previously been calculated. A change in 5d7a782 to fix excess memory usage got the high end slightly wrong (using i_dim - 3). When the query sequence length was less than the band width, this could lead to the last column being incorrectly missed out from parts of the calculation.

daviesrob · 2023-06-13T17:13:19Z

Comments adjusted. Hopefully the new ones make what's going on a bit clearer.

This code needs more units tests, and refactoring.

jkbonfield · 2023-06-27T11:58:41Z

Refactoring is out of scope for this PR IMO and would just muddy the evaluation and dramatically slow down acceptance. I'd prefer one PR to fix bugs, and a completely different one to refactor the code which may take much longer to get done.

Testing is more interesting. I'd extend the tests if I could find any, but it seems to be entirely outside of our testing framework bar user-tests in things like "samtools mpileup". That however has no way of setting things like band widths, so again I think full unit tests are probably best punted to a future PR, along side refactoring.

As for the comments you added, the first is correct, but you're stating the bug in 1605 only existed between 1.8 and 1.17. Have you explicitly checked this?

For the "normal" case of bw2 < l_ref, u >= i_dim - 3 is identical to the 1.7 code of u >= bw2*3+3. It was changed in 5d7a782 to prevent excessive memory usage. So there was no bug before or after 1.8 in that situation. For bw2 >= l_ref, as in this bug, it previously still used i_dim = bw2*3+6 too as the ?: conditional only appeared with your memory-reduction PR, so the change of u check has no impact there either. Is it the case that the loop termination (k <= l_ref) coupled to the over-sized memory means the off-by-one error is basically avoided through luck?

I think the comment is correct therefore, but can't be fully sure.

It also makes me wonder whether the bug was actually the ?: initialisatio of i_dim, and whether it should have just been l_ref*3+9 instead so it accounted for additional termination space. Although I think what we have is equivalent anyway.

daviesrob self-assigned this Jun 6, 2023

daviesrob merged commit b52f3fa into samtools:develop Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change bounds checking in probaln_glocal #1616

Change bounds checking in probaln_glocal #1616

jkbonfield commented May 4, 2023

jkbonfield commented May 4, 2023

daviesrob commented Jun 5, 2023

jkbonfield commented Jun 6, 2023

daviesrob commented Jun 13, 2023

jkbonfield commented Jun 27, 2023

Change bounds checking in probaln_glocal #1616

Change bounds checking in probaln_glocal #1616

Conversation

jkbonfield commented May 4, 2023

jkbonfield commented May 4, 2023

daviesrob commented Jun 5, 2023

jkbonfield commented Jun 6, 2023

daviesrob commented Jun 13, 2023

jkbonfield commented Jun 27, 2023