-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change bounds checking in probaln_glocal #1616
Conversation
In 3 places when filling out forwards and backwards arrays, the "u" array index has bounds checks of "u < 3 || u >= i_dim-3". Understanding this code is tricky however! My hypothesis that the upper bounds check here is because we use u, u+1 and u+2 in array indices, and we iterate with "k <= l_ref" so we can access one beyond the end of the array. However the arrays are allocated to be dimension (l_query+1)*i_dim, so (assuming correctness of l_ref vs l_query in bw/i_dim calculation) we have compensated for this over-step already. This has been validated with address sanitiser. The effect of the i_dim-3 limit is that having band width equal to query length causes the final state element to be incorrectly labelled as an insertion. This hypothesis may however be incorrect, as the lower bound "u < 3" also seems redundant, yet changing this to "u < 0" does give different quality scores in about 1 in 4000 sequences (tested on 10 million illumina short read BAQ calculations). Hence for now this is left unchanged. In normal behaviour using a band, tested using "samtools calmd -r -E" to generate BQ tags, this commit does not change output. Fixes samtools#1605
I think the fix may be correct, but the comment isn't. The loop being altered looks like it's trying to sum over the values set in the previous loop when The tricky part of this is trying to work out if the restriction on |
Feel free to correct the comment and push back a change. The code is convoluted so any improvement to documentation is most welcome! |
Adds a comment explaining that the f[] and b[] arrays count positions from 1, allowing 0 to be used to more easily handle the edges of the alignment matrix. Changes the comment explaining the line: if (u < 3 || u >= i_dim) continue; used in some of the loops over f[] and b[]. While it does prevent overstepping the array boundaries, its main function is to select the parts over which the scores have previously been calculated. A change in 5d7a782 to fix excess memory usage got the high end slightly wrong (using i_dim - 3). When the query sequence length was less than the band width, this could lead to the last column being incorrectly missed out from parts of the calculation.
Comments adjusted. Hopefully the new ones make what's going on a bit clearer. This code needs more units tests, and refactoring. |
Refactoring is out of scope for this PR IMO and would just muddy the evaluation and dramatically slow down acceptance. I'd prefer one PR to fix bugs, and a completely different one to refactor the code which may take much longer to get done. Testing is more interesting. I'd extend the tests if I could find any, but it seems to be entirely outside of our testing framework bar user-tests in things like "samtools mpileup". That however has no way of setting things like band widths, so again I think full unit tests are probably best punted to a future PR, along side refactoring. As for the comments you added, the first is correct, but you're stating the bug in 1605 only existed between 1.8 and 1.17. Have you explicitly checked this? For the "normal" case of I think the comment is correct therefore, but can't be fully sure. It also makes me wonder whether the bug was actually the ?: initialisatio of |
In 3 places when filling out forwards and backwards arrays, the "u" array index has bounds checks of "u < 3 || u >= i_dim-3".
Understanding this code is tricky however! My hypothesis that the upper bounds check here is because we use u, u+1 and u+2 in array indices, and we iterate with "k <= l_ref" so we can access one beyond the end of the array.
However the arrays are allocated to be dimension (l_query+1)*i_dim, so (assuming correctness of l_ref vs l_query in bw/i_dim calculation) we have compensated for this over-step already. This has been validated with address sanitiser.
The effect of the i_dim-3 limit is that having band width equal to query length causes the final state element to be incorrectly labelled as an insertion.
This hypothesis may however be incorrect, as the lower bound "u < 3" also seems redundant, yet changing this to "u < 0" does give different quality scores in about 1 in 4000 sequences (tested on 10 million illumina short read BAQ calculations). Hence for now this is left unchanged. In normal behaviour using a band, tested using "samtools calmd -r -E" to generate BQ tags, this commit does not change output.
Fixes #1605