-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compare str comparison improvements #151
Compare str comparison improvements #151
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Just two comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed a few things that I should've caught in the first go round, but hopefully they're not a big deal to address.
trtools/utils/tr_harmonizer.py
Outdated
@@ -680,6 +682,15 @@ def __init__(self, | |||
self.has_fabricated_ref_allele = False | |||
self.ref_allele_length = len(ref_allele) / len(motif) | |||
|
|||
# declaration of end_pos variables | |||
self.end_pos = int(self.pos + self.ref_allele_length * len(motif) - 1) | |||
self.full_alleles_end_pos = self.end_pos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments:
- I would rather not define this here and then redefine it later if we have full_alleles, that could lead to confusion for a reader. Let's just have "if full_alleles is None then .... else ...."
- If we're going to define this, can we document this (in the same way you documented end_pos). It can go in the attributes section.
- It seems
full_alleles_pos
is already defined but not documented, do you mind documenting that likewise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've applied your suggestions.
trtools/utils/tr_harmonizer.py
Outdated
@@ -680,6 +682,15 @@ def __init__(self, | |||
self.has_fabricated_ref_allele = False | |||
self.ref_allele_length = len(ref_allele) / len(motif) | |||
|
|||
# declaration of end_pos variables | |||
self.end_pos = int(self.pos + self.ref_allele_length * len(motif) - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cast is unsafe, int(1.9999)
goes to 1, and it's possible that with finite precision this doesn't properly evaluate to an exact integer. Let's use round instead. We can have a comment here explaining why we're using round when all we want is a cast because in theory it should be a whole integer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I didn't realize that self.ref_allele_length can sometimes be a float because of partial repeats, so I thought something like this can't happen. I changed this cast to round and added the comment.
trtools/compareSTR/compareSTR.py
Outdated
right_start, right_end = right.pos, right.end_pos | ||
|
||
overlap = min(left_end, right_end) - max(left_start, right_start) | ||
# This calculation contains max() - 1 to compensate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the right implementation. Shouldn't we be adding one to overlap above instead of subtracting one from the denominator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right, this implementation only calculates the right score for completely overlapping sequences, otherwise, it would calculate a score that is slightly lower than intended. Thank you for pointing that out!
Changes all seem to be going in the right direction! |
Sorry for letting this hang - life got busy. This looks done! Please add an 'unreleased changes' section to RELEASE_NOTES.rst in the main directory with a bullet or two about what's changed here and I'll merge this into develop. |
Changes in this pull request are based on suggestions from this comment
#147 (comment)
My changes change what determines whether two records are comparable. Before, everything was based on function, which provided a list that said which records should be skipped in the next iteration. If this list was only made up of True values, records were compared. I've decided to change that, this list is still produced, but another, more configurable function determines the comparability of the two records.