Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare str comparison improvements #151

Conversation

Kulivox
Copy link
Contributor

@Kulivox Kulivox commented Feb 15, 2022

Changes in this pull request are based on suggestions from this comment
#147 (comment)
My changes change what determines whether two records are comparable. Before, everything was based on function, which provided a list that said which records should be skipped in the next iteration. If this list was only made up of True values, records were compared. I've decided to change that, this list is still produced, but another, more configurable function determines the comparability of the two records.

@Kulivox Kulivox marked this pull request as ready for review February 15, 2022 17:40
Copy link
Contributor

@LiterallyUniqueLogin LiterallyUniqueLogin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Just two comments

Copy link
Contributor

@LiterallyUniqueLogin LiterallyUniqueLogin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed a few things that I should've caught in the first go round, but hopefully they're not a big deal to address.

@@ -680,6 +682,15 @@ def __init__(self,
self.has_fabricated_ref_allele = False
self.ref_allele_length = len(ref_allele) / len(motif)

# declaration of end_pos variables
self.end_pos = int(self.pos + self.ref_allele_length * len(motif) - 1)
self.full_alleles_end_pos = self.end_pos
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments:

  • I would rather not define this here and then redefine it later if we have full_alleles, that could lead to confusion for a reader. Let's just have "if full_alleles is None then .... else ...."
  • If we're going to define this, can we document this (in the same way you documented end_pos). It can go in the attributes section.
  • It seems full_alleles_pos is already defined but not documented, do you mind documenting that likewise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've applied your suggestions.

@@ -680,6 +682,15 @@ def __init__(self,
self.has_fabricated_ref_allele = False
self.ref_allele_length = len(ref_allele) / len(motif)

# declaration of end_pos variables
self.end_pos = int(self.pos + self.ref_allele_length * len(motif) - 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cast is unsafe, int(1.9999) goes to 1, and it's possible that with finite precision this doesn't properly evaluate to an exact integer. Let's use round instead. We can have a comment here explaining why we're using round when all we want is a cast because in theory it should be a whole integer.

Copy link
Contributor Author

@Kulivox Kulivox Feb 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I didn't realize that self.ref_allele_length can sometimes be a float because of partial repeats, so I thought something like this can't happen. I changed this cast to round and added the comment.

right_start, right_end = right.pos, right.end_pos

overlap = min(left_end, right_end) - max(left_start, right_start)
# This calculation contains max() - 1 to compensate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right implementation. Shouldn't we be adding one to overlap above instead of subtracting one from the denominator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, this implementation only calculates the right score for completely overlapping sequences, otherwise, it would calculate a score that is slightly lower than intended. Thank you for pointing that out!

@LiterallyUniqueLogin
Copy link
Contributor

Changes all seem to be going in the right direction!

@LiterallyUniqueLogin
Copy link
Contributor

Sorry for letting this hang - life got busy. This looks done! Please add an 'unreleased changes' section to RELEASE_NOTES.rst in the main directory with a bullet or two about what's changed here and I'll merge this into develop.

@LiterallyUniqueLogin LiterallyUniqueLogin merged commit 8dd9abf into gymrek-lab:develop Mar 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants