Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial version of spliced sequence retrieval #127

Merged
merged 3 commits into from
Oct 12, 2017

Conversation

simonvh
Copy link
Contributor

@simonvh simonvh commented Oct 11, 2017

Hi Matt, thanks for an extremely useful module. One feature that would be useful (at least to me) is the ability to retrieve spliced sequences from a FASTA file. Use-case would be to convert bed12 records to sequences, for instance. This PR is initial attempt at doing this. I would be happy to further discuss and adapt code/naming/style.

Some points:

  • Would an iterable of intervals be the right type of argument?
  • I personally prefer the option to get the reverse complement directly in one call. However, get_seq() doesn't have this. So at the moment this is not consistent.
  • Would it be worth adding this to the command line script?

@mdshw5
Copy link
Owner

mdshw5 commented Oct 11, 2017

Thanks for the contribution. I like this idea very much, and it actually fits with some of my recent day-job work :).

Would an iterable of intervals be the right type of argument?

I think that an iterable of intervals makes sense. The spliced sequence should always be on the same contig, so I like your current interface.

I personally prefer the option to get the reverse complement directly in one call. However, get_seq() doesn't have this. So at the moment this is not consistent.

I'm okay with some inconsistency. We can add rc=False to get_seq() to keep things consistent - don't see anything wrong with that.

Would it be worth adding this to the command line script?

Certainly. I think we would need BED12 parsing to make this useful. I've also considered how to better integrate with gffutils. Currently there is a Feature.sequence method but that does not handle splicing of child introns in transcript feature types. I'm doing this GTF + FASTA -> transcriptome FASTA transformation quite a bit lately so moving the sequence splicing interface into pyfaidx would be helpful. Maybe @daler would have some thoughts about whether spliced sequence retrieval would be something appropriate for his module as well?

Let's get the tests passing and go from there!

@simonvh
Copy link
Contributor Author

simonvh commented Oct 11, 2017

I see that there is one test failing relating to my work, I'll get fix that. However, this is not something that I did, right?

======================================================================

FAIL: Samtools BGZF index should be identical to pyfaidx BGZF index

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/home/travis/build/mdshw5/pyfaidx/tests/test_Fasta_bgzip.py", line 48, in test_build_issue_126

    assert result_index == expect_index

AssertionError

I would be happy to add BED12 parsing. Not sure if there's actually a good (lightweight?) module that we could use there, otherwise I'll add it based on what I have for myself so far. It would indeed be great to have GFF/GTF support and if we could profit from @daler's work there.

@mdshw5
Copy link
Owner

mdshw5 commented Oct 11, 2017

I think master was broken before your contribution. I was working on figuring our how samtools faidx works with BGZF compressed files, see #126. I abandoned the work and never came back :). I'll fix up master so the other tests pass.

mdshw5 added a commit that referenced this pull request Oct 11, 2017
mdshw5 added a commit that referenced this pull request Oct 11, 2017
mdshw5 added a commit that referenced this pull request Oct 11, 2017
@mdshw5 mdshw5 closed this in 592f480 Oct 11, 2017
@mdshw5 mdshw5 reopened this Oct 11, 2017
@mdshw5 mdshw5 closed this in 259b222 Oct 11, 2017
@mdshw5 mdshw5 reopened this Oct 11, 2017
@mdshw5
Copy link
Owner

mdshw5 commented Oct 11, 2017

Alright, @simonvh. Thanks for the work on your end. After far too much messing around on my side, it look like master is passing tests as well. If you're okay with it I'll cut a new release with this PR merged. Just give a thumbs up if you're okay, otherwise I'll wait if you want to add anything else.

@daler
Copy link
Contributor

daler commented Oct 11, 2017

@simonvh, @mdshw5 I'd love to have spliced sequence retrieval in gffutils. Like BED12 conversion (http://daler.github.io/gffutils/autodocs/gffutils.FeatureDB.bed12.html) it should actually be a method on FeatureDB rather than Feature. I'll add an issue over there to remind myself.

mdshw5 added a commit that referenced this pull request Oct 12, 2017
@mdshw5 mdshw5 merged commit b9fdf83 into mdshw5:master Oct 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants