-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FastaReader
returns empty pandas dataframe
#96
Comments
Thanks for filing this, and sorry for the issue! I think the issue is due to the
|
I see, it makes sense. It works with |
Cool, glad that part works. BioBear only supports indexed BAM/VCF/GFF files at the moment (https://www.wheretrue.dev/docs/exon/sql-reference#indexed-scan-functions) ... I've been meaning to add indexed fastas and the underlying bits are in place, so I'll add it today and follow up on this ticket. |
Here's an example of using an faidx index with a fasta file... it's a bit limited in that it only supports a single region. Is that similar to how you're using (py)faidx or is it will multiple regions and/or a regions file? The builds for this to go to pypi are on their way, and should be up in ~2hrs... https://github.com/wheretrue/biobear/actions/runs/8051376387 biobear/python/tests/test_session.py Lines 77 to 88 in 7357fb8
|
Cool, let me give it a try and I'll get back to you soon. Thanks for your quick responses!
What I'm trying to do is simply (!) loading FASTQ data as dataframe and then using "groupby" function from polars to count all unique sequences. Then, I want to map the counted sequences to a reference I also need to ask your opinion about multi-threading on FASTQ files to run these kinds of tasks with safe memory usage (I can open a new issue to discuss that separately). |
Hey @tshauck, I'm not sure if I understand your fasta indexing implementation. How should I create |
Thanks for sharing the notebook. I think I generally understand the linked part, but looks interesting overall. This comment got a little long, but it's mostly code snippets... there's a paragraph on reproducing parts of your notebook with SQL, one on preprocessing to parquet for getting good I/O performance, and one to your last comment. SQLTo the workflow, it's not super clear to me if df = session.sql(
"""
SELECT substr(f.sequence, 1, 10) AS sequence, COUNT(*) as count
FROM fastq_scan('./624-02_lib26670_nextseq_n0240_151bp_R1.fastq.gz') f
GROUP BY substr(f.sequence, 1, 10)
"""
).to_polars() Which would do the counting and reading at the same time. Or if you have the start and stop from another place (like a CSV), you could do something like: session = bb.connect()
session.execute(
"""
CREATE EXTERNAL TABLE positions
STORED AS CSV
WITH HEADER ROW
LOCATION './examples/positions.csv'
"""
)
df = session.sql(
"""
SELECT sequence, COUNT(*) AS count
FROM (
SELECT substr(f.sequence, positions.start, positions.stop - positions.start) AS sequence
FROM fastq_scan('./624-02_lib26670_nextseq_n0240_151bp_R1.fastq.gz') f
JOIN positions
ON f.name = positions.name
)
GROUP BY sequence
"""
).to_polars() I realize SQL may not be totally in your wheelhouse, but just to put it out there since it can actually be pretty good for stuff like this. On a FASTQ file with about 2.5M records, this takes about 45 seconds, with the vast majority spent on IO:
Depending on how ParquetTo the last point, because I/O takes up the majority of time, if write the file to parquet format first, you get much better I/O for subsequent read and polars can work really well with Parquet. E.g....
pyfaidxMy fault for not understanding your workflow. For now, you'd have to use pyfaidx (or samtools) to create the index, but biobear will recognize it and use it on the underlying files to control which bytes actually get read. Anyways, thanks for raising the issues. Let me know if you have thoughts and I'll check the other issue you raised tomorrow. |
Hi @tshauck, Thanks for all the feedbacks! I really liked using SQL for IO tasks on FASTQ files. I seems to outperform other methods I tried so far! I even don't need the FASTA indexing for now. I'm closing this issue and will be in touch again since I really liked your tool! |
I'm trying to load this FASTA data using
FastaReader
module but it seems there is something wrong! Any thoughts?!The text was updated successfully, but these errors were encountered: