`FastaReader` returns empty pandas dataframe #96

abearab · 2024-02-23T00:50:00Z

I'm trying to load this FASTA data using FastaReader module but it seems there is something wrong! Any thoughts?!

Python implementation: CPython
Python version       : 3.9.18
IPython version      : 8.18.1

Compiler    : GCC 11.2.0
OS          : Linux
Release     : 5.10.0-28-cloud-amd64
Machine     : x86_64
Processor   : 
CPU cores   : 16
Architecture: 64bit

matplotlib: 3.6.3
screenpro : 0.2.6
anndata   : 0.10.5.post1
biobear   : 0.14.6
sys       : 3.9.18 (main, Sep 11 2023, 13:41:44) 
[GCC 11.2.0]
pandas    : 1.5.3
polars    : 0.20.10

The text was updated successfully, but these errors were encountered:

tshauck · 2024-02-23T01:16:58Z

Thanks for filing this, and sorry for the issue! I think the issue is due to the .fa extension. In the short term, could you rename the file to with .fasta? And I can make it easier to configure in the medium term. Like you probably know fasta has a variety of extensions compared to something like bam or vcf.

In [1]: import biobear as bb

In [2]: !wget -O CRISPRi_v2_human.trim_1_29_forward.fasta https://raw.githubusercontent.com/mhorlbeck/ScreenProcessing/master/library_reference/CRISPRi_v2_human.trim_1_29_forward.fa
--2024-02-22 17:06:25--  https://raw.githubusercontent.com/mhorlbeck/ScreenProcessing/master/library_reference/CRISPRi_v2_human.trim_1_29_forward.fa
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11799370 (11M) [text/plain]
Saving to: ‘CRISPRi_v2_human.trim_1_29_forward.fasta’

CRISPRi_v2_human.trim_1_29_forward.fasta                  100%[====================================================================================================================================>]  11.25M  16.2MB/s    in 0.7s

2024-02-22 17:06:27 (16.2 MB/s) - ‘CRISPRi_v2_human.trim_1_29_forward.fasta’ saved [11799370/11799370]


In [3]: df = bb.FastaReader('CRISPRi_v2_human.trim_1_29_forward.fasta').to_pandas()

In [4]: df
Out[4]:
                           id description                      sequence
0       A1BG_-_58858617.23-P1        None  GAGACCCAGCGCTAACCAGGTTTAAGAG
1       A1BG_-_58858788.23-P1        None  GGGCACCCAGGAGCGGTAGGTTTAAGAG
2       A1BG_+_58858964.23-P1        None  CTCCGGGCGACGTGGAGTGGTTTAAGAG
3       A1BG_-_58858630.23-P1        None  AACCAGGGGTGCCCAAGGGGTTTAAGAG
4       A1BG_+_58858549.23-P1        None  GCGAGGAACCGCCCAGCAAGTTTAAGAG
...                       ...         ...                           ...
209065    non-targeting_03785        None  CTACTCCGCCCCGCGGGAGGTTTAAGAG
209066    non-targeting_03786        None  TGGCCGTTCATGGGACCGGGTTTAAGAG
209067    non-targeting_03787        None  AACTCTGTAGAAGGGACCGGTTTAAGAG
209068    non-targeting_03788        None  CTCGACAGCGACTGAAGAGGTTTAAGAG
209069    non-targeting_03789        None  AGAAGCCAACCTCGCGTTAGTTTAAGAG

[209070 rows x 3 columns]

abearab · 2024-02-23T07:20:04Z

I see, it makes sense. It works with .fasta, thanks for clarifying. One more question. Are you using fasta indexing (e.g. pyfaidx) to accelerate downstream tasks?

tshauck · 2024-02-23T16:18:22Z

Cool, glad that part works. BioBear only supports indexed BAM/VCF/GFF files at the moment (https://www.wheretrue.dev/docs/exon/sql-reference#indexed-scan-functions) ... I've been meaning to add indexed fastas and the underlying bits are in place, so I'll add it today and follow up on this ticket.

tshauck · 2024-02-26T15:34:56Z

Here's an example of using an faidx index with a fasta file... it's a bit limited in that it only supports a single region. Is that similar to how you're using (py)faidx or is it will multiple regions and/or a regions file?

The builds for this to go to pypi are on their way, and should be up in ~2hrs... https://github.com/wheretrue/biobear/actions/runs/8051376387

biobear/python/tests/test_session.py

Lines 77 to 88 in 7357fb8

    
           def test_indexed_scan(): 
        
               """Test an indexed fasta can works.""" 
        
               session = connect() 
        
               fasta_file = DATA / "test.fasta" 
        
               query = f"SELECT * FROM fasta_indexed_scan('{fasta_file}', 'a:1-2')" 
        
               results = session.sql(query) 
        
               df = results.to_polars() 
        
               assert len(df) == 1, df

abearab · 2024-02-26T23:45:48Z

Cool, let me give it a try and I'll get back to you soon. Thanks for your quick responses!

Here's an example of using an faidx index with a fasta file... it's a bit limited in that it only supports a single region. Is that similar to how you're using (py)faidx or is it will multiple regions and/or a regions file?

What I'm trying to do is simply (!) loading FASTQ data as dataframe and then using "groupby" function from polars to count all unique sequences. Then, I want to map the counted sequences to a reference fasta file (or just a table of expected sequences and their names). Does it make sense? This is my current code to load FASTQ, count, and then map ...

https://github.com/ArcInstitute/ScreenPro2/blob/c3e476f1a75edb8d39ba466078e1204d652e9db0/screenpro/ngs.py#L27C1-L91C21

I also need to ask your opinion about multi-threading on FASTQ files to run these kinds of tasks with safe memory usage (I can open a new issue to discuss that separately).

abearab · 2024-02-27T01:26:52Z

Hey @tshauck, I'm not sure if I understand your fasta indexing implementation. How should I create fasta index using biobear? Also, I'm not sure what function I need to call for using it.

tshauck · 2024-02-27T04:50:11Z

Thanks for sharing the notebook. I think I generally understand the linked part, but looks interesting overall. This comment got a little long, but it's mostly code snippets... there's a paragraph on reproducing parts of your notebook with SQL, one on preprocessing to parquet for getting good I/O performance, and one to your last comment.

SQL

To the workflow, it's not super clear to me if slice_seq is static across the entire dataframe or different for each row, but either way I think you could push down some of this to SQL which would likely give you better performance. E.g. if it's static:

df = session.sql(
    """
SELECT substr(f.sequence, 1, 10) AS sequence, COUNT(*) as count
FROM fastq_scan('./624-02_lib26670_nextseq_n0240_151bp_R1.fastq.gz') f
GROUP BY substr(f.sequence, 1, 10)
"""
).to_polars()

Which would do the counting and reading at the same time.

Or if you have the start and stop from another place (like a CSV), you could do something like:

session = bb.connect()

session.execute(
    """
CREATE EXTERNAL TABLE positions
STORED AS CSV
WITH HEADER ROW
LOCATION './examples/positions.csv'
"""
)

df = session.sql(
    """
SELECT sequence, COUNT(*) AS count
FROM (
    SELECT substr(f.sequence, positions.start, positions.stop - positions.start) AS sequence
    FROM fastq_scan('./624-02_lib26670_nextseq_n0240_151bp_R1.fastq.gz') f
        JOIN positions
            ON f.name = positions.name
)
GROUP BY sequence
"""
).to_polars()

I realize SQL may not be totally in your wheelhouse, but just to put it out there since it can actually be pretty good for stuff like this.

On a FASTQ file with about 2.5M records, this takes about 45 seconds, with the vast majority spent on IO:

In [3]: %%time
   ...: df = session.sql(
   ...:     """
   ...: SELECT sequence, COUNT(*) AS count
   ...: FROM (
   ...:     SELECT substr(f.sequence, positions.start, positions.stop - positions.start) AS sequence
   ...:     FROM fastq_scan('./624-02_lib26670_nextseq_n0240_151bp_R1.fastq.gz') f
   ...:         JOIN positions
   ...:             ON f.name = positions.name
   ...: )
   ...: GROUP BY sequence
   ...: """
   ...: ).to_polars()
   ...: 
   ...: 
CPU times: user 47.7 s, sys: 972 ms, total: 48.6 s
Wall time: 26.5 s

Depending on how library is stored in map_sample_counts_to_library, you also might move that "down" to the SQL layer as well.

Parquet

To the last point, because I/O takes up the majority of time, if write the file to parquet format first, you get much better I/O for subsequent read and polars can work really well with Parquet. E.g....

In [11]: %%time
    ...: session.execute(
    ...:     """
    ...:     COPY
    ...:         (SELECT * FROM fastq_scan('./624-02_lib26670_nextseq_n0240_151bp_R1.fastq.gz'))
    ...:     TO './sequences.parquet'
    ...:     (FORMAT PARQUET)
    ...: """
    ...: )
    ...: 
    ...: 
CPU times: user 28.1 s, sys: 889 ms, total: 28.9 s
Wall time: 23.8 s

In [12]: %%time
    ...: pl.read_parquet("./sequences.parquet")
    ...: 
    ...: 
CPU times: user 1.16 s, sys: 407 ms, total: 1.57 s
Wall time: 1.01 s
Out[12]: 
shape: (2_385_245, 4)

pyfaidx

My fault for not understanding your workflow. For now, you'd have to use pyfaidx (or samtools) to create the index, but biobear will recognize it and use it on the underlying files to control which bytes actually get read.

Anyways, thanks for raising the issues. Let me know if you have thoughts and I'll check the other issue you raised tomorrow.

relying on wheretrue/biobear#96

abearab · 2024-02-28T01:52:44Z

Hi @tshauck, Thanks for all the feedbacks! I really liked using SQL for IO tasks on FASTQ files. I seems to outperform other methods I tried so far! I even don't need the FASTA indexing for now. I'm closing this issue and will be in touch again since I really liked your tool!

abearab mentioned this issue Feb 26, 2024

Design and implement fastq2count module ArcInstitute/ScreenPro2#28

Closed

10 tasks

abearab mentioned this issue Feb 27, 2024

How to run polars dataframe methods on large FASTQ files in a memory efficient way #99

Open

abearab added a commit to ArcInstitute/ScreenPro2 that referenced this issue Feb 27, 2024

improve I/O performance using biobear SQL

3d2d70e

relying on wheretrue/biobear#96

abearab closed this as completed Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`FastaReader` returns empty pandas dataframe #96

`FastaReader` returns empty pandas dataframe #96

abearab commented Feb 23, 2024 •

edited

Loading

tshauck commented Feb 23, 2024

abearab commented Feb 23, 2024

tshauck commented Feb 23, 2024

tshauck commented Feb 26, 2024 •

edited

Loading

abearab commented Feb 26, 2024

abearab commented Feb 27, 2024

tshauck commented Feb 27, 2024

abearab commented Feb 28, 2024

FastaReader returns empty pandas dataframe #96

FastaReader returns empty pandas dataframe #96

Comments

abearab commented Feb 23, 2024 • edited Loading

tshauck commented Feb 23, 2024

abearab commented Feb 23, 2024

tshauck commented Feb 23, 2024

tshauck commented Feb 26, 2024 • edited Loading

abearab commented Feb 26, 2024

abearab commented Feb 27, 2024

tshauck commented Feb 27, 2024

SQL

Parquet

pyfaidx

abearab commented Feb 28, 2024

`FastaReader` returns empty pandas dataframe #96

`FastaReader` returns empty pandas dataframe #96

abearab commented Feb 23, 2024 •

edited

Loading

tshauck commented Feb 26, 2024 •

edited

Loading