Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No entry in lineage_report.csv for very short inputs #409

Closed
garfinjm opened this issue Apr 5, 2022 · 7 comments
Closed

No entry in lineage_report.csv for very short inputs #409

garfinjm opened this issue Apr 5, 2022 · 7 comments

Comments

@garfinjm
Copy link

garfinjm commented Apr 5, 2022

Hello!

Thanks for all your work developing this great tool! I've been testing V4 and noticed that when I give pangolin V4 a short sequence it gets filtered out at the preprocessing align_to_reference step, resulting in a lineage_report.csv that contains only the header line. Would it be feasable to add a warning when this happens?

Below is a walkthrough for replicating the issue. Thanks for your help!

My versions:

Singularity> pangolin --all-versions
pangolin: 4.0.1
pangolin-data: 1.2.133
constellations: v0.1.4
scorpio: 0.3.16
pangolin-assignment: v1.2.133

Example short fasta (the first 6 lines of pangolin/tests/test-data/sequence1.fasta):

Singularity> cat short.fasta
>sequence1
AACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTA
ATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTTTT
GCAGCCGATCATCAGCACATCTAGGTTTTGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCC
TGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTACGT
GGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCT

Running Pangolin

Singularity> pangolin --no-temp short.fasta
****
Pangolin running in usher mode.
****

--no-temp: all intermediate files will be written to /data

Query file:     /data/short.fasta
****
Data files found:
usher_pb:       /opt/conda/envs/pangolin/lib/python3.8/site-packages/pangolin_data/data/lineageTree.pb
****
Job stats:
job                      count    min threads    max threads
---------------------  -------  -------------  -------------
align_to_reference           1              1              1
all                          1              1              1
cache_sequence_assign        1              1              1
create_seq_hash              1              1              1
get_constellations           1              1              1
merged_info                  1              1              1
scorpio                      1              1              1
sequence_qc                  1              1              1
total                        8              1              1

****
Query sequences collapsed from 0 to 0 unique sequences.
****
0 sequences assigned via designations.
****
Running sequence QC
Total passing QC: 0
Job stats:
job                count    min threads    max threads
---------------  -------  -------------  -------------
all                    1              1              1
usher_cache            1              1              1
usher_inference        1              1              1
usher_to_report        1              1              1
total                  4              1              1

Using UShER as inference engine.
****
Output file written to: /data/lineage_report.csv

Gives a lineage_report.csv that looks like:

Singularity> cat lineage_report.csv
taxon,lineage,conflict,ambiguity_score,scorpio_call,scorpio_support,scorpio_conflict,scorpio_notes,version,pangolin_version,scorpio_version,constellation_version,is_designated,qc_status,qc_notes,note

Singularity> wc -l lineage_report.csv
1 lineage_report.csv

Here's the mapped.sam contents:

Singularity> cat mapped.sam
@SQ     SN:outgroup_A   LN:29903
@PG     ID:minimap2     PN:minimap2     VN:2.24-r1122   CL:minimap2 -a -x asm20 --sam-hit-only --secondary=no --score-N=0 -t 1 -o /data/mapped.sam /opt/conda/envs/pangolin/lib/python3.8/site-packages/pangolin/data/reference.fasta -

And the file sizes:

Singularity> ls -lahrt
total 511M
-rw-r--r--.  1 garf0012 mdh  366 Apr  5 11:11 short.fasta
-rwxr-xr-x.  1 garf0012 mdh 463M Apr  5 11:11 pangolin_latest.sif
drwxr-xr-x.  1 garf0012 mdh   60 Apr  5 11:11 ..
drwxr-sr-x. 11 garf0012 mdh 4.0K Apr  5 11:12 .snakemake
-rw-r--r--.  1 garf0012 mdh  646 Apr  5 11:12 get_constellations.txt
-rw-r--r--.  1 garf0012 mdh  245 Apr  5 11:12 mapped.sam
-rw-r--r--.  1 garf0012 mdh    0 Apr  5 11:12 alignment.fasta
-rw-r--r--.  1 garf0012 mdh    0 Apr  5 11:12 hashed.aln.fasta
-rw-r--r--.  1 garf0012 mdh   10 Apr  5 11:12 hash_map.csv
-rw-r--r--.  1 garf0012 mdh   24 Apr  5 11:12 designation_status.csv
-rw-r--r--.  1 garf0012 mdh   24 Apr  5 11:12 seq_status.csv
-rw-r--r--.  1 garf0012 mdh    0 Apr  5 11:12 pass_qc.fasta
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.A.23.1-like_E484K_counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.A.23.1-like_counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.AV.1-like_counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Delta__AY.4.2-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Delta__AY.4-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.B.1.1.318-like_counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Omicron__Unassigned__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.B.1.1.7-like_E484K_counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Alpha__B.1.1.7-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Beta__B.1.351-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Epsilon__B.1.427-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Epsilon__B.1.429-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Eta__B.1.525-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Iota__B.1.526-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.B.1.617.1-like_counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Delta__B.1.617.2-like___K417N_counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Delta__B.1.617.2-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.B.1.617.3-like_counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Mu__B.1.621-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Omicron__BA.1-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Omicron__BA.2-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Omicron__BA.3-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Lambda__C.37-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Gamma__P.1-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Zeta__P.2-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Theta__P.3-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  147 Apr  5 11:12 VOC_report.scorpio.csv
-rw-r--r--.  1 garf0012 mdh  169 Apr  5 11:12 preprocessing.csv
-rw-r--r--.  1 garf0012 mdh  199 Apr  5 11:12 cache_assigned.csv
-rw-r--r--.  1 garf0012 mdh    0 Apr  5 11:12 not_cache_assigned.fasta
drwxr-sr-x.  2 garf0012 mdh 4.0K Apr  5 11:12 logs
-rw-r--r--.  1 garf0012 mdh    0 Apr  5 11:12 clades.txt
-rw-r--r--.  1 garf0012 mdh   33 Apr  5 11:12 inference_report.csv
drwxr-sr-x.  4 garf0012 mdh 8.0K Apr  5 11:12 .
-rw-r--r--.  1 garf0012 mdh  200 Apr  5 11:12 lineage_report.csv
@EricFournier3
Copy link

Hi,
it seems that sequences having N content of 1.0 are also filtered out from the report. With version 3, results for those sequence are saved as fail in the status column. Is this expected ?

Thank you

@aineniamh
Copy link
Member

It's unexpected, thanks for flagging! I'll follow up and see why this is happening.

@KatSteinke
Copy link

Looking at the logs sequences with N content of 1.0 are dropped fairly early, before sequences are hashed - in a test run with 7 samples, two of which have N content of 1.0, pangolin reports Query sequences collapsed from 5 to 5 unique sequences.

@aineniamh
Copy link
Member

So I think I've fixed this now- just making sure tests pass on the branch!

@aineniamh
Copy link
Member

Resolved in pangolin v4.0.3!

@EricFournier3
Copy link

Thank you @aineniamh, it works now

@garfinjm
Copy link
Author

garfinjm commented Apr 8, 2022

Works perfect! Thanks @aineniamh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants