No entry in lineage_report.csv for very short inputs #409

garfinjm · 2022-04-05T16:39:24Z

Hello!

Thanks for all your work developing this great tool! I've been testing V4 and noticed that when I give pangolin V4 a short sequence it gets filtered out at the preprocessing align_to_reference step, resulting in a lineage_report.csv that contains only the header line. Would it be feasable to add a warning when this happens?

Below is a walkthrough for replicating the issue. Thanks for your help!

My versions:

Singularity> pangolin --all-versions
pangolin: 4.0.1
pangolin-data: 1.2.133
constellations: v0.1.4
scorpio: 0.3.16
pangolin-assignment: v1.2.133

Example short fasta (the first 6 lines of pangolin/tests/test-data/sequence1.fasta):

Singularity> cat short.fasta
>sequence1
AACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTA
ATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTTTT
GCAGCCGATCATCAGCACATCTAGGTTTTGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCC
TGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTACGT
GGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCT

Running Pangolin

Singularity> pangolin --no-temp short.fasta
****
Pangolin running in usher mode.
****

--no-temp: all intermediate files will be written to /data

Query file:     /data/short.fasta
****
Data files found:
usher_pb:       /opt/conda/envs/pangolin/lib/python3.8/site-packages/pangolin_data/data/lineageTree.pb
****
Job stats:
job                      count    min threads    max threads
---------------------  -------  -------------  -------------
align_to_reference           1              1              1
all                          1              1              1
cache_sequence_assign        1              1              1
create_seq_hash              1              1              1
get_constellations           1              1              1
merged_info                  1              1              1
scorpio                      1              1              1
sequence_qc                  1              1              1
total                        8              1              1

****
Query sequences collapsed from 0 to 0 unique sequences.
****
0 sequences assigned via designations.
****
Running sequence QC
Total passing QC: 0
Job stats:
job                count    min threads    max threads
---------------  -------  -------------  -------------
all                    1              1              1
usher_cache            1              1              1
usher_inference        1              1              1
usher_to_report        1              1              1
total                  4              1              1

Using UShER as inference engine.
****
Output file written to: /data/lineage_report.csv

Gives a lineage_report.csv that looks like:

Singularity> cat lineage_report.csv
taxon,lineage,conflict,ambiguity_score,scorpio_call,scorpio_support,scorpio_conflict,scorpio_notes,version,pangolin_version,scorpio_version,constellation_version,is_designated,qc_status,qc_notes,note

Singularity> wc -l lineage_report.csv
1 lineage_report.csv

Here's the mapped.sam contents:

Singularity> cat mapped.sam
@SQ     SN:outgroup_A   LN:29903
@PG     ID:minimap2     PN:minimap2     VN:2.24-r1122   CL:minimap2 -a -x asm20 --sam-hit-only --secondary=no --score-N=0 -t 1 -o /data/mapped.sam /opt/conda/envs/pangolin/lib/python3.8/site-packages/pangolin/data/reference.fasta -

And the file sizes:

Singularity> ls -lahrt
total 511M
-rw-r--r--.  1 garf0012 mdh  366 Apr  5 11:11 short.fasta
-rwxr-xr-x.  1 garf0012 mdh 463M Apr  5 11:11 pangolin_latest.sif
drwxr-xr-x.  1 garf0012 mdh   60 Apr  5 11:11 ..
drwxr-sr-x. 11 garf0012 mdh 4.0K Apr  5 11:12 .snakemake
-rw-r--r--.  1 garf0012 mdh  646 Apr  5 11:12 get_constellations.txt
-rw-r--r--.  1 garf0012 mdh  245 Apr  5 11:12 mapped.sam
-rw-r--r--.  1 garf0012 mdh    0 Apr  5 11:12 alignment.fasta
-rw-r--r--.  1 garf0012 mdh    0 Apr  5 11:12 hashed.aln.fasta
-rw-r--r--.  1 garf0012 mdh   10 Apr  5 11:12 hash_map.csv
-rw-r--r--.  1 garf0012 mdh   24 Apr  5 11:12 designation_status.csv
-rw-r--r--.  1 garf0012 mdh   24 Apr  5 11:12 seq_status.csv
-rw-r--r--.  1 garf0012 mdh    0 Apr  5 11:12 pass_qc.fasta
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.A.23.1-like_E484K_counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.A.23.1-like_counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.AV.1-like_counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Delta__AY.4.2-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Delta__AY.4-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.B.1.1.318-like_counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Omicron__Unassigned__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.B.1.1.7-like_E484K_counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Alpha__B.1.1.7-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Beta__B.1.351-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Epsilon__B.1.427-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Epsilon__B.1.429-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Eta__B.1.525-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Iota__B.1.526-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.B.1.617.1-like_counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Delta__B.1.617.2-like___K417N_counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Delta__B.1.617.2-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.B.1.617.3-like_counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Mu__B.1.621-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Omicron__BA.1-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Omicron__BA.2-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Omicron__BA.3-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Lambda__C.37-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Gamma__P.1-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Zeta__P.2-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  107 Apr  5 11:12 VOC_report.scorpio.Theta__P.3-like__counts.csv
-rw-r--r--.  1 garf0012 mdh  147 Apr  5 11:12 VOC_report.scorpio.csv
-rw-r--r--.  1 garf0012 mdh  169 Apr  5 11:12 preprocessing.csv
-rw-r--r--.  1 garf0012 mdh  199 Apr  5 11:12 cache_assigned.csv
-rw-r--r--.  1 garf0012 mdh    0 Apr  5 11:12 not_cache_assigned.fasta
drwxr-sr-x.  2 garf0012 mdh 4.0K Apr  5 11:12 logs
-rw-r--r--.  1 garf0012 mdh    0 Apr  5 11:12 clades.txt
-rw-r--r--.  1 garf0012 mdh   33 Apr  5 11:12 inference_report.csv
drwxr-sr-x.  4 garf0012 mdh 8.0K Apr  5 11:12 .
-rw-r--r--.  1 garf0012 mdh  200 Apr  5 11:12 lineage_report.csv

The text was updated successfully, but these errors were encountered:

EricFournier3 · 2022-04-05T18:00:06Z

Hi,
it seems that sequences having N content of 1.0 are also filtered out from the report. With version 3, results for those sequence are saved as fail in the status column. Is this expected ?

Thank you

aineniamh · 2022-04-05T18:22:28Z

It's unexpected, thanks for flagging! I'll follow up and see why this is happening.

KatSteinke · 2022-04-06T07:27:48Z

Looking at the logs sequences with N content of 1.0 are dropped fairly early, before sequences are hashed - in a test run with 7 samples, two of which have N content of 1.0, pangolin reports Query sequences collapsed from 5 to 5 unique sequences.

aineniamh · 2022-04-06T16:11:00Z

So I think I've fixed this now- just making sure tests pass on the branch!

aineniamh · 2022-04-07T13:20:42Z

Resolved in pangolin v4.0.3!

EricFournier3 · 2022-04-07T14:02:15Z

Thank you @aineniamh, it works now

garfinjm · 2022-04-08T16:33:22Z

Works perfect! Thanks @aineniamh.

KatSteinke mentioned this issue Apr 6, 2022

"Sequences" consisting only of N's disappear in pangolin output - desired behavior? #414

Closed

aineniamh added a commit that referenced this issue Apr 6, 2022

fix for issue #409 with failed to map seqs written out

31b0c22

aineniamh closed this as completed Apr 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No entry in lineage_report.csv for very short inputs #409

No entry in lineage_report.csv for very short inputs #409

garfinjm commented Apr 5, 2022

EricFournier3 commented Apr 5, 2022

aineniamh commented Apr 5, 2022

KatSteinke commented Apr 6, 2022

aineniamh commented Apr 6, 2022

aineniamh commented Apr 7, 2022

EricFournier3 commented Apr 7, 2022

garfinjm commented Apr 8, 2022

No entry in lineage_report.csv for very short inputs #409

No entry in lineage_report.csv for very short inputs #409

Comments

garfinjm commented Apr 5, 2022

EricFournier3 commented Apr 5, 2022

aineniamh commented Apr 5, 2022

KatSteinke commented Apr 6, 2022

aineniamh commented Apr 6, 2022

aineniamh commented Apr 7, 2022

EricFournier3 commented Apr 7, 2022

garfinjm commented Apr 8, 2022