Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I use gtf file run qapa? #4

Closed
Trandamere opened this issue Jun 7, 2018 · 13 comments
Closed

How can I use gtf file run qapa? #4

Trandamere opened this issue Jun 7, 2018 · 13 comments

Comments

@Trandamere
Copy link

Hi:
I'm working on transcriptome of fungi. I have the transcriptome gtf files.Like this:

1	PacBio	transcript	6264	9330	.	-	.	gene_id "PB.1"; transcript_id "PB.1.1";
1	PacBio	exon	6264	6676	.	-	.	gene_id "PB.1"; transcript_id "PB.1.1";
1	PacBio	exon	6728	6812	.	-	.	gene_id "PB.1"; transcript_id "PB.1.1";
1	PacBio	exon	6865	6996	.	-	.	gene_id "PB.1"; transcript_id "PB.1.1";
1	PacBio	exon	7041	7358	.	-	.	gene_id "PB.1"; transcript_id "PB.1.1";
1	PacBio	exon	7417	7676	.	-	.	gene_id "PB.1"; transcript_id "PB.1.1";
1	PacBio	exon	7727	8063	.	-	.	gene_id "PB.1"; transcript_id "PB.1.1";
1	PacBio	exon	8116	8523	.	-	.	gene_id "PB.1"; transcript_id "PB.1.1";
1	PacBio	exon	8595	8609	.	-	.	gene_id "PB.1"; transcript_id "PB.1.1";
1	PacBio	exon	8658	9330	.	-	.	gene_id "PB.1"; transcript_id "PB.1.1";
1	PacBio	CDS	8658	9122	100	-	0	gene_id "PB.1"; transcript_id "PB.1.1";
1	PacBio	CDS	8595	8609	100	-	0	gene_id "PB.1"; transcript_id "PB.1.1";
1	PacBio	CDS	8116	8523	100	-	0	gene_id "PB.1"; transcript_id "PB.1.1";
1	PacBio	CDS	7727	8063	100	-	0	gene_id "PB.1"; transcript_id "PB.1.1";
1	PacBio	CDS	7417	7676	100	-	1	gene_id "PB.1"; transcript_id "PB.1.1";
1	PacBio	CDS	7041	7358	100	-	0	gene_id "PB.1"; transcript_id "PB.1.1";
1	PacBio	CDS	6865	6996	100	-	0	gene_id "PB.1"; transcript_id "PB.1.1";
1	PacBio	CDS	6728	6812	100	-	0	gene_id "PB.1"; transcript_id "PB.1.1";
1	FG	CDS	6321	6676	100	-	1	gene_id "PB.1"; transcript_id "PB.1.1";

polyAsite file

gene	strand	aligned reads	num sites	locations		
PB.8888	+	1	1	1784934		
PB.301	-	2	2	972218	972259	
PB.10914	+	2	2	6573785	6574749	
PB.8192	+	6	3	150622	150649	151336

Is this enough data to run QAPA?

@kcha
Copy link
Collaborator

kcha commented Jun 8, 2018

Hi @Trandamere,

Thank you for your interest in QAPA. Unfortunately, QAPA currently doesn’t support GTF input at this time, but I may add a parser for it in the future. For the time being, you can get around this by converting your GTF file to genePred format using the gtfToGenePred tool from UCSC, as described here.

Then, download and install the QAPA version from this branch. Be sure to add the option -H (tells QAPA that you are using a using a converted GTF file) to qapa build when building your reference library.

Also, you will need to convert your poly(A) site file into a standard BED format.

Hope this helps,
Kevin

@Trandamere
Copy link
Author

Trandamere commented Jun 9, 2018

@kcha
Hi:
I already convert GTF file to genePred

./gtfToGenePred.txt -genePredExt short-fix-ORF-hlq-flag-fix-fusion5.gtf ORF.genedPred

The result like this:

PB.1.1	1	-	6263	9330	6320	9122	8	6263,6727,6864,7040,7416,7726,8115,8657,	6676,6812,6996,7358,7676,8063,8523,9330,	0	PB.1	incmpl	incmpl	2,0,0,0,2,0,0,0,
PB.2.1	1	-	23373	25221	23518	25108	8	23373,23689,23941,24146,24381,24549,24761,24922,	23631,23879,24100,24330,24494,24709,24868,25221,	0	PB.2	incmpl	incmpl	2,0,0,1,0,1,0,0,
PB.4.1	1	+	26947	30719	26999	30578	1	26947,	30719,	0	PB.4	incmpl	incmpl	0,
PB.5.1	1	+	36382	36882	36524	36854	1	36382,	36882,	0	PB.5	incmpl	incmpl	0,
PB.6.1	1	+	53833	55769	53924	55610	4	53833,54563,55057,55235,	54485,54994,55181,55769,	0	PB.6	incmpl	incmpl	0,0,1,0,
PB.6.2	1	+	53884	54469	53924	54467	1	53884,	54469,	0	PB.6	incmpl	incmpl	0,
PB.6.3	1	+	53887	55646	53924	55610	4	53887,54563,55057,55235,	54485,54994,55181,55646,	0	PB.6	incmpl	incmpl	0,0,1,0,
PB.6.4	1	+	54921	55430	55430	55430	3	54921,55057,55235,	54994,55181,55430,	0	PB.6	none	none	-1,-1,-1,

And the identifiers file

Gene stable ID	Transcript stable ID	Gene type	Transcript type	Gene name
PB.1	PB.1.1	protein_coding	processed_transcript	PB.1
PB.2	PB.2.1	protein_coding	processed_transcript	PB.2
PB.4	PB.4.1	protein_coding	processed_transcript	PB.4
PB.5	PB.5.1	protein_coding	processed_transcript	PB.5
PB.6	PB.6.1	protein_coding	processed_transcript	PB.6
PB.6	PB.6.2	protein_coding	processed_transcript	PB.6
PB.6	PB.6.3	protein_coding	processed_transcript	PB.6
PB.6	PB.6.4	protein_coding	processed_transcript	PB.6

I got a error when running qapa

[luping@centos gtf2gff]$ qapa build -H -N --db FG.identifiers.gff ORF.genedPred 
[qapa] Annotation step will be skipped
[qapa] Extracting 3' UTRs from table
Warning: More than 75% of entries skipped. Are you using the correct database?
[qapa] Annotating 3' UTRs
[qapa] Error: 
Command was:

	bedtools groupby -o collapse,collapse,collapse,collapse -i /tmp/pybedtools.YiKae_.tmp -g 1,2,3,6,7,8,9 -delim | -c 4,5,10,11

Error message was:

*****
***** ERROR: Requested column 4, but database file /tmp/pybedtools.YiKae_.tmp only has fields 1 - 0.

Tool:    bedtools groupby 
Version: v2.25.0
Summary: Summarizes a dataset column based upon
	 common column groupings. Akin to the SQL "group by" command.

Usage:	 bedtools groupby -g [group_column(s)] -c [op_column(s)] -o [ops] 
	 cat [FILE] | bedtools groupby -g [group_column(s)] -c [op_column(s)] -o [ops] 

Options: 
	-i		Input file. Assumes "stdin" if omitted.

	-g -grp		Specify the columns (1-based) for the grouping.
			The columns must be comma separated.
			- Default: 1,2,3

	-c -opCols	Specify the column (1-based) that should be summarized.
			- Required.

	-o -ops		Specify the operation that should be applied to opCol.
			Valid operations:
			    sum, count, count_distinct, min, max,
			    mean, median, mode, antimode,
			    stdev, sstdev (sample standard dev.),
			    collapse (i.e., print a comma separated list (duplicates allowed)), 
			    distinct (i.e., print a comma separated list (NO duplicates allowed)), 
			    distinct_sort_num (as distinct, but sorted numerically, ascending), 
			    distinct_sort_num_desc (as distinct, but sorted numerically, descending), 
			    concat   (i.e., merge values into a single, non-delimited string), 
			    freqdesc (i.e., print desc. list of values:freq)
			    freqasc (i.e., print asc. list of values:freq)
			    first (i.e., print first value)
			    last (i.e., print last value)
			- Default: sum

		If there is only column, but multiple operations, all operations will be
		applied on that column. Likewise, if there is only one operation, but
		multiple columns, that operation will be applied to all columns.
		Otherwise, the number of columns must match the the number of operations,
		and will be applied in respective order.
		E.g., "-c 5,4,6 -o sum,mean,count" will give the sum of column 5,
		the mean of column 4, and the count of column 6.
		The order of output columns will match the ordering given in the command.


	-full		Print all columns from input file.  The first line in the group is used.
			Default: print only grouped columns.

	-inheader	Input file has a header line - the first line will be ignored.

	-outheader	Print header line in the output, detailing the column names. 
			If the input file has headers (-inheader), the output file
			will use the input's column names.
			If the input file has no headers, the output file
			will use "col_1", "col_2", etc. as the column names.

	-header		same as '-inheader -outheader'

	-ignorecase	Group values regardless of upper/lower case.

	-prec	Sets the decimal precision for output (Default: 5)

	-delim	Specify a custom delimiter for the collapse operations.
		- Example: -delim "|"
		- Default: ",".

Examples: 
	$ cat ex1.out
	chr1 10  20  A   chr1    15  25  B.1 1000    ATAT
	chr1 10  20  A   chr1    25  35  B.2 10000   CGCG

	$ groupBy -i ex1.out -g 1,2,3,4 -c 9 -o sum
	chr1 10  20  A   11000

	$ groupBy -i ex1.out -grp 1,2,3,4 -opCols 9,9 -ops sum,max
	chr1 10  20  A   11000   10000

	$ groupBy -i ex1.out -g 1,2,3,4 -c 8,9 -o collapse,mean
	chr1 10  20  A   B.1,B.2,    5500

	$ cat ex1.out | groupBy -g 1,2,3,4 -c 8,9 -o collapse,mean
	chr1 10  20  A   B.1,B.2,    5500

	$ cat ex1.out | groupBy -g 1,2,3,4 -c 10 -o concat
	chr1 10  20  A   ATATCGCG

Notes: 
	(1)  The input file/stream should be sorted/grouped by the -grp. columns
	(2)  If -i is unspecified, input is assumed to come from stdin.

What's wrong with this? Thank you

@kcha
Copy link
Collaborator

kcha commented Jun 11, 2018

Hi @Trandamere,

Can you try two things:

QAPA filters for protein-coding transcripts. For your purposes, I think you can get around this by manually replacing "processed_transcript" with "protein_coding", like this:

Gene stable ID	Transcript stable ID	Gene type	Transcript type	Gene name
PB.1	PB.1.1	protein_coding	protein_coding	PB.1
PB.2	PB.2.1	protein_coding	protein_coding	PB.2
PB.4	PB.4.1	protein_coding	protein_coding	PB.4
PB.5	PB.5.1	protein_coding	protein_coding	PB.5
PB.6	PB.6.1	protein_coding	protein_coding	PB.6
PB.6	PB.6.2	protein_coding	protein_coding	PB.6
PB.6	PB.6.3	protein_coding	protein_coding	PB.6
PB.6	PB.6.4	protein_coding	protein_coding	PB.6

QAPA requires chromosomes to begin with "chr" and skips random chromosomes. Can you modify your genePred file to include this and try again?

@Trandamere
Copy link
Author

Trandamere commented Jun 12, 2018

@kcha
Hi I already run the qapa build and get the utrs.bed like this

chr1	26947	30719	PB.4.1_PB.4_unk_chr1_26947_30719_+_utr_30578_30719	141	+	PB.4
chr1	36382	36882	PB.5.1_PB.5_unk_chr1_36382_36882_+_utr_36854_36882	28	+	PB.5
chr1	55235	55646	PB.6.3_PB.6_unk_chr1_55235_55646_+_utr_55610_55646	36	+	PB.6
chr1	55235	55769	PB.6.1_PB.6_unk_chr1_55235_55769_+_utr_55610_55769	159	+	PB.6
chr1	68741	69032	PB.7.1_PB.7_unk_chr1_68741_69032_+_utr_68938_69032	94	+	PB.7
chr1	74060	76426	PB.8.1_PB.8_unk_chr1_74060_76426_+_utr_76352_76426	74	+	PB.8
chr1	78550	78995	PB.9.1_PB.9_unk_chr1_78550_78995_+_utr_78886_78995	109	+	PB.9

But I get another error for qapa fasta

qapa fasta -f genome.fa utrs.bed utr.fasa

Traceback (most recent call last):
  File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/bin/qapa", line 11, in <module>
    load_entry_point('qapa==1.1.1', 'console_scripts', 'qapa')()
  File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/qapa-1.1.1-py2.7.egg/qapa/qapa.py", line 258, in main
    args.func(args)
  File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/qapa-1.1.1-py2.7.egg/qapa/qapa.py", line 226, in fetch_sequences
    fasta.main(args)
  File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/qapa-1.1.1-py2.7.egg/qapa/fasta.py", line 38, in main
    seqs = get_sequences(args.bed_file[0], args.genome)
  File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/qapa-1.1.1-py2.7.egg/qapa/fasta.py", line 10, in get_sequences
    return bed.sequence(genome, s=True, name=True, fullHeader=True, split=True)
  File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/pybedtools-0.7.10-py2.7-linux-x86_64.egg/pybedtools/bedtool.py", line 806, in decorated
    result = method(self, *args, **kwargs)
  File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/pybedtools-0.7.10-py2.7-linux-x86_64.egg/pybedtools/bedtool.py", line 337, in wrapped
    decode_output=decode_output,
  File "/disk/luping/tools/third-seq/pitchfork-isoseq_sa5.0.0/lib/python2.7/site-packages/pybedtools-0.7.10-py2.7-linux-x86_64.egg/pybedtools/helpers.py", line 356, in call_bedtools
    raise BEDToolsError(subprocess.list2cmdline(cmds), stderr)
pybedtools.helpers.BEDToolsError: 
Command was:

	bedtools getfasta -split -s -name -fullHeader -fo /tmp/pybedtools.pdfY8S.tmp -fi genome.fa -bed test-utr.txt

Error message was:
Warning: the index file is older than the FASTA file.
WARNING. chromosome (chr1) was not found in the FASTA file. Skipping.
WARNING. chromosome (chr1) was not found in the FASTA file. Skipping.
WARNING. chromosome (chr1) was not found in the FASTA file. Skipping.
WARNING. chromosome (chr1) was not found in the FASTA file. Skipping.
WARNING. chromosome (chr1) was not found in the FASTA file. Skipping.
WARNING. chromosome (chr1) was not found in the FASTA file. Skipping.
WARNING. chromosome (chr1) was not found in the FASTA file. Skipping.


The genome file like this

>1 dna:chromosome chromosome:RR:1:1:11760950:1 REF

But I get fasat when run bedtools

bedtools getfasta -fi genome.fa -bed utrs.bed

Warning: the index file is older than the FASTA file.
>chr1:26947-30719
TCTACAAAGTGAACTTTATTATAACATCACTTCACTCCTCCTCAAAGCAATGCAAGGCCTGCCAACCAGAACTGTCAGATGGATGCTCTTTAGCGACCTTCATTTTAAGCATCATGACTTGAATCGAGTCTGGCAGACAGCGAAGTGGATTGTTGCTGAGGCAGAGCGACATCAGGTCAGGAGGGTAGTAGTATGCGGCGATCTGCTAACGAGCCGCACCATGCAGCCAACACATGTTTTGTCTACATGTTATCGCTTCATTGGCTTGCTCAGTGATATCATACCCCACGTCCATATCCTACTTGGGAACCATGACCTTGCCTACCGTCGTGACTACCAAACCACGGCCCTCGACGCCTTCAACCTTAATCGCCTAGCCCCCTACGTCTCTATCCATAAAGAAATTACCCAACACGAGTGGGACGGCCGACGTGTTATACTTTTGCCATTTCGCGAGGACCAGAGTGGGTTGACAGATGCTGTGGCTTCTCTGAGTCCTACAGAGGCGAGCAAAACCGTTGCCTTCGCCCATCTTGCGATCAACAAGGCCATCACACAGCGGTACGTGGTCAGTGATGGCTTCAAAAACTTATCGGCGGCGAAATCCATCATATACAATGGATTGACGGGTCCGGATCAATTTGCCTCACTGGCCCGAACTTTTACGGGACATTTCCACAGCCACCAGACCATCACCCAGAAGCAACCAAATACCAACAAGACAGATCTTCGAGGAAGCATCACGTACCTGGGCTCACCATTGCAACTGAACTGGAGCGACCTCTACGATGAGAAGCGTGGCATTCTGTTATTCGATCCAGAAACGCTCGAGCACGAAATGCTTATTAACCCATATGTCATCGGCTATACAACTGTAAATTTACAGGAAGTACTCAACGACCAGGCCAACGAAGAAGCTTTAAAGGATAAGCACGTAATGCTTCTAGGTGATTTGAGCCATCTCAAGTATGTTATGGCTCGTGACAAGCTCCTTTCACTTGGGGCGCGGAGTGTACGGGACTGGAACCCTATGGGCTTTAAATCGCATTCTGACCGTGCGGCTTTGGGATCATCGGTCCCAGCCAGCGATGTATCCTTTCAGCCCTTAGAGGAGCCCAGGGAAATTGAGGATACAACAGGCAATAGGGTATCCAGCTCTGCCCTTGACTACCAACCTCGAGCCAATAAACTCGATTTTGCGGCGGAGGCTCGGGAGTATGTCACATCACTAGAATTAGATAACCCTCTCTCCTTGCGACGAGAAGAGTTAATTCGAGTTGGCCAGAGAATGATTCAGGTTTCCCATGGGCTAGTAGATCAGGATGAAGACGAAGACCTCCAGCTGAACTATGAGGTCTTTCTCGACAGGTCAACTCAGGCCATCAGCACGAGAAGCGCTACCGATCTGAAAGGTGCCTCCGGTCATAATTTCGCCGCAGTGCCATCCACTCTTACGATCACCAATTTCCTGAGCGTGCAGAACACCATCACCATTGACTTCCGGTATGATCTTGCGCGTGGACTAACTTTCCTTGTTGGCGAGAATGGCTCTGGCAAGAGCATGCTGATTGAGGCCATGGTTTGGTGCCAGTTTGGGCAGTGTGTTCGCAGCGGCATGGCGGTTGACGATGTTGTCAACGATATTATAGGTAAGAATTGCTGTGTCACGCTGGAGTTCGCCAACGGTTACGCGATCTCGCGTCATCGAAAGCACGAGAAAAACGGTAACCGCGTTATAATTCTTTTACATGGCGAGGCCCAGCTGCAATATGATCGCCCAGGCAAAGGCACGACACAGGAGGCTATCGTCGAGTTGCTTGGTATCAACTACGAAACATACATCAAGACAGTCGTATTGAGCCACGAAAGCGCTGCGAGCTTCTTGAGCTCAAAGGCAGCGGAGAAGCGTAAATTGACAGAGGCAGCTCTTGGAATGTCGATACTAGATACCTATGGGAAGGTATCGAGATTACTCCTCAAGGACATTGATGAAAACGCGAACGCGGTGGATAAGAATATGCAGGATGTGGCTCGCACAATGGAACATACTGAAAAGCATATCGAAGTTCTGGACCGGAAACGGAAAATATGTGAAATGGAGGCGTATAACGCTGAGGCATCTCTTGAGTTGGCGATACAGGACCATGCAACCGCCAGATTACGAAGTGATGGGCAGTTTGGAGACGGAGACCTTGAGCTTTCTATAGACTCCGCAGAGTATGAACAGAAGATGTCGGCCTCAAGCCACGGATTGCAGCATGCAATGGAAGTGGCACACCTCGCTGAGCTTACAATGGGTTCCCACGCAAAGGTCTCGGCATTGACGGATCAGATCAGTATGGAGAAGAAAAGCCGGCAAAGCCTTGAAAACGAATATACCCAAATACGAGCAATGGTGAATGCCGATTTCATATCATGGCTCGTCGGACTGAAGCAGAACCTCGGTCAAAATCTGGAGGCCGTGCCAGCAGATCGTCCCGCCATACTCCAAAGACTTTTATATGCTATAAGAGCAGTGGTTACCAGGGGTGCGGTGGGCGTATTGGAATTCTTGATAAGCAACTATCAAGCAATCGACACCAAACATCAACAACGAAGAGTGGTCATTGACGGTCTCTGCGAGGACATTAAAGACAGCAAGTTGCGGCTGCAAAGCCTCCATCTCGAGGTAACGGATATTGTTCATCGAAATCGGCTTGCCACTAGCCAGGCCCTTCGGGTGATCGACAAGCGAATCATGGAGGTCGAGCAGGCTCAAAAAACATGCGCAGATGTTATGTTCAAGCATCAACAGGCCATGTTTAGGCAGCAGGAAGAGCTCATATTTAAACAACAGCAAGAGGCTAAACTCAAGCACAAGCAAGAAGAGGCGATGCTTAAGCGACACGAAGCAGCTATCTATACATGCCTTATCGAAGCCGAGCGATCATCTCTACGCTCAATTAGCCAGAAGTACGATAACCTGGCTCTCAAACTTAGAGAACTTGCAGCCGATCGCGAGGTCTTTGCGTTCTGGTCCTCTGCTTTAACAAAACGTAACAAGCGCACTAGTTCCTCAGCCAAATCCAGGGGGAAATCAACGGCCAACTTCCCCGAATATGTCCTCGAAAAGTCAATATCACATCTCAATAGCTTACTTGCACAGATCCTTACGGCACTGTACGACGACACACGCCACACAAACGCCATCGCAACTGGAATGCTGAGTTCGCTGTTCGACTCTGAATCTATTGGCACCATGGAACACGAATCTTGTTCGCCTGGGCCTGTCCTCGATCAGAGCCTTGGTATCCACCAGTCGCTAGCCTATGGGAAGCGATCAGGTGGCGAGCGCAAGCGTCTCGACCTAGCGCTCTTCTTCGCACTGCTACATCTGGGTTGGGCAGAGAGCGCGCATCGGGCACATTACCTGCTCATCGATGAGGTGTTTGACAGTCTAGACGAGGCGGGTCAAGAGGCAGTTGTCAGGTGGTGCATGACAATGCTACAGTCGATGGTCAGTTGGATTGTGATGGTCACTCATAGCCGATTCTTGGCTGAGCGAGATCCGGAAAGGGATGCGGGTAAGGTCATGGTCATGCGGGTAAAGATAGGGAGCCAAGGTACGGAGCTTGTTAAAGATAGATAGAGGATCGGTATCTAAGAGAGAGCTTGAGGCCATTAAGGATAACAGTTTACCTATAGTGTTTAAAAAACTTCCTATTAAATTGTATTTTACTAGTATAGCCTAAGCTAACAGCTTTGTAAAATGGGAGAATACGGATGGATTTA
>chr1:36382-36882
TTTGGTTCAGGCAATCACTAATAAAGGATTTCCTTCGATCTGTAAATGAGTTTCCGTAAGAGTATCAGTTACTCTCTATACTCTTTATAATCGTGTTCATACTAAAGCTTGGCAAATTAAACTGACTAGGAATCAGAAGATGTCCGCAACCCAAGGAGATATAAAGGCTACAATCGAACTCCTACGCCTAAAACAAACCGGAAGCGCCAGGGATTACAGTACCAAGTTTCTCGAACTTTTAAGCAAGACGACAAAGGAAACATATTTGGCAGCACGATTCTTTCTGGGACTAAAAGAAGAGATCCAAGAGGCTATACATAAGGATGGAGAACTGCCGGCTACATTCGAAGATATGGCGAGAAAAGCGACGACCATTGACAACTACCTCCACGATAAACGGAGAAAACGTGGACTCTATTATGTATGCGGTGCAAGCGGCCATATAGCAAAGGACTGTAAGACGGAGTGATAGACATAATTGAGATAAATGCAGTTTGCTT
>chr1:55235-55646
TGTAGAAGGCCAAGTCAAACCAACGAGCATTTACTCTCTTACTTCGGTTCCAAAGACATAGGCATCTCTCACACACTATTCCGCCGTTTCTTCTGGGCGGACAACGTATTATGGAAAGAGGATATCCAAGATCGCCCGGCGACTGTGGTCTTGGCTGGAAGAGATTGTGTAATTGACACAAAAACAATTCGGGCATACCTGCTGGGCTTAGACAAGTGGGCCCTGGTAACTACAGATTTGATGGACCGCGGAAGGAAAGACGAAGGATTGGATGTGATCTGGTTCCAAGATCTGGACCACGGCCAGGTATTTGATGAAAAGAGGACGAGAAGAAGTTTAGTAAAGGTTATCTGGAACCTCTGCAAGCAATAGGAGGTAGCGACAGAGGTATTTGTTGTTTGGTTCGGTTTA
>chr1:55235-55769
TGTAGAAGGCCAAGTCAAACCAACGAGCATTTACTCTCTTACTTCGGTTCCAAAGACATAGGCATCTCTCACACACTATTCCGCCGTTTCTTCTGGGCGGACAACGTATTATGGAAAGAGGATATCCAAGATCGCCCGGCGACTGTGGTCTTGGCTGGAAGAGATTGTGTAATTGACACAAAAACAATTCGGGCATACCTGCTGGGCTTAGACAAGTGGGCCCTGGTAACTACAGATTTGATGGACCGCGGAAGGAAAGACGAAGGATTGGATGTGATCTGGTTCCAAGATCTGGACCACGGCCAGGTATTTGATGAAAAGAGGACGAGAAGAAGTTTAGTAAAGGTTATCTGGAACCTCTGCAAGCAATAGGAGGTAGCGACAGAGGTATTTGTTGTTTGGTTCGGTTTAATTGGCAACGACCCCTCATAGTAGTTTGATTGTGGCCTTATGAGGTCTTGAAAGCTATGCACTATTTTACTTGGGTATATGAGAGCCCATGCTGGCACATTCAAATCCGGCTTTGTTTACGAA
>chr1:68741-69032
GAGCTTTGCCTGGCTGGAGAAAGCGACTTGGACAATTTATTGAGATCACACAGCGCAAAGAAACGCAACAGGATGGGAACGATGTTGTTATGGAAACTTTGCTTCCGTTTAACCCTACTGCGGGAGGCCAGCGCGATGAAAGAGAGTGGCGTACTATGGCAGCGAGGCGATCTTGGAGCAGAATAGATAATTAGCCACGGATTCTATGGGATATGCGCAAGTCAAGAGAATTCTCTGTATATAGGATCCTAATATTAAATTGGTAATTTACTGAAAGGTGTTATAGCGGCA
>chr1:74060-76426
TATATACCTCCAAGTAGATATCTCATACGCAGCTCAGATAGTCACCCACCTCTTCAATCCATTAACCCCATTTACTACACTTCTTATTTGCTCCCAACCATGCATCAGCATTGGAAGACACTCCTCACGCTGTCAACTCTCACATCAAGCAGTGTGGGAAAGGATCTTCCATTCGATGCCCATATCCTCCTTCAACAGCCTGGCCTGCCTTCTGTGTTTAGCAAGAATGGTACAGGCTCGGGGCTCAACACCCCGTCCACTCGCACACACAGGTCCATTGATCTGAGTGAGATTGTCCCTGCCCGAGATTCTGCTGGCAATCATCGCTGCTGTCCCATTGGAACTGTCAACGACGGCACCGGCTGCGTGTTCCCTCAATCTTCAGTGTGCCCTGAGGGTACCTACCTTGATGGCAACAACTGCATCAGTATGAAACCACCGAGTTGTCCCGAAGGTCTTTTCTACAATGGCCGAAACTGCATTGGCCTTCCGCCTAGCTGCCCCACTGGAAGCTGGTTTAATGGCGACGCCTGTGAGTCTGAGGCAGGGCCTGGGTGTAGTCAAGGCTTCAGGTATGACGGCGGTGTTTGCATTTCCCTCACTGGGCCGGGATGCCCTCCTGGATTTACTCTCAAGAATCGGCTTTGCACTTCTAGTGTCCCACCACAGTGTCCCTCCCCTCTCATACTTCAGGACGGGAGCTGTGTTGGTTCCAAGCCCCCATCTTGCCCTCCCAACACAAAGCTTTCCGATGGCAACTGCATCGGCAAAGTCCCTCCGTCTTGCCCTCCTGGGTACGTCATGGAGAACAATGTTTGTACTCACAAGACCAAGCCAGAGTGCCCTATTGGCTCCTACTTTGATAGCTCTGCTTGTGTATCTGAGAAGCCCCCTTCTTGTGAGCCCGGCTTCTTCAACGGCAAGGTCTGTGAAGACAAACCGTCCTGCCCACCCGGCTTTAACTTGAGGGGAAGAGTGTGTTCTTCTGAAAGTAAGGCCACTTGCGAGGTTGGCACCTTTTTCTCTACAGATGATACCACTGGAAAGTCTGCTTGTTGCCACAAGAACTTCCAGTTCAATGGAGACAGGTGTTACATGAAAGCTGAGAATGAGGGCGACTGCCCTCCAGGATCCCACTTTGATAAGGGGCGGTGCTGGATTACTCCGTTGAAGGAGCCCGAATGCCCTACCGGAGGAAGGTTCAACGGCAAGGTCTGCATCGTTCAGTCTGTCCCTGAATGCCGATCTGGGACATACAATAATGGGAAATGTCTTATCTCTGAGGATCCCACTTGCCAGGGTGGTGCTGTCTTTGACGGCGAAAAATGCGTTATGGAAAAGCGCGCGACTTGTCCTCCAGGCTCATCCCTCAGTGGCAAGAAATGTATCTCGAGCCATGATCCAACATGCGAGGCAGGCCAGATCCTGATCGGAGATTACTGTACTTCGGAGAAGCGCCCCGAATGTCCCGATCCCTCAATACTTGTAGGTCGATACTGCAAGTATCCGGATCTCCCTTACTGTGAGGATGGCACCGACCGCAACGGCAACGACTGCACATCCCCAGTGACACCAGATTGCCCAACTGGAAGCTCTTTCAACGGCAAAAACTGTGTTTCGAACGTGCGTCCTGATTGTAACCCTGGGTACTTCTTTAATGGGTCCACTTGTGTTTCCAAAAAAGATGGTCCGGGTTGTGCAAAGGGCATGTACTTTGACGGCACCAACTGTGTATCCAAGGAGCGTCCTGGATGCAAGGATGGTAGTACTTTCAACGGAATTGCTTGCGTTGACGAAGCCGACCCCAAATGCCGCAAAGGTCTCGTCTTCGATGGCACCAGCTGCGTGTCCAAGTCCCCTCCTAACTGTGTGTCAGGCACAACCTGGAATGGAAGCAAGTGTACTTCAAACAAGAGTCCAGGATGTACCCCAGGAATGACATATAATTCCGAGACTGAGACCTGCGACTCCGATACCAAACCGAAATGCCCTACGGGTACTGAACTTAGAGGCGATATCTGTGTATCGATTAAGCCGCCTGCGTGCCCTGAAGGAACGACTCGCAAGGGCGGCAAGTGCGTGTCTACTGACCCGCCCGAGTGCCCTGACGGTCTTACTCCCAGCAATGGGCTTTGTGTCTCTACCAAGCAGCCCAGTTGTGGTGAAGGTACCATCTTTGACGGCAAGCGCTGCGTTATTGGTAAAACCCAGGAATGTTATACCTTGCATACCTGTCCACCTGTTGGTGCTGCTGGACTGACTGCCCAGCCTGCTCTAGCTTCTCAGTAGATAATAGTCTCATGCAATTATGTATCACTGTAGCTCTAGTCTACATTTAATGCAAAGTCATATCATTTCTCCTTTCA
>chr1:78550-78995
AATGAGGACGGTGCTGGCCCCCTGATGGTAGACGTCGACTTCACTTCGGGAGGCACTGATCCATCAGCGTTCAAATCCGTAGAAGTGGTTCATAATATCATTGGCGTATTGGGATTTTCAACCGTAAGCTCAACTGACTTTCCGGTCGTTGTCAAAGTGCCTGCCGGTAAAGTTTGTTCCGGAAAGGTAGCCGGCATGTCTGGTGTTTGCATTGCGAGGGTGCGAAACAGTGCCACTGCTGGCCCCTTTGGCGGAGCTGCCGCCTTTACACATTCTCCTATAACTTCTACGGGAAAGAAGCCCAGCGCAAAGTTTCGTAACCGCCATATCTAGTTTCGTATTCCCTACATTTACTCAGAATGGCGTTATGTTACATTATTATCACTTATAGGCTGGCATCATTGTTAGTGTCATATTCAGCAATAAAATAGATAAATAATCGCTA

@kcha
Copy link
Collaborator

kcha commented Jun 12, 2018

Hi @trandemere,

This appears to be a bedtools warning and not from QAPA. My guess is that because after adding the "chr" prefix to the chromosome, you will have to do the same for your fasta file. Sorry for the inconvenience. In the future, I will look into removing this restriction.

The qapa fasta module is essentially a wrapper for the bedtools getfasta command (e.g. bedtools getfasta -split -s -name -fullHeader ... with a few extra filtering steps (short sequences less than 100 nt are removed). I am bit surprised that your bedtools command worked given the discrepancy in chromosome names.

@Trandamere
Copy link
Author

@kcha
Is this option requierd? -fullHeader
It doesn't work although I change the chr

>chr1
TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC

Warning: the index file is older than the FASTA file.
WARNING. chromosome (chr1) was not found in the FASTA file. Skipping.
WARNING. chromosome (chr1) was not found in the FASTA file. Skipping.
WARNING. chromosome (chr1) was not found in the FASTA file. Skipping.
WARNING. chromosome (chr1) was not found in the FASTA file. Skipping.
WARNING. chromosome (chr1) was not found in the FASTA file. Skipping.
WARNING. chromosome (chr1) was not found in the FASTA file. Skipping.
WARNING. chromosome (chr1) was not found in the FASTA file. Skipping.

It works when this parameter is removed -fullHeader

>PB.4.1_PB.4_unk_chr1_26947_30719_+_utr_30578_30719(+)
AATGC
>PB.5.1_PB.5_unk_chr1_36382_36882_+_utr_36854_36882(+)
GAGTTTCCGTAA
>PB.6.3_PB.6_unk_chr1_55235_55646_+_utr_55610_55646(+)
TTCCAAAGACATAGGCA
>PB.6.1_PB.6_unk_chr1_55235_55769_+_utr_55610_55769(+)
TTCCAAAGACATAGGCATCTC
>PB.7.1_PB.7_unk_chr1_68741_69032_+_utr_68938_69032(+)
ACACAGCGCAAAGAAACGCAA
>PB.8.1_PB.8_unk_chr1_74060_76426_+_utr_76352_76426(+)
ACC
>PB.9.1_PB.9_unk_chr1_78550_78995_+_utr_78886_78995(+)
GGG

@kcha
Copy link
Collaborator

kcha commented Jun 13, 2018

Yes, you may be right that -fullHeader is unnecessary. Although I'm not sure why it wouldn't work if you included it.

@Trandamere
Copy link
Author

@kcha
How can I fix it in qapa?

@kcha
Copy link
Collaborator

kcha commented Jun 14, 2018

@Trandamere, as this is a bedtools error, I don't have a good sense of what's causing the problem. Can you send me your genome.fa via e-mail or a public URL that I can try to replicate the problem on my end? The chr1 sequence you provided above seems too short so I assume you posted a fake sequence?

@Trandamere
Copy link
Author

@kcha
The URL

ftp://ftp.ensemblgenomes.org/pub/fungi/release-32/fasta/fusarium_graminearum/dna/

@kcha
Copy link
Collaborator

kcha commented Jun 14, 2018

I was able to replicate the issue. Thanks for your help. I have removed the
requirement for fullHeader and things seem to work now. I also have removed the
need for chromosome names to begin with "chr", so in the future you don't need to modify your files. Please download the latest changes from this branch and try again. I will merge these changes to the master branch soon.

@Trandamere
Copy link
Author

@kcha
I already run the qapa fasta and get the utrs.fasta
But I get a error for qapa quant

qapa quant --db FG.identifiers.gff ./FG*/quant.sf 
Merging samples by TPM
  |======================================================================| 100%
Separating Ensembl IDs
Error in separate_ensembl_field(merged_data) : 
  Unable to find Ensembl IDs by regex
Execution halted
Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
  no lines available in input
Calls: data.table -> read.csv -> read.table
Execution halted

identifier file

Gene stable ID	Transcript stable ID	Gene type	Transcript type	Gene name
PB.1	PB.1.1	protein_coding	protein_coding	PB.1
PB.2	PB.2.1	protein_coding	protein_coding	PB.2
PB.4	PB.4.1	protein_coding	protein_coding	PB.4
PB.5	PB.5.1	protein_coding	protein_coding	PB.5
PB.6	PB.6.1	protein_coding	protein_coding	PB.6
PB.6	PB.6.2	protein_coding	protein_coding	PB.6
PB.6	PB.6.3	protein_coding	protein_coding	PB.6
PB.6	PB.6.4	protein_coding	protein_coding	PB.6

quant.sf file1

Name	Length	EffectiveLength	TPM	NumReads
PB.4.1_PB.4_unk_1_26947_30719_+_utr_30578_30719(+)	3772	3550.25	5.09521	185
PB.5.1_PB.5_unk_1_36382_36882_+_utr_36854_36882(+)	500	279.346	3.15028	9
PB.6.3_PB.6_unk_1_55235_55646_+_utr_55610_55646(+)	411	191.038	0	0
PB.6.1_PB.6_unk_1_55235_55769_+_utr_55610_55769(+)	534	313.214	0	0

quant.sf file2

Name	Length	EffectiveLength	TPM	NumReads
PB.4.1_PB.4_unk_1_26947_30719_+_utr_30578_30719(+)	3772	3544.14	0.297814	24
PB.5.1_PB.5_unk_1_36382_36882_+_utr_36854_36882(+)	500	273.796	0.642508	4
PB.6.3_PB.6_unk_1_55235_55646_+_utr_55610_55646(+)	411	185.802	0	0
PB.6.1_PB.6_unk_1_55235_55769_+_utr_55610_55769(+)	534	307.613	0	0

kcha added a commit that referenced this issue Jun 15, 2018
 - Update regex for matching Ensembl IDs and species
 - Fix bug caused by input data coming from only one strand
@kcha
Copy link
Collaborator

kcha commented Jun 15, 2018

@Trandamere, thank you for sharing your result. The issue is due to QAPA not recognizing the format of the gene IDs in your data. I have now fixed this to be more flexible. Please pull the latest changes and try again.

kcha added a commit that referenced this issue Jun 18, 2018
 - Update regex for matching Ensembl IDs and species
 - Fix bug caused by input data coming from only one strand
@kcha kcha mentioned this issue Jun 18, 2018
kcha added a commit that referenced this issue Jun 18, 2018
* Add support for custom genePred files and
bypassing annotation step

* Add option to stop filtering of random chromosomes

 - chromosomes also don't need to begin with 'chr'

* Disable fullHeader

* More changes to support other species (#4)

 - Update regex for matching Ensembl IDs and species
 - Fix bug caused by input data coming from only one strand

* Update README and add helper script for installing R packages

* Restore indents

* Fix regex matching of multi-transcript cases

* Add optional --species option

* Minor checks
@kcha kcha closed this as completed Jun 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants