Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NA annotations in pau table (drosophila, dm6) #15

Closed
mirax87 opened this issue May 10, 2019 · 10 comments
Closed

NA annotations in pau table (drosophila, dm6) #15

mirax87 opened this issue May 10, 2019 · 10 comments

Comments

@mirax87
Copy link

mirax87 commented May 10, 2019

After successfully running the QAPA workflow for drosophila (ensembl release 91, dm6) and quantification using mRNA-seq using Salmon (please note, there is no 3'-seq involved) - I found that the the PAU table (output of qapa quant) contains a substantial number of NA entries in the APA_ID column, which I intepret as 3'UTRs discarded by QAPA.

Short version:

After double checking some of the NA entries, I found genes that contain only non-ambiguous UTR3 and can't explain why they were discarded.

Thus, my questions:

  1. Where do these NA entries come from
  2. How to deal with them downstream?
  3. How to reduce the amount of NA entries in the first place?

Long version:

The output of qapa build, i.e. total number of UTR3 before filtering:

$ wc -l qapa_utr3.bed 
16725 qapa_utr3.bed

Number of UTR3s assessed and the number of NA entries

> nrow(qapatable)
[1] 16505
> sum(grepl('^NA_', qapatable$APA_ID))
[1] 7294

The Num_Events columns shows for all NA entries the total number of NA entries, i.e. 7294.

> qapatable[grepl('^NA_', qapatable$APA_ID), c(1,2,5:12)]
# A tibble: 7,294 x 10
   APA_ID Transcript  Chr   LastExon.Start LastExon.End Strand UTR3.Start UTR3.End Length Num_Events
   <chr>  <chr>       <chr>          <int>        <int> <chr>       <int>    <int>  <int>      <int>
 1 NA_1_S FBtr0006151 chr2L       14618525     14618902 +        14618840 14618902     62       7294
 2 NA_1_S FBtr0070002 chrX        20051293     20052519 +        20052085 20052519    434       7294
 3 NA_1_S FBtr0070007 chrX        20167881     20168080 +        20167977 20168080    103       7294
 4 NA_1_S FBtr0070025 chrX        20171444     20171648 -        20171444 20171594    150       7294
 5 NA_1_D FBtr0070028 chrX        20156478     20157836 -        20156478 20157585   1107       7294
 6 NA_1_S FBtr0070032 chrX        20091427     20092408 -        20091427 20091605    178       7294
 7 NA_1_S FBtr0070036 chrX        19958392     19958900 -        19958392 19958502    110       7294
 8 NA_1_S FBtr0070040 chr3L       23341456     23341597 +        23341516 23341597     81       7294
 9 NA_1_P FBtr0070058 chrX        21600599     21601274 -        21600599 21600845    246       7294
10 NA_1_D FBtr0070059 chrX        21600501     21601274 -        21600501 21600845    344       7294
# ... with 7,284 more rows

Checking some NA entries through IGV, next to some ambiguous overlaps (overlapping 3'UTR with other transcripts, regions contained within intron, ...), there are unambiguous tandem 3'UTR - without overlap or other ambiguities, which are nontheless discarded.

@kcha
Copy link
Collaborator

kcha commented May 13, 2019

In the APA_ID, the NA values usually contain an Ensembl gene ID. So this looks like a possible issue during the build step. It might be that the Ensembl gene ID could not be probably retrieved. Can you share a sample of what your ensembl_identifiers.txt table and your genePred file looks like?

@mirax87
Copy link
Author

mirax87 commented May 13, 2019

Hi @kcha,

thanks for the quick reply.

I checked the agreement of gene IDs between ensembl_identifiers.txt and the genePred file. All gene IDs were found in both files.

In case you want to reproduce anything: I used the ensembl-91, dm6 annotation.

Below are the examples for the requested files:

ensembl identifiers:

> head  ensembl91_identifiers.txt 
Gene stable ID	Transcript stable ID	Gene type	Transcript type	Gene name
FBgn0036531	FBtr0075502	protein_coding	protein_coding	CG6244
FBgn0037375	FBtr0300738	protein_coding	protein_coding	kat-60L1
FBgn0037375	FBtr0300739	protein_coding	protein_coding	kat-60L1
FBgn0037375	FBtr0300737	protein_coding	protein_coding	kat-60L1
FBgn0037375	FBtr0300736	protein_coding	protein_coding	kat-60L1
FBgn0037375	FBtr0078628	protein_coding	protein_coding	kat-60L1
FBgn0037375	FBtr0300740	protein_coding	protein_coding	kat-60L1
FBgn0037375	FBtr0078627	protein_coding	protein_coding	kat-60L1
FBgn0037375	FBtr0300741	protein_coding	protein_coding	kat-60L1

Here is a sample of the genePred file

> head dm6_ensembl91.chr.genePred 
FBtr0392909	chr3R	+	567075	2532932	835377	2532854	22	567075,835375,869485,895785,947425,1025039,1138729,1191609,1342195,1454571,1467145,1588294,1845560,1861193,1938066,1961119,2136429,2307284,2413804,2461181,2503676,2532797,	567268,835491,869548,895893,947570,1025614,1138972,1191975,1342317,1455091,1467258,1588538,1846163,1861417,1938205,1961515,2137402,2307583,2414131,2461592,2503880,2532932,	0	FBgn0267431	cmpl	incmpl	-1,0,0,0,0,1,0,0,0,2,0,2,0,0,2,0,0,1,0,0,0,0,
FBtr0114258	chr3R	-	722369	722621	722621	722621	1	722369,	722621,	0	FBgn0085804	none	none	-1,
FBtr0302440	chr3R	+	1031170	1031354	1031354	1031354	1	1031170,	1031354,	0	FBgn0039987	none	none	-1,
FBtr0347367	chr3R	-	1366233	1366601	1366601	1366601	1	1366233,	1366601,	0	FBgn0267798	none	none	-1,
FBtr0347366	chr3R	-	1865107	1866008	1866008	1866008	1	1865107,	1866008,	0	FBgn0267797	none	none	-1,
FBtr0302347	chr3R	-	2156915	2157206	2157206	2157206	1	2156915,	2157206,	0	FBgn0058182	none	none	-1,
FBtr0445191	chr3R	-	2554161	3263582	2554161	3263573	21	2554161,2554469,2557977,2690089,2690286,2743110,2743225,2743786,2940912,3031405,3060392,3061234,3061772,3168994,3170468,3262113,3262572,3262749,3262922,3263194,3263512,	2554398,2555023,2558354,2690237,2690442,2743167,2743729,2744070,2941229,3031536,3061173,3061718,3062322,3169542,3170666,3262480,3262692,3262870,3263146,3263456,3263582,	0	FBgn0267430	incmpl	cmpl	1,2,0,2,2,2,2,0,1,2,1,0,2,0,0,2,2,1,2,1,0,
FBtr0345282	chr3R	-	2744303	2744800	2744800	2744800	2	2744303,2744757,	2744698,2744800,	0	FBgn0266747	none	none	-1,-1,
FBtr0345281	chr3R	-	2744304	2744800	2744800	2744800	1	2744304,	2744800,	0	FBgn0266747	none	none	-1,
FBtr0300207	chr3R	+	3322809	3354486	3322809	3354378	2	3322809,3353128,	3323218,3354486,	0	FBgn0086917	cmpl	incmpl	0,1,

List of chromosome names in the genePred file, with the number of occurrences.

> $ cut -f 2 dm6_ensembl91.chr.genePred  | sort | uniq -c | sort -k1hr
   8079 chr3R
   6903 chr2R
   6676 chr3L
   6628 chr2L
   5949 chrX
    339 chr4
    115 chrY
     38 chrmitochondrion_genome
     21 chrrDNA
      3 chrUnmapped_Scaffold_8
      2 chr211000022280494
      14 chr2110000222* [summarized manually]

For our internal annotations we remove the chr prefix, which I added back to the annotation. I forgot to exclude the mitochondrial_genome and rDNA from it. However, they shouldnt be responsible for the majority of the missing APA_IDs, if at all.

@kcha
Copy link
Collaborator

kcha commented May 13, 2019

Hi @mirax87, thanks for sharing the information. Can you share what your build command was? I'm wondering if the -H option was missing, which is required when the genePred file was custom built (e.g. using genePred --genePredExt.

@mirax87
Copy link
Author

mirax87 commented May 13, 2019

here are the commands - following the guide lines.

gtfToGenePred -genePredExt {input.gtf} {output.genePred} > {log}

qapa build -N -H --db {input.database} {input.genePred} 1> {output.bed} 2> {log}

both log files are empty - no warning nor error.

@kcha
Copy link
Collaborator

kcha commented May 13, 2019

The command looks good to me so I'm not exactly sure what the issue is. Would you be able to send me {input.database} and {input.genePred} for me to investigate? You can e-mail them to k.ha -at- mail.utoronto.ca.

It's been great hearing from users building their own libraries, but I don't have much experience working with other species outside of human and mouse!

@mirax87
Copy link
Author

mirax87 commented May 14, 2019

I just check issue #13 reporting an issues with data.table version 1.12.2 and seemingly introducing NA values.

It turned out, that my R environment is using exactly that version - maybe related? I am checking on my side by downgrading data.table.

What's the version that worked for you @kcha ?

@kcha
Copy link
Collaborator

kcha commented May 15, 2019

I have been using an older version, which was working fine. I have not investigated #13 as the user did not provide any further information. Perhaps with the data you e-mailed me I can finally resolve it.

@mirax87
Copy link
Author

mirax87 commented May 16, 2019

I tried downgrading the CRAN package data.table to versions 1.11.0 and 1.10.0.

No effect on the amount NA counts in APA_ID.

@mirax87
Copy link
Author

mirax87 commented May 22, 2019

Hi @kcha ,

your latest update fixing data.table(...quote = NULL) in qapa quant seems to have fixed the NA issue. I observed no further NA in the APA_ID column, as well as the Num_Events seems alright:

> range(qapa$Num_Events)
[1] 1 7

On top of that, I used data.table 1.12.2, indicating that the version works for me.

Further, I observed, that transcript IDs are now separated by a comma. But this is - if at all - a separate issue.

Thank you for your time and energy looking into it! From my side, it's okay to close this issue now.

@kcha
Copy link
Collaborator

kcha commented May 23, 2019

HI @mirax87,

Thanks for sending some sample data for me to investigate and resolve the issue. That was really helpful! As we discussed offline, the issue was related to an apostrophe in the database file and read.table() treating it as a quoting character. It has now been fixed such that this shouldn't be an issue any further.

@kcha kcha closed this as completed May 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants