Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PA annotation details #2

Open
esebesty opened this issue Aug 2, 2023 · 5 comments
Open

PA annotation details #2

esebesty opened this issue Aug 2, 2023 · 5 comments

Comments

@esebesty
Copy link

esebesty commented Aug 2, 2023

Just started using REPAC and was wondering about the annotation details, found in

library("REPAC")

data("hg38_pa")
data("mm10_pa")

Both the human and the mouse annotation contains 3UTR, CDS and IN annotation types. However, I can't find any description for them or how the data was generated exactly. For example, in the paper, I see the number 67509 for human 3' UTR PAS, but the above dataset contains 68423 hg38 3UTRs. Is there a more detailed description/code somewhere, that I can check? Thanks!

@eddieimada
Copy link
Owner

Hi Endre,

Thank you for your interest in REPAC. These annotations were derived from the polyAsite database. We took the annotations provided in the database and overlapped with current hg38 and mm10 annotations from the annotatr package. We also removed sites that overlapped to more than one gene. The differences in number of sites might be due to an updated annotation version than the one we used when the paper was written. If you find anything odd, please let me know and I will look into it!

@esebesty
Copy link
Author

esebesty commented Aug 3, 2023

Hi, I would be interested in replicating the annotations available in the package. Are the scripts and R package versions used to generate the hg38_pa and mm10_pa datasets available somewhere? For example, polyAsite database mentions that Number of poly(A) site clusters: 569,005. So how did this lead to the 68423 3' UTRs exactly? Which annotatr package version, annotation version, exact filtering steps, etc? Something similar to what is usually described for Bioconductor package external data.

@eddieimada
Copy link
Owner

I believe the closest script to obtain the current annotations would be:
https://github.com/eddieimada/REPAC_paper/blob/main/code/Bcell/00createBED.R

The input bed file used in this script was obtained using QAPA build with ENSEMBL v102 and PolyaDB v2 annotations.

I'm currently working on putting the package on Bioconductor – when I do – I will update the annotations and log the versions.

@esebesty
Copy link
Author

Looking forward to the Bioconductor package! I just checked the linked R script, and it seems that you are further processing of the output of the qapa build command for the mouse data.

Is this also true for the human data? Looks like the there is another 00createBED.R script, referencing a hg38 utr list here, that might be coming from QAPA or other places.

@reck999
Copy link

reck999 commented Mar 29, 2024

Hi Eddie,

Thank you for an excellent package! I've had a lot of success using it to study my mouse and human sets with my own bams. I work in C. elegans too and would love to use REPAC to probe APA changes in C. elegans. Would you be able to create a reference file from PolyASite for C. elegans? I've taken a few stabs at it myself, and even though C. elegans is supported by PolyASite, I haven't gotten it to work. Any advice would be great too. Thank you!

Randall

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants