-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
common_sites.min3rep.bed file is empty #13
Comments
Hi Lu, It is difficult to say without seeing the date. However, I think I can see the problem. In your last provided line : I would first merge sites found in each DART sample with
Then you can find the common sites across all replicates:
The same results can also be obtained using a combination of bedtools intersect, merge and unix sort. if the script is not working for you somehow. Please let me know if that fixes the problem. Best, |
Hi Lu, I am glad it is now working! I am don't really have premade scripts for MetaplotR and Homer. for MetaplotR, I follow the instructions here https://github.com/olarerin/metaPlotR. with some custom plotting in R afterwards. if you have already generated the prerequisite files from the genome fasta and genepred, you can then run:
the output file can then be read in R for plotting: here is an example code plotting the results :
For homer, you should use findMotifsGenome.pl. A good starting point could be:
this will look for motifs of length 6 and 8 in a region (size) of 10nt around each site in the out.RAC.bed file. This means that the motif is not necessarily centered on the edit site. For this you could run with the same motif length (-len 6) and RNA size (-size 6). I think homer uses the ncbi notation for assembly names (eg. 1, not chr1) so you may need to change the bed file to this format to reflect this before you run it. You can do this with:
Let me know if you have another question |
Hey Mflamand, Thanks so much for your so detailed information! I tried both the metaPlotR and Homer, and still have something want to discuss with you. For the metaPlotR, I used hg38 reference genome, but I got some error because there was no chr1_KN196472v1_fix.fa in the hg38 chroms files (I already contact the authors, but haven't got the response). so I only can use the hg19 instead? In addition, I tried the package Guitar (https://rdrr.io/bioc/Guitar/man/GuitarPlot.html) using the out.RAC.bed as input:
Does this plot make sense to you? Homer uses the UCSC notations (chr1...), so I just use the out.RAC.bed or the out.bed (no RAC motif enrich) files as input. Then I successfully got the plots In addition, I have another new question. I loaded the bam files (Dmut2Aligned.sortedByCoord.out.bam, Dmut3Aligned.sortedByCoord.out.bam, Dyth1Aligned.sortedByCoord.out.bam, Dyth2Aligned.sortedByCoord.out.bam, Dyth3Aligned.sortedByCoord.out.bam) which I got from the STAR alignment in the IGV browser, and searched for the ACTB gene, I could tell that there were some differences between yth files and mut files, and then I zoomed in, and clicked on one of the different positions, and got the percentage of the A, C, T, G, no U. my question is how should I observe the C to U using the IGV browser? |
Hi, For metaplotR, hg38 can be used. The problem is with the alternate and fixed chromosome sequences. If there is a mismatch between the provided genome.fasta file and the genepred file, there will be an error. For this kind of analysis, it can be best to use the "analysis set" genome files from UCSC which excludes those sequences (which can also interfere with alignement in RNA-seq by causing multi mapping). This would remove all lines with an underscore (_) in the chromosome name and would remove these from the final file. It should fix your problem. Otherwise, the output from guitar plot looks fine to me with a nice enrichment near the stop codon. The motifs in Homer looks OK, you could get something even better by trying more stringent size/length parameter are using a background list of shuffled sites. For IGV, I think it looks great. A few things to keep in mind, IGV is a genome browser, it will only show DNA bases, but we can assume that for RNA-seq, a "T" means a "U". Moreover, ActB is on the reverse strand so C-to-T (or C-to-U when it comes from RNA-se) will appear as G-to-A. By default IGV shows mutation (in blue/red or green/orange) when there is more than 20% mismatches. you can change this in : View/Preferences/Alignements and then Coverage Track Options: Coverage allele-fraction threshold: default, 0.2 is 20%, you could change to 0.05 for 5%. Best, |
Hi Lu, I am glad you have it working. Indeed those R lines were for comparing different analysis set and I forgot to removed them when I copied it here. For the perl script, it seems I did something wrong when I uploaded it last time. I have now fixed it. The new file is: make_annot_bed_bullseye.pl in the same directory. This one will also work with the unfiltered genepred file (directly from table browser without filter) so feel free to use in the future if you need to. It seems you have it working anyways. Best, |
Hey Mflamand, I just tried the new perl script, and it works! Awesome! Thanks so much for your help and patience! Best, Lu |
Hi, yes it is possible (and a good control). I would suggest to identify the sites in your 4th-mut matrix against the genome. Using something like this for each yth_mut file:
you can then find all sites common to both replicates:
or found in any replicate
and then use the resulting bed file to run metaplotR as above. |
The shift of edits towards the 3'UTR is what we have seen in the original DART-seq paper, so it looks fine. smoothing on the density plot could be applied to remove the peaks in each curve, which are unlikely to represent real signal. As for the error, I cannot tell without seeing what is in the original bed file. Did you run summarize_sites.pl on these files? If not, the header line could be causing this. Otherwise it seems that bedtools intersect is throwing an error because there is a mismatch between the order of chromosomes the bed file and the hg38_annot.sorted.bed file. if you run the following, do you get the same chromosome order?
If not, you can sort each file with : Alternatively you can run a mock RAC filtering as so:
Then the files would be treated exactly the same with or without RAC filtering so it could remove the error. Best, |
Hi Mflamand,
I'm running the bulk DART data, I have 2 replicates for YTH-MUT, 3 replicates for YTH.
Firstly, I got bam files using STAR, then I ran parseBAM.pl to get Dmut2.matrix.gz, Dmut3.matrix.gz, Dyth1.matrix.gz, Dyth2.matrix.gz, Dyth3.matrix.gz:
perl /usr/local/bullseye/1.0.0/Code/parseBAM.pl --input /data/epifluidlab/lulu/hek293t/3_star/output/Dmut2Aligned.sortedByCoord.out.bam --output /data/epifluidlab/lulu/hek293t/3_star/parseBam/Dmut2.matrix --cpu 4 --minCoverage 10 –removeDuplicates
Then I ran Find_edit_site.pl on 3 YTH files against the 2 YTH-MUT files to get Dyth1_Dmut2.bed, Dyth2_Dmut2.bed, Dyth3_Dmut2.bed, Dyth1_Dmut3.bed, Dyth2_Dmut3.bed, Dyth3_Dmut3.bed:
perl /usr/local/bullseye/1.0.0/Code/Find_edit_site.pl --annotationFile /data/epifluidlab/lulu/hek293t/4_ref/annotation.refFlat --EditedMatrix Dyth3.matrix.gz --controlMatrix Dmut3.matrix.gz --minEdit 5 --maxEdit 90 --editFoldThreshold 1.5 --MinEditSites 3 --cpu 4 --outfile Dyth3_Dmut3.bed --fallback /data/epifluidlab/lulu/gencode/GRCh38.primary_assembly.genome.fa --verbose
Then I ran summarize_sites.pl to get dythmut2.bed and dythmut3.bed, and then I merged the replicates files commonsites.bed: perl /usr/local/bullseye/1.0.0/Code/summarize_sites.pl --MinRep 3 --mut 3 dythmut2.bed dythmut3.bed > commonsites.bed. But my commonsites.bed is empty. Do you know how to fix it? Thanks so much!
Best,
Lu
The text was updated successfully, but these errors were encountered: