-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C2U sites #12
Comments
Forgot to mention, but it may be relevant. The m6a list was obtained using a YTH-mut sample as the control matrix. |
Hi, Indeed when comparing to a YTH-mut sample, Find_edit_site.pl is not aware of the sequence at the annotated genome (unless the --fallback option is used and there is not enough coverage in the control YTH-mut sample). Instead, it only checks at the sequence in the control file. So it it possible to see a C2U if a T is found at the reference position, but when there is a mutation/SNP in you cell/mouse line at that position. For example the alignement coverage in the YTH-mut sample may show that a "C" is found instead of a "T". You can look at the coverage at this position using tabix.
this will output: "chr position #A #T #C #G #G #total" and you can see what is the coverage in your control sample. If this is because of a known SNP, you can filter them out using the "--filterBed SNP.bed" option using a bed files of annotated SNPs. Please let me know if that answer your question or if you have another question. If it doesn't, it may be a bug and I will need to look into it. Best, |
Hi, Thanks for the detailed response. Indeed, I have checked and there's an indel variant at that position. However, when I look at the coverage of the position in the YTH-mut sample, I can see that there are relatively few reads that read a C. Specifically, 948,084 reads contain a T, and 3,325 reads contain a C, out of a total of 15,434 cells. Similarly, in the YTH sample, 1,416,981 reads show a T, while 4,849 reads show a C. Based on that, I don't think it can confidently be said that there is a C2U at that position, do you? Could this Bullseye's behavior in calling variants be due to the variability that exists in the YTH-mut cells? Do you think that it would be safer to run Bullseye individually on the YTH and YTH-mut samples, and then manually discard those present in YTH-mut from the YTH sample? Thank you. |
Hi, Based on the information you gave me, I would say there is definitely something odd going on and I would not say that there is a C2U at this position.
If we look at the results you provided:
It seems that there is no coverage in the YTHmut at this position: column 7 is the number of mutations in control sample and column 8 is the coverage (or 10 if it is less then -ControlMinCoverage if specified). However, you say there are lots of reads at that position. This means that there is either a bug here or the control file used was not from a pseudobulk sample. For single single cell analysis, we treated the Control (YTHmut) samples as a "pseudobulk" sample: we ran parseBam.pl in "bulk" mode and not in "SingleCell" mode. Why? The default behaviour in Find_edit_site.pls is to compare the lines in the DART matrix file with those in the Control matrix file. If there are the same position, it compares coverage, mutations. Then it will move to the next line on the DART matrix only. It will go over each line at this genomic position (each individual cell) until it hits a new genomic position (ie. from chr2:121983394 to chr2:121983395) and then reads the next line in the Control matrix file until it matches the new position in the DART matrix. So this means that it if there are multiple lines in the Control matrix, only the first one will be used. If the Control matrix was processed in "SingleCell" mode, the first line had low coverage (as expected in single cell cell datasets), and Find_edit_sites would assume 0 mutation and 10 coverage as seen above instead. This could explain why it called those sites in some cells. If it had used the pseudobulk, it should have seen: 948,084 (T) / ( 3,325 (C)+ 948,084 (T) ) = 99.6% editing in YTHmut and the sites would have been filtered out. (It would actually have skipped that position) Can you confirm that a pseudobulk matrix was used? Does the tabix command above reports a single line per position, or multiple lines (one per cell) for the YTHmut? If you did use the pseudobulk, then there is a bug and I will need to look at in more details. Please let me know. If you used a single cell dataset instead of pseudobulk, you can use collapse_matrix.pl in /scripts to collapse the matrix to a pseudobulk, or rerun parseBAM.pl in bulk mode.
Running the single cell analysis on both YTH and YTHmut against the genome is not a bad idea. You could also compare single cell YTH and YTHmut to pseudobulk YTHmut and to identify likely background edits. This would allow a more complete picture of editing profiles (% editing, % cells with edits) in both population. Because of the variability in single cell dataset, you would need to choose the good cutoff for removing sites. I would agree with your statement and with being careful when calling edits. Some sites are more prone to background editing. For example, in human cells we routinely filter out sites obtained from cells expressing Apobec1 alone. Please let me know if it is working or if you have more questions. |
Hi there,
I have a question regarding the conversion observed in the output of Bullseye for single-cell data. It might be a silly question, and I might be missing something, but I can't seem to find the answer.
In the output of Find_edit_site.pl, I observe that the chr2:121983394 position has a C2U change:
However, when I refer to the reference genome I used with parseBAM.pl, I see that I have a T at that position:
My question is: how can I be observing a C2U when I have a T at the reference position?
Thank you.
The text was updated successfully, but these errors were encountered: