-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathLoopMS_16S_Fecal.Rmd
518 lines (454 loc) · 27.1 KB
/
LoopMS_16S_Fecal.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
---
title: "Evaluationg LoopSeq 16S on Human Fecal Samples"
author: "BJC"
output: html_document
---
## Setup
Load libraries, set the working directory to the location of this Rmarkdown file (only necessary when running by hand), and read in filenames:
```{r, echo=FALSE}
library(dada2, quietly=TRUE); packageVersion("dada2") # Should be 1.15.4 or later
library(Biostrings, quietly=TRUE)
library(ShortRead, quietly=TRUE)
library(ggplot2, quietly=TRUE)
library(reshape2, quietly=TRUE)
library(vegan, quietly=TRUE)
library(ape, quietly=TRUE)
set.seed(100)
setwd("~/LoopManuscript") # CHANGE ME to the location of this file
path <- "~/LoopData/16S/CallahanFecal" # CHANGE ME to the location of the fastq files
path.fig <- "Figures" # Relative path to where figures should be saved
path.rds <- "RDS" # Relative path to where RDS should be saved
fn <- list.files(path, pattern=".fq$", full.names=TRUE)
sapply(fn, function(f) length(getSequences(f)))
```
## QA, Filtering and Trimming
First do basic QA on these data, and then apply the DADA2 filtering and trimming workflow with the parameters selected from the detailed inspection of the Zymo mock community.
```{r}
plotComplexity(fn)
```
```{r}
plotQualityProfile(fn)
```
These profiles look consistent with what we saw in the Zymo mock -- no low complexity issues, and high qualities throughout the 16S sequence (i.e. ignoring the >1500nt tail). The one difference, especially in the R3.1 is that the length distribution is different, around a half of the reads in that library appear not to extend to the full-length of the 16S gene. This may be a common way in which quality variation in Loop libraries becomes evidence, as molecules that are sequenced to insufficient coverage are likely to end up being assembled into incomplete contigs that don't cover the entire targeted amplicon.
Enter primers, and confirm their presence and the overall orientation of the reads. Code is adapated from [the DADA2 ITS tutorial workflow](https://benjjneb.github.io/dada2/ITS_workflow.html#identify-primers):
```{r}
FWD <- "AGAGTTTGATCMTGGC" # Loop 16S forward primer
REV <- "TACCTTGTTACGACTT" # Loop 16S reverse primer
allOrients <- function(primer) {
# Create all orientations of the input sequence
require(Biostrings)
dna <- DNAString(primer)
orients <- c(Forward = dna, Complement = Biostrings::complement(dna), Reverse = Biostrings::reverse(dna),
RevComp = Biostrings::reverseComplement(dna))
return(sapply(orients, toString)) # Convert back to character vector
}
primerHits <- function(primer, fn) {
# Counts number of reads in which the primer is found
nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
return(sum(nhits > 0))
}
for(f in fn) {
cat("Primers detected in", basename(f), "\n")
print(rbind(FWD.Primer = sapply(allOrients(FWD), primerHits, fn = f),
REV.Primer = sapply(allOrients(REV), primerHits, fn = f)))
cat("Out of", length(readFastq(f)), "total reads.\n\n")
}
```
Primers in the expected orientations, and yes it looks like the R3.1 library will take a big hit from incomplete amplicon reconstruction, with the REV primer only being detected in <20\% of the total reads, whereas the other libraries are closer to 50\%. It is worth noting that all of these libraries have a much lower fraction of primer-complete reads than the Zymo mock community data had (~90\%).
Remove the primers (and any flanking sequence) from the reads, and filter out reads that don't contain both primers:
```{r}
nop <- file.path(path, "nop", basename(fn))
out <- removePrimers(fn, nop, FWD, rc(REV), verbose=TRUE)
```
Filter the sequences and enforce minimum/maximum lengths appropriate for full-length 16S. Note that we are enforcing `maxEE=1`, as that was determined to be a better filter than `maxEE=2` in the Zymo mock community data.
```{r}
filt <- file.path(path, "filtered", basename(fn))
track <- filterAndTrim(nop, filt, maxEE=1, minLen=1400, maxLen=1600, verbose=TRUE)
```
Final inspection of the quality profile:
```{r, message=FALSE}
plotQualityProfile(filt)
```
Final progress of reads through filtering and trimming:
```{r}
cbind(raw=out[,1], primers=out[,2], filtered=track[,2])
```
The large loss of reads due to a lack of detectable primers, especially in the R3.1 library, is notable. It seems likely that optimizing the ratio of Loop molecules to total output reads will be an important part of maximizing throughput in the form of the most high-quality Loop reads per run.
# Denoising
Learn the error rates:
```{r, warning=FALSE}
err.rds <- file.path(path.rds, "err_16S_Fecal.rds") # RDS save/load to speed up reruns of the code
if(!file.exists(err.rds)) {
err <- learnErrors(filt, multi=TRUE, verbose=0)
saveRDS(err, err.rds)
}
err <- readRDS(err.rds)
plotErrors(err, nominalQ=TRUE)
```
Denoise the filtered data into ASVs using current DADA2 defaults:
```{r}
dd <- dada(filt, err, multi=TRUE, verbose=0)
dd
```
Make sequence table and remove chimeras:
```{r}
sta <- makeSequenceTable(dd)
st <- removeBimeraDenovo(sta, minFoldParentOverAbundance=4.5, multi=TRUE, verbose=TRUE)
```
Assign taxonomy:
```{r}
tax.rds <- file.path(path.rds, "tax_16S_Fecal.rds") # RDS save/load to speed up reruns of the code
if(!file.exists(tax.rds)) {
tax <- assignTaxonomy(st, "~/tax/silva_nr_v132_train_set.fa.gz", minBoot=80, multi=TRUE)
saveRDS(tax, tax.rds)
}
tax <- readRDS(tax.rds)
if(!identical(getSequences(tax), getSequences(st))) stop("Taxonomy mismatch.")
table(tax[,"Phylum"], useNA="ifany")
```
The usual suspects at the Phylum level at least.
## Comparison to PacBio full-length 16S sequencing results
Currently, PacBio is the gold standard for full-length 16S gene sequencing. Previously, we sequenced these same three fecal samples as part of [our earlier investigation of the accuracy of PacBio full-length 16S sequencing](https://doi.org/10.1093/nar/gkz569). We processed those samples through the DADA2 workflow in that paper, and have included the sequence table obtained from a set of 12 fecal samples (including the 3 resequenced here using Loop full-length 16S) as part of this repository.
Importing PacBio results, and coordinating sample names between the Loop and PacBio results:
```{r}
pb1.rds <- file.path(path.rds, "PacBio_Fecal_st1.rds")
st1.pb <- removeBimeraDenovo(readRDS(pb1.rds), minFoldParentOverAbundance=4.5,
multi=TRUE, verbose=TRUE)
pb2.rds <- file.path(path.rds, "PacBio_Fecal_st2.rds")
st2.pb <- removeBimeraDenovo(readRDS(pb2.rds), minFoldParentOverAbundance=4.5,
multi=TRUE, verbose=TRUE)
sam.names <- sapply(strsplit(rownames(st), "_"), `[`, 1) # e.g. R3.1_contig_list_trimmed.fq
rownames(st) <- sam.names; rownames(sta) <- sam.names
rownames(st1.pb) <- gsub("^Callahan_16S_2pM-Cell1_", "", rownames(st1.pb)) # Remove prefix
rownames(st1.pb) <- gsub("[.]fastq[.]gz$", "", rownames(st1.pb)) # Remove suffix
rownames(st1.pb) <- gsub("_", "", rownames(st1.pb)) # e.g. R_3.1
rownames(st2.pb) <- gsub("_", "", rownames(st2.pb)) # e.g. R_3.1
if(!all(rownames(st) %in% rownames(st1.pb))) stop("Sample name mismatch (1).")
if(!all(rownames(st) %in% rownames(st2.pb))) stop("Sample name mismatch (2).")
st1.pb <- st1.pb[rownames(st),]
st2.pb <- st2.pb[rownames(st),]
```
In the PacBio protocol, a slightly different set of primers is used:
```{r}
FWD.pacb <- "AGRGTTYGATYMTGGCTCAG"
FWD.loop <- "AGAGTTTGATCMTGGC"
REV.pacb <- "RGYTACCTTGTTACGACTT"
REV.loop <- "TACCTTGTTACGACTT"
```
The result is that the Loop sequences have an additional 4 nucleotides at the start of the reads that are absent in the PacBio data. To remedy that, we will simply remove the first 4nts from all the Loop sequences:
```{r}
sq.loop.full <- getSequences(st)
sq.loop.trunc <- substr(sq.loop.full, 5, nchar(sq.loop.full))
any(duplicated(sq.loop.trunc)) # FALSE
if(mean(grepl("^TCAG", sq.loop.full)) < 0.95) stop("Trim issue, expected 4nt sequence not at start of these sequences.")
st.loop <- st; tax.loop <- tax
colnames(st.loop) <- sq.loop.trunc; rownames(tax.loop) <- sq.loop.trunc
```
As a first pass, looking at the overlap in ASVs and reads between Loop and PacBio full-length 16S data (replicate 2) on each sample:
```{r}
setASV <- function(unq1, unq2, nms) {
unq1 <- getUniques(unq1); unq1 <- unq1[unq1 > 0]
unq2 <- getUniques(unq2); unq2 <- unq2[unq2 > 0]
rval <- c(sum(!names(unq1) %in% names(unq2)),
sum(names(unq1) %in% names(unq2)),
sum(!names(unq2) %in% names(unq1)))
names(rval) <- c(nms[[1]], "Shared", nms[[2]])
rval
}
t(sapply(sam.names, function(nm) setASV(st.loop[nm,], st1.pb[nm,], nms=c("Loop", "PB1"))))
t(sapply(sam.names, function(nm) setASV(st.loop[nm,], st2.pb[nm,], nms=c("Loop", "PB2"))))
t(sapply(sam.names, function(nm) setASV(st1.pb[nm,], st2.pb[nm,], nms=c("PB1", "PB2"))))
```
Half or more of the denoised ASVs are shared across the methods. How about on a per-read basis?
```{r}
setReads <- function(unq1, unq2, nms) {
unq1 <- getUniques(unq1); unq1 <- unq1[unq1 > 0]
unq2 <- getUniques(unq2); unq2 <- unq2[unq2 > 0]
rval <- c(sum(unq1[!names(unq1) %in% names(unq2)]),
sum(unq1[names(unq1) %in% names(unq2)]),
sum(unq2[names(unq2) %in% names(unq1)]),
sum(unq2[!names(unq2) %in% names(unq1)]))
names(rval) <- c(nms[[1]], paste0("Shared", nms[[1]]), paste0("Shared", nms[[2]]), nms[[2]])
rval
}
totReads <- function(st1, st2) { as.matrix(cbind(rowSums(st1), rowSums(st1), rowSums(st2), rowSums(st2))) }
tab.loop.pb1 <- t(sapply(sam.names, function(nm) setReads(st.loop[nm,], st1.pb[nm,], nms=c("Loop", "PB1"))))
tab.loop.pb1
tab.loop.pb1/totReads(st.loop, st1.pb)
tab.loop.pb2 <- t(sapply(sam.names, function(nm) setReads(st.loop[nm,], st2.pb[nm,], nms=c("Loop", "PB2"))))
tab.loop.pb2
tab.loop.pb2/totReads(st.loop, st2.pb)
tab.pb1.pb2 <- t(sapply(sam.names, function(nm) setReads(st1.pb[nm,], st2.pb[nm,], nms=c("PB1", "PB2"))))
tab.pb1.pb2
tab.pb1.pb2/totReads(st1.pb, st2.pb)
```
A very high fraction of reads (>85\%) occur in ASVs identified consistently by both methods. How do things look at the whole community level, when we ordinate these three samples?
Combine the sequence tables together:
```{r}
st.loop.lab <- st.loop
rownames(st.loop.lab) <- gsub("$", ".loop", rownames(st.loop.lab))
st1.pb.lab <- st1.pb
rownames(st1.pb.lab) <- gsub("$", ".pb1", rownames(st1.pb.lab))
st2.pb.lab <- st2.pb
rownames(st2.pb.lab) <- gsub("$", ".pb2", rownames(st2.pb.lab))
st.both.lab <- mergeSequenceTables(st.loop.lab, st1.pb.lab, st2.pb.lab)
ft.both.lab <- sweep(st.both.lab, 1, rowSums(st.both.lab), "/")
```
Perform ordination:
```{r}
bc <- vegdist(ft.both.lab, "bray")
mds <- pcoa(bc)
df <- data.frame(mds$vectors)
df$Sample <- substr(rownames(df), 1, 4)
df$Technology <- sapply(strsplit(rownames(df), "[.]"), `[`, 3)
gg_color_hue <- function(n) {
hues = seq(15, 375, length = n + 1)
hcl(h = hues, l = 65, c = 100)[1:n]
}
clr3 <- gg_color_hue(3)
df$Technology <- c(loop="LoopSeq", pb1="PacBio CCS\n(S/P2-C2/5.0)\n", pb2="PacBio CCS\n(S/P3-C3/5.0)")[df$Technology]
tech.color.scale <- c("LoopSeq"=clr3[[1]], "PacBio CCS\n(S/P2-C2/5.0)\n"=clr3[[2]],
"PacBio CCS\n(S/P3-C3/5.0)"="darkgreen")
ppb <- ggplot(data=df, aes(x=Axis.1, y=Axis.2, label=Sample, color=Technology)) + theme_bw() +
scale_color_manual(values=tech.color.scale)
ppb + geom_text(alpha=0.8)
labdf <- df[df$Technology=="LoopSeq",]
labdf$Axis.1 <- 0.8*labdf$Axis.1 # pull labels towards middle of plot
labdf$Axis.2 <- 0.8*labdf$Axis.2
ppb + geom_point(shape="x", size=6) + geom_label(data=labdf, color="black")
ggsave(file.path(path.fig, "Fecal_LoopVsPacBio_ord.pdf"),
ppb + geom_point(shape="x", size=6) + geom_label(data=labdf, color="black"),
width=5, height=3, units="in", useDingbats=FALSE)
ggsave(file.path(path.fig, "Fecal_LoopVsPacBio_ord.png"),
ppb + geom_point(shape="x", size=6) + geom_label(data=labdf, color="black"),
width=5, height=3, units="in")
```
They fall almost exactly on top of each other. Worth calling out that two of these samples were longitudinal samples from the same person too, so inter-subject and within-subject temporal variability is represented here, as well as the technical replication amongst the same technology (PacBio).
Comparing the measurement dissimilarities between Loop and PacBio and between the PacBio technical replicates:
```{r}
as.matrix(bc)[df$Sample=="R3.1", df$Sample=="R3.1"]
as.matrix(bc)[df$Sample=="R9.3", df$Sample=="R9.3"]
as.matrix(bc)[df$Sample=="R9.4", df$Sample=="R9.4"]
```
In Sample R3.1 the PacBio technical replicates are highly similar, but in the other two samples the Loop measurement is about as similar to each PacBio technical replicate as the PacBio technical replicates are to each other!
Also testing Jaccard distance to ensure this pattern is not entirely driven by high abundance ASVs.
```{r}
jc <- vegdist(ft.both.lab, "jaccard")
mds.jc <- pcoa(jc)
df.jc <- df # Just replace the vectors with the new jaccard derived numbers
df.jc[,paste0("Axis.", seq(8))] <- data.frame(mds.jc$vectors)
ppj <- ggplot(data=df.jc, aes(x=Axis.1, y=Axis.2, label=Sample, color=Technology)) + theme_bw() +
scale_color_manual(values=tech.color.scale)
ppj + geom_text(alpha=0.8)
labdf.jc <- df.jc[df.jc$Technology=="LoopSeq",]
labdf.jc$Axis.1 <- 0.8*labdf.jc$Axis.1 # pull labels towards middle of plot
labdf.jc$Axis.2 <- 0.8*labdf.jc$Axis.2
ppj + geom_point(shape="x", size=6) + geom_label(data=labdf.jc, color="black")
ggsave(file.path(path.fig, "Fecal_LoopVsPacBio_ord_jaccard.pdf"),
ppj + geom_point(shape="x", size=6) + geom_label(data=labdf.jc, color="black"),
width=5, height=3, units="in", useDingbats=FALSE)
ggsave(file.path(path.fig, "Fecal_LoopVsPacBio_ord_jaccard.png"),
ppj + geom_point(shape="x", size=6) + geom_label(data=labdf.jc, color="black"),
width=5, height=3, units="in")
```
Exact same qualitative patterns with Jaccard distance as well.
## Demonstration of LoopSeq Accuracy using 16S conservation patterns
Natural samples re closer in complexity and composition thatn are mock communities to the types of samples in which people want to employ technologies like LoopSeq full-length. However, unlike in mock communities, we do not know the true compposition of natural samples, and thus accuracy is more difficult to determine in natural samples. Here we will attempt to evaluate the accuracy of LoopSeq FL 16S in natural human fecal samples by leveraging the variable levels at which different sections of the 16S rRNA gene are evolutionarily conserved. In short, we expect the differences between legitimate biological variants to preferentially be found in the variable regions of the 16S rRNA gene, while artefactual variation introduced by technical error processes will not.
We'll use [the external program ssu-align(http://eddylab.org/software/ssu-align/) for this purpose, which uses Infernal to align potentially large collectiosn of SSU rRNA sequences.
First we'll run the program on an example sequence, to show its output:
```{r}
ssu.exe <- "~/Desktop/bin/ssu-align-0.1.1/src/ssu-align" # CHANGE ME...
ssu.dir <- "/usr/local/share/ssu-align-0.1.1"
export.str <- paste0('export SSUALIGNDIR="', ssu.dir, '";') # Prepend to all ssu-align calls
# i <- 2 has just 1 sub, for testing
d1 <- dd[[1]]
i <- 2
bs <- d1$birth_subs[d1$birth_subs$clust==i,]
sq <- d1$sequence[d1$clustering$birth_from[i]] # Positions are in birth (ref) seq
if(!dir.exists("temp")) dir.create("temp")
tf <- "temp/tempfile.fa"
writeFasta(sq, tf)
td <- "temp/tmp"
system(paste(export.str, ssu.exe, '-f', tf, td))
```
At the end of the program, the key file we are interested in is the bacterial alignment, which is the alignment of the input sequences against the bacterial SSU RNA model. Inspecting the formate of that file:
```{r}
print(substr(readLines("temp/tmp/tmp.bacteria.stk"), 1, 80))
```
There are four lines of output. The first is the aligned sequence itself, converted to the RNA ACGU code. The second line gives "posterior probabilities" of the confidence of the alignment at that position. The third gives the secondary structure consensus at that position, i.e. is there expected to be a basepairing (indicated by various brackets) or not (gaps). Finally, the fourth line is what we are interested in, this shows the level of conservation of that nucleotide position in the model, with well-conserved positions in uppercase, and less-conserved positions in lower-case.
So, we can use this file to determine the fraction of differences between the second dada-denoised ASV and the first that occurred in conserved positions.
```{r}
bps <- c("A", "C", "G", "U", "a", "c", "g", "u")
caps <- c("A", "C", "G", "U")
ln <- readLines("temp/tmp/tmp.bacteria.stk")
sqa <- strsplit(ln[[4]], "")[[1]]
refcons <- strsplit(ln[[7]], "")[[1]]
keep <- sqa %in% bps
sqa <- sqa[keep]
if(!gsub("U", "T", toupper(paste0(sqa, collapse=""))) == sq) { # TRUE
stop("Aligned sequence not matching input sequence!")
}
sqcons <- refcons[keep] %in% caps
bs$cons <- sqcons[bs$pos]
table(sqcons)/sum(table(sqcons))
table(bs$cons)/sum(table(bs$cons))
```
So in this example, there are a much higher fraction of subsitutions in non-conserved read positions in the differences between the top two ASVs than would be expected by chance!
Functionalizing this operation, and then extending it to a larger set of denoised ASVs:
```{r}
get_subcons <- function(i, dd, tf = "temp/tempfile.fa", td = "temp/tmp", ssu=ssu.exe, export=export.str, return.all.pos=FALSE) {
bs <- dd$birth_subs[dd$birth_subs$clust==i,]
i.from <- dd$clustering$birth_from[[i]] # Sub positions are in ASV that this one was born from
bs$from <- i.from
sq <- dd$sequence[[i.from]]
writeLines(c(paste0(">Sq", i.from), sq), tf)
system(paste(export, ssu, '-f', tf, td), ignore.stdout=TRUE)
bps <- c("A", "C", "G", "U", "a", "c", "g", "u")
caps <- c("A", "C", "G", "U")
alfile <- file.path(td, paste0(basename(td), ".bacteria.stk"))
if(file.exists(alfile)) {
ln <- readLines(file.path(td, "tmp.bacteria.stk"))
sqa <- strsplit(ln[[4]], "")[[1]]
refcons <- strsplit(ln[[7]], "")[[1]]
keep <- sqa %in% bps
sqa <- sqa[keep]
sqa.tr <- gsub("U", "T", toupper(paste0(sqa, collapse="")))
PAD <- 0
if(!sqa.tr == sq) {
### cat("\n", sq, sqa.tr, sep="\n")
if(sqa.tr == substr(sq,1,nchar(sqa.tr))) {
cat("Aligned sequence is missing some trailing nucleotides:", nchar(sq)-nchar(sqa.tr), "\n")
} else if(sqa.tr == substr(sq,1+nchar(sq)-nchar(sqa.tr),nchar(sq))) {
PAD <- nchar(sq)-nchar(sqa.tr)
cat("Aligned sequence is missing some beginning nucleotides:", PAD, "\n")
} else {
stop("Input sequence didn't match aligned sequence.")
}
}
sqcons <- refcons[keep] %in% caps
if(PAD > 0) { # NEED to fix for when aligned sequence starts later...
sqcons <- c(rep(NA, PAD), sqcons)
}
bs$cons <- sqcons[bs$pos]
} else { # Didn't match database, give NAs to all
bs$cons <- NA
}
if(return.all.pos) { # Return vector of conservation at all positions, not those associated with the birth substitutions
return(sqcons)
}
else { return(bs) } # Return birth_subs data.frame, now with conservation state of each position added in the $cons column
}
nseqs <- function(x) length(getSequences(x))
```
Testing the function:
```{r}
rbind(get_subcons(3, d1), get_subcons(4, d1), get_subcons(5, d1))
```
Output is matching the expected format, this is a `data.frame` with information on the position, substitution type and quality score associated with various dada-denoised ASVs. Interestingly, all of the substitutions between these denoised ASVs and the more abundant ASVs from which they were disciminated occurrsed in non-conserved base positions! Now to see if that pattern is something that holds up over the dataset.
First, running `dada` with a very aggressive sensitivity setting. Our goal here is to use `dada` to order ASVs by the diagnostic likelihood that they are true (under the DADA2 error model). We will then evaluate the conservation patterns of those ordered ASVs. First, to generate them:
```{r}
dd2 <- dada(filt, err, OMEGA_A=1e-2, DETECT_SINGLETONS=TRUE, multi=TRUE, verbose=0)
dd2
```
Now let's run our identication of conservation patterns on all these ASVs:
```{r}
cons.rds <- file.path(path.rds, "cons_16S_Fecal.rds") # RDS save/load to speed up reruns of the code
if(!file.exists(cons.rds)) {
system.time(cons <- lapply(names(dd2), function(nm) {
di <- dd2[[nm]]
lapply(seq(2, nseqs(di)), get_subcons, dd=di)
}))
names(cons) <- names(dd2)
saveRDS(cons, cons.rds)
system('rm -Rf temp/tmp')
system('rm temp/*')
}
```
```{r}
cons <- readRDS(cons.rds)
if(!all(sapply(cons, length) == sapply(dd2, function(x) nseqs(x)-1))) stop("Number of cons ASVs mismatch.")
```
Now to curate the SSU conservation data into a data.frame:
```{r}
clustdfs <- lapply(names(dd2), function(nm) {
di <- dd2[[nm]]; coni <- cons[[nm]]
rdf <- di$clustering[-1,] # Dropping first ASV for simplicity
rdf$cons <- sapply(coni, function(x) sum(x$cons, na.rm=TRUE))
# NAs occur when only part of the ASVs aligns to the ssu-align RNA model
rdf$vars <- sapply(coni, function(x) sum(!x$cons, na.rm=TRUE))
rdf$nsub <- sapply(coni, function(x) nrow(x))
rdf$ncat <- rdf$cons + rdf$vars # The number of substitutions categorized as conserved/variable
rdf$var.frac <- rdf$vars/rdf$ncat
rdf$index <- seq(2,nrow(rdf)+1)
rdf$sample <- nm
rdf
})
names(clustdfs) <- names(dd2)
```
For comparison, we also want to know the average fraction of positions that are conserved if chosen at random (rather than at the positions of birth substutitions). To calculate that:
```{r}
ii <- sample(nseqs(d1), 20)
cons.all <- lapply(ii, get_subcons, dd=d1, return.all.pos=TRUE)
sapply(cons.all, mean)
```
Yep. Quite consistent, a bit over 80\% of the substitutions are in conserved positions.
Now some exploration of the distribution of the ratio of substitutions in conserved and variable regions, as a function of the order of the ASVs (which reflects decreasing confidence by DADA2 they are not seqeuencing error) and by the abundance of the denoised ASVs:
```{r}
# Use the median over our random sample as a plugin estimate for the fraction of conserved sites in each sequence
raw.cons.frac <- median(sapply(cons.all, mean))
raw.var.frac <- 1-raw.cons.frac
clustdf <- do.call(rbind, clustdfs)
clustdf$nnt <- nchar(clustdf$sequence)
clustdf$frac.sub.cons <- clustdf$cons/(clustdf$nnt*raw.cons.frac)
clustdf$frac.sub.var <- clustdf$vars/(clustdf$nnt*raw.var.frac)
clustdf$logp <- log10(clustdf$birth_pval)
clustdf$rlogp <- log10(clustdf$birth_pval + min(clustdf$birth_pval[clustdf$birth_pval>0]))
clustdf$sample <- paste("Sample", sapply(strsplit(clustdf$sample, "_"), `[`, 1))
clustdf$abundance <- as.numeric(clustdf$abundance)
theme_set(theme_bw())
hist(clustdf$ncat, 100)
p <- ggplot(data=clustdf, aes(x=frac.sub.cons, y=frac.sub.var)) +
xlab("Substitution Rate at Conserved Positions") + ylab("Substitution Rate at\nVariable Positions") +
geom_abline(col="grey80", intercept=0, slope=1) + theme(aspect.ratio=1)
sz.scale <- scale_size_continuous(breaks=c("1"=1, "10"=10, "100"=100, "1000"=1000), name="ASV abundance")
p + aes(size=abundance) + geom_point(alpha=0.2) + sz.scale
p + aes(size=abundance) + geom_point(alpha=0.2) + sz.scale + scale_x_log10() + scale_y_log10()
p4 <- p + aes(size=abundance) + geom_point(alpha=0.2) + facet_wrap(~sample) + sz.scale
p4
ggsave(file.path(path.fig, "Fecal_VarVsCons.pdf"), p4, width=7, heigh=3, units="in", useDingbats=FALSE)
ggsave(file.path(path.fig, "Fecal_VarVsCons.png"), p4, width=7, heigh=3, units="in")
#require(hexbin)
#p + geom_hex()
#p + geom_bin2d()
```
There are two modes here. The first mode contains ASVs with very large numbers of substitution differences (>100) from the ASV from which they were discriminated. In this large-difference mode, the number of substitutiosn at conserved positions exceed the number of substitutions at variable positions by about a factor of 1.5x. The second mode contains ASVs with small to moderate numbers of substitution differences (<100, mostly <10) from the ASV from which they were disciminated. In this small-difference mode, the number of substitutions at conserved positions seems to be slightly less than at variable positions. Remember that ~80\% of the positions are conserved (red-dashed line) so this is strongly skewed against what would be expected by chance.
Does this enrichment of substitutions at variable positions hold for even the most difficult to identify ASVs, singletons and single-substitution variants?
```{r}
is.var.1diff <- as.logical(clustdf$vars[clustdf$ncat==1])
table(is.var.1diff)
```
```{r}
frac.var.1abund <- clustdf$var.frac[clustdf$abundance==1]
hist(frac.var.1abund, n=100)
```
Yep, for both categories the number of substitutions at variable positions is about equal to those at conserved positions.
Our goals here are to determine the level to which biological differences can be discriminated from Loop sequencing errors in real complex data from these human fecal samples. Thus, we are primarily interested in that small-substitution part of the data -- it has already been established that the per-nucleotide error rate in LoopSeq data is extremely low and is not introducing hundreds of substitutions. So, focusing in on that part of the data, is there a trend towards decreasing accuracy (i.e. a lower fraction of substutions at variable positions) as we push the sensitivity of ASV inference (i.e. look at lower-certainty ASVs)?
```{r}
p1 <- ggplot(data=clustdf[clustdf$ncat==1,]) + aes(x=index, y=var.frac, size=abundance, weight=abundance) + sz.scale
p1 + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)
p2 <- ggplot(data=clustdf[clustdf$ncat==2,]) + aes(x=index, y=var.frac, size=abundance, weight=abundance) + sz.scale
p2 + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)
pdef <- ggplot(data=clustdf[clustdf$abundance>2 & clustdf$birth_pval < 1e-40,]) + aes(x=index, y=var.frac, size=abundance, weight=abundance) + sz.scale
pdef + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)
ggplot(data=clustdf[clustdf$ncat>10,], aes(x=index, y=var.frac, size=abundance, weight=abundance)) + geom_point(alpha=0.4) + geom_smooth() + facet_wrap(~sample)
ggplot(data=clustdf[clustdf$ncat<10,], aes(x=index, y=var.frac, size=abundance, weight=abundance)) + geom_point(alpha=0.4) + geom_smooth() + facet_wrap(~sample)
ggplot(data=clustdf[clustdf$abundance > 1 & clustdf$ncat>5 & clustdf$ncat<80,], aes(x=index, y=var.frac, size=abundance, weight=ncat)) + geom_point() + geom_smooth() + facet_wrap(~sample)
ggplot(data=clustdf[clustdf$abundance == 1 & clustdf$ncat<80,], aes(x=index, y=var.frac, size=abundance)) + geom_point(alpha=0.4) + geom_smooth() + facet_wrap(~sample)
```
```{r}
ggplot(data=clustdf[clustdf$ncat<2,], aes(x=index, y=var.frac, size=abundance)) + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)
ggplot(data=clustdf[clustdf$ncat<3,], aes(x=index, y=var.frac, size=abundance)) + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)
ggplot(data=clustdf[clustdf$ncat<4,], aes(x=index, y=var.frac, size=abundance)) + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)
ggplot(data=clustdf[clustdf$ncat<5,], aes(x=index, y=var.frac, size=abundance)) + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)
```
There is some dropoff in the fraction of substitutions at variable positions as we get to the later ASVs identified using these very high sensitivity parameters, but not that much really.