-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathLecture3_dada2_workflow.Rmd
421 lines (274 loc) · 15.3 KB
/
Lecture3_dada2_workflow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
---
title: "DADA2 Tutorial: BIOME Worshop"
output: slidy_presentation
---
## Processing marker-gene data with...
<center>

</center>
## The DADA2 Workflow
1. Preprocessing
2. Filter and Trim
3. Learn Error Rates
4. Denoise/Sample Inference
5. Merge (if paired-end)
6. Remove Chimeras
7. Assign Taxonomy
Throughout: Sanity checks!
## Preprocessing, Filtering and Trimming
## Preprocessing
**This workflow assumes that your sequencing data meets certain criteria:**
- Samples have been demultiplexed, i.e. split into individual per-sample fastq files.
- Non-biological nucleotides have been removed, e.g. primers, adapters, linkers, etc.
- If paired-end sequencing data, the forward and reverse fastq files contain reads in matched order.
See [the DADA2 FAQ](https://benjjneb.github.io/dada2/faq.html) for tips to deal with non-demultiplexed files, primer removal
## Load package and set path
Load the `dada2` package. If you don't already it, see the [dada2 installation instructions](dada-installation.html):
```{r libraries, message=FALSE, warning=FALSE}
library(dada2); packageVersion("dada2")
```
Set the path to the fastq files:
```{r path}
path <- "data/fastqs"
head(list.files(path))
```
## Forward, Reverse, Sample Names
Get matched lists of the forward and reverse fastq.gz files:
```{r filenames}
# Forward and reverse fastq filenames have format: SAMPLENAME_R1.fastq.gz and SAMPLENAME_R2.fastq.gz
fnFs <- sort(list.files(path, pattern="_R1.fastq.gz", full.names = TRUE))
fnRs <- sort(list.files(path, pattern="_R2.fastq.gz", full.names = TRUE))
fnFs[[1]]; fnRs[[1]]
```
Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq.gz
```{r sample.names}
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)
head(sample.names)
```
## Check the amplicon design
We are using the 515F/806R primer set. The primers are not sequenced. The sequencing technology is 2x250 paired end Illumina.
<center>

</center>
**What does this mean for later? Artifacts? Trimming?**
## What is your amplicon design?
coi? trnL? amoA? ITS1? ITS2? cpn60? 18S? ...
16S: v1v2? v1v3? v4? v3v4? v4v5? ...
How long is it? Length variation?
Did you sequence your primers? Are you sure?
## Inspect forward read quality profiles
```{r plotqF}
plotQualityProfile(fnFs[c(1,11)])
```
**Where to truncate?**
## Inspect reverse read quality profiles
```{r plotqR}
plotQualityProfile(fnRs[c(2,12)])
```
**Where to truncate?**
## Filter and trim
Assign filenames for the filtered fastq.gz in the filtered/ subdirectory.
```{r filt-names}
filtFs <- file.path(path, "filtered", paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(path, "filtered", paste0(sample.names, "_R_filt.fastq.gz"))
```
The critical parameters we chose are the truncation lengths of **240** (forward) and **170** (reverse). *Why did we choose these values?*
```{r filter, message=FALSE, warning=FALSE}
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs,
truncLen=c(240,160), maxEE=c(2,2), # maxEE=2 is the default
compress=TRUE, multithread=TRUE) # Set multithread=TRUE to use all cores
```
In most cases, the key quality filtering parameter is `maxEE`, which sets the maximum number of expected errors allowed in each read. This has been shown to be a better quality filter than an average quality score filter.
## Quality filtering options
- `maxEE`: Maximum expected errors, usually the only quality filter needed.
- `truncQ`: Truncate at first occurrence of this quality score.
- `maxLen`: Remove sequences greater than this length (mostly for pyrosequencing).
- `minLen`: Remove sequences less than this length.
- `maxN`: Remove sequences with more than this many Ns. `dada2` requires no Ns, so `maxN=0` by default.
- `minQ`: Remove reads with a quality scores less than this value.
Usually `maxEE` is enough, but for non-Illumina sequencing technologies, or less standard setups, the other options can be useful as well. Remember that help is your friend! `?filterAndTrim`
## SANITY CHECK: Filtering Stats
```{r filter-stats}
head(out)
```
- What fraction of reads were kept?
- Was that fraction reasonably connsistent among samples?
- Were enough reads kept to achieve your analysis goals?
**The truncation lengths are the most likely parameters you might want to revisit.**
Basic strategy: While preserving overlap of 12nts + biological length variation, truncate off quality crashes.
## Primer removal
For common primer designs, in which a primer of fixed length is at the start of the forward (and reverse) reads, primers can be removed by dada2 in the `filterAndTrim` step.
```
# Single-end reads
filterAndTrim(..., trimLeft=FWD_PRIMER_LENGTH)
# Paired-end reads
filterAndTrim(..., trimLeft=c(FWD_PRIMER_LENGTH, REV_PRIMER_LENGTH))
```
However! There are other scenarios that this won't handle, in particular when amplicon length is too so variable that reads sometime read into the other primer at the end:

In that case, you will need to use an outside program to remove primers prior to running the dada2 workflow. If you are in that scenario, please see the [DADA2 ITS workflow](https://benjjneb.github.io/dada2/ITS_workflow.html).
## Exercise: Pick truncation and trimming values
**Sequenced** amplicon length: 400-420nts. Primers are sequenced.

> - `trimLeft=c(17, 21)`
> - `truncLen=c(245, 195)`
## Exercise: Pick truncation and trimming values
**Sequenced** amplicon length: 220-320nts. Primers are not sequenced.

> - `trimLeft=0`
> - `truncLen=c(210, 160)`
## Exercise: Pick truncation and trimming values
**Sequenced** amplicon length: 250-260nts. Primers are sequenced.

> - `trimLeft=c(14, 17)`
> - `truncLen=c(220, 140)`
## Learn error rates and Denoise
## Learn the Error Rates
```{r learn-errors}
errF <- learnErrors(filtFs, multithread=TRUE) # Set multithread=TRUE to use all cores
errR <- learnErrors(filtRs, multithread=TRUE)
```
The DADA2 algorithm makes use of a parametric error model (`err`) and every amplicon dataset has a different set of error rates. The `learnErrors` method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution.
## SANITY CHECK: Error Rates
```{r plot-errors, warning=FALSE}
plotErrors(errF, nominalQ=TRUE)
```
- Does the model (black line) reasonably fit the observations (black points)?
- Do the error rates mostly decrease with quality score?
The goal here is good, not perfect, so don't sweat the small stuff (or non-convergence).
## Dereplicate
Dereplication combines all identical sequencing reads into "unique sequences" with a corresponding "abundance" equal to the number of reads with that unique sequence.
```{r dereplicate, message=FALSE}
derepFs <- derepFastq(filtFs)
derepRs <- derepFastq(filtRs)
# Name the derep-class objects by the sample names
names(derepFs) <- sample.names
names(derepRs) <- sample.names
```
**Big Data**: The tutorial dataset is small enough to easily load into memory. If your dataset exceeds available RAM, it is preferable to process samples one-by-one in a streaming fashion: see the [DADA2 Workflow on Big Data](bigdata.html) for an example.
## Sample Inference
We are now ready to apply [the core sample inference algorithm](https://www.nature.com/articles/nmeth.3869#methods) to the dereplicated data.
```{r dada}
dadaFs <- dada(derepFs, err=errF, multithread=TRUE) # Set multithread=TRUE to use all cores
dadaRs <- dada(derepRs, err=errR, multithread=TRUE)
```
## Inspect the `dada-class` object
```{r see-dada}
dadaFs[[1]]
```
The `getSequences` and `getUniques` functions work on just about any dada2-created object. `getUniques` returns an integer vector, named by the sequences and valued by their abundances. `getSequences` just returns the sequences.
```{r}
head(getSequences(dadaFs[[1]]))
```
## DADA2 Options: Multithreading
All computation-intensive functions in the dada2 R package have optional multithreading via the `multithread` argument.
- `multithread = FALSE`: No multithreading. The default.
- `multithread = TRUE`: Detect the number of available threads, and use that many. The fastest.
- `multithread = N`: Use N threads. A way to be a good citizen and allow processing for other tasks.
Usually you will want to turn multithreading on!
## DADA2 Options: Pooling
Pooling can [increase sensitivity to rare per-sample variants](https://benjjneb.github.io/dada2/pool.html#pooling-for-sample-inference). `dada(..., pool=TRUE)`
The cost of pooling is increasing memory and computation time requirements. Pooled sample inference scales quadratically in the number of samples, while the default independent sample inference scales linearly.
Pseudo-pooling [approximates pooling in linear time](https://benjjneb.github.io/dada2/pseudo.html#pseudo-pooling). `dada(..., pool="pseudo")`

## DADA2 Options: Non-Illumnina sequencing technologies
For pyrosequencing data (e.g. 454 or Ion Torrent) we recommend a slight change in the alignment parameters to better handle those technologies tendency to make homopolymer errors.
```{r pyro, eval=FALSE}
foo <- dada(..., HOMOPOLYMER_GAP_PENALTY=-1, BAND_SIZE=32)
```
In the latest versions, and in the upcoming release, dada2 now supports PacBio CCS amplicon reads that can capture entire barcode genes with single-nucleotide resolution. See [our preprint](https://doi.org/10.1101/392332) for evaluation, and [the reproducible workflows from the preprint](https://github.com/benjjneb/LRASManuscript) for information on appropriate PacBio processing options.
{width=480px}
## DADA2 Options: Sensitivity and Heuristics
Sensitivity options
- `OMEGA_A`: The key sensititivy parameters, controls the p-value threshold at which to call new ASVs.
- `OMEGA_C`: The error-correction threshold. One alternative is to turn off error-correction.
- `MIN_ABUNDANCE`: Sets a minimum abundance threshold to call new ASVs.
- `MIN_FOLD`: Minimum fold overabundance relative to error model for new ASVs.
**See also the `priors` argument to raise sensitivity (at no cost to specificity) for sequences you expect might be present.**
Heuristics
- `BAND_SIZE`: Sets the width of the banded alignment.
- `USE_KMERS`: Avoids most pairwise alignments by using a kmer-distance screen beforehand.
- `GREEDY`: A very modest form of greediness, if a sequences is at lower abundance than expected from errors, lock it.
*The adventurous can see `?setDadaOpt` for more algorithmic parameters.*
## Merge, Table, Remove Chimeras, Sanity Check
## Merge Paired Reads
```{r merge, message=FALSE}
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)
```
**Most reads should pass the merging step! If that isn't the case, are you sure your truncated reads still overlap sufficiently?**
## Merging options
{width=960px}
- If (a): Use normally.
- If (b or a+b): `mergePairs(..., trimOverhang=TRUE)` *(but you probably should have trimmed away the overhang earlier, see ITS workflow)*
- If (c): `mergePairs(..., justConcatenate=TRUE)`.
- If (a+c or a+b+c): Trouble.
## Construct Sequence Table (ASV Table)
```{r seqtab}
seqtab <- makeSequenceTable(mergers)
```
The sequence table is a `matrix` with rows corresponding to (and named by) the samples, and columns corresponding to (and named by) the sequence variants.
```{r dim}
dim(seqtab)
```
```{r seqlens}
table(nchar(getSequences(seqtab)))
```
The lengths of the merged sequences all fall in the expected range for this amplicon.
## Remove chimeras
Chimeric sequences are identified if they can be exactly reconstructed by combining a left-segment and a right-segment from two more abundant "parent" sequences.
```{r chimeras, message=FALSE}
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
# Set multithread=TRUE to use all cores
sum(seqtab.nochim)/sum(seqtab)
```
**In some cases, most sequences will be chimeric. But most reads should not be. If they are, you probably have unremoved primers.**
If you used `pool=TRUE` during sample inference, you should use `method="pooled"` for chimera removal.
## Track reads through the pipeline
Look at the number of reads that made it through each step in the pipeline:
```{r track}
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
head(track)
```
Looks good! We kept the majority of our raw reads, and there is no over-large drop associated with any single step.
## SANITY CHECK: Read Tracking
```{r track2}
head(track)
```
- If a majority of reads failed to merge, you may need to revisit `truncLen` to ensure overlap.
- If a majority of reads were removed as chimeric, you may have unremoved primers.
**This is the single most important place to inspect your workflow to make sure everything went as expected!**
## Assign Taxonomy
The `assignTaxonomy` function takes as input a set of sequences to ba classified, and a training set of reference sequences with known taxonomy, and outputs taxonomic assignments with at least `minBoot` bootstrap confidence.
```{r taxify}
taxa <- assignTaxonomy(seqtab.nochim, "~/tax/rdp_train_set_16.fa.gz", multithread=TRUE)
```
**I recommend [the Silva database](file:///Users/bcallah/dada2/training.html) for 16S data. We are using the RDP database here to keep file sizes down.**
## Taxonomic assignment methods
The dada2 `assignTaxonomy` function is just a reimplementation of the naive Bayesian classifer developed as part of the RDP project. It is based on shredding reads into kmers, matching against a reference database, and assigning if classification is consistent over subsets of the shredded reads.
{width=540px}
This method has held up well over the years, but additional options are now available. For classification based on exact matching, consider `assignSpecies`. For general purpose classification with reported higher accuracy, consider the reently published `IDTaxa` method in the DECIPHER package. You can see how to use `IDTaxa` in [the DADA2 tutorial](https://benjjneb.github.io/dada2/tutorial.html#assign-taxonomy).
## Taxonomic assignment databases
Having a good reference database is usually **much more important** than the difference between the good taxonomic assignment methods.
What is the reference database for your metabarcoding locus? Is it comprehensive? Appropriate for the environments you are sampling? Do you need to augment or construct your own?
## SANITY CHECK: Taxonomic Assignments
```{r tax-look}
head(unname(taxa))
```
**Do the taxonomies assigned to the top ASVs make sense in the sampled environment?**
## Handoff to Phyloseq
```{r phyloseq}
library("phyloseq"); packageVersion("phyloseq")
```
Create a phyloseq object from the ASV table and taxonomy assigned by DADA2.
```{r make-ps}
ps <- phyloseq(otu_table(seqtab.nochim, taxa_are_rows=FALSE),
tax_table(taxa))
ps
```
Usually you'll want to add sample metadata at this point as well.
```{r cleanup, echo=FALSE, }
foo <- file.remove(list.files(file.path(path, "filtered"), full.names=TRUE))
```