LoopMS_16S_Fecal.Rmd

---
title: "Evaluationg LoopSeq 16S on Human Fecal Samples"
author: "BJC"
output: html_document
---

## Setup

Load libraries, set the working directory to the location of this Rmarkdown file (only necessary when running by hand), and read in filenames:
```{r, echo=FALSE}
library(dada2, quietly=TRUE); packageVersion("dada2") # Should be 1.15.4 or later
library(Biostrings, quietly=TRUE)
library(ShortRead, quietly=TRUE)
library(ggplot2, quietly=TRUE)
library(reshape2, quietly=TRUE)
library(vegan, quietly=TRUE)
library(ape, quietly=TRUE)
set.seed(100)
setwd("~/LoopManuscript") # CHANGE ME to the location of this file
path <- "~/LoopData/16S/CallahanFecal" # CHANGE ME to the location of the fastq files
path.fig <- "Figures" # Relative path to where figures should be saved
path.rds <- "RDS" # Relative path to where RDS should be saved
fn <- list.files(path, pattern=".fq$", full.names=TRUE)
sapply(fn, function(f) length(getSequences(f)))
```

## QA, Filtering and Trimming

First do basic QA on these data, and then apply the DADA2 filtering and trimming workflow with the parameters selected from the detailed inspection of the Zymo mock community.

```{r}
plotComplexity(fn)
```

```{r}
plotQualityProfile(fn)
```

These profiles look consistent with what we saw in the Zymo mock -- no low complexity issues, and high qualities throughout the 16S sequence (i.e. ignoring the >1500nt tail). The one difference, especially in the R3.1 is that the length distribution is different, around a half of the reads in that library appear not to extend to the full-length of the 16S gene. This may be a common way in which quality variation in Loop libraries becomes evidence, as molecules that are sequenced to insufficient coverage are likely to end up being assembled into incomplete contigs that don't cover the entire targeted amplicon.

Enter primers, and confirm their presence and the overall orientation of the reads. Code is adapated from [the DADA2 ITS tutorial workflow](https://benjjneb.github.io/dada2/ITS_workflow.html#identify-primers):
```{r}
FWD <- "AGAGTTTGATCMTGGC" # Loop 16S forward primer
REV <- "TACCTTGTTACGACTT" # Loop 16S reverse primer
allOrients <- function(primer) {
    # Create all orientations of the input sequence
    require(Biostrings)
    dna <- DNAString(primer)
    orients <- c(Forward = dna, Complement = Biostrings::complement(dna), Reverse = Biostrings::reverse(dna), 
        RevComp = Biostrings::reverseComplement(dna))
    return(sapply(orients, toString))  # Convert back to character vector
}
primerHits <- function(primer, fn) {
    # Counts number of reads in which the primer is found
    nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
    return(sum(nhits > 0))
}
for(f in fn) {
  cat("Primers detected in", basename(f), "\n")
  print(rbind(FWD.Primer = sapply(allOrients(FWD), primerHits, fn = f), 
              REV.Primer = sapply(allOrients(REV), primerHits, fn = f)))
  cat("Out of", length(readFastq(f)), "total reads.\n\n")
}
```

Primers in the expected orientations, and yes it looks like the R3.1 library will take a big hit from incomplete amplicon reconstruction, with the REV primer only being detected in <20\% of the total reads, whereas the other libraries are closer to 50\%. It is worth noting that all of these libraries have a much lower fraction of primer-complete reads than the Zymo mock community data had (~90\%).

Remove the primers (and any flanking sequence) from the reads, and filter out reads that don't contain both primers:
```{r}
nop <- file.path(path, "nop", basename(fn))
out <- removePrimers(fn, nop, FWD, rc(REV), verbose=TRUE)
```

Filter the sequences and enforce minimum/maximum lengths appropriate for full-length 16S. Note that we are enforcing `maxEE=1`, as that was determined to be a better filter than `maxEE=2` in the Zymo mock community data.
```{r}
filt <- file.path(path, "filtered", basename(fn))
track <- filterAndTrim(nop, filt, maxEE=1, minLen=1400, maxLen=1600, verbose=TRUE)
```

Final inspection of the quality profile:
```{r, message=FALSE}
plotQualityProfile(filt)
```

Final progress of reads through filtering and trimming:
```{r}
cbind(raw=out[,1], primers=out[,2], filtered=track[,2])
```

The large loss of reads due to a lack of detectable primers, especially in the R3.1 library, is notable. It seems likely that optimizing the ratio of Loop molecules to total output reads will be an important part of maximizing throughput in the form of the most high-quality Loop reads per run.

# Denoising

Learn the error rates:
```{r, warning=FALSE}
err.rds <- file.path(path.rds, "err_16S_Fecal.rds") # RDS save/load to speed up reruns of the code
if(!file.exists(err.rds)) {
  err <- learnErrors(filt, multi=TRUE, verbose=0)
  saveRDS(err, err.rds)
}
err <- readRDS(err.rds)
plotErrors(err, nominalQ=TRUE)
```

Denoise the filtered data into ASVs using current DADA2 defaults:
```{r}
dd <- dada(filt, err, multi=TRUE, verbose=0)
dd
```

Make sequence table and remove chimeras:
```{r}
sta <- makeSequenceTable(dd)
st <- removeBimeraDenovo(sta, minFoldParentOverAbundance=4.5, multi=TRUE, verbose=TRUE)
```

Assign taxonomy:
```{r}
tax.rds <- file.path(path.rds, "tax_16S_Fecal.rds") # RDS save/load to speed up reruns of the code
if(!file.exists(tax.rds)) {
  tax <- assignTaxonomy(st, "~/tax/silva_nr_v132_train_set.fa.gz", minBoot=80, multi=TRUE)
  saveRDS(tax, tax.rds)
}
tax <- readRDS(tax.rds)
if(!identical(getSequences(tax), getSequences(st))) stop("Taxonomy mismatch.")
table(tax[,"Phylum"], useNA="ifany")
```

The usual suspects at the Phylum level at least.

## Comparison to PacBio full-length 16S sequencing results

Currently, PacBio is the gold standard for full-length 16S gene sequencing. Previously, we sequenced these same three fecal samples as part of [our earlier investigation of the accuracy of PacBio full-length 16S sequencing](https://doi.org/10.1093/nar/gkz569). We processed those samples through the DADA2 workflow in that paper, and have included the sequence table obtained from a set of 12 fecal samples (including the 3 resequenced here using Loop full-length 16S) as part of this repository.

Importing PacBio results, and coordinating sample names between the Loop and PacBio results:
```{r}
pb1.rds <- file.path(path.rds, "PacBio_Fecal_st1.rds")
st1.pb <- removeBimeraDenovo(readRDS(pb1.rds), minFoldParentOverAbundance=4.5, 
                             multi=TRUE, verbose=TRUE)
pb2.rds <- file.path(path.rds, "PacBio_Fecal_st2.rds")
st2.pb <- removeBimeraDenovo(readRDS(pb2.rds), minFoldParentOverAbundance=4.5, 
                             multi=TRUE, verbose=TRUE)
sam.names <- sapply(strsplit(rownames(st), "_"), `[`, 1) # e.g. R3.1_contig_list_trimmed.fq
rownames(st) <- sam.names; rownames(sta) <- sam.names
rownames(st1.pb) <- gsub("^Callahan_16S_2pM-Cell1_", "", rownames(st1.pb)) # Remove prefix
rownames(st1.pb) <- gsub("[.]fastq[.]gz$", "", rownames(st1.pb)) # Remove suffix
rownames(st1.pb) <- gsub("_", "", rownames(st1.pb)) # e.g. R_3.1
rownames(st2.pb) <- gsub("_", "", rownames(st2.pb)) # e.g. R_3.1
if(!all(rownames(st) %in% rownames(st1.pb))) stop("Sample name mismatch (1).")
if(!all(rownames(st) %in% rownames(st2.pb))) stop("Sample name mismatch (2).")
st1.pb <- st1.pb[rownames(st),]
st2.pb <- st2.pb[rownames(st),]
```

In the PacBio protocol, a slightly different set of primers is used:
```{r}
FWD.pacb <- "AGRGTTYGATYMTGGCTCAG" 
FWD.loop <- "AGAGTTTGATCMTGGC"
REV.pacb <- "RGYTACCTTGTTACGACTT"
REV.loop <-    "TACCTTGTTACGACTT"
```

The result is that the Loop sequences have an additional 4 nucleotides at the start of the reads that are absent in the PacBio data. To remedy that, we will simply remove the first 4nts from all the Loop sequences:
```{r}
sq.loop.full <- getSequences(st)
sq.loop.trunc <- substr(sq.loop.full, 5, nchar(sq.loop.full))
any(duplicated(sq.loop.trunc)) # FALSE
if(mean(grepl("^TCAG", sq.loop.full)) < 0.95) stop("Trim issue, expected 4nt sequence not at start of these sequences.")
st.loop <- st; tax.loop <- tax
colnames(st.loop) <- sq.loop.trunc; rownames(tax.loop) <- sq.loop.trunc
```

As a first pass, looking at the overlap in ASVs and reads between Loop and PacBio full-length 16S data (replicate 2) on each sample:
```{r}
setASV <- function(unq1, unq2, nms) {
  unq1 <- getUniques(unq1); unq1 <- unq1[unq1 > 0]
  unq2 <- getUniques(unq2); unq2 <- unq2[unq2 > 0]
  rval <- c(sum(!names(unq1) %in% names(unq2)), 
            sum(names(unq1) %in% names(unq2)),
            sum(!names(unq2) %in% names(unq1)))
  names(rval) <- c(nms[[1]], "Shared", nms[[2]])
  rval
}
t(sapply(sam.names, function(nm) setASV(st.loop[nm,], st1.pb[nm,], nms=c("Loop", "PB1"))))
t(sapply(sam.names, function(nm) setASV(st.loop[nm,], st2.pb[nm,], nms=c("Loop", "PB2"))))
t(sapply(sam.names, function(nm) setASV(st1.pb[nm,], st2.pb[nm,], nms=c("PB1", "PB2"))))
```

Half or more of the denoised ASVs are shared across the methods. How about on a per-read basis?
```{r}
setReads <- function(unq1, unq2, nms) {
  unq1 <- getUniques(unq1); unq1 <- unq1[unq1 > 0]
  unq2 <- getUniques(unq2); unq2 <- unq2[unq2 > 0]
  rval <- c(sum(unq1[!names(unq1) %in% names(unq2)]), 
            sum(unq1[names(unq1) %in% names(unq2)]),
            sum(unq2[names(unq2) %in% names(unq1)]),
            sum(unq2[!names(unq2) %in% names(unq1)]))
  names(rval) <- c(nms[[1]], paste0("Shared", nms[[1]]), paste0("Shared", nms[[2]]), nms[[2]])
  rval
}
totReads <- function(st1, st2) { as.matrix(cbind(rowSums(st1), rowSums(st1), rowSums(st2), rowSums(st2))) }

tab.loop.pb1 <- t(sapply(sam.names, function(nm) setReads(st.loop[nm,], st1.pb[nm,], nms=c("Loop", "PB1"))))
tab.loop.pb1
tab.loop.pb1/totReads(st.loop, st1.pb)
tab.loop.pb2 <- t(sapply(sam.names, function(nm) setReads(st.loop[nm,], st2.pb[nm,], nms=c("Loop", "PB2"))))
tab.loop.pb2
tab.loop.pb2/totReads(st.loop, st2.pb)
tab.pb1.pb2 <- t(sapply(sam.names, function(nm) setReads(st1.pb[nm,], st2.pb[nm,], nms=c("PB1", "PB2"))))
tab.pb1.pb2
tab.pb1.pb2/totReads(st1.pb, st2.pb)
```

A very high fraction of reads (>85\%) occur in ASVs identified consistently by both methods. How do things look at the whole community level, when we ordinate these three samples?

Combine the sequence tables together:
```{r}
st.loop.lab <- st.loop
rownames(st.loop.lab) <- gsub("$", ".loop", rownames(st.loop.lab))
st1.pb.lab <- st1.pb
rownames(st1.pb.lab) <- gsub("$", ".pb1", rownames(st1.pb.lab))
st2.pb.lab <- st2.pb
rownames(st2.pb.lab) <- gsub("$", ".pb2", rownames(st2.pb.lab))
st.both.lab <- mergeSequenceTables(st.loop.lab, st1.pb.lab, st2.pb.lab)
ft.both.lab <- sweep(st.both.lab, 1, rowSums(st.both.lab), "/")
```

Perform ordination:
```{r}
bc <- vegdist(ft.both.lab, "bray")
mds <- pcoa(bc)
df <- data.frame(mds$vectors)
df$Sample <- substr(rownames(df), 1, 4)
df$Technology <- sapply(strsplit(rownames(df), "[.]"), `[`, 3)
gg_color_hue <- function(n) {
  hues = seq(15, 375, length = n + 1)
  hcl(h = hues, l = 65, c = 100)[1:n]
}
clr3 <- gg_color_hue(3)
df$Technology <- c(loop="LoopSeq", pb1="PacBio CCS\n(S/P2-C2/5.0)\n", pb2="PacBio CCS\n(S/P3-C3/5.0)")[df$Technology]
tech.color.scale <- c("LoopSeq"=clr3[[1]], "PacBio CCS\n(S/P2-C2/5.0)\n"=clr3[[2]], 
                      "PacBio CCS\n(S/P3-C3/5.0)"="darkgreen")
ppb <- ggplot(data=df, aes(x=Axis.1, y=Axis.2, label=Sample, color=Technology)) + theme_bw() + 
  scale_color_manual(values=tech.color.scale)
ppb + geom_text(alpha=0.8)
labdf <- df[df$Technology=="LoopSeq",]
labdf$Axis.1 <- 0.8*labdf$Axis.1 # pull labels towards middle of plot
labdf$Axis.2 <- 0.8*labdf$Axis.2
ppb + geom_point(shape="x", size=6) + geom_label(data=labdf, color="black")
ggsave(file.path(path.fig, "Fecal_LoopVsPacBio_ord.pdf"), 
       ppb + geom_point(shape="x", size=6) + geom_label(data=labdf, color="black"), 
       width=5, height=3, units="in", useDingbats=FALSE)
ggsave(file.path(path.fig, "Fecal_LoopVsPacBio_ord.png"), 
       ppb + geom_point(shape="x", size=6) + geom_label(data=labdf, color="black"), 
       width=5, height=3, units="in")
```

They fall almost exactly on top of each other. Worth calling out that two of these samples were longitudinal samples from the same person too, so inter-subject and within-subject temporal variability is represented here, as well as the technical replication amongst the same technology (PacBio).

Comparing the measurement dissimilarities between Loop and PacBio and between the PacBio technical replicates:
```{r}
as.matrix(bc)[df$Sample=="R3.1", df$Sample=="R3.1"]
as.matrix(bc)[df$Sample=="R9.3", df$Sample=="R9.3"]
as.matrix(bc)[df$Sample=="R9.4", df$Sample=="R9.4"]
```

In Sample R3.1 the PacBio technical replicates are highly similar, but in the other two samples the Loop measurement is about as similar to each PacBio technical replicate as the PacBio technical replicates are to each other!

Also testing Jaccard distance to ensure this pattern is not entirely driven by high abundance ASVs.
```{r}
jc <- vegdist(ft.both.lab, "jaccard")
mds.jc <- pcoa(jc)
df.jc <- df # Just replace the vectors with the new jaccard derived numbers
df.jc[,paste0("Axis.", seq(8))] <- data.frame(mds.jc$vectors)
ppj <- ggplot(data=df.jc, aes(x=Axis.1, y=Axis.2, label=Sample, color=Technology)) + theme_bw() + 
  scale_color_manual(values=tech.color.scale)
ppj + geom_text(alpha=0.8)
labdf.jc <- df.jc[df.jc$Technology=="LoopSeq",]
labdf.jc$Axis.1 <- 0.8*labdf.jc$Axis.1 # pull labels towards middle of plot
labdf.jc$Axis.2 <- 0.8*labdf.jc$Axis.2
ppj + geom_point(shape="x", size=6) + geom_label(data=labdf.jc, color="black")
ggsave(file.path(path.fig, "Fecal_LoopVsPacBio_ord_jaccard.pdf"), 
       ppj + geom_point(shape="x", size=6) + geom_label(data=labdf.jc, color="black"), 
       width=5, height=3, units="in", useDingbats=FALSE)
ggsave(file.path(path.fig, "Fecal_LoopVsPacBio_ord_jaccard.png"), 
       ppj + geom_point(shape="x", size=6) + geom_label(data=labdf.jc, color="black"), 
       width=5, height=3, units="in")
```

Exact same qualitative patterns with Jaccard distance as well.

## Demonstration of LoopSeq Accuracy using 16S conservation patterns

Natural samples re closer in complexity and composition thatn are mock communities to the types of samples in which people want to employ technologies like LoopSeq full-length. However, unlike in mock communities, we do not know the true compposition of natural samples, and thus accuracy is more difficult to determine in natural samples. Here we will attempt to evaluate the accuracy of LoopSeq FL 16S in natural human fecal samples by leveraging the variable levels at which different sections of the 16S rRNA gene are evolutionarily conserved. In short, we expect the differences between legitimate biological variants to preferentially be found in the variable regions of the 16S rRNA gene, while artefactual variation introduced by technical error processes will not.

We'll use [the external program ssu-align(http://eddylab.org/software/ssu-align/) for this purpose, which uses Infernal to align potentially large collectiosn of SSU rRNA sequences. 

First we'll run the program on an example sequence, to show its output:
```{r}
ssu.exe <- "~/Desktop/bin/ssu-align-0.1.1/src/ssu-align" # CHANGE ME...
ssu.dir <- "/usr/local/share/ssu-align-0.1.1"
export.str <- paste0('export SSUALIGNDIR="', ssu.dir, '";') # Prepend to all ssu-align calls
# i <- 2 has just 1 sub, for testing
d1 <- dd[[1]]
i <- 2
bs <- d1$birth_subs[d1$birth_subs$clust==i,]
sq <- d1$sequence[d1$clustering$birth_from[i]] # Positions are in birth (ref) seq
if(!dir.exists("temp")) dir.create("temp")
tf <- "temp/tempfile.fa"
writeFasta(sq, tf)
td <- "temp/tmp"
system(paste(export.str, ssu.exe, '-f', tf, td))
```

At the end of the program, the key file we are interested in is the bacterial alignment, which is the alignment of the input sequences against the bacterial SSU RNA model. Inspecting the formate of that file:
```{r}
print(substr(readLines("temp/tmp/tmp.bacteria.stk"), 1, 80))
```

There are four lines of output. The first is the aligned sequence itself, converted to the RNA ACGU code. The second line gives "posterior probabilities" of the confidence of the alignment at that position. The third gives the secondary structure consensus at that position, i.e. is there expected to be a basepairing (indicated by various brackets) or not (gaps). Finally, the fourth line is what we are interested in, this shows the level of conservation of that nucleotide position in the model, with well-conserved positions in uppercase, and less-conserved positions in lower-case.

So, we can use this file to determine the fraction of differences between the second dada-denoised ASV and the first that occurred in conserved positions.

```{r}
bps <- c("A", "C", "G", "U", "a", "c", "g", "u")
caps <- c("A", "C", "G", "U")
ln <- readLines("temp/tmp/tmp.bacteria.stk")
sqa <- strsplit(ln[[4]], "")[[1]]
refcons <- strsplit(ln[[7]], "")[[1]]
keep <- sqa %in% bps
sqa <- sqa[keep]
if(!gsub("U", "T", toupper(paste0(sqa, collapse=""))) == sq) { # TRUE
  stop("Aligned sequence not matching input sequence!")
}
sqcons <- refcons[keep] %in% caps
bs$cons <- sqcons[bs$pos]
table(sqcons)/sum(table(sqcons))
table(bs$cons)/sum(table(bs$cons))
```

So in this example, there are a much higher fraction of subsitutions in non-conserved read positions in the differences between the top two ASVs than would be expected by chance!

Functionalizing this operation, and then extending it to a larger set of denoised ASVs:
```{r}
get_subcons <- function(i, dd, tf = "temp/tempfile.fa", td = "temp/tmp", ssu=ssu.exe, export=export.str, return.all.pos=FALSE) {
  bs <- dd$birth_subs[dd$birth_subs$clust==i,]
  i.from <- dd$clustering$birth_from[[i]] # Sub positions are in ASV that this one was born from
  bs$from <- i.from
  sq <- dd$sequence[[i.from]]
  writeLines(c(paste0(">Sq", i.from), sq), tf)
  system(paste(export, ssu, '-f', tf, td), ignore.stdout=TRUE)
  bps <- c("A", "C", "G", "U", "a", "c", "g", "u")
  caps <- c("A", "C", "G", "U")
  alfile <- file.path(td, paste0(basename(td), ".bacteria.stk"))
  if(file.exists(alfile)) {
    ln <- readLines(file.path(td, "tmp.bacteria.stk"))
    sqa <- strsplit(ln[[4]], "")[[1]]
    refcons <- strsplit(ln[[7]], "")[[1]]
    keep <- sqa %in% bps
    sqa <- sqa[keep]
    sqa.tr <- gsub("U", "T", toupper(paste0(sqa, collapse="")))
    PAD <- 0
    if(!sqa.tr == sq) {
      ### cat("\n", sq, sqa.tr, sep="\n")
      if(sqa.tr == substr(sq,1,nchar(sqa.tr))) {
        cat("Aligned sequence is missing some trailing nucleotides:", nchar(sq)-nchar(sqa.tr), "\n")
      } else if(sqa.tr == substr(sq,1+nchar(sq)-nchar(sqa.tr),nchar(sq))) {
        PAD <- nchar(sq)-nchar(sqa.tr)
        cat("Aligned sequence is missing some beginning nucleotides:", PAD, "\n")
      } else { 
        stop("Input sequence didn't match aligned sequence.")
      }
    }
    sqcons <- refcons[keep] %in% caps
    if(PAD > 0) { # NEED to fix for when aligned sequence starts later...
      sqcons <- c(rep(NA, PAD), sqcons)
    }
    bs$cons <- sqcons[bs$pos] 
  } else { # Didn't match database, give NAs to all
    bs$cons <- NA
  }
  if(return.all.pos) { # Return vector of conservation at all positions, not those associated with the birth substitutions
    return(sqcons)
  }
  else { return(bs) } # Return birth_subs data.frame, now with conservation state of each position added in the $cons column
}
nseqs <- function(x) length(getSequences(x))
```

Testing the function:
```{r}
rbind(get_subcons(3, d1), get_subcons(4, d1), get_subcons(5, d1))
```

Output is matching the expected format, this is a `data.frame` with information on the position, substitution type and quality score associated with various dada-denoised ASVs. Interestingly, all of the substitutions between these denoised ASVs and the more abundant ASVs from which they were disciminated occurrsed in non-conserved base positions! Now to see if that pattern is something that holds up over the dataset.

First, running `dada` with a very aggressive sensitivity setting. Our goal here is to use `dada` to order ASVs by the diagnostic likelihood that they are true (under the DADA2 error model). We will then evaluate the conservation patterns of those ordered ASVs. First, to generate them:

```{r}
dd2 <- dada(filt, err, OMEGA_A=1e-2, DETECT_SINGLETONS=TRUE, multi=TRUE, verbose=0)
dd2
```

Now let's run our identication of conservation patterns on all these ASVs:
```{r}
cons.rds <- file.path(path.rds, "cons_16S_Fecal.rds") # RDS save/load to speed up reruns of the code
if(!file.exists(cons.rds)) {
  system.time(cons <- lapply(names(dd2), function(nm) {
    di <- dd2[[nm]]
    lapply(seq(2, nseqs(di)), get_subcons, dd=di)
  }))
  names(cons) <- names(dd2)
  saveRDS(cons, cons.rds)
  system('rm -Rf temp/tmp')
  system('rm temp/*')
}
```

```{r}
cons <- readRDS(cons.rds)
if(!all(sapply(cons, length) == sapply(dd2, function(x) nseqs(x)-1))) stop("Number of cons ASVs mismatch.")
```

Now to curate the SSU conservation data into a data.frame:
```{r}
clustdfs <- lapply(names(dd2), function(nm) {
  di <- dd2[[nm]]; coni <- cons[[nm]]
  rdf <- di$clustering[-1,] # Dropping first ASV for simplicity
  rdf$cons <- sapply(coni, function(x) sum(x$cons, na.rm=TRUE)) 
  # NAs occur when only part of the ASVs aligns to the ssu-align RNA model
  rdf$vars <- sapply(coni, function(x) sum(!x$cons, na.rm=TRUE))
  rdf$nsub <- sapply(coni, function(x) nrow(x))
  rdf$ncat <- rdf$cons + rdf$vars # The number of substitutions categorized as conserved/variable
  rdf$var.frac <- rdf$vars/rdf$ncat
  rdf$index <- seq(2,nrow(rdf)+1)
  rdf$sample <- nm
  rdf
})
names(clustdfs) <- names(dd2)
```

For comparison, we also want to know the average fraction of positions that are conserved if chosen at random (rather than at the positions of birth substutitions). To calculate that:
```{r}
ii <- sample(nseqs(d1), 20)
cons.all <- lapply(ii, get_subcons, dd=d1, return.all.pos=TRUE)
sapply(cons.all, mean)
```

Yep. Quite consistent, a bit over 80\% of the substitutions are in conserved positions.

Now some exploration of the distribution of the ratio of substitutions in conserved and variable regions, as a function of the order of the ASVs (which reflects decreasing confidence by DADA2 they are not seqeuencing error) and by the abundance of the denoised ASVs:
```{r}
# Use the median over our random sample as a plugin estimate for the fraction of conserved sites in each sequence
raw.cons.frac <- median(sapply(cons.all, mean))
raw.var.frac <- 1-raw.cons.frac
clustdf <- do.call(rbind, clustdfs)
clustdf$nnt <- nchar(clustdf$sequence)
clustdf$frac.sub.cons <- clustdf$cons/(clustdf$nnt*raw.cons.frac)
clustdf$frac.sub.var <- clustdf$vars/(clustdf$nnt*raw.var.frac)
clustdf$logp <- log10(clustdf$birth_pval)
clustdf$rlogp <- log10(clustdf$birth_pval + min(clustdf$birth_pval[clustdf$birth_pval>0]))
clustdf$sample <- paste("Sample", sapply(strsplit(clustdf$sample, "_"), `[`, 1))
clustdf$abundance <- as.numeric(clustdf$abundance)
theme_set(theme_bw())
hist(clustdf$ncat, 100)
p <- ggplot(data=clustdf, aes(x=frac.sub.cons, y=frac.sub.var)) + 
  xlab("Substitution Rate at Conserved Positions") + ylab("Substitution Rate at\nVariable Positions") +
  geom_abline(col="grey80", intercept=0, slope=1) + theme(aspect.ratio=1)
sz.scale <- scale_size_continuous(breaks=c("1"=1, "10"=10, "100"=100, "1000"=1000), name="ASV abundance")
p + aes(size=abundance) + geom_point(alpha=0.2) + sz.scale
p + aes(size=abundance) + geom_point(alpha=0.2) + sz.scale + scale_x_log10() + scale_y_log10()
p4 <- p + aes(size=abundance) + geom_point(alpha=0.2) + facet_wrap(~sample) + sz.scale
p4
ggsave(file.path(path.fig, "Fecal_VarVsCons.pdf"), p4, width=7, heigh=3, units="in", useDingbats=FALSE)
ggsave(file.path(path.fig, "Fecal_VarVsCons.png"), p4, width=7, heigh=3, units="in")
#require(hexbin)
#p + geom_hex()
#p + geom_bin2d()
```

There are two modes here. The first mode contains ASVs with very large numbers of substitution differences (>100) from the ASV from which they were discriminated. In this large-difference mode, the number of substitutiosn at conserved positions exceed the number of substitutions at variable positions by about a factor of 1.5x. The second mode contains ASVs with small to moderate numbers of substitution differences (<100, mostly <10) from the ASV from which they were disciminated. In this small-difference mode, the number of substitutions at conserved positions seems to be slightly less than at variable positions. Remember that ~80\% of the positions are conserved (red-dashed line) so this is strongly skewed against what would be expected by chance.

Does this enrichment of substitutions at variable positions hold for even the most difficult to identify ASVs, singletons and single-substitution variants?
```{r}
is.var.1diff <- as.logical(clustdf$vars[clustdf$ncat==1])
table(is.var.1diff)
```

```{r}
frac.var.1abund <- clustdf$var.frac[clustdf$abundance==1]
hist(frac.var.1abund, n=100)
```

Yep, for both categories the number of substitutions at variable positions is about equal to those at conserved positions.

Our goals here are to determine the level to which biological differences can be discriminated from Loop sequencing errors in real complex data from these human fecal samples. Thus, we are primarily interested in that small-substitution part of the data -- it has already been established that the per-nucleotide error rate in LoopSeq data is extremely low and is not introducing hundreds of substitutions. So, focusing in on that part of the data, is there a trend towards decreasing accuracy (i.e. a lower fraction of substutions at variable positions) as we push the sensitivity of ASV inference (i.e. look at lower-certainty ASVs)?

```{r}
p1 <- ggplot(data=clustdf[clustdf$ncat==1,]) + aes(x=index, y=var.frac, size=abundance, weight=abundance) + sz.scale
p1 + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)
p2 <- ggplot(data=clustdf[clustdf$ncat==2,]) + aes(x=index, y=var.frac, size=abundance, weight=abundance) + sz.scale
p2 + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)
pdef <- ggplot(data=clustdf[clustdf$abundance>2 & clustdf$birth_pval < 1e-40,]) + aes(x=index, y=var.frac, size=abundance, weight=abundance) + sz.scale
pdef + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)
ggplot(data=clustdf[clustdf$ncat>10,], aes(x=index, y=var.frac, size=abundance, weight=abundance)) + geom_point(alpha=0.4) + geom_smooth() + facet_wrap(~sample)
ggplot(data=clustdf[clustdf$ncat<10,], aes(x=index, y=var.frac, size=abundance, weight=abundance)) + geom_point(alpha=0.4) + geom_smooth() + facet_wrap(~sample)
ggplot(data=clustdf[clustdf$abundance > 1 & clustdf$ncat>5 & clustdf$ncat<80,], aes(x=index, y=var.frac, size=abundance, weight=ncat)) + geom_point() + geom_smooth() + facet_wrap(~sample)
ggplot(data=clustdf[clustdf$abundance == 1 & clustdf$ncat<80,], aes(x=index, y=var.frac, size=abundance)) + geom_point(alpha=0.4) + geom_smooth() + facet_wrap(~sample)
```

```{r}
ggplot(data=clustdf[clustdf$ncat<2,], aes(x=index, y=var.frac, size=abundance)) + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)
ggplot(data=clustdf[clustdf$ncat<3,], aes(x=index, y=var.frac, size=abundance)) + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)
ggplot(data=clustdf[clustdf$ncat<4,], aes(x=index, y=var.frac, size=abundance)) + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)
ggplot(data=clustdf[clustdf$ncat<5,], aes(x=index, y=var.frac, size=abundance)) + geom_smooth() + geom_point(alpha=0.4) + facet_wrap(~sample)
```

There is some dropoff in the fraction of substitutions at variable positions as we get to the later ASVs identified using these very high sensitivity parameters, but not that much really.