stimulus-timings.Rmd

---
title: "Stimulus Timings"
author: "TJ Mahr"
date: "May 2, 2017"
output:
  rmarkdown::github_document: default
  ghdown::github_html_document: default
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  echo = TRUE,
  comment = "#>", 
  collapse = TRUE)
```

Over the course of the longitudinal study, we re-recorded the stimuli used in 
our word recognition experiments. I would like double-check when the stimuli
were changed in each experiment. This scripts iterates over the wav files in 
each year of study x experiment x dialect version and computes the durations of 
the files.

## Helpers

```{r}
library(purrr)
library(dplyr, warn.conflicts = FALSE)

# Measure duration of a wav file in ms
get_wave_duration <- function(filepath) {
  info <- tuneR::readWave(filepath, header = TRUE)
  # milliseconds = (1000 ms per second / samples per second) * num samples
  (1000 / info$sample.rate) * info$samples
}

find_waves <- function(dirpath) {
  list.files(dirpath, pattern = ".wav|.WAV", 
             recursive = TRUE, full.names = TRUE)
}
```

List all the wav files in each of the longitudinal study folders and compute 
the duration.

```{r}
wav_files <- list(tp1 = "tp1", tp2 = "tp2", tp3 = "tp3") %>% 
  # find the files in each study, creating a data-frame for each study
  map(find_waves) %>% 
  map(~ data_frame(file = .x)) %>% 
  # collapse to a single data-frame
  bind_rows(.id = "study") %>% 
  mutate(
    token = basename(file),
    duration = map_dbl(file, get_wave_duration)) %>% 
  select(study, token, duration, file)
wav_files
```

Deduce dialect and experiment name.

```{r}
wav_files <- wav_files %>% 
  mutate(
    task = file %>% stringr::str_extract("MP|RWL"),
    dialect = file %>% stringr::str_extract("AAE|SAE")) %>% 
  select(study, task, dialect, token, duration, file)
wav_files
```

For each experiment, count the number of token _durations_. This count will
reveal whether the tokens in an experiment were normalized to have the same 
duration.

```{r}
wav_files %>% 
  group_by(study, task, dialect) %>% 
  summarise(num_durations = n_distinct(duration)) 
```

It looks like the RWL experiments did *not* have normalized tokens.

Now, I would like to exclude the files that were not the main nouns used in the
experiments. Our trials involved three parts: a carrier phrase, a noun, and an
attention-getter phrase. For example, _find the_ [carrier phrase] _dog_ [noun] 
... _check it out_ [attention getter]. I'm going to clean up the token filenames 
and figure out a pattern that will filter out nouns versus non-nouns.


```{r}
# Strip off junk from file name to see the main content of it
words <- wav_files %>% 
  getElement("token") %>% 
  stringr::str_replace_all("AAE|SAE", "") %>% 
  stringr::str_replace_all("\\d", "") %>% 
  stringr::str_replace_all("_[A-Z]_", "") %>% 
  stringr::str_replace_all("_", "") %>% 
  stringr::str_replace_all(".wav", "") %>% 
  unique() %>% 
  sort()
words

non_noun <- "check|Check|Fin|Fun|look|Look|See|this"

noun_files <- wav_files %>%
  filter(!stringr::str_detect(token, non_noun))
```

Now, there are fewer file durations per experiment.

```{r}
noun_files %>% 
  group_by(study, task, dialect) %>% 
  summarise(num_durations = n_distinct(duration)) 
```

And I can get what I would like, which is the duration of the noun tokens in
each experiment.

```{r}
noun_files %>% 
  group_by(study, task, dialect) %>% 
  summarise(
    n_durations = n_distinct(duration),
    min_dur = min(duration) %>% round(1),
    med_dur = median(duration) %>% round(1),
    max_dur = max(duration) %>% round(1)) %>% 
  ungroup() %>% 
  knitr::kable()
```

Okay, that confirms what I thought. The stimuli were re-recorded after TP1, so
the same files are used in TP2 and TP3.