Part4-WhyTidyData.Rmd

---
title: "Tidy Data: Why and How"
author: "Ted Laderas"
date: "5/24/2017"
output: 
  html_document:
    code_download: true
    code_folding: hide
    df_print: paged
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
```

## What is Tidy Data?

- each row corresponds to an observation
- each variable is a column
- each type of observation is in a different table

![](figure/tidy-1.png)

## Why Tidy Data?

Tidy data enables us to do lots of things!

1) Great ggplots
2) Summarize/slice the data in multiple ways
3) Enable Exploratory Data Analysis
4) Ensure assumptions are met for methods
5) Enable Confirmatory Data Analysis

## Beware of columns masquerading as variables!

These columns are actually categories!

Ask yourself: do these columns go together as a single observation for your analysis?

Also ask yourself: What is the unit of observation?

```{r}
library(tidyr)
fertilityData <- read.csv("data/total_fertility.csv", check.names = FALSE)

fertilityData[1:10, 1:10]
```

## Making data tidy: `gather()`

Use `gather()` when you need to make a bunch of columns into one column. 

```{r}
library(tidyr)
fertilityData <- read.csv("data/total_fertility.csv", check.names = FALSE)

#gather() takes three arguments: data, key, and value
#key is what you want your new categorical column to be named
#value is for the actual values in the columns

#We don't want the `Total fertility rate` column to be included as part of the
#gather() operation, so we use the `-` notation to exclude it.

fertilityDataTidy <- 
  gather(fertilityData, "Year", "fertilityRate", -`Total fertility rate`) %>% 
  select(Country = `Total fertility rate`, Year, fertilityRate) %>% 
  #remove na values (there are countries that have no information)
  filter(!is.na(fertilityRate))

fertilityDataTidy[1:10,]
```

## Your Task: using tidy data

How would we find the average fertility within a year? How about from 1860 on?

```{r}

```

## Making one column into many: `spread()`

Sometimes, you will need to go the other direction: take a long format dataset and make it into a more matrix-like format. This is necessary for such functions such as `heatmap()`.

Let's change things around and make the `Country` column into the variables (columns) in the dataset. 

```{r}
fertilityCountryColumns <- fertilityDataTidy %>% 
  #spread takes a key (Country) and value (fertilityRate) argument
  spread(Country, fertilityRate) 

fertilityCountryColumns[1:10, 1:10]
```

## Your Task - Who is the most democratic?

Load the `dem_score.csv` dataset in the `data/` folder. Tidy it up. Which countries had the highest democracy score in 2007?

Hint: you'll have to use your `dplyr` skills as well.

```{r}
#enter your answer here

demScore <- read.csv("data/dem_score.csv")

```

## Challenge - if that was too easy...

Take a look at the `who` dataset (it's built into `tidyverse`)

```{r}
data(who)

who
```

## Make it look like this:

```{r}
load("data/who_tidy.rda")

who_tidy
```

## Some Hints on the Challenge

Look at the documentation for `separate()`. You will first have to gather a bunch of the columns into a single column. Then you will have to apply `separate()` twice, with different parameters.

What does each column mean? Here's some info from the data dictionary:

1) The first three letters of entries in the key column correspond to new or old cases of TB.
2) The next two letters (after the _) correspond to TB type:
    + `rel` for relapse, 
    + `ep` for extrapulmonary TB
    + `sn` for smear negative, 
    + `sp` for smear positive
3) The next letter after the second _ corresponds to the sex of the TB patient.
4) The remaining numbers correspond to age group:
    + `014` for 0 to 14 years
    + `65` for 65 or older
    + etc.

## What's Next?

We've showed you the bare basics of the tidyverse. There's a ton more!

Where to go next?

- <http://tidyverse.org>
    - `lubridate` for dealing with dates
    - `stringr` for manipulating strings
    - `haven`/`readr`/`readxl` for importing data
- Modern Dive (by Chester and Albert Kim): http://moderndive.com
- R for Data Science: http://r4ds.had.co.nz
- [Variety of courses on DataCamp](https://www.datacamp.com/) 

## Keep in Touch!

Ted: @tladeras https://laderast.github.io
Chester: @old_man_chester https://ismayc.github.io

## Give us Feedback!

Let us know what you thought of the workshop! 

Was it too easy? Too hard?

What else would you like to see? [Add Feedback here](https://github.com/Cascadia-R/gRadual-intRoduction-tidyverse/issues/1)