excel-extensions-which-witch.Rmd

---
title: "Tutorial on Pivot Tables and other Excel things you can also do in R - Witch Trials data"
author: "by Kim Fitter"
output: 
  html_document: 
    toc: true
    toc_float: true
---

## Introduction

This tutorial was inspired by the [R Curious tutorial at useR! 2018](https://user2018.r-project.org/tutorials/), and follows on thematically from the R Curious workshop [notes](https://github.com/softloud/rcurious/tree/master/workshop) as an extension.

It is aimed at those with a background in Excel, who would also like to use R for data analysis. This tutorial compares the things you would normally do in Excel, but with an equivalent function in R. 

This introductory level tutorial assumes you have already installed R and R studio and had a brief introduction to the R basics and R Markdown. 

From the [R Markdown](https://github.com/kimnewzealand/R-tutorials/blob/master/excel-extensions-which-witch.Rmd) run each line of code (Ctrl + Enter) or code chunk (Ctrl + Shift +Enter) yourself or Knit -> Knit to HTML to read through the HTML version.


```{r setup, include=FALSE}
# This is a comment on the setup chunk, we are using echo=TRUE so that the code is included in the R Markdown document
knitr::opts_chunk$set(cache = TRUE, echo = TRUE)
```

## Particularly useful packages for this exercise

```{r load packages}
# Load packages
library(tidyverse)
library(rpivotTable)
```

**Tip** Install the package from CRAN the first time you use the package using  `install.packages("name of package")` else you will get an `Error` that there is no package.

**Tip** The latest package version is sometimes available from GitHub so alternatively install from GitHub using the`devtools` package. A package's GitHub README file typically has installation instructions.

## Import Witch Trial data

The dataset used in this tutorial is of 43,000 people tried for witchcraft across
21 European countries over a period of five-and-a-half centuries used for this [economic journal](https://www.researchgate.net/publication/315533575_Witch_Trials). 

This is what the csv file looks like in Excel:

<center>
<img src="https://user-images.githubusercontent.com/26913197/42921926-90eae0b6-8b72-11e8-928d-c0e014bb7230.PNG" style="width:60%">
</center>

In R, import the data using the `read_csv` function from the [readr](https://cran.r-project.org/web/packages/readr/index.html) R package which is loaded with the tidyverse, to import the raw data to a data frame called `witchdat`.

A data frame, which is a 2-dimensional object where contents can be different data types.

```{r download data}
# Download raw data from the author's GitHub url https://github.com/JakeRuss/witch-trials/tree/master/data
url <- "https://raw.githubusercontent.com/JakeRuss/witch-trials/master/data/trials.csv"
# The read_csv function also produces default messages describing the parsing of columns which is converting the columns into different types.
witchdat <- read_csv(url)
```
  

Let's take a quick look at the `witchdat` data frame in R to get an idea of it's size and contents.

In R there are some packages that are included by default; take a look at the packages available in the Global Environment drop down. One such package is called [base](https://stat.ethz.ch/R-manual/R-devel/library/base/html/00Index.html) which includes dim function to retrieve the dimensions, the row and column numbers.

```{r dim}
# Retrieve the number of rows and columns
dim(witchdat)
```

Another included package is [utils](https://cran.r-project.org/web/packages/R.utils/index.html) R package which includes the `str` and `structure` functions to view the basic structure. 

```{r structure}
# Use str function to 'Compactly Display the Structure of an Arbitrary R Object'. All R objects store additional attributes to store metadata about an object. In this str function we set the give.attr to FALSE to exclude these attributes.
str(witchdat, give.attr=FALSE)
# Next we will use structure function to view witchdat 
structure(witchdat)
# There are other ways of looking at your data. Try the glimpse function from the tibble package. Take a look at the Help description with ?glimpse to find out more
glimpse(witchdat)
```
  
This is a fairly large data frame with columns that are characters, doubles and integers.

##  Format "cells" in R

*In Excel we can rename and format "cells". *

<center>
<img src="https://user-images.githubusercontent.com/26913197/42921927-9114e546-8b72-11e8-864d-625d594fa297.PNG" style="width:60%">

</center>


<center>
<img src="https://user-images.githubusercontent.com/26913197/42921928-913ec1e0-8b72-11e8-8fc7-adefe3320e83.PNG" style="width:60%">
</center>


*Which functions can we use in R?*

From now on we will refer to the columns as variables, with column headers as the variable names.

First we will look at the format of the variable names. 

```{r variable names}
# Take a look at the column or variable names
names(witchdat)
```

A tool for data frame data manipulation is the  [dplyr](https://cran.r-project.org/web/packages/dplyr/index.html) R package. R packages typically have vignettes which are tutorials and worked examples using the package. There is useful documentation in the [Introduction to dplyr](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html)  vignette.

In order to access the contents of the columns we can use a `$` for example `witchdat$gadm.adm0`. Alternatively we can use another method that joins objects and functions using a pipe `%>%`. This can be useful if we would like to use multiple `dplyr` verbs or functions on an object.

We will be use `<-` to assign the formatting changes to `witchdat`.

```{r format}
# From inspection of witchdat this cryptic name gadm.adm0 refers to country. Use dplyr to rename a variable. 
witchdat <- witchdat %>% 
  rename(country=gadm.adm0)
```

```{r format2}
# Perform data type conversion of the deaths variable - use dplyr mutate to change this variable
witchdat <- witchdat %>% 
  mutate(deaths=as.numeric(deaths))
# or we can use dplyr mutate_all to change types of all integer variables (is.integer) to numeric (as.numeric). With mutate_if we don't need the () with is.integer and as.numeric functions as mutate_if will operate on all the columns for which the first test function is TRUE.
witchdat <- witchdat %>% 
  mutate_if(is.integer,as.numeric)
```

Try typing 'as.' in the R Console to see what other options are available to change data types.

```{r formatting changes structure}
# View the structure of eventdat again to see these formatting changes
glimpse(witchdat)
```

For other formatting we can also format the end products (such as plots), using the functions available in those R packages.

##  "Sort" data in R

*In Excel we can sort by columns.*

<center>
<img src="https://user-images.githubusercontent.com/26913197/42921929-9166b63c-8b72-11e8-973f-2e218cc7391e.PNG" style="width:60%">
</center>

*Which functions can we use in R?*

R is different to Excel that we can also view data manipulations without assigning changes to an object. In these sorting examples we will just view the changes in-line.

```{r arrange}
# Sort by character using arrange from dplyr. We can also pipe other packages functions such as head from the utils package.
witchdat %>% 
  arrange(city) %>% 
  head()
# This sorting also works on numeric variables, sort by year using arrange from dplyr
witchdat %>% 
  arrange(year) %>% 
  head()
# Or we can sort in descending order
witchdat %>% 
  arrange(desc(year)) %>% 
  head()
```


##  Create a "formula" in R

*In Excel we can create formulas in cells.*

<center>
<img src="https://user-images.githubusercontent.com/26913197/42921930-91cad61c-8b72-11e8-8718-de26f5af4ca8.PNG" style="width:60%">

</center>

<center>
<img src="https://user-images.githubusercontent.com/26913197/42921931-91f7bae2-8b72-11e8-9e12-5742fa8befef.PNG" style="width:60%">
</center>

*Which functions can we use in R?*

```{r formula}
# Remember R as a calculator? This is a simple formula
3+2
# Create a new variable in our dataframe called triedbutnotdied as the difference between two variables (formula after the =). Use the mutate function from dplyr
witchdat <- witchdat %>% 
  mutate(triedbutnotdied = tried-deaths)
```


```{r formula2}
# In R missing values are typically NA values. A handy base function to know is is.na(). Try typing ?is.na in the console.  Let's use this function in a new formula to test if a variable is populated
witchdat <- witchdat %>% 
  mutate(missingyear = ifelse(is.na(year),"Yup missing","Not missing"))
# View the structure of witchdat again to see these new variables
glimpse(witchdat)
```


##  Filter a "column"

*In Excel we can filter by columns.* 

<center>
<img src="https://user-images.githubusercontent.com/26913197/42921932-922124c2-8b72-11e8-93b0-d3d9888e5b47.PNG" style="width:60%">

</center>

*Which functions can we use in R?*

```{r filter}
# Filter which witches were tried in decade 1520 by using dplyr filter. Remember to use == for equals in R, whereas = is an assignment operator in R functions.
witchdat %>% 
  filter(decade==1520) %>% 
  head()
```


##  Calculate on "columns"

*In Excel we can add a calculation such as sum or a mean to a column.*

<center>
<img src="https://user-images.githubusercontent.com/26913197/42921933-924e4c72-8b72-11e8-9086-f7df45709e12.PNG" style="width:60%">
</center>

*Which functions can we use in R?*

In R Markdown documents we can "highlight" and comment on insights, calculations or data manipulations.

```{r calculation}
# Sum the total deaths. Remember there are NA values so we need to tell this sum function to ignore these with argument na.rm=TRUE. Using sum on the 'whole column' ie variable witchdat$deaths is intuitively the same way of calculating this sum in Excel
sum(witchdat$deaths,na.rm=TRUE)
# Now use dplyr with the select function to select the deaths variable (also the same as deleting the other columns in Excel) then apply the sum on this object
witchdat %>% 
  select(deaths) %>% 
  sum(na.rm = TRUE)
# Now try the dplyr summarise function and sum function then name this sum object as sum_deaths
witchdat %>% 
  summarise(sum_deaths = sum(deaths, na.rm = TRUE))
```


##  Create a "Pivot Table"

*In Excel you can create pivot tables.*

<center>
<img src="https://user-images.githubusercontent.com/26913197/42921934-927aebf6-8b72-11e8-9cf7-b907e14fcbaa.PNG" style="width:60%">
</center>

*Which functions can we use in R?*

In R we can use the table function to create a simple pivot table.

```{r table}
# Use th table function to find a simple count by country
witchdat %>% 
  select(country) %>% 
      table()
```


A neat R package that creates interactive pivot tables is [rpivotTable](https://cran.r-project.org/web/packages/rpivotTable/).   Have a look at this package's [vignette](https://cran.r-project.org/web/packages/rpivotTable/vignettes/rpivotTableIntroduction.html).

This example shows the sum of the "witches"" tried by country.

```{r pivot}
# Create an interactive pivot table using the rpivotTable function and package. 
rpivotTable(witchdat,rows="country",vals="tried",aggregator="Sum")
```
  
  
**Challenge 1** Can you recreate the pivot table as a new R data frame using `dplyr` functions? (Hint ?group_by)

**Challenge 2** On the pivot table widget move the `country` above the table so the `country` is a pivot table column. Can you extend your challenge 1 answer with `dplyr`  to recreate.

**Challenge 3** On the pivot table widget move `decade` and `country` as pivot table rows. Can you extend your challenge 1 answer with `dplyr`  to recreate.

## Save the "spreadsheet"

*In Excel you can save your workbooks and spreadsheets. Which functions can we use in R?*

Here are some options in R, depending on what output is needed:  

- The `base` R `write.csv` function to write to a csv file  
- The `readr` package in the tidyverse, includes the function `write_csv`, which is faster than write.csv  
- The `xlsx` package `write.xlsx` to write to an Excel workbook  


```{r save}
# Save dataframe with write_csv from readr package
write_csv(witchdat,"witchdat.csv")
```
  
  
***  

## Acknowledgments

  
Thanks to Marcia Ferreira @DrMarciaFe, Chris Beltz @BeltzEcology, Duncan Garmonsway @nacnudus and Ivan Leung @urganmax for reviewing and providing suggestions on this tutorial through Twitter.

If you have any further suggestions or corrections please raise an issue in [GitHub](https://github.com/kimnewzealand/R-tutorials)