-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathexcel-extensions-which-witch.Rmd
309 lines (203 loc) · 12.2 KB
/
excel-extensions-which-witch.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
---
title: "Tutorial on Pivot Tables and other Excel things you can also do in R - Witch Trials data"
author: "by Kim Fitter"
output:
html_document:
toc: true
toc_float: true
---
## Introduction
This tutorial was inspired by the [R Curious tutorial at useR! 2018](https://user2018.r-project.org/tutorials/), and follows on thematically from the R Curious workshop [notes](https://github.com/softloud/rcurious/tree/master/workshop) as an extension.
It is aimed at those with a background in Excel, who would also like to use R for data analysis. This tutorial compares the things you would normally do in Excel, but with an equivalent function in R.
This introductory level tutorial assumes you have already installed R and R studio and had a brief introduction to the R basics and R Markdown.
From the [R Markdown](https://github.com/kimnewzealand/R-tutorials/blob/master/excel-extensions-which-witch.Rmd) run each line of code (Ctrl + Enter) or code chunk (Ctrl + Shift +Enter) yourself or Knit -> Knit to HTML to read through the HTML version.
```{r setup, include=FALSE}
# This is a comment on the setup chunk, we are using echo=TRUE so that the code is included in the R Markdown document
knitr::opts_chunk$set(cache = TRUE, echo = TRUE)
```
## Particularly useful packages for this exercise
```{r load packages}
# Load packages
library(tidyverse)
library(rpivotTable)
```
**Tip** Install the package from CRAN the first time you use the package using `install.packages("name of package")` else you will get an `Error` that there is no package.
**Tip** The latest package version is sometimes available from GitHub so alternatively install from GitHub using the`devtools` package. A package's GitHub README file typically has installation instructions.
## Import Witch Trial data
The dataset used in this tutorial is of 43,000 people tried for witchcraft across
21 European countries over a period of five-and-a-half centuries used for this [economic journal](https://www.researchgate.net/publication/315533575_Witch_Trials).
This is what the csv file looks like in Excel:
<center>
<img src="https://user-images.githubusercontent.com/26913197/42921926-90eae0b6-8b72-11e8-928d-c0e014bb7230.PNG" style="width:60%">
</center>
In R, import the data using the `read_csv` function from the [readr](https://cran.r-project.org/web/packages/readr/index.html) R package which is loaded with the tidyverse, to import the raw data to a data frame called `witchdat`.
A data frame, which is a 2-dimensional object where contents can be different data types.
```{r download data}
# Download raw data from the author's GitHub url https://github.com/JakeRuss/witch-trials/tree/master/data
url <- "https://raw.githubusercontent.com/JakeRuss/witch-trials/master/data/trials.csv"
# The read_csv function also produces default messages describing the parsing of columns which is converting the columns into different types.
witchdat <- read_csv(url)
```
Let's take a quick look at the `witchdat` data frame in R to get an idea of it's size and contents.
In R there are some packages that are included by default; take a look at the packages available in the Global Environment drop down. One such package is called [base](https://stat.ethz.ch/R-manual/R-devel/library/base/html/00Index.html) which includes dim function to retrieve the dimensions, the row and column numbers.
```{r dim}
# Retrieve the number of rows and columns
dim(witchdat)
```
Another included package is [utils](https://cran.r-project.org/web/packages/R.utils/index.html) R package which includes the `str` and `structure` functions to view the basic structure.
```{r structure}
# Use str function to 'Compactly Display the Structure of an Arbitrary R Object'. All R objects store additional attributes to store metadata about an object. In this str function we set the give.attr to FALSE to exclude these attributes.
str(witchdat, give.attr=FALSE)
# Next we will use structure function to view witchdat
structure(witchdat)
# There are other ways of looking at your data. Try the glimpse function from the tibble package. Take a look at the Help description with ?glimpse to find out more
glimpse(witchdat)
```
This is a fairly large data frame with columns that are characters, doubles and integers.
## Format "cells" in R
*In Excel we can rename and format "cells". *
<center>
<img src="https://user-images.githubusercontent.com/26913197/42921927-9114e546-8b72-11e8-864d-625d594fa297.PNG" style="width:60%">
</center>
<center>
<img src="https://user-images.githubusercontent.com/26913197/42921928-913ec1e0-8b72-11e8-8fc7-adefe3320e83.PNG" style="width:60%">
</center>
*Which functions can we use in R?*
From now on we will refer to the columns as variables, with column headers as the variable names.
First we will look at the format of the variable names.
```{r variable names}
# Take a look at the column or variable names
names(witchdat)
```
A tool for data frame data manipulation is the [dplyr](https://cran.r-project.org/web/packages/dplyr/index.html) R package. R packages typically have vignettes which are tutorials and worked examples using the package. There is useful documentation in the [Introduction to dplyr](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html) vignette.
In order to access the contents of the columns we can use a `$` for example `witchdat$gadm.adm0`. Alternatively we can use another method that joins objects and functions using a pipe `%>%`. This can be useful if we would like to use multiple `dplyr` verbs or functions on an object.
We will be use `<-` to assign the formatting changes to `witchdat`.
```{r format}
# From inspection of witchdat this cryptic name gadm.adm0 refers to country. Use dplyr to rename a variable.
witchdat <- witchdat %>%
rename(country=gadm.adm0)
```
```{r format2}
# Perform data type conversion of the deaths variable - use dplyr mutate to change this variable
witchdat <- witchdat %>%
mutate(deaths=as.numeric(deaths))
# or we can use dplyr mutate_all to change types of all integer variables (is.integer) to numeric (as.numeric). With mutate_if we don't need the () with is.integer and as.numeric functions as mutate_if will operate on all the columns for which the first test function is TRUE.
witchdat <- witchdat %>%
mutate_if(is.integer,as.numeric)
```
Try typing 'as.' in the R Console to see what other options are available to change data types.
```{r formatting changes structure}
# View the structure of eventdat again to see these formatting changes
glimpse(witchdat)
```
For other formatting we can also format the end products (such as plots), using the functions available in those R packages.
## "Sort" data in R
*In Excel we can sort by columns.*
<center>
<img src="https://user-images.githubusercontent.com/26913197/42921929-9166b63c-8b72-11e8-973f-2e218cc7391e.PNG" style="width:60%">
</center>
*Which functions can we use in R?*
R is different to Excel that we can also view data manipulations without assigning changes to an object. In these sorting examples we will just view the changes in-line.
```{r arrange}
# Sort by character using arrange from dplyr. We can also pipe other packages functions such as head from the utils package.
witchdat %>%
arrange(city) %>%
head()
# This sorting also works on numeric variables, sort by year using arrange from dplyr
witchdat %>%
arrange(year) %>%
head()
# Or we can sort in descending order
witchdat %>%
arrange(desc(year)) %>%
head()
```
## Create a "formula" in R
*In Excel we can create formulas in cells.*
<center>
<img src="https://user-images.githubusercontent.com/26913197/42921930-91cad61c-8b72-11e8-8718-de26f5af4ca8.PNG" style="width:60%">
</center>
<center>
<img src="https://user-images.githubusercontent.com/26913197/42921931-91f7bae2-8b72-11e8-9e12-5742fa8befef.PNG" style="width:60%">
</center>
*Which functions can we use in R?*
```{r formula}
# Remember R as a calculator? This is a simple formula
3+2
# Create a new variable in our dataframe called triedbutnotdied as the difference between two variables (formula after the =). Use the mutate function from dplyr
witchdat <- witchdat %>%
mutate(triedbutnotdied = tried-deaths)
```
```{r formula2}
# In R missing values are typically NA values. A handy base function to know is is.na(). Try typing ?is.na in the console. Let's use this function in a new formula to test if a variable is populated
witchdat <- witchdat %>%
mutate(missingyear = ifelse(is.na(year),"Yup missing","Not missing"))
# View the structure of witchdat again to see these new variables
glimpse(witchdat)
```
## Filter a "column"
*In Excel we can filter by columns.*
<center>
<img src="https://user-images.githubusercontent.com/26913197/42921932-922124c2-8b72-11e8-93b0-d3d9888e5b47.PNG" style="width:60%">
</center>
*Which functions can we use in R?*
```{r filter}
# Filter which witches were tried in decade 1520 by using dplyr filter. Remember to use == for equals in R, whereas = is an assignment operator in R functions.
witchdat %>%
filter(decade==1520) %>%
head()
```
## Calculate on "columns"
*In Excel we can add a calculation such as sum or a mean to a column.*
<center>
<img src="https://user-images.githubusercontent.com/26913197/42921933-924e4c72-8b72-11e8-9086-f7df45709e12.PNG" style="width:60%">
</center>
*Which functions can we use in R?*
In R Markdown documents we can "highlight" and comment on insights, calculations or data manipulations.
```{r calculation}
# Sum the total deaths. Remember there are NA values so we need to tell this sum function to ignore these with argument na.rm=TRUE. Using sum on the 'whole column' ie variable witchdat$deaths is intuitively the same way of calculating this sum in Excel
sum(witchdat$deaths,na.rm=TRUE)
# Now use dplyr with the select function to select the deaths variable (also the same as deleting the other columns in Excel) then apply the sum on this object
witchdat %>%
select(deaths) %>%
sum(na.rm = TRUE)
# Now try the dplyr summarise function and sum function then name this sum object as sum_deaths
witchdat %>%
summarise(sum_deaths = sum(deaths, na.rm = TRUE))
```
## Create a "Pivot Table"
*In Excel you can create pivot tables.*
<center>
<img src="https://user-images.githubusercontent.com/26913197/42921934-927aebf6-8b72-11e8-9cf7-b907e14fcbaa.PNG" style="width:60%">
</center>
*Which functions can we use in R?*
In R we can use the table function to create a simple pivot table.
```{r table}
# Use th table function to find a simple count by country
witchdat %>%
select(country) %>%
table()
```
A neat R package that creates interactive pivot tables is [rpivotTable](https://cran.r-project.org/web/packages/rpivotTable/). Have a look at this package's [vignette](https://cran.r-project.org/web/packages/rpivotTable/vignettes/rpivotTableIntroduction.html).
This example shows the sum of the "witches"" tried by country.
```{r pivot}
# Create an interactive pivot table using the rpivotTable function and package.
rpivotTable(witchdat,rows="country",vals="tried",aggregator="Sum")
```
**Challenge 1** Can you recreate the pivot table as a new R data frame using `dplyr` functions? (Hint ?group_by)
**Challenge 2** On the pivot table widget move the `country` above the table so the `country` is a pivot table column. Can you extend your challenge 1 answer with `dplyr` to recreate.
**Challenge 3** On the pivot table widget move `decade` and `country` as pivot table rows. Can you extend your challenge 1 answer with `dplyr` to recreate.
## Save the "spreadsheet"
*In Excel you can save your workbooks and spreadsheets. Which functions can we use in R?*
Here are some options in R, depending on what output is needed:
- The `base` R `write.csv` function to write to a csv file
- The `readr` package in the tidyverse, includes the function `write_csv`, which is faster than write.csv
- The `xlsx` package `write.xlsx` to write to an Excel workbook
```{r save}
# Save dataframe with write_csv from readr package
write_csv(witchdat,"witchdat.csv")
```
***
## Acknowledgments
Thanks to Marcia Ferreira @DrMarciaFe, Chris Beltz @BeltzEcology, Duncan Garmonsway @nacnudus and Ivan Leung @urganmax for reviewing and providing suggestions on this tutorial through Twitter.
If you have any further suggestions or corrections please raise an issue in [GitHub](https://github.com/kimnewzealand/R-tutorials)