-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy pathdata_processing_with_R_1.Rmd
527 lines (282 loc) · 15.7 KB
/
data_processing_with_R_1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
---
title: "Data Processing with R - part 1"
author: "John Pinney"
date: "January 2020"
output:
html_notebook:
css: rmd.css
number_sections: yes
toc: yes
toc_depth: 3
---
# Introduction
This workshop introduces the [tidyverse](https://www.tidyverse.org/) packages, which support data processing and visualisation in R.
All of these packages share the same concepts of data handling, which makes it easier to build workflows that move between different kinds of task:
- Reading tabular data (with `readr`)
- Tidying data (with `tidyr`)
- Transforming data (with `dplyr`)
- Data visualisation (with `ggplot2`)
- Applying statistical tests
This workshop assumes a basic familiarity with the R language, as covered in the introductory *R Programming* workshop.
## Getting started in RStudio
![](rconsole.png)
- Bottom left: console window (type commands here and see the output directly)
- Top left: editor window (for editing R notebooks and scripts)
- Top right: workspace / history window (examine the current workspace, import datasets, see commands entered previously)
- Bottom right: files / plots / packages / help window (change working directory, install packages, see graphics output, browse help)
## R notebooks
This document is an example of an *R notebook*, which combines text and code and makes it easy to embed R analysis within a report (ouput as HTML, PDF, Word document or presentation slides). This can help to make research more reproducible by allowing you to share an entire analysis workflow together with a narrative.
When you open the notebook's source code (*.Rmd* file) in RStudio, you can view and edit it in the editor window.
Text is formatted using the [R markdown](https://rmarkdown.rstudio.com/) notation, which is derived from [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet). This is a simple way to apply styling to text and indicate the structure of your document.
R code is included as *chunks*, which look like this:
```{r}
## R code lives here
print("Hello RStudio!")
```
When the cursor is inside a chunk, you can execute the code using *Ctrl-Shift-Enter*. The commands and output appear in the console window as if the chunk had been copy-pasted there. The output also appears in the notebook just after the chunk.
A notebook is a living document. You are encouraged to make use of this notebook to try out the example code, alter it, complete the exercises and add your own notes and code chunks. You can insert a new code chunk using the shortcut _Ctrl+Alt+I_.
## Working directory
Before you start work, please check your working directory:
```{r eval=FALSE}
getwd()
```
You can change the working directory using the files window in RStudio (bottom right).
Navigate to the directory containing the workshop data and click the cog icon for the "Set as Working Directory" option.
## About packages
Many useful R functions are developed by individuals and research groups and made available to the community as *packages*.
You can find a full list of R packages at the [CRAN](https://cran.r-project.org/) page.
The tidyverse packages are installed with:
```{r eval=FALSE}
install.packages("tidyverse")
```
and loaded into the workspace with:
```{r eval=FALSE}
library(tidyverse)
```
Note that this command loads all of the core tidyverse packages, but you can also load them individually as necessary.
## The data
Today's session will focus on data taken from the [GapMinder](https://www.gapminder.org/data/) project.
# Loading data (**readr**)
## Excel spreadsheets
To begin, we will use the [readxl](https://readxl.tidyverse.org/) package to load data from the relevant sheets in the Excel workbook.
This package is not part of the core tidyverse, so we need to load it directly:
```{r}
library(readxl)
```
The command `read_excel()` reads both xls and xlsx files and detects the format from the extension.
Today we will only use the sheet called `list-of-countries-etc` from this workbook.
```{r}
read_excel("data_geographies_v1.xlsx", sheet = "list-of-countries-etc")
```
RStudio shows you the output as a table.
Let's capture the loaded data using the variable `countries`:
```{r}
countries <- read_excel("data_geographies_v1.xlsx", sheet = "list-of-countries-etc")
```
We can get a quick overview of the data using the `str()` function:
```{r}
str(countries)
```
## Tibbles
Notice that as well the class `data.frame`, the object `countries` also belongs to the classes `tbl_df` and `tbl`.
This shows that the data has been loaded in the form of a [tibble](https://tibble.tidyverse.org/).
Tibbles are part of the tidyverse architecture. They are like data frames, but they do **less** (e.g., they don't attempt to coerce variables into different types) and complain **more** (e.g., when a variable does not exist). The idea is to force the programmer to deal with issues earlier, and so make it harder to write confusing or broken R code.
## CSV files
Let's look at another, related data set, this time loaded from a CSV file using `read_csv()` from the [readr](https://readr.tidyverse.org/) package:
```{r}
co2 <- read_csv("yearly_co2_emissions_1000_tonnes.csv")
co2
```
This is a straightforward numerical tibble which contains a lot of missing values.
The table shows annual CO2 emissions from burning fossil fuels for each country.
NB `read_csv()` is a different function to `read.csv()` from base R (which outputs a base `data.frame` rather than a `tbl_df`).
Tabular data in other formats (e.g. with tabs as delimiters, or using fixed-width fields) can also be read using other functions from `readr`.
# Tidying data (**tidyr**)
The tidyverse packages put a lot of emphasis on working with **tidy** data. What do we mean by that?
Tidy data has the three following attributes:
1. Every column is a variable
2. Every row is an observation (also known as a "case")
3. Every cell holds a single value
When data is tidy, we can visualise and analyse it more easily.
However, most of the data tables that you encounter "in the wild" will not be tidy by this definition, so the [tidyr](https://tidyr.tidyverse.org/) package provides functions to help reshape them into a tidy form.
Look at the `co2` tibble. What are the observations and what are the variables?
## gather()
We can use the `gather()` function to tidy the `co2` tibble:
```{r}
co2 <- gather(co2, -c(country,geo), key=year, value=kt, na.rm=TRUE)
co2
```
`gather()` works to **lengthen** the data table by collecting observations from multiple columns.
We specify the columns to use (all columns *except* the country name and geo code) and provide the names of two new variables, one to hold the old variable names ("year") and one to hold
the observations collected ("tonnes_per_person").
Notice that the tidyverse functions usually allow you to omit quotes when referring to column names.
By setting `na.rm=TRUE`, we discard the cells that did not contain observations.
## Changing data type
One complication here is that the `year` variable is shown as having a character data type. This is because the years have been derived from column names (strings) in the previous version of the table. Let's fix this before going any further:
```{r}
co2$year <- parse_integer(co2$year)
co2
```
## CSV without headers
Now that `co2` is in a tidy form, let's look at another example. `1997_stats.csv` is a CSV file containing GDP and population for various countries for the year 1997.
Actually this file is not in a correct CSV format, because it is missing a header row. You can open it in Excel to verify this. However, we can still load it using `read_csv()` as follows:
```{r}
stats97 <- read_csv("1997_stats.csv", col_names=FALSE)
stats97
```
You can see that the two columns in this file have been given the names `X1` and `X2`.
This file looks a bit harder to tidy. What are the variables in this data set and what are the observations?
## separate()
First, we need to split the `X1` data into two columns: one for the three-letter country code (the variable named `geo` in our `countries` tibble) and one for the type of measurement (GDP or population).
We can do this with the function `separate()`:
```{r}
stats97 <- separate(stats97, col=X1, sep="-", into=c("geo","measurement"))
stats97
```
Take a look at the command above. The arguments to `separate()` give the tibble to be processed, the name of the column to be split, the string to use as the delimiter, and a vector of strings giving the names for the resulting columns.
## spread()
Now we need to separate the GDP and population values into two different columns. This is done using the `spread()` function:
```{r}
stats97 <- spread(stats97, key=measurement, value=X2)
stats97
```
Notice that there are fewer rows than before; The `spread()` action **shortens** the length of the table.
Now each column is a variable, each row is an observation, and each cell is a single vale, so we have successfully tidied `stats97`.
The [readr/tidyr cheat sheet](https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf) gives a useful summary of the tidyverse functions for loading and tidying data.
# Visualising data (**ggplot2**)
Having loaded some numerical data, a sensible first step is to visualise the distributions of variables to check for any issues.
Although there are many useful plotting functions available in base R, in this session we will focus on making plots using the tidyverse package `ggplot2`. This is a very powerful set of functions that support a consistent approach to data visualisation. `ggplot2` makes it much easier to create high-quality graphics for presentations and publications.
## Histogram
Let's start with a histogram for the GDP data from the `stats97` tibble:
```{r}
stats97
```
Some of the GDP values are missing, so let's remove those rows as a first step, using `drop_na()` from `readr`:
```{r}
gdp97 <- drop_na(stats97, gdp)
gdp97
```
```{r}
a <- ggplot(gdp97, aes(x=gdp))
hist <- a + geom_histogram(bins=100)
hist
```
Notice that there are two steps to creating the plot.
The first line constructs the data space by specifying the dataset (`gdp97`) and the variable(s) of interest
(in this case, just `gdp`). The `aes()` function controls the *aesthetic mappings* used by `ggplot2`, i.e. the way in which the data will be mapped to visual elements.
In the second step, we add in a histogram *geom* that paints in the histogram bars and constructs the y-axis. Geoms are the actual visual marks that
encode the data. `ggplot2` provides geoms for all the commonly-used plot types, but also graphical primitives such as curves, lines and polygons, from which other visualisations can be built.
The `ggplot2` defaults for axes, labels etc. are usually informative enough for data exploration, but of course everything can be modified if necessary.
For example, to add at title, change the histogram colour and show population on a log scale:
```{r}
hist <- a + geom_histogram(bins=20, fill="blue", alpha=0.5) + labs(x="GDP", title="Countries in 1997") + scale_x_log10()
hist
```
## Saving plots to file
We can save the last plot made to a file using
```{r}
ggsave("my_histogram.pdf")
```
The file type is determined by the file extension and the size of the image can be changed using `width` and `height` options.
## Violin plot
The same data can be plotted using the `geom_violin()` geom. This time we will map `gdp` to the y coordinate:
```{r}
ggplot(gdp97, aes(y=gdp, x="")) + geom_violin(fill="blue", alpha=0.5, linetype=0) + scale_y_log10()
```
## Scatter plot
We can visualise covariation between variables using a scatter plot, for example GDP vs population. This uses `geom_point()`:
```{r}
ggplot(gdp97, aes(x=pop, y=gdp)) +
geom_point() +
scale_x_log10() +
scale_y_log10()
```
The [ggplot2 cheat sheet](https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf) gives much more information about plotting options.
# Manipulating data (**dplyr**)
The [dplyr]() package deals with many types of data manipulation that are needed as part of any analysis workflow.
## mutate()
It might be more useful to compare countries' GDP on a per-capita basis. We need to make a new variable to show per-capita GDP.
To do this, we will use the `mutate()` function, which adds new columns to the tibble:
```{r}
gdp97 <- mutate(gdp97, gdp_pp=gdp/pop)
gdp97
```
Note that like all dplyr functions (and despite its name), the `mutate()` function does not actually change the original tibble used as input, so we still need to capture the output using `<-`.
#### Exercise {-}
Visualise the distribution of the new variable `gdp_pp`.
```{r}
```
## Selecting and filtering data
### filter()
With tidy data, it is easy to filter cases according to whatever conditions we specify, e.g.:
```{r}
filter(gdp97, gdp_pp > 30000, pop < 1000000)
```
### select()
We can extract a subset of variables using select(), e.g.:
```{r}
select(gdp97, c(geo,gdp_pp))
```
### pull()
If you just want the values from a single column, use `pull()` to extract them as a vector:
```{r}
pull(gdp97, geo)
```
### arrange()
Use `arrange()` to sort a tibble by ascending column value:
```{r}
arrange(gdp97, pop)
```
...or by descending value using `desc()`:
```{r}
arrange(gdp97, desc(pop))
```
The [dplyr cheat sheet](https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf) gives useful summaries of these and other functions for data manipulation.
#### Exercise {-}
Use manipulations of the `countries` tibble to complete the following tasks:
1. Find the Asian countries that are south of the equator.
```{r}
```
2. Find the first five African countries to join the UN.
```{r}
```
3. Make a vector of OECD country names, sorted from East to West.
```{r}
```
#### Exercise {-}
Starting with the `co2` tibble, plot the annual emissions of a country of your choice.
*Hint*: use the `geom_line()` geom.
```{r}
```
## Joining tables
To compare emissions between countries in a fair way, it would make sense to convert them to a per-capita basis.
Let's start with the figures for 1997 to see how this can be done.
First we will make a new tibble containing only the 1997 emissions:
```{r}
co2_1997 <- filter(co2, year==1997)
co2_1997
```
However, the population data is not yet in the co2 tibble, so we will need to look it up from another tibble by matching the country name.
This type of **relational data**, where information must be collected from multiple tables, requires careful handling to make sure that rows in different tables are correctly associated with each other. The country name acts as a **key** to unlock the correct data from the associated table.
The relevant population data is in the stats97 table:
```{r}
stats97
```
This is indexed by the `geo` code for each country.
Taking the `co2_1997` data, we apply a `left_join()` to relate its `geo` variable to the `geo` variable in `stats97`:
```{r}
co2_1997 <- left_join(co2_1997, stats97[,c("geo","pop")], by="geo")
co2_1997
```
For every row in the original table, `left_join()` tries to match its `country` with a `name` in `stats97`. Notice that we have specified only the columns that we need from `stats97`.
The resulting table imports the additional column `pop` from the `stats97` tibble, so now we can associate each country and year with the correct CO2 emissions and population.
`left_join()` is just one of several `dplyr` functions for working with relational data.
You can read more about relational data in the [R for Data Science](https://r4ds.had.co.nz/relational-data.html) online textbook.
#### Exercise {-}
Calculate the per-capita emissions for 1997 as a new column in `co2_1997` and plot these on a histogram.
```{r}
```
Which country emitted the most CO2 per capita in 1997? Which emitted the least?
```{r}
```
---