-
Notifications
You must be signed in to change notification settings - Fork 1.7k
/
Copy pathIntroduction.Rmd
285 lines (207 loc) · 17.7 KB
/
Introduction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
\mainmatter
```{r, include = FALSE}
source("common.R")
```
# Introduction
I have now been programming in R for over 15 years, and have been doing it full-time for the last five years. This has given me the luxury of time to examine how the language works. This book is my attempt to pass on what I've learned so that you can understand the intricacies of R as quickly and painlessly as possible. Reading it will help you avoid the mistakes I've made and dead ends I've gone down, and will teach you useful tools, techniques, and idioms that can help you to attack many types of problems. In the process, I hope to show that, despite its sometimes frustrating quirks, R is, at its heart, an elegant and beautiful language, well tailored for data science.
## Why R?
If you are new to R, you might wonder what makes learning such a quirky language worthwhile. To me, some of the best features are:
* It's free, open source, and available on every major platform. As a result, if
you do your analysis in R, anyone can easily replicate it, regardless of
where they live or how much money they earn.
* R has a diverse and welcoming community, both online (e.g.
[the #rstats twitter community][rstats-twitter]) and in person (like the
[many R meetups][r-meetups]). Two particularly inspiring community groups are
[rweekly newsletter][rweekly] which makes it easy to keep up to date with
R, and [R-Ladies][r-ladies] which has made a wonderfully welcoming community
for women and other minority genders.
* A massive set of packages for statistical modelling, machine learning,
visualisation, and importing and manipulating data. Whatever model or
graphic you're trying to do, chances are that someone has already tried
to do it and you can learn from their efforts.
* Powerful tools for communicating your results. [RMarkdown][rmarkdown] makes
it easy to turn your results into HTML files, PDFs, Word documents,
PowerPoint presentations, dashboards and more. [Shiny][shiny] allows you to
make beautiful interactive apps without any knowledge of HTML or javascript.
* RStudio, [the IDE](http://www.rstudio.com/ide/), provides an integrated
development environment, tailored to the needs of data science, interactive
data analysis, and statistical programming.
* Cutting edge tools. Researchers in statistics and machine learning will often
publish an R package to accompany their articles. This means immediate
access to the very latest statistical techniques and implementations.
* Deep-seated language support for data analysis. This includes features
like missing values, data frames, and vectorisation.
* A strong foundation of functional programming. The ideas of functional
programming are well suited to the challenges of data science, and the
R language is functional at heart, and provides many primitives needed
for effective functional programming.
* RStudio, [the company](https://www.rstudio.com), which makes money by
selling professional products to teams of R users, and turns around and
invests much of that money back into the open source community (over 50%
of software engineers at RStudio work on open source projects). I work for
RStudio because I fundamentally believe in its mission.
* Powerful metaprogramming facilities. R's metaprogramming capabilities allow
you to write magically succinct and concise functions and provide an excellent
environment for designing domain-specific languages like ggplot2, dplyr,
data.table, and more.
* The ease with which R can connect to high-performance programming languages
like C, Fortran, and C++.
Of course, R is not perfect. R's biggest challenge (and opportunity!) is that most R users are not programmers. This means that:
* Much of the R code you'll see in the wild is written in haste to solve
a pressing problem. As a result, code is not very elegant, fast, or easy to
understand. Most users do not revise their code to address these shortcomings.
* Compared to other programming languages, the R community is more focussed on
results than processes. Knowledge of software engineering best practices is
patchy. For example, not enough R programmers use source code control or
automated testing.
* Metaprogramming is a double-edged sword. Too many R functions use
tricks to reduce the amount of typing at the cost of making code that
is hard to understand and that can fail in unexpected ways.
* Inconsistency is rife across contributed packages, and even within base R.
You are confronted with over 25 years of evolution every time you use R,
and this can make learning R tough because there are so many special cases to
remember.
* R is not a particularly fast programming language, and poorly written R code
can be terribly slow. R is also a profligate user of memory.
Personally, I think these challenges create a great opportunity for experienced programmers to have a profound positive impact on R and the R community. R users do care about writing high quality code, particularly for reproducible research, but they don't yet have the skills to do so. I hope this book will not only help more R users to become R programmers, but also encourage programmers from other languages to contribute to R.
## Who should read this book {#who-should-read}
This book is aimed at two complementary audiences:
* Intermediate R programmers who want to dive deeper into R, understand how
the language works, and learn new strategies for solving diverse problems.
* Programmers from other languages who are learning R and want to understand
why R works the way it does.
To get the most out of this book, you'll need to have written a decent amount of code in R or another programming language. You should be familiar with the basics of data analysis (i.e. data import, manipulation, and visualisation), have written a number of functions, and be familiar with the installation and use of CRAN packages.
This book walks the narrow line between being a reference book (primarily used for lookup), and being linearly readable. This involves some tradeoffs, because it's difficult to linearise material while still keeping related materials together, and some concepts are much easier to explain if you're already familiar with specific technical vocabulary. I've tried to use footnotes and cross-references to make sure you can still make sense even if you just dip your toes in a chapter.
## What you will get out of this book {#what-you-will-get}
This book delivers the knowledge that I think an advanced R programmer should possess: a deep understanding of the fundamentals coupled with a broad vocabulary that means that you can tactically learn more about a topic when needed.
After reading this book, you will:
* Be familiar with the foundations of R. You will understand complex data types
and the best ways to perform operations on them. You will have a deep
understanding of how functions work, you'll know what environments are, and
how to make use of the condition system.
* Understand what functional programming means, and why it is a useful tool for
data science. You'll be able to quickly learn how to use existing tools, and
have the knowledge to create your own functional tools when needed.
* Know about R's rich variety of object-oriented systems. You'll be most
familiar with S3, but you'll know of S4 and R6 and where to look for more
information when needed.
* Appreciate the double-edged sword of metaprogramming. You'll be able to
create functions that use tidy evaluation, saving typing and creating elegant
code to express important operations. You'll also understand the dangers
and when to avoid it.
* Have a good intuition for which operations in R are slow or use a lot of
memory. You'll know how to use profiling to pinpoint performance
bottlenecks, and you'll know enough C++ to convert slow R functions to
fast C++ equivalents.
## What you will not learn
This book is about R the programming language, not R the data analysis tool. If you are looking to improve your data science skills, I instead recommend that you learn about the [tidyverse](https://www.tidyverse.org/), a collection of consistent packages developed by me and my colleagues. In this book you'll learn the techniques used to develop the tidyverse packages; if you want to instead learn how to use them, I recommend [_R for Data Science_](http://r4ds.had.co.nz/).
If you want to share your R code with others, you will need to make an R package. This allows you to bundle code along with documentation and unit tests, and easily distribute it via CRAN. In my opinion, the easiest way to develop packages is with [devtools](http://devtools.r-lib.org), [roxygen2](http://roxygen2.r-lib.org/), [testthat](http://testthat.r-lib.org), and [usethis](http://usethis.r-lib.org). You can learn about using these packages to make your own package in [_R packages_](http://r-pkgs.had.co.nz/).
## Meta-techniques {#meta-techniques}
There are two meta-techniques that are tremendously helpful for improving your skills as an R programmer: reading source code and adopting a scientific mindset.
Reading source code is important because it will help you write better code. A great place to start developing this skill is to look at the source code of the functions and packages you use most often. You'll find things that are worth emulating in your own code and you'll develop a sense of taste for what makes good R code. You will also see things that you don't like, either because its virtues are not obvious or it offends your sensibilities. Such code is nonetheless valuable, because it helps make concrete your opinions on good and bad code.
A scientific mindset is extremely helpful when learning R. If you don't understand how something works, you should develop a hypothesis, design some experiments, run them, and record the results. This exercise is extremely useful since if you can't figure something out and need to get help, you can easily show others what you tried. Also, when you learn the right answer, you'll be mentally prepared to update your world view.
## Recommended reading {#recommended-reading}
Because the R community mostly consists of data scientists, not computer scientists, there are relatively few books that go deep in the technical underpinnings of R. In my personal journey to understand R, I've found it particularly helpful to use resources from other programming languages. R has aspects of both functional and object-oriented (OO) programming languages. Learning how these concepts are expressed in R will help you leverage your existing knowledge of other programming languages, and will help you identify areas where you can improve.
To understand why R's object systems work the way they do, I found _The Structure and Interpretation of Computer Programs_[^SICP] [@SICP] (SICP) to be particularly helpful. It's a concise but deep book, and after reading it, I felt for the first time that I could actually design my own object-oriented system. The book was my first introduction to the encapsulated paradigm of object-oriented programming found in R, and it helped me understand the strengths and weaknesses of this system. SICP also teaches the functional mindset where you create functions that are simple individually, and which become powerful when composed together.
[^SICP]: You can read it online for free at <https://mitpress.mit.edu/sites/default/files/sicp/full-text/book/book.html>
To understand the trade-offs that R has made compared to other programming languages, I found _Concepts, Techniques and Models of Computer Programming_ [@ctmcp] extremely helpful. It helped me understand that R's copy-on-modify semantics make it substantially easier to reason about code, and that while its current implementation is not particularly efficient, it is a solvable problem.
If you want to learn to be a better programmer, there's no place better to turn than _The Pragmatic Programmer_ [@pragprog]. This book is language agnostic, and provides great advice for how to be a better programmer.
## Getting help {#getting-help}
\index{help}
\index{reprex}
Currently, there are three main venues to get help when you're stuck and can't figure out what's causing the problem: [RStudio Community](https://community.rstudio.com/), [StackOverflow](http://stackoverflow.com) and the [R-help mailing list][r-help]. You can get fantastic help in each venue, but they do have their own cultures and expectations. It's usually a good idea to spend a little time lurking, learning about community expectations, before you put up your first post.
Some good general advice:
* Make sure you have the latest version of R and of the package (or packages)
you are having problems with. It may be that your problem is the result of
a recently fixed bug.
* Spend some time creating a **repr**oducible **ex**ample, or reprex.
This will help others help you, and often leads to a solution without
asking others, because in the course of making the problem reproducible you
often figure out the root cause. I highly recommend learning and using
the [reprex](https://reprex.tidyverse.org/) package.
<!-- GVW: is someone going to go through once you're done and create a glossary? If you've flagged things like "reprex" in bold, it ought to be easy to find terms. -->
If you are looking for specific help solving the exercises in this book, solutions from Malte Grosser and Henning Bumann are available at <https://advanced-r-solutions.rbind.io>.
## Acknowledgments {#intro-ack}
I would like to thank the many contributors to R-devel and R-help and, more recently, Stack Overflow and RStudio Community. There are too many to name individually, but I'd particularly like to thank Luke Tierney, John Chambers, JJ Allaire, and Brian Ripley for generously giving their time and correcting my countless misunderstandings.
This book was [written in the open](https://github.com/hadley/adv-r/), and chapters were advertised on [twitter](https://twitter.com/hadleywickham) when complete. It is truly a community effort: many people read drafts, fixed typos, suggested improvements, and contributed content. Without those contributors, the book wouldn't be nearly as good as it is, and I'm deeply grateful for their help. Special thanks go to Jeff Hammerbacher, Peter Li, Duncan Murdoch, and Greg Wilson, who all read the book from cover-to-cover and provided many fixes and suggestions.
```{r, eval = FALSE, echo = FALSE}
library(tidyverse)
contribs_all_json <- gh::gh("/repos/:owner/:repo/contributors",
owner = "hadley",
repo = "adv-r",
.limit = Inf
)
contribs_all <- tibble(
login = contribs_all_json %>% map_chr("login"),
n = contribs_all_json %>% map_int("contributions")
)
contribs_old <- read_csv("contributors.csv", col_types = list())
contribs_new <- contribs_all %>% anti_join(contribs_old, by = "login")
# Get info for new contributors
needed_json <- map(
contribs_new$login,
~ gh::gh("/users/:username", username = .x)
)
info_new <- tibble(
login = contribs_new$login,
name = map_chr(needed_json, "name", .default = NA),
blog = map_chr(needed_json, "blog", .default = NA)
)
info_old <- contribs_old %>% select(login, name, blog)
info_all <- bind_rows(info_old, info_new)
contribs_all <- contribs_all %>%
left_join(info_all, by = "login") %>%
arrange(login)
write_csv(contribs_all, "contributors.csv")
```
```{r, results = "asis", echo = FALSE, message = FALSE}
library(dplyr)
contributors <- read.csv("contributors.csv", stringsAsFactors = FALSE)
contributors <- contributors %>%
filter(login != "hadley") %>%
mutate(
login = paste0("\\@", login),
desc = ifelse(is.na(name), login, paste0(name, " (", login, ")"))
)
cat("A big thank you to all ", nrow(contributors), " contributors (in alphabetical order by username): ", sep = "")
cat(paste0(contributors$desc, collapse = ", "))
cat(".\n")
```
## Conventions {#conventions}
Throughout this book I use `f()` to refer to functions, `g` to refer to variables and function parameters, and `h/` to paths.
Larger code blocks intermingle input and output. Output is commented (`#>`) so that if you have an electronic version of the book, e.g., <https://adv-r.hadley.nz/>, you can easily copy and paste examples into R.
Many examples use random numbers. These are made reproducible by `set.seed(1014)`, which is executed automatically at the start of each chapter.
\newpage
## Colophon {#colophon}
This book was written in [bookdown](http://bookdown.org/) inside [RStudio](http://www.rstudio.com/ide/). The [website](https://adv-r.hadley.nz/) is hosted with [netlify](http://netlify.com/), and automatically updated after every commit by [travis-ci](https://travis-ci.org/). The complete source is available from [GitHub](https://github.com/hadley/adv-r). Code in the printed book is set in [inconsolata](http://levien.com/type/myfonts/inconsolata.html). Emoji images in the printed book come from the open-licensed [Twitter Emoji](https://github.com/twitter/twemoji).
This version of the book was built with `r R.version.string` and the following packages.
```{r, echo = FALSE, results="asis"}
deps <- desc::desc_get_deps()$package[-1]
pkgs <- sessioninfo::package_info(deps, dependencies = FALSE)
df <- tibble(
package = pkgs$package,
version = pkgs$ondiskversion,
source = gsub("@", "\\\\@", pkgs$source)
)
knitr::kable(df, format = "markdown")
```
```{r, include = FALSE}
ruler <- function(width = getOption("width")) {
x <- seq_len(width)
y <- case_when(
x %% 10 == 0 ~ as.character((x %/% 10) %% 10),
x %% 5 == 0 ~ "+",
TRUE ~ "-"
)
cat(y, "\n", sep = "")
cat(x %% 10, "\n", sep = "")
}
ruler()
```
[r-help]: https://stat.ethz.ch/mailman/listinfo/r-help
[rstats-twitter]: https://twitter.com/search?q=%23rstats
[r-meetups]: https://www.meetup.com/topics/r-programming-language/
[rweekly]: https://rweekly.org
[r-ladies]: http://r-ladies.org
[rmarkdown]: https://rmarkdown.rstudio.com
[shiny]: http://shiny.rstudio.com