-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCH4_tibbles_dplyr.qmd
1787 lines (1019 loc) · 52.6 KB
/
CH4_tibbles_dplyr.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Chapter 4 - Tibbles and dplyr"
author: "Government Analysis Function and ONS Data Science Campus"
engine: knitr
execute:
echo: true
eval: false
freeze: auto # re-render only when source changes
---
> To switch between light and dark modes, use the toggle in the top left
# Learning Objectives
* Understand the importance of clean variable names.
* Be able to clean column names using the janitor package.
* Understand the use of the pipe operator.
* Be able to sort data with dplyr's **arrange** verb.
* Be able to select data with dplyr's **select** verb.
* Be able to filter data with dplyr's **filter** verb.
* Be able to transform data with dplyr's **mutate** verb.
* Be able to join datasets together.
# Packages and Data
Remember, the first steps when starting a new script are:
* Load in the packages required for the work.
* Read in datasets required and assign them to a variable in memory.
## Exercise
::: {.panel-tabset}
### **Exercise**{-}
1. Load the following packages:
* Tidyverse
* janitor
2. Read in the **titanic.csv** file and assign it to the name "titanic_data". Remember to assign null values properly (as in Chapter 3) using the "na" parameter.
* Remember that you are in your R project, which sets the working directory to be inside the **Course_content** folder.
3. Have a glimpse of your dataset.
### **Show Answer**{-}
```{r message = FALSE, warning = FALSE}
# Load packages
library(tidyverse)
library(janitor)
# Read in titanic.csv and set null values to be specific symbols
titanic_data <- read_csv("Data/titanic.csv",
na = c("*", ".", "", "NULL"))
# Have a peak
glimpse(titanic_data)
```
:::
As a reminder, in the titanic dataset our columns are:
* **Pclass**: Passenger’s class, 1 = 1st (Upper), 2 = 2nd(Middle), 3 = 3rd(Lower)
* **Survived**: Survived (1) or died (0)
* **Name**: Passenger’s name
* **Sex**: Passenger’s sex
* **Age of Passenger**: Passenger’s age
* **SibSp**: Number of siblings/spouses aboard (excluding the person)
* **Parch**: Number of parents/children aboard (excluding the person)
* **Ticket**: Ticket number
* **Fare**: Fare
* **Cabin**: Cabin number
* **Embarked**: Port of embarkation, C = Cherbourg, Q = Queenstown, S = Southampton
We can see more details on the [Data Dictionary](https://www.kaggle.com/c/titanic/data)
# Column Names
In the previous session we stated that every column in a tibble is a variable and it is good practice to not have spaces within variable names, as spaces makes it harder for us to call on the variables when we need to use them.
When you enter data in Excel, you most often don’t think too much about what you call each column. After all, you just label them once and as long as they are documented, this isn't given too much thought.
When you are working with variables in R though, you need to type the name of each variable, every time you want to work with it. So, it makes sense to make your column names as simple, but meaningful as possible.
## Retuning columns by name
In base R, to call a column by name and return its contents as a single vector (remember, each column in a tibble is a vector) we use the dollar sign **$** operator.
You will notice the list of column names will pop up and you can move through them with arrow keys and select the one you want.
### Example
Let's return the column "Pclass".
```{r}
# Return Pclass with base R
titanic_data$Pclass
```
This returns the entire vector (usually up to 1000 entries) so it would be useful to use **glimpse()** or other inspection functions for a sense check.
```{r}
# Return Pclass and glimpse
glimpse(titanic_data$Pclass)
```
This could already prove frustrating due to needing to remember the capital letters, particularly if the autocomplete is slowing down.
However, it can get worse if spaces are included in the column name.
### Example - Returning column with spaces
Let's take the "name of Passenger" column and try to return it without the auto-complete.
```{r, eval = FALSE}
titanic$name Of Passenger
```
This will throw an error as spaces in syntax are not allowed, R cannot process code in this way as a space should usually denote the end of a line of code.
To get around this we enclose column names with spaces in backticks **\` \`** and you will notice that the autocomplete does the same.
This allows the entire column name to be read as one entity.
```{r}
# Selecting a column with spaces in the names
glimpse(titanic_data$`name Of Passenger`)
```
Whilst this works, it is bad practice to use capitalisation and spaces, as it complicates things for us as well as others we collaborate with.
### A word of warning
Whilst this is completely fine in Base R, there are unexpected consequences of using this technique, particularly when making changes to a column using assignment.
* With the tidyverse, when we use its myriad of functions, we can check manipulations on the data without overwriting the underlying variable unless we **explicitly** ask it to do so with **<-**.
* However, with the **$** method, we can overwrite a column for example and this will **implicitly** and permanently change the underlying variable, which we must be careful of.
Should you make a mistake following this choice, you would have to revert back to the original data and read it back in to start from scratch, as recreating data is a nightmare.
### Accessing column names
We can see the column names by using the "names()" function to print a character vector of the column names.
```{r}
# Getting the column names using the names function
names(titanic_data)
```
We will need to do some work on these to remove the use of capitalisation and spaces.
## Cleaning Column Names
### The Janitor Package
The **janitor** package offers many functions used to manipulate data, such as finding duplicates. In this chapter we will use it to clean column names.
The function to use is called "clean_names()" and automatically formats the column names as **snake_case**, but this can be altered with an additional parameter.
```{r}
# Clean the column names and overwrite the variable
titanic_data <- clean_names(titanic_data)
# Getting the column names of the dataset
names(titanic_data)
```
# The Data Manipulation Package dplyr
This is one of the most powerful packages in the **tidyverse**, which makes data manipulation simple and code easy to read.
We will look at how to perform the following actions:
1. arrange/sort
2. select
3. filter
4. mutate
5. joining data
with the aim of the package to provide a function for each basic **verb** of data manipulation. This has led to them being referred to as such in the documentation.
Each of the verbs have the same structure:
> **verb(.data, info,....)**, note the full stop which is syntax to allow us to reference variables from the dataset (enables auto-completion too!)
and the [cheat sheet](https://nyu-cdsc.github.io/learningr/assets/data-transformation.pdf) is incredibly useful for a reference piece.
Before we jump into the verbs, let's see dplyr's version of renaming columns!
## Rename
We may wish to remain within the tidyverse when cleaning column names (say, Janitor is not available to you), which invites the use of **rename()**.
This allows you to change column names one at a time using the following syntax:
> **rename(.data, new_name = old_name)**
As an example, let's rename the age_of_passenger column to simply "age".
```{r}
# Rename the age_of_passenger column to age
rename(.data = titanic_data,
age = age_of_passenger)
```
If we glimpse the data again:
```{r}
# Take a peak at titanic data
glimpse(titanic_data)
```
We see that age has not carried through to the underlying dataset.
This is an excellent feature of the tidyverse, in that if we do not:
* Directly overwrite the variable
* Or create a new one using assignment **<-**
then it just shows us what that process will do, so we can then make the decision on whether it's what we want.
### Renaming multiple columns
One of the best things about dplyr's functions is that the .data argument clues the function in to the columns themselves, which means we can just continually list them, or make changes to them, without wrapping them in a vector **c()**, like many other libraries.
As an example, let's create a new variable and rename the "of_passenger" columns.
```{r}
# Rename of_passenger columns and create a new variable
titanic_renamed <- rename(.data = titanic_data,
name = name_of_passenger,
sex = sex_of_passenger,
age = age_of_passenger) # Notice autocompletion
glimpse(titanic_renamed)
```
We also didn't need to give them in the order specified either, the function automatically knows what columns we are referring to in the underlying dataset.
There is much more we can say about the structure of these functions, but we will see this as we proceed through the verbs themselves.
## Arrange
Our data is displayed in the same order as the source data.
We may want to sort our data based on specific columns.
To do so, we use the verb **arrange()**.
### Example - Single column sort
From here, we will only create a new variable or overwrite the existing one when a change we wish to permanently keep is performed.
We highly encourage you to do the same, as creating multiple new variables you will never use again will clutter up your environment.
```{r}
# Sort titanic by fare
arrange(.data = titanic_data,
fare)
```
We get a larger output here that does not show us all columns at a glance, so checking that the outcome is what we expect is more difficult.
We should instead use **glimpse()** to check the output, which needs to be written first in the code.
```{r}
# Sort by fare and then glimpse
glimpse(arrange(.data = titanic_data,
fare))
```
Notice that by default, arrange() sorted the fare column in **ascending** order.
### **Small Aside - Functions**{-}
Notice that glimpse() is **written first** but **executed last**, since R evaluates functions from the inside out.
After all, if glimpse() has no input, we can't see anything at all!
Keep this in mind as it will become very important later.
### Example - Desending Order Sort
To sort a column in descending order, we use the **desc()** function and the column name as its input.
```{r}
# Sort titanic in descending order
glimpse(arrange(.data = titanic_data,
desc(fare)))
```
### Example - Multi-Column Sort
We can also sort by multiple columns, but this creates a chain of **dependence**, in that the first column sort is maintained before the second sort is performed.
This means that the second sort won't change the order from the first sort.
* When this becomes useful is if there are numerous examples of the same value, such as the 0.0000 values in the fare column.
* Whilst the position of 0.0000 will not change, the values in the adjacent column will be sorted. Some of those paying £0 fare may have been younger than others, after all.
```{r}
# Multi-column sort
glimpse(arrange(.data = titanic_data,
fare,
age_of_passenger))
```
Notice that we were able to just continually reference columns from the data as if they were arguments to the function, and the verb understands that immediately, just like with rename().
### Exercise
::: {.panel-tabset}
### **Exercise**{-}
Sort the titanic data set by age in **descending** order, then fare in **ascending** order.
Glimpse the result.
### **Show Answer**{-}
```{r}
# Sort by age desc, fare asc
glimpse(arrange(.data = titanic_data,
desc(age_of_passenger),
fare))
```
:::
## Select
Sometimes we will want to work with smaller tibbles that contain just a subset of available columns.
The **select()** verb is perfect for this, and it takes the arguments:
* The first is our dataset, made even simpler with ".data"
* From here we list as many columns by name as we would like to be retained in our selecting process.
* The order in which we specify them is the order they will be in the smaller tibble.
### Example - Single Column
```{r}
# Selecting data
glimpse(select(.data = titanic_data,
name_of_passenger))
```
Notice that this has returned a tibble with one column, not the underlying vector itself that **$** would have.
### Example - Multiple Columns
If we want to select multiple columns, we can continue to list them, separating with commas, as we did with **arrange()**.
```{r}
# Selecting data
glimpse(select(.data = titanic_data,
name_of_passenger,
age_of_passenger,
pclass))
```
### Example - Columns in a range
We can select consecutive columns (next to one another) with the syntax we used for indexing in chapter 2, the colon **:** operator.
```{r}
# Selecting from passenger class to the age of passenger in order
glimpse(select(.data = titanic_data,
pclass:age_of_passenger))
```
Notice that we return the columns from pclass to age_of_passenger in the order of the source data.
### Selecting with exclusion
Up to this point we have selected with **inclusion**, a.k.a we specify the columns we want to include in our smaller tibble.
The real power of select comes from it's flexibility, in that we can reduce the amount of code to write by using **exclusion** instead, a.k.a specify the columns we don't want to include.
Say we want to keep 7 of 9 columns, instead of typing out the names of all 7, we can just exclude the 2 we don't want instead.
### **Example**{-}
To exclude, we use the minus sign operator **-** which signifies "do not select this" or "select, not this".
```{r}
# Selecting by excluding columns we don't want
glimpse(select(.data = titanic_data,
-name_of_passenger,
-age_of_passenger,
-pclass))
```
There is an even more streamlined way to do this, by wrapping the columns in a vector with the **c()** function and using one minus sign outside it.
You can think of this as expanding brackets in mathematics:
> \-c(col1, col2) = c(\-col1, \-col2) = \-col1, \-col2
```{r}
# Using a vector for exclusion
glimpse(select(.data = titanic_data,
-c(name_of_passenger,
age_of_passenger,
pclass)))
```
### Selecting with Index Position
We can also select the columns using their index position, starting from 1, just like we did with data structures in chapter 2.
Let's select columns 1 to 4, and also column 7.
```{r}
# Selecting sequential and out of sequence columns with index position
glimpse(select(.data = titanic_data,
1:4,
7))
```
Exclusion works similarly here with the minus sign.
Note that whilst this may be required in some cases, it is usually better to be **explicit** with the exact column names, if possible.
### Select Helper Functions
As if **select()** wasn't already helpful enough, it even has helper functions that allow us to select on specific patterns, such as a prefix or suffix in a column name.
They are as follows:
* starts_with(match): Starts with a prefix.
* ends_with(match): Ends with a suffix.
* contains(match): Contains a literal string.
* matches(match): Matches a [regular expression](https://www.regular-expressions.info/quickstart.html).
* everything(): Selects all variables.
* last_col(): Selects the last column.
These can be passed instead of column names **and** alongside selected column names as well.
### **Examples**{-}
Let's return columns that start with an "s".
```{r}
# Selecting columns
glimpse(select(.data = titanic_data,
starts_with("s")))
```
As a second example, let's return everything.
```{r}
# Selecting everything
glimpse(select(.data = titanic_data,
everything()))
```
### Exercise
::: {.panel-tabset}
### **Exercise**{-}
1. Select the second, third and fourth columns from titanic_data, without typing all three.
2. Select all columns except "fare", "cabin" and "embarked" from titanic_data. Note that these are consecutive.
3. Select just the last column from titanic_data using a helper function.
4. Select the columns that end in the suffix "passenger" using a helper function.
### **Show Answer**{-}
1.
```{r}
# Select second, third and fourth column
glimpse(select(.data = titanic_data,
survived:sex_of_passenger))
```
2.
```{r}
# Using exclusion on a range
glimpse(select(.data = titanic_data,
-fare:-embarked))
# OR
# glimpse(select(.data = titanic_data,
# -c(fare, cabin, embarked)))
# OR
# glimpse(select(.data = titanic_data,
# -fare,
# -cabin,
# -embarked))
```
3.
```{r}
# Selecting last column only
glimpse(select(.data = titanic_data,
last_col()))
```
4.
```{r}
# Selecting on a suffix
glimpse(select(.data = titanic_data,
ends_with("passenger")))
```
:::
## The Pipe Operator
Up until now, we have run verbs one at a time on our dataset, be it to sort or select columns.
The real power of the tidyverse comes from the ability to chain these functions together in a sequence for more complex data manipulation tasks.
However, out of the box, this becomes laborious quickly due to the nature of functions.
### Returning to Functions - Composition
As we saw earlier with glimpse(), to apply many functions to a single output, known as **composition** of functions, we must nest them, with the function applied last being written first, and evaluated from the inside to the outside.
### **Example**{-}
For example, let's take the **sqrt()** and **round()** mathematical functions and apply them together, so that we obtain the square root of a number and hence round it to a specified number of decimal places.
```{r}
# Compose mathematical functions
round(sqrt(2))
```
Notice that the functions are evaluated in reverse order (from the inside, out) just as you would with writing this mathematically. You would perform the square root operation first, **then** round the result.
Should we wish to use an extra parameter, we need to start thinking about the brackets, as they are the opening and closing doors of a function itself.
Thus, the **digits** parameter needs to go inside the round() brackets, not the sqrt() ones.
```{r}
# Compose mathematical functions with extra parameters
round(sqrt(2), digits = 2)
```
Notice that the input to the function round() was the output from the sqrt() function.
### Composition of functions in the tidyverse
This becomes tricky to comprehend when working with the tidyverse functions, as we are effectively performing actions in sequence, such as selecting the columns we want and **then** sorting the result.
As such, the output of one function is the input to another:
```{r}
# Select passenger columns and fare then arrange by fare
glimpse(arrange(.data = select(.data = titanic_data,
ends_with("passenger"),
fare),
desc(fare)))
```
Notice that the input to arrange was the result of a select operation, so we had to keep track of our brackets and ensure that the desc(fare) was placed within arrange(), **not** select().
Going forward, we will see more verbs that can be applied on top of this, and it becomes difficult to manage.
This is where the pipe operator comes in.
### Introducing the Pipe
The Pipe Operator makes it possible to chain a sequence of functions starting with the base data as an input to the sequence.
This removes the need to type the **.data** argument each time.
* Prior to R 4.1, the pipe operator came from a package known as **magrittr**, which would be loaded alongside the tidyverse, and it took the form **%>%**.
* From R 4.1 onwards, the operator comes as standard with base R, as it has become the universal standard for data analysis. It now takes the form **|>**, and this option must be turned on in the Tools --> Global Options tab.
![](Images/native_pipe.png){fig-alt="The Code, Editing pane with the native pipe operator tick box."}
The shortcut for this operator is **CTRL + SHIFT + M** and is one you will use alot from here on.
### **Example - Single Value**{-}
Let's return to our square root and rounding functions. Let's apply one, and then both in sequence using our new operator.
* First, we write the value or dataset we wish to apply functions to, it is at the entrance to the "pipe" if you will.
* Then, we write the function we wish to use and any additional parameters we may need.
* The pipe operator **passes** the input to the function, so we do not need to write it within the brackets.
```{r}
# Our first pipe - with a single value
2 |> sqrt()
```
You can read the pipe operator as **AND THEN**.
It takes the output of one function AND THEN uses that as the input of the next function, and so on.
This means that to chain a sequence of functions, we will use the pipe operator again to pipe our output to our next function as its input.
```{r}
# Our first sequence of pipes
2 |> sqrt() |>
round(digits = 2)
```
This is quite the shift from what we have done so far, but is more readable in plain english:
>**To the value 2, apply the square root function, and then, round that output to 2 decimal places.**
Notice, that following the second pipe, we insert a new line, which automatically moves us in a few spaces. This is optional but is easier to read as a sequence.
### Using the Pipe Operator with the tidyverse
What was done above was an oversimplification of why the pipe was created, in reality it was created to pipe datasets into functions, as opposed to a single value.
Even in the case of datasets, we do not need to pipe if we are only applying a single function such as glimpse().
It is when we are performing a chain of steps that the pipe completely simplifies our code, enhacing readability.
### **Example**{-}
Let's return to our previous example, where we selected passenger demographics and sorted by fare.
* First, we write the dataset we want to apply this sequence of steps to.
* AND THEN pipe that into the **select()** function, specifying what columns we wish to retain.
* AND THEN pipe that output to the **arrange()** function, so that it can be sorted by whatever column(s) we wish.
```{r}
# Repeat earlier operation with pipes
titanic_data |>
select(ends_with("passenger"),
fare) |>
arrange(desc(fare))
```
This way, the code is written in the order that it executes, as opposed to the reverse order without the pipe operator.
>**To the titanic dataset, we select columns that end with "passenger" as well as the fare column, and then, sort that in descending order of fare paid.**
We could also pipe the output to a glimpse() for easier sense checking.
```{r}
# Pipeline with glimpse
titanic_data |>
select(ends_with("passenger"),
fare) |>
arrange(desc(fare)) |>
glimpse()
```
### **Variables with tidyverse chains**{-}
Notice that we didn't create a variable in the prior conversion.
This is because it can cause confusion between the assignment operator **<-** and the pipe operator **|>** if not covered separately.
In front of the previous sequence, known as a **pipeline**, we would write our variable name and assignment operator, which saves the final output in memory.
```{r}
# Save pipeline as a variable
titanic_demographics_sort <- titanic_data |>
select(ends_with("passenger"),
fare) |>
arrange(desc(fare))
titanic_demographics_sort |>
glimpse()
```
Be careful not to confuse the meaning of the two operators, this will take some practice to get used to.
### Exercise
::: {.panel-tabset}
### **Exercise**{-}
Speaking of pratice, let's try out our new tool!
Create a pipeline that:
* Selects the first five columns of the data
* Arranges them in ascending order of age.
Assign this output to a new variable and glimpse it.
### **Show Answer**{-}
```{r}
# Selecting and ordering by age with pipes
titanic_age_sort <- titanic_data |>
select(pclass:age_of_passenger) |>
arrange(age_of_passenger)
titanic_age_sort |>
glimpse()
```
:::
## Filter
Often, we are only interested in groups of rows that adhere to a specific condition, such as:
* Passengers that paid over or under a certain fare.
* Passengers who are in a particular age threshold.
* Passengers who embarked from a particular port.
and of course, combinations of the above.
The next verb, **filter()** allows us to subset our rows in this way. To understand this section, we first need to consider conditions.
### Conditional Statements
A conditional statement is one that returns **TRUE** or **FALSE** dependent on the outcome.
We saw examples of these back in Chapter 1, when we were producing logicals with comparisons. For example:
* 4 < 5 is a condition statement that evaluates to TRUE
* 4 != 4 is a conditional statement that evaluates to FALSE
Logical Operator| Description
:--------------:|:------------:
< | Less Than
<= | Less Than or Equal To
\> | Greater Than
>= | Greater Than or Equal To
== | Equal To
!= | Not Equal To
\| | Or
& | And
! | Not
any() | Checks if any value in a logical vector are TRUE
all() | Checks if all values in a logical vector are TRUE
is.na() | Is the value missing (NA)?
between() | Is between 2 numbers
>**Note: The "!" allows us to flip or invert an expression. Basically, if an expression returns c(TRUE, TRUE, FALSE), the inverted expression (place ! in front of it) will return c(FALSE, FALSE, TRUE).**
These statements can of course be much more complex than comparing two single numbers, we can apply them across columns with dplyr's **filter()** verb!
### Single Conditional Filtering
We will first look at filtering by a single condition, which are constructed as follows:
* We begin with the column to compare with.
* Next is the logical operator of choice, such as **<**, **>**.
* Last up is the value to compare each entry in the column to, which generates the set of TRUEs and FALSEs.
This is generated in the background, and **filter()** will keep only the rows that return **TRUE** from this comparison.
> **For example, to subset down to only second class passengers, we would write: "pclass == 2".**
### **Example - Categorical**{-}
```{r}
# Filter to retain only second class passengers
second_class <- titanic_data |>
filter(pclass == 2) |>
glimpse()
```
A quick way to check that the filter has worked is to use base R's **unique()** function on the column.
We need to use our **$** here to reference columns by name as this is not a tidyverse method.
Let's check the original data first.
```{r}
# Return unique values from the original data
unique(titanic_data$pclass)
```
We see that the pclass column has 3 unique levels of 1, 2 and 3.
```{r}
# Return unique values from the filtered data
unique(second_class$pclass)
```
We see that in the filtered data, we have just second class passengers, as expected.
### **Example - Numeric**{-}
Let's filter to passengers who paid above £200.
```{r}
# Select passengers who paid more than 200
titanic_data |>
filter(fare > 200) |>
glimpse()
```
Notice that we didn't save this as a variable, as this was just an example, one we will not carry forward in our analysis.
However, if you are being asked to answer specific questions, then saving the outputs of your manipulation pipelines as variables is very useful, to prevent overwriting the base data.
### Exercise
::: {.panel-tabset}
### **Exercise**{-}
1. Use filter to return the row for the passenger named: 'Birkeland, Mr. Hans Martin Monsen'
2. Filter for passengers that are male and save it as a variable. Can you count how many there were?
3. Filter for passengers that are under 18 years of age and save it as a variable. Can you count how many there were?
4. **Extension - Harder**: What percentage of passengers in the dataset survived? Remember, to compute a percentage, we must multiply the proportion by 100.
### **Show Answer**{-}
1.
```{r}
# Filtering to a specific passenger
titanic_data |>
filter(name_of_passenger ==
'Birkeland, Mr. Hans Martin Monsen') |>
glimpse()
```
2.
To display the number of males in the filtered data, we simply need the number of rows, the length of the data!
```{r}
# Males only
titanic_male <- titanic_data |>
filter(sex_of_passenger == 'male')
# Counting the number of males
nrow(titanic_male)
```
3.
To display the number of passengers below 18 years of age in the filtered data, we simply need the number of rows, the length of the data!
```{r}
# Underage passengers
titanic_underage <- titanic_data |>
filter(age_of_passenger < 18)
# Counting the number of underage passengers
nrow(titanic_underage)
```
4.
Firstly, we need to filter the dataset to those who survived, and the percentage would be calculated as:
> Number of those who survived/Number of passengers as a whole * 100
Remember, we can compute the number who survived and the number of passengers, we just need the number of rows in each subsequent tibble.
```{r}
# Passengers who survived
titanic_survived <- titanic_data |>
filter(survived == 1)
# Survival percentage
(nrow(titanic_survived) / nrow(titanic_data)) * 100
```
So we see that, tragically, only 38% of the passengers aboard the titanic survived.