-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdiscrete-distributions.Rmd
735 lines (558 loc) · 26.2 KB
/
discrete-distributions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
# Discrete Probability Distributions
## Learning Objectives {-#objectives-discrete-distributions}
1. Define and explain the key characteristics of the discrete distributions:
- geometric,
- binomial,
- negative binomial,
- hypergeometric,
- Poisson and
- uniform on a finite set.
2. Evaluate probabilities and quantiles associated with such distributions.
## Theory {-#theory-discrete-distributions}
`R` was designed to be used for statistical computing - so it handles **randomness** well!
Using `R` we can guarantee reproducibility (and enhance sharability) by using the function `set.seed(seed)`{.R} where `seed` is a single value integer. Using this approach we guarantee the generation of the same sequence of random numbers everytime we call this function. Use `?set.seed`{.R} to learn more about this function.
Let's see `set.seed` in action:
```{r discrete-distributions-randomness}
# We make five random draws from the integer range [1, 10]
# We cannot guarantee reproducing this outcome when sharing the code:
sample(1:10, 5)
# Now we set a seed value before making the five random draws
# We guarantee a fixed output which enhances reproducibility and sharability:
set.seed(42) # Fixes result
# Using set.seed(42) we guarantee five random draws within the integer range [1, 10] will be:
# 1, 5, 10, 8, 2
sample(1:10, 5)
# We can re-initialise the seed with the same value
# Observe that the same sequence of random numbers are generated:
set.seed(42) # Fixes result
# Guarantee we draw 1, 5, 10, 8, 2 again
sample(1:10, 5)
```
## In-built probability distributions
`R` has in-built functions for probability distributions:
- **d***\<distribution-name\>* $:=$ **d**ensity ("_PDF_"), *i.e.* $f_X(x)$
- **p***\<distribution-name\>* $:=$ **p**robability distribution cumulative function ("_CDF_"), *i.e.* $F_X(x) =\boldsymbol{P}(X \leq x)$
- **q***\<distribution-name\>* $:=$ **q**uantile function, *i.e.* return $x$ such that $\boldsymbol{P}(X \leq x) = p$
- **r***\<distribution-name\>* $:=$ **r**andom deviates, *i.e.* (*psuedo*) random number generator for a given distribution
- *Where \<distribution-name\>* $=$ Normal, uniform, lognormal, Student's $t$, Poisson, binormal, Weibull ... see `?distributions()`{.R} for more information
To give some quick examples (we will explore these in more detail later in this chapter and the next chapter):
| R Code | Definition |
|-|-|
| `rnorm(1)`{.R} | Generates $x_1$ where $X \sim \mathcal{N}(0,\,1)$ |
| `rnorm(y, mean=10, sd=2)`{.R} | Generates $\{y_1,\,y_2,\,\dots\}$ with $Y \sim \mathcal{N}(10,\,2^2)$ |
| `runif(3, min=5, max=10)`{.R} | Generates $\{z_1,\,z_2,\,z_3\}$ where $Z \sim \mathcal{U}(5,\,10)$ |
| `dbinom(4, size=5, prob=0.5)`{.R} | Computes $\boldsymbol{P}(X = 4)$ where $X \sim Bin(5,\,0.5)$ |
| `pgamma(0.2, shape=2, rate=2)`{.R} | Computes $F_Y(0.2)$ where $Y \sim \mathcal{\Gamma}(2,\,2)$, i.e. $\boldsymbol{P}(Y\leq 0.2)$|
| `qexp(0.5, rate = 2)`{.R} | Determines smallest value of $z$ for $\boldsymbol{P}(Z \leq z) = 0.5$ where $Z \sim Exp(2)$ |
## Discrete probability distributions covered {-}
We will consider how to interact with the following discrete probability distributions in `R`:
- Geometric
- Binomial
- Negative binomial
- Hypergeometric
- Poisson
- Uniform (on a finite set)
For each distribution above we will determine how to calculate:
- A random deviate following the discrete distribution $X$,
- The probability mass function ("_PMF_"), $P(X = k)$ for distribution $X$ and $-\infty < k < \infty$ (noting the _PMF_ $= 0$ for most values of $k$),
- The cumulative distribution function ("_CDF_"), $P(X \leq k)$,
- A range of PMFs, i.e. $P(k_1 \leq X \leq k_2)$, and
- The quantile function to find $k$ representing the value such that $P(X \leq k) = p$, i.e. the _pth_ percentile.
We will finish off with a plot of the distribution.
## Geometric distribution
The geometric probability distribution ($X$) can be defined by the number of failures ($k$) in Bernoulli trials before achieving a success. A Bernoulli trial is a random experiment with exactly two outcomes, \{"_success_", "_failure_"\}, in which the probability of success is constant in every experiment. More formally, if we have $k \in \{0,\,1,\,2,\, \dots \}$ independent, identically distributed ("_i.i.d._") Bernoulli trials before a "_success_" with the $P($"_success_"$)$ on each trial defined as $p$, then:
$$P(X = k) = (1 - p)^{k}p$$
We have $X \sim Geo(p)$ with $p \in (0, 1]$.
In `R` we can generate random deviates following a geometric distribution using the function `rgeom(n, prob)`{.R} where:
- _n_ is the _number_ of random deviates we want to generate, and
- _prob_ is the _probability_ of success in any given Bernoulli trial.
```{r discrete-distributions-geometric-1}
# Guarantee reproducibility
set.seed(42)
# Generate 5 random deviates following X~Geo(0.25)
rgeom(5, 0.25)
```
If we wanted to find the x<sup>th</sup> percentile of $X \sim Geo(p)$ we would use the quantile function, `qgeom(p, prob)`{.R}.
```{r discrete-distributions-geometric-2}
# Find the 99th percentile of X~Geo(0.25)
percentile_99 <- qgeom(0.99, 0.25)
paste0(
"The 99th percentile of X~Geo(0.25) is ",
percentile_99,
"."
)
```
If we wish to find $P(X \leq k)$ then we need to use the function for the cumulative distribution function, `pgeom(q, prob)`{.R} where:
- _q_ equates to $k$ failures before success in a sequence of i.i.d. Bernoulli trials,
- _prob_ is the _probability_ of success in any given Bernoulli trial.
```{r discrete-distributions-geometric-3}
# Find P(X <= 7) with X~Geo(0.25)
prob_geom <- pgeom(7, 0.25)
paste0(
"P(X <= 7) with X~Geo(0.25) is ",
format(prob_geom, digits = 4),
"."
)
```
We can find $P(X = k)$ using the `dgeom()`{.R} function call:
```{r discrete-distributions-geometric-4}
# Find P(X = 7) with X~Geo(0.25)
prob_geom <- dgeom(7, 0.25)
```
We find that $P(X = 7) =$ `r toString(prob_geom)`.
It helps to visualise the probability distribution. We _can_ do this in base `R` with the `plot()`{.R} function. However for illustrative purposes we will use the `ggplot2` package from the [`tidyverse`](https://www.tidyverse.org/packages/) suite of packages. `ggplot2` allows for rich visualisations^[For inspiration consider the BBC's [cookbook](https://bbc.github.io/rcookbook/) that highlights how to create their distinctive graphics predominately using `ggplot2`.] using a consistent syntax.
```{r discrete-distributions-geometric-5}
# Plot the distribution function of X~Geo(0.25)
library(ggplot2)
df <- data.frame(x = rgeom(1000, 0.25))
ggplot(df, aes(x=x)) + geom_histogram(binwidth = 0.5)
```
## Binomial distribution
The binomial probability distribution ($Y$) can be defined by the number of successes in a sequence of $n$ i.i.d. Bernoulli trials. A Bernoulli trial is a random experiment with exactly two outcomes, \{"_success_", "_failure_"\}, wherein the probability of success is $p$. We often state the probability of failure as $q = 1 - p$ for convenience. More formally, if we have $k \in \{0,\,1,\,2,\, \dots,\, n \}$ successes given $n$ i.i.d. Bernoulli trials with the $P($"_success_"$)$ on each trial defined as $p$, then:
$$P(Y = k) = \binom{n}{k}p^k(1 - p)^{n - k}$$
It can help to think of this as:
- $p^k$ chance of $k$ successes,
- $(1 - p)^{n - k}$ chance of $n - k$ failures,
- the ordering of the $k$ successes can occur anywhere within the $n$ trials, hence $\binom{n}{k}$.
We can generate random deviates following the binomial distribution $Y \sim Bin(n,\,p)$ using `rbinom(n, size, prob)`{.R} where:
- _n_ is the _number_ of random deviates we want to generate,
- _size_ is the number of i.i.d. Bernoulli trials performed, and
- _prob_ is the probability of success in any given trial.
```{r discrete-distributions-binomial-1}
# Generate 5 random deviates on Y~Bin(10, 0.55)
set.seed(42)
rbinom(5, 10, 0.55)
```
To find $P(Y \leq k)$ we can use the function `pbinom(p, size, prob)`{.R} where:
- _p_ is the probability of interest,
- _size_ is the number of i.i.d. Bernoulli trials performed, and
- _prob_ is the probability of success in any given trial.
Suppose we were told the probability of having a boy at birth was 0.5131451^[This probability was chosen based on _Table 1_ from [Sex ratios at birth in the United Kingdom, 2014-18](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/925066/Sex_ratios_at_birth_in_the_United_Kingdom__2014-18.pdf) from the Department of Health & Social Care.]. Additionally, suppose we were told that there were 1,264 births^[This turns out to be the number of live births in Guildford in 2019, as per _Table 3_ of the ONS' [Birth Summary Tables, England and Wales, 2019](https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/datasets/birthsummarytables). Source: Office for National Statistics under the Open Government Licence.]. Given this, we are curious to know what is the probability of there being less than 632 boys (_i.e. < 50%_) amongst the new babies.
```{r discrete-distributions-binomial-2}
# Find P(Y <= 632) with Y~Binom(1264, 0.5131451)
prob_binom <- pbinom(632, 1264, 0.5131451)
paste0(
"P(Y <= 632) with Y~Binom(1264, 0.5131451) is ",
format(prob_binom, digits = 4),
"."
)
```
To find $P(Y = k)$ we can use the function `dbinom(x, size, prob)`{.R} where:
- _x_ is the quantile of interest,
- _size_ is the number of i.i.d Bernoulli trials performed, and
- _prob_ is the probability of success in any given trial.
Throwing a fair coin, we are curious to compute the probability of observing 4 heads in a sequence of 5 tosses:
```{r discrete-distributions-binomial-3}
# Find P(Y = 4) where Y~Binom(5, 0.5)
prob_binom <- dbinom(4, 5, 0.5)
paste0(
"P(Y = 4) with Y~Binom(5, 0.5) is ",
format(prob_binom, digits = 4),
"."
)
```
If we wanted to find $P(k_1 \leq Y \leq k_2)$ we can also use the function `dbinom(x, size, prob)`{.R} with _x_ entered as a vector and summing over the output.
Continuing the fair coin example, we are curious what is the probability of observing at least 1 head and at most 3 heads in a sequence of 5 tosses:
```{r discrete-distributions-binomial-4}
# Find P(1 <= Y <= 3) where Y~Binom(5, 0.5)
prob_binom <- sum(dbinom(1:3, 5, 0.5))
paste0(
"P(1 <= Y <= 3) with Y~Binom(5, 0.5) is ",
format(prob_binom, digits = 4),
"."
)
```
When we are interested in finding the _yth_ percentile we can use the quantile function, `qbinom(p, size, prob)`{.R} where:
- _p_ is the percentile of interest,
- _size_ is the number of i.i.d Bernoulli trials performed, and
- _prob_ is the probability of success in any given Bernoulli trial.
Continuing the baby boys example, we now are curious what is the 99th percentile of baby births being boys. Recall that the probability (_p_) of having a boy at birth was 0.5131451 and that there were 1,264 births (_n_) in Guildford in 2019.
```{r discrete-distributions-binomial-5}
# Find the 99th percentile of Y~Binom(1264, 0.5131451)
percentile_99 <- qbinom(0.99, 1264, 0.5131451)
paste0(
"The 99th percentile of Y~Binom(1264, 0.5131451) is ",
percentile_99,
"."
)
```
Finally, we finish with a plot of binomial distributions and note that with large $n$ the shape of the distribution begins to form the normal distribution (as long as $p$ is not near the extremes of 0 or 1).
```{r discrete-distributions-binomial-6}
# Plot the distribution function of X, Y, Z with:
# X,Y,Z~Binom(n, p) for various {n, p}
library(ggplot2)
set.seed(42)
df <- data.frame(
dist_types = factor(
rep(c(
"n = 20, p = 0.5",
"n = 20, p = 0.7",
"n = 40, p = 0.5"
), each = 1000
)
),
dist_values = c(
rbinom(1000, 20, 0.5),
rbinom(1000, 20, 0.7),
rbinom(1000, 40, 0.5)
)
)
ggplot(df, aes(x = dist_values, fill = dist_types)) +
geom_histogram(binwidth = 1, alpha = 0.75, position = "identity") +
xlab("Distribution values") +
ylab("Count") +
theme_minimal() +
labs(fill = "") +
ggtitle("Histogram of binomial distributions for various n, p")
```
## Negative binomial distribution
The negative binomial distribution ($Z$) can be defined by the number of failures in a sequence of $n$ independent, identically distributed ("_i.i.d._") Bernoulli trials before a specified number of successes, $r$, occurs. More formally, if we have, $r >0$ number of successes, $k$ number of failures, $p$ probability of success then:
$$P(Z = k) = \binom{k+r-1}{k}(1-p)^{k}p^{r}$$
We can see that with $r = 1$, the binomial distribution is a special case of the more general negative binomial distribution.
Let us begin by generate random deviates from the negative binomial distribution, $Z \sim NegBin(r, p)$, using `rnbinom(n, size, prob)`{.R} where:
- _n_ is the _number_ of random deviates we want to generate,
- _size_ is the target number of successes from i.i.d. Bernoulli trials, and
- _prob_ is the probability of success in any given trial
```{r discrete-distributions-negative-binomial-1}
# Generate random deviates from the negative binomial distribution
set.seed(42)
rnbinom(5, 10, 0.5)
```
To find $P(Z \leq k)$ we can use the function `pnbinom(q, size, prob)`{.R} where:
- _q_ is the quantile of interest,
- _size_ is the target number of successes from i.i.d. Bernoulli trials,
- _prob_ is the probability of success in any given trial.
```{r discrete-distributions-negative-binomial-2}
# Find P(Z <= 2) with Z~NegBin(5, 0.5)
prob_nbinom <- pnbinom(2, 5, 0.5)
paste0(
"P(Z <= 2) with Z~NegBin(5, 0.5) is ",
format(prob_nbinom, digits = 4),
"."
)
```
To find $P(Z = k)$ we can use the function `dnbinom(x, size, prob)`{.R} where:
- _x_ is the quantile of interest,
- _size_ is the target number of successes from i.i.d. Bernoulli trials, and
- _prob_ is the probability of success in any given trial.
```{r discrete-distributions-negative-binomial-3}
# Find P(Z = 2) where Z~NegBin(5, 0.5)
prob_nbinom <- dnbinom(2, 5, 0.5)
paste0(
"P(Z = 2) with Z~NegBin(5, 0.5) is ",
format(prob_nbinom, digits = 4),
"."
)
```
We can also calculate $P(k_1 \leq Z \leq k_2)$ with `dnbinom(x, size, prob)`{.R} by entering _x_ as a vector input and summing over the output.
```{r discrete-distributions-negative-binomial-4}
# Find P(1 <= Z <= 3) with Z~NegBin(5, 0.5)
prob_nbinom <- sum(dnbinom(1:3, 5, 0.5))
paste0(
"P(1 <= Z <= 3) with Z~NegBin(5, 0.5) is ",
format(prob_nbinom, digits = 4),
"."
)
```
When we are interested in finding the _zth_ percentile we can make use of the quantile function, `qnbinom(p, size, prob)`{.R} where:
- _p_ is the percentile of interest,
- _size_ is the target number of successes from i.i.d. Bernoulli trials, and
- _prob_ is the probability of success in any given trial.
```{r discrete-distributions-negative-binomial-5}
# Find the 99th percentile of Z~NegBin(100, 0.5)
percentile_99 <- qnbinom(0.99, 100, 0.5)
paste0(
"The 99th percentile of Z~NegBin(100, 0.5) is ",
percentile_99,
"."
)
```
Finally, we finish with a plot of the negative binomial distribution:
```{r discrete-distributions-negative-binomial-6}
# Plot the distribution function of X, Y, Z with:
# X,Y,Z~NegBin(r, p) for various {r, p}
library(ggplot2)
set.seed(42)
df <- data.frame(
dist_types = factor(
rep(c(
"r = 5, p = 0.6",
"r = 5, p = 0.5",
"r = 10, p = 0.4"
), each = 1000
)
),
dist_values = c(
rnbinom(1000, 5, 0.6),
rnbinom(1000, 5, 0.5),
rnbinom(1000, 10, 0.4)
)
)
ggplot(df, aes(x = dist_values, fill = dist_types)) +
geom_histogram(binwidth = 1, alpha = 0.75, position = "identity") +
xlab("Distribution values") +
ylab("Count") +
theme_minimal() +
labs(fill = "") +
ggtitle("Histogram of negative binomial distributions for various r, p")
```
## Hypergeometric distribution
The hypergeometric distribution ($X$) can be defined by the probability of $k$ successes in $n$ draws **without** replacement from a finite population $N$ that contains exactly $K$ "_success_" objects/states.
$$P(X = k) = \frac{ \binom{K}{k} \binom{N - K}{n - k} }{ \binom{N}{n} }$$
We begin by generating random deviates from the hypergeometric distribution, $X \sim Hyper(N, K, n)$, using `rhyper(nn, m, n, k)`{.R} where:
- _nn_ is _number_ of observations we want to generate,
- _m_ is number of "success" objects/states in the population $N$,
- _n_ is number of "failure" objects/states in the population $N$, and
- _k_ is the number of objects drawn / states observed
Traditionally we have defined the population $N$ as an urn containing _m_ white balls and _n_ black balls and this is how the distribution arguments are specified in the `stats`{.R} package. Note that _m_ $= K$ and _n_ $= N - K$.
```{r discrete-distributions-hypergeometric-1}
# Generate random deviates on X~Hyper(N = 100, K = 90, n = 10)
set.seed(42)
rhyper(5, 90, 100 - 90, 10)
```
In the above random deviate generation we specified a 90% ($\frac{90}{90 + 10}$) chance of drawing a "success" object and we made 10 draws from this population of 100 objects. With such a high proportion of "_success_" objects it was not suprising to note we generated random deviates from this distribution close to 10/10.
Next we wish to find $P(X \leq k)$ which involves using `phyper(q, m, n, k)`{.R} where:
- _q_ is the quantile of interest (representing the number of "success" objects in the finite population $N$),
- _m_ is the number of "success" objects in the population $N$,
- _n_ is number of "failure" objects in the population $N$, and
- _k_ is the number of objects drawn
```{r discrete-distributions-hypergeometric-2}
# Find P(X <= 5) with X~Hyper(N = 100, K = 90, n = 10)
prob_hyper <- phyper(5, 90, 100 - 90, 10)
paste0(
"With 10 objects drawn from a total 100 possible, ",
"P(X <= 3) with X~Hyper() is ",
format(prob_hyper, digits = 4),
"."
)
```
To find $P(X = k)$ we can use the function `dhyper(x, m, n, k)`{.R} where:
- _x_ is the quantile of interest (representing the number of "success" objects in the finite population $N$),
- _m_ is the number of "success" objects in the population $N$,
- _n_ is number of "failure" objects in the population $N$, and
- _k_ is the number of objects drawn
```{r discrete-distributions-hypergeometric-3}
# Find P(X = 5) with X~Hyper(N = 100, K = 90, n = 10)
prob_hyper <- dhyper(5, 90, 100 - 90, 10)
```
We can also calculate $P(k_1 \leq X \leq k_2)$ with `dhyper(x, m, n, k)`{.R} by entering _x_ as a vector input and summing over the output.
```{r discrete-distributions-hypergeometric-4}
# Find P(7 <= X <= 9) with X~Hyper(N = 100, K = 90, n = 10)
prob_hyper <- sum(dhyper(7:9, 90, 100 - 90, 10))
```
When we are interested in finding the _xth_ percentile we can make use of the quantile function, `qhyper(p, m, n, k)`{.R} where:
- _p_ is the percentile of interest,
- _m_ is the number of "success" objects in the population $N$,
- _n_ is number of "failure" objects in the population $N$, and
- _k_ is the number of objects drawn
```{r discrete-distributions-hypergeometric-5}
# Find the 99th percentile of X~Hyper(N = 100, K = 90, n = 50)
percentile_99 <- qhyper(0.99, 90, 100 - 90, 50)
paste0(
"When we draw 50 objects from a population of ",
"90 'success' objects and 10 'failure' objects, ",
"we expect the 99th percentile outcome to be ",
percentile_99,
"."
)
```
Finally, we finish with a plot of the hypergeometric distribution:
```{r discrete-distributions-hypergeometric-6}
# Plot the distribution function of X, Y, Z with:
# X,Y,Z~Hyper(N, K, n) for various {N, K, n}
library(ggplot2)
set.seed(42)
inputs <- c(
"N = 500, K = 50, n = 100",
"N = 500, K = 60, n = 200",
"N = 500, K = 70, n = 300"
)
df <- data.frame(
dist_types = rep(inputs, each = 1000),
dist_values = c(
rhyper(1000, 50, 500 - 50, 100),
rhyper(1000, 60, 500 - 60, 200),
rhyper(1000, 70, 500 - 70, 300)
)
)
ggplot(df, aes(x = dist_values, fill = dist_types)) +
geom_histogram(binwidth = 1, alpha = 0.75, position = "identity") +
xlab("Distribution values") +
ylab("Count") +
theme_minimal() +
labs(fill = "") +
scale_fill_discrete(breaks = inputs) +
ggtitle("Histogram of Hypergeometric distributions for various inputs")
```
## Poisson distribution
The Poisson distribution ($Y$) is usually defined as the probability of a given number of events occurring ($k$) in a fixed interval of time where events occur with a **constant mean rate** ($\lambda$) and **independently** of the time since the last event. More formally, for $\lambda > 0$ and $k = 0,\,1,\,2,\,\dots\,$:
$$P(Y = k) = \frac{\lambda^k e^{-k}}{k!}$$
We can generate random deviates from the Poisson distribution $Y \sim Pois(\lambda)$ using `rpois(n, lambda)`{.R} where:
- _n_ is the _number_ of random deviates we want to generate, and
- _lambda_ is the constant _mean_ rate.
```{r discrete-distributions-poisson-1}
# Generate 5 random deviates on Y~Pois(8)
set.seed(42)
rpois(5, 8)
```
To find $P(Y \leq k)$ we can use the function `ppois(q, lambda)`{.R} where:
- _q_ is the quantile of interest, and
- _lambda_ is the constant _mean_ rate.
```{r discrete-distributions-poisson-2}
# Find P(Y <= 5) with Y~Pois(8)
prob_pois <- ppois(5, 8)
paste0(
"P(Y <= 5) with Y~Pois(5, 8) is ",
format(prob_pois, digits = 4),
"."
)
```
To find $P(Y = k)$ we can use the function `dpois(x, lambda)`{.R} where:
- _x_ is the quantile of interest, and
- _lambda_ is the constant _mean_ rate.
```{r discrete-distributions-poisson-3}
# Find P(Y = 5) with Y~Pois(8)
prob_pois <- dpois(5, 8)
paste0(
"P(Y = 5) with Y~Pois(5, 8) is ",
format(prob_pois, digits = 4),
"."
)
```
If we wanted to find $P(k_1 \leq Y \leq k_2)$ we can also use the function `dpois(x, lambda)`{.R} with _x_ entered as a vector and summing over the output.
```{r discrete-distributions-poisson-4}
# Find P(4 <= Y <= 6) where Y~Pois(8)
prob_pois <- sum(dpois(4:6, 5))
paste0(
"P(4 <= Y <= 6) with Y~Pois(8) is ",
format(prob_pois, digits = 4),
"."
)
```
When we are interested in finding the _yth_ percentile we can use the quantile function, `qpois(p, lambda)`{.R} where:
- _p_ is the _percentile_ of interest, and
- _lambda_ is the constant _mean_ rate.
```{r discrete-distributions-poisson-5}
# Find the 99th percentile of Y~Poisson(8)
percentile_99 <- qpois(0.99, 8)
paste0(
"The 99th percentile of Y~Pois(8) is ",
percentile_99,
"."
)
```
Finally, we finish with a plot of Poisson distributions. We note that for large $\lambda$ the Poisson distribution is well approximated by the normal distribution with $\mu = \lambda$ and $\sigma^2 = \lambda$. We can approximate the Poisson distribution with this normal distribution after allowing for a continuity correction:
$$F_{Pois}(x; \lambda) \approx F_{normal}(x + 0.5; \mu = \lambda, \sigma^2 = \lambda)$$
Here we have replaced $P(X \leq x)$ with $P(X \leq x + 0.5)$ on account of the continuity correction.
```{r discrete-distributions-poisson-6}
# Plot the distribution function of X, Y, Z with:
# X,Y,Z~Pois(lambda) for various lambda
library(ggplot2)
set.seed(42)
lambdas <- c(
"lambda = 8",
"lambda = 16",
"lambda = 32"
)
df <- data.frame(
dist_types = rep(lambdas, each = 1000),
dist_values = c(
rpois(1000, 8),
rpois(1000, 16),
rpois(1000, 32)
)
)
ggplot(df, aes(x = dist_values, fill = dist_types)) +
geom_histogram(binwidth = 1, alpha = 0.75, position = "identity") +
xlab("Distribution values") +
ylab("Count") +
theme_minimal() +
labs(fill = "") +
scale_fill_discrete(breaks = lambdas) +
ggtitle("Histogram of Poisson distributions for various lambda")
```
## Uniform distribution {#discrete-uniform}
Our final discrete probability distribution concerns the uniform distribution ($Z$) defined over a finite set ${a, b}$. This is one discrete probability distribution which is **not** part of the source code for the `stats` package in base `R`. We can use `rdunif(n, b, a)` function within the `purrr` package^[[`purrr`](https://purrr.tidyverse.org/) is a part of the tidyverse suite of packages] to generate random deviates following the discrete uniform distribution.
We have:
$$P(Z = k) = \frac{1}{b - a + 1}$$
As an example we can simulate a die throw using $Z \sim \mathcal{U}(1, 6)$ as follows:
```{r discrete-distributions-uniform-1}
# Generate 5 random deviates on Z~Unif(1, 6)
set.seed(42)
purrr::rdunif(5, 6, 1)
```
To find $P(Z \leq k)$ we can calculate this using the CDF:
$$P(Z \leq k) = \frac{\lfloor k \rfloor - a + 1}{b - a + 1}$$
```{r discrete-distributions-uniform-2}
# Find P(Z <= 2) with Z~Unif(1, 6)
prob_uniform <- (2 - 1 + 1) / (6 - 1 + 1)
paste0(
"P(Z <= 2) with Z~Unif(1, 6) is ",
format(prob_uniform, digits = 4),
"."
)
```
To find $P(Z = k)$ we can use the PMF, $P(Z = k) = \frac{1}{b - a + 1}$:
```{r discrete-distributions-uniform-3}
# Find P(Z = 2) with Z~Unif(1, 6)
prob_uniform <- 1 / (6 - 1 + 1)
paste0(
"P(Z = 2) with Z~Unif(1, 6) is ",
format(prob_uniform, digits = 4),
"."
)
```
To find $P(k_1 \leq Z \leq k_2)$ we can sum over the PMFs:
```{r discrete-distributions-uniform-4}
# Find P(3 <= Z <= 5) where Z~Unif(1, 6)
# P(Z = k) is constant for all k:
prob_uniform <- 1 / (6 - 1 + 1)
# We have k = 3, 4, 5:
prob_uniform <- prob_uniform * 3
paste0(
"P(3 <= Z <= 5) with Z~Unif(1, 6) is ",
format(prob_uniform, digits = 4),
"."
)
```
When we are interested in finding the _zth_ percentile we can use the quantile function $F^{-1}_{Z}(p)$:
$$F^{-1}(p) = \lfloor (b - a + 1)p \rfloor - 1$$
```{r discrete-distributions-uniform-5}
# Plot the distribution function of Z~Uniform(1, 6)
percentile_99 <- floor((6 - 1 + 1)*0.99)
paste0(
"The 99th percentile of Z~Unif(1, 6) is ",
percentile_99,
"."
)
```
Finally, we finish with a plot of uniform distributions for various inputs.
```{r discrete-distributions-uniform-6}
# Plot the distribution function of X, Y, Z with:
# X,Y,Z~Unif(a, b) for various {a, b}
library(ggplot2)
set.seed(42)
finite_sets <- c(
"a = 1, b = 6", # 1 die
"a = 1, b = 12", # 2 dice
"a = 1, b = 20" # d20 die from D&D
)
df <- data.frame(
dist_types = rep(finite_sets, each = 5000),
dist_values = c(
purrr::rdunif(5000, 1 , 6),
purrr::rdunif(5000, 1, 12),
purrr::rdunif(5000, 1, 20)
)
)
ggplot(df, aes(x = dist_values, fill = dist_types)) +
geom_histogram(binwidth = 1, alpha = 0.75, position = "identity") +
xlab("Distribution values") +
ylab("Count") +
theme_minimal() +
labs(fill = "") +
scale_fill_discrete(breaks = finite_sets) +
ggtitle("Histogram of uniform distributions for various inputs")
```
## `R` Practice {-#practice-discrete-distributions}
We finish with a comprehensive example of an univariate discrete distribution question in `R`.