discrete-distributions.Rmd

# Discrete Probability Distributions

## Learning Objectives {-#objectives-discrete-distributions}

1. Define and explain the key characteristics of the discrete distributions:
  - geometric, 
  - binomial, 
  - negative binomial, 
  - hypergeometric, 
  - Poisson and 
  - uniform on a finite set.
2. Evaluate probabilities and quantiles associated with such distributions.

## Theory {-#theory-discrete-distributions}

`R` was designed to be used for statistical computing - so it handles **randomness** well!

Using `R` we can guarantee reproducibility (and enhance sharability) by using the function `set.seed(seed)`{.R} where `seed` is a single value integer. Using this approach we guarantee the generation of the same sequence of random numbers everytime we call this function. Use `?set.seed`{.R} to learn more about this function.

Let's see `set.seed` in action:

```{r discrete-distributions-randomness}
# We make five random draws from the integer range [1, 10]
# We cannot guarantee reproducing this outcome when sharing the code: 
sample(1:10, 5)

# Now we set a seed value before making the five random draws
# We guarantee a fixed output which enhances reproducibility and sharability:
set.seed(42) # Fixes result

# Using set.seed(42) we guarantee five random draws within the integer range [1, 10] will be:
# 1, 5, 10, 8, 2
sample(1:10, 5)

# We can re-initialise the seed with the same value
# Observe that the same sequence of random numbers are generated:
set.seed(42) # Fixes result
# Guarantee we draw 1, 5, 10, 8, 2 again
sample(1:10, 5)
```

## In-built probability distributions

`R` has in-built functions for probability distributions:

- **d***\<distribution-name\>* $:=$ **d**ensity ("_PDF_"), *i.e.* $f_X(x)$
- **p***\<distribution-name\>* $:=$ **p**robability distribution cumulative function ("_CDF_"), *i.e.* $F_X(x) =\boldsymbol{P}(X \leq x)$
- **q***\<distribution-name\>* $:=$ **q**uantile function, *i.e.* return $x$ such that $\boldsymbol{P}(X \leq x) = p$
- **r***\<distribution-name\>* $:=$ **r**andom deviates, *i.e.* (*psuedo*) random number generator for a given distribution
- *Where \<distribution-name\>* $=$ Normal, uniform, lognormal, Student's $t$, Poisson, binormal, Weibull ... see `?distributions()`{.R} for more information

To give some quick examples (we will explore these in more detail later in this chapter and the next chapter):

| R Code | Definition |
|-|-|
| `rnorm(1)`{.R} | Generates $x_1$ where $X \sim \mathcal{N}(0,\,1)$ |
| `rnorm(y, mean=10, sd=2)`{.R} | Generates $\{y_1,\,y_2,\,\dots\}$ with $Y \sim \mathcal{N}(10,\,2^2)$ |
| `runif(3, min=5, max=10)`{.R} | Generates $\{z_1,\,z_2,\,z_3\}$  where $Z \sim \mathcal{U}(5,\,10)$ |
| `dbinom(4, size=5, prob=0.5)`{.R} | Computes $\boldsymbol{P}(X = 4)$ where $X \sim Bin(5,\,0.5)$ |
| `pgamma(0.2, shape=2, rate=2)`{.R} | Computes $F_Y(0.2)$ where $Y \sim \mathcal{\Gamma}(2,\,2)$, i.e.  $\boldsymbol{P}(Y\leq 0.2)$|
| `qexp(0.5, rate = 2)`{.R} | Determines smallest value of $z$ for $\boldsymbol{P}(Z \leq z) = 0.5$ where $Z \sim Exp(2)$ |

## Discrete probability distributions covered {-}

We will consider how to interact with the following discrete probability distributions in `R`:

- Geometric
- Binomial
- Negative binomial
- Hypergeometric
- Poisson
- Uniform (on a finite set)

For each distribution above we will determine how to calculate:

- A random deviate following the discrete distribution $X$,
- The probability mass function ("_PMF_"), $P(X = k)$ for distribution $X$ and $-\infty < k < \infty$ (noting the _PMF_ $= 0$ for most values of $k$),
- The cumulative distribution function ("_CDF_"), $P(X \leq k)$,
- A range of PMFs, i.e. $P(k_1 \leq X \leq k_2)$, and
- The quantile function to find $k$ representing the value such that $P(X \leq k) = p$, i.e. the _pth_ percentile.

We will finish off with a plot of the distribution.

## Geometric distribution

The geometric probability distribution ($X$) can be defined by the number of failures ($k$) in Bernoulli trials before achieving a success. A Bernoulli trial is a random experiment with exactly two outcomes, \{"_success_", "_failure_"\}, in which the probability of success is constant in every experiment. More formally, if we have $k \in \{0,\,1,\,2,\, \dots \}$ independent, identically distributed ("_i.i.d._") Bernoulli trials before a "_success_" with the $P($"_success_"$)$ on each trial defined as $p$, then:

$$P(X = k) = (1 - p)^{k}p$$

We have $X \sim Geo(p)$ with $p \in (0, 1]$.

In `R` we can generate random deviates following a geometric distribution using the function `rgeom(n, prob)`{.R} where: 

- _n_ is the _number_ of random deviates we want to generate, and
- _prob_ is the _probability_ of success in any given Bernoulli trial. 

```{r discrete-distributions-geometric-1}
# Guarantee reproducibility
set.seed(42)

# Generate 5 random deviates following X~Geo(0.25)
rgeom(5, 0.25)
```

If we wanted to find the x<sup>th</sup> percentile of $X \sim Geo(p)$ we would use the quantile function, `qgeom(p, prob)`{.R}.

```{r discrete-distributions-geometric-2}
# Find the 99th percentile of X~Geo(0.25)
percentile_99 <- qgeom(0.99, 0.25)

paste0(
  "The 99th percentile of X~Geo(0.25) is ",
  percentile_99,
  "."
)
```

If we wish to find $P(X \leq k)$ then we need to use the function for the cumulative distribution function, `pgeom(q, prob)`{.R} where:

- _q_ equates to $k$ failures before success in a sequence of i.i.d. Bernoulli trials,
- _prob_ is the _probability_ of success in any given Bernoulli trial. 

```{r discrete-distributions-geometric-3}
# Find P(X <= 7) with X~Geo(0.25)
prob_geom <- pgeom(7, 0.25)

paste0(
  "P(X <= 7) with X~Geo(0.25) is ",
  format(prob_geom, digits = 4),
  "."
)
```

We can find $P(X = k)$ using the `dgeom()`{.R} function call:

```{r discrete-distributions-geometric-4}
# Find P(X = 7) with X~Geo(0.25)
prob_geom <- dgeom(7, 0.25)
```
We find that $P(X = 7) =$ `r toString(prob_geom)`.

It helps to visualise the probability distribution. We _can_ do this in base `R` with the `plot()`{.R} function. However for illustrative purposes we will use the `ggplot2` package from the [`tidyverse`](https://www.tidyverse.org/packages/) suite of packages. `ggplot2` allows for rich visualisations^[For inspiration consider the BBC's [cookbook](https://bbc.github.io/rcookbook/) that highlights how to create their distinctive graphics predominately using `ggplot2`.] using a consistent syntax. 

```{r discrete-distributions-geometric-5}
# Plot the distribution function of X~Geo(0.25)
library(ggplot2)
df <- data.frame(x = rgeom(1000, 0.25)) 
ggplot(df, aes(x=x)) + geom_histogram(binwidth = 0.5)
```

## Binomial distribution

The binomial probability distribution ($Y$) can be defined by the number of successes in a sequence of $n$ i.i.d. Bernoulli trials. A Bernoulli trial is a random experiment with exactly two outcomes, \{"_success_", "_failure_"\}, wherein the probability of success is $p$. We often state the probability of failure as $q = 1 - p$ for convenience. More formally, if we have $k \in \{0,\,1,\,2,\, \dots,\, n \}$ successes given $n$ i.i.d. Bernoulli trials with the $P($"_success_"$)$ on each trial defined as $p$, then:

$$P(Y = k) = \binom{n}{k}p^k(1 - p)^{n - k}$$

It can help to think of this as:

- $p^k$ chance of $k$ successes,
- $(1 - p)^{n - k}$ chance of $n - k$ failures,
- the ordering of the $k$ successes can occur anywhere within the $n$ trials, hence $\binom{n}{k}$.

We can generate random deviates following the binomial distribution $Y \sim Bin(n,\,p)$ using `rbinom(n, size, prob)`{.R} where:

- _n_ is the _number_ of random deviates we want to generate, 
- _size_ is the number of i.i.d. Bernoulli trials performed, and
- _prob_ is the probability of success in any given trial.

```{r discrete-distributions-binomial-1}
# Generate 5 random deviates on Y~Bin(10, 0.55)
set.seed(42)
rbinom(5, 10, 0.55)
```

To find $P(Y \leq k)$ we can use the function `pbinom(p, size, prob)`{.R} where:

- _p_ is the probability of interest,
- _size_ is the number of i.i.d. Bernoulli trials performed, and
- _prob_ is the probability of success in any given trial.

Suppose we were told the probability of having a boy at birth was 0.5131451^[This probability was chosen based on _Table 1_ from [Sex ratios at birth in the United Kingdom, 2014-18](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/925066/Sex_ratios_at_birth_in_the_United_Kingdom__2014-18.pdf) from the Department of Health & Social Care.]. Additionally, suppose we were told that there were 1,264 births^[This turns out to be the number of live births in Guildford in 2019, as per _Table 3_ of the ONS' [Birth Summary Tables, England and Wales, 2019](https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/datasets/birthsummarytables). Source: Office for National Statistics under the Open Government Licence.]. Given this, we are curious to know what is the probability of there being less than 632 boys (_i.e. < 50%_) amongst the new babies.

```{r discrete-distributions-binomial-2}
# Find P(Y <= 632) with Y~Binom(1264, 0.5131451)
prob_binom <- pbinom(632, 1264, 0.5131451)

paste0(
  "P(Y <= 632) with Y~Binom(1264, 0.5131451) is ",
  format(prob_binom, digits = 4),
  "."
)
```

To find $P(Y = k)$ we can use the function `dbinom(x, size, prob)`{.R} where:

- _x_ is the quantile of interest,
- _size_ is the number of i.i.d Bernoulli trials performed, and
- _prob_ is the probability of success in any given trial.

Throwing a fair coin, we are curious to compute the probability of observing 4 heads in a sequence of 5 tosses:

```{r discrete-distributions-binomial-3}
# Find P(Y = 4) where Y~Binom(5, 0.5)
prob_binom <- dbinom(4, 5, 0.5)

paste0(
  "P(Y = 4) with Y~Binom(5, 0.5) is ",
  format(prob_binom, digits = 4),
  "."
)
```

If we wanted to find $P(k_1 \leq Y \leq k_2)$ we can also use the function `dbinom(x, size, prob)`{.R} with _x_ entered as a vector and summing over the output.

Continuing the fair coin example, we are curious what is the probability of observing at least 1 head and at most 3 heads in a sequence of 5 tosses:

```{r discrete-distributions-binomial-4}
# Find P(1 <= Y <= 3) where Y~Binom(5, 0.5)
prob_binom <- sum(dbinom(1:3, 5, 0.5))

paste0(
  "P(1 <= Y <= 3) with Y~Binom(5, 0.5) is ",
  format(prob_binom, digits = 4),
  "."
)
```

When we are interested in finding the _yth_ percentile we can use the quantile function, `qbinom(p, size, prob)`{.R} where:

- _p_ is the percentile of interest,
- _size_ is the number of i.i.d Bernoulli trials performed, and
- _prob_ is the probability of success in any given Bernoulli trial.

Continuing the baby boys example, we now are curious what is the 99th percentile of baby births being boys. Recall that the probability (_p_) of having a boy at birth was 0.5131451 and that there were 1,264 births (_n_) in Guildford in 2019.

```{r discrete-distributions-binomial-5}
# Find the 99th percentile of Y~Binom(1264, 0.5131451)
percentile_99 <- qbinom(0.99, 1264, 0.5131451)

paste0(
  "The 99th percentile of Y~Binom(1264, 0.5131451) is ",
  percentile_99,
  "."
)
```

Finally, we finish with a plot of binomial distributions and note that with large $n$ the shape of the distribution begins to form the normal distribution (as long as $p$ is not near the extremes of 0 or 1). 

```{r discrete-distributions-binomial-6}
# Plot the distribution function of X, Y, Z with:
# X,Y,Z~Binom(n, p) for various {n, p}
library(ggplot2)
set.seed(42)

df <- data.frame(
  dist_types = factor(    
    rep(c(
      "n = 20, p = 0.5",
      "n = 20, p = 0.7",
      "n = 40, p = 0.5"
      ), each = 1000
    )
  ),
  dist_values = c(
    rbinom(1000, 20, 0.5),
    rbinom(1000, 20, 0.7),
    rbinom(1000, 40, 0.5)
  )
) 

ggplot(df, aes(x = dist_values, fill = dist_types)) +
  geom_histogram(binwidth = 1, alpha = 0.75, position = "identity") +
  xlab("Distribution values") +
  ylab("Count") +
  theme_minimal() +
  labs(fill = "") +
  ggtitle("Histogram of binomial distributions for various n, p")
```

## Negative binomial distribution

The negative binomial distribution ($Z$) can be defined by the number of failures in a sequence of $n$ independent, identically distributed ("_i.i.d._") Bernoulli trials before a specified number of successes, $r$, occurs. More formally, if we have, $r >0$ number of successes, $k$ number of failures, $p$ probability of success then:

$$P(Z = k) = \binom{k+r-1}{k}(1-p)^{k}p^{r}$$

We can see that with $r = 1$, the binomial distribution is a special case of the more general negative binomial distribution.

Let us begin by generate random deviates from the negative binomial distribution, $Z \sim NegBin(r, p)$, using `rnbinom(n, size, prob)`{.R} where:

- _n_ is the _number_ of random deviates we want to generate,
- _size_ is the target number of successes from i.i.d. Bernoulli trials, and
- _prob_ is the probability of success in any given trial

```{r discrete-distributions-negative-binomial-1}
# Generate random deviates from the negative binomial distribution
set.seed(42)
rnbinom(5, 10, 0.5)
```

To find $P(Z \leq k)$ we can use the function `pnbinom(q, size, prob)`{.R} where:

- _q_ is the quantile of interest,
- _size_ is the target number of successes from i.i.d. Bernoulli trials,
- _prob_ is the probability of success in any given trial.

```{r discrete-distributions-negative-binomial-2}
# Find P(Z <= 2) with Z~NegBin(5, 0.5)
prob_nbinom <- pnbinom(2, 5, 0.5)

paste0(
  "P(Z <= 2) with Z~NegBin(5, 0.5) is ",
  format(prob_nbinom, digits = 4),
  "."
)
```

To find $P(Z = k)$ we can use the function `dnbinom(x, size, prob)`{.R} where:

- _x_ is the quantile of interest,
- _size_ is the target number of successes from i.i.d. Bernoulli trials, and
- _prob_ is the probability of success in any given trial.

```{r discrete-distributions-negative-binomial-3}
# Find P(Z = 2) where Z~NegBin(5, 0.5)
prob_nbinom <- dnbinom(2, 5, 0.5)

paste0(
  "P(Z = 2) with Z~NegBin(5, 0.5) is ",
  format(prob_nbinom, digits = 4),
  "."  
)
```

We can also calculate $P(k_1 \leq Z \leq k_2)$ with `dnbinom(x, size, prob)`{.R} by entering _x_ as a vector input and summing over the output.

```{r discrete-distributions-negative-binomial-4}
# Find P(1 <= Z <= 3) with Z~NegBin(5, 0.5)
prob_nbinom <- sum(dnbinom(1:3, 5, 0.5))

paste0(
  "P(1 <= Z <= 3) with Z~NegBin(5, 0.5) is ",
  format(prob_nbinom, digits = 4),
  "."
)
```

When we are interested in finding the _zth_ percentile we can make use of the quantile function, `qnbinom(p, size, prob)`{.R} where:

- _p_ is the percentile of interest,
- _size_ is the target number of successes from i.i.d. Bernoulli trials, and
- _prob_ is the probability of success in any given trial.

```{r discrete-distributions-negative-binomial-5}
# Find the 99th percentile of Z~NegBin(100, 0.5)
percentile_99 <- qnbinom(0.99, 100, 0.5)

paste0(
  "The 99th percentile of Z~NegBin(100, 0.5) is ",
  percentile_99,
  "."
)
```

Finally, we finish with a plot of the negative binomial distribution:

```{r discrete-distributions-negative-binomial-6}
# Plot the distribution function of X, Y, Z with: 
# X,Y,Z~NegBin(r, p) for various {r, p}
library(ggplot2)
set.seed(42)

df <- data.frame(
  dist_types = factor(
    rep(c(
      "r = 5, p = 0.6",
      "r = 5, p = 0.5",
      "r = 10, p = 0.4"
    ), each = 1000
    )
  ),
  dist_values = c(
    rnbinom(1000, 5, 0.6),
    rnbinom(1000, 5, 0.5),
    rnbinom(1000, 10, 0.4)
  )
)

ggplot(df, aes(x = dist_values, fill = dist_types)) +
  geom_histogram(binwidth = 1, alpha = 0.75, position = "identity") +
  xlab("Distribution values") +
  ylab("Count") +
  theme_minimal() +
  labs(fill = "") +
  ggtitle("Histogram of negative binomial distributions for various r, p")
```

## Hypergeometric distribution

The hypergeometric distribution ($X$) can be defined by the probability of $k$ successes in $n$ draws **without** replacement from a finite population $N$ that contains exactly $K$ "_success_" objects/states.

$$P(X = k) = \frac{ \binom{K}{k} \binom{N - K}{n - k} }{ \binom{N}{n} }$$

We begin by generating random deviates from the hypergeometric distribution, $X \sim Hyper(N, K, n)$, using `rhyper(nn, m, n, k)`{.R} where:

- _nn_ is _number_ of observations we want to generate,
- _m_ is number of "success" objects/states in the population $N$,
- _n_ is number of "failure" objects/states in the population $N$, and
- _k_ is the number of objects drawn / states observed

Traditionally we have defined the population $N$ as an urn containing _m_ white balls and _n_ black balls and this is how the distribution arguments are specified in the `stats`{.R} package. Note that _m_ $= K$ and _n_ $= N - K$.

```{r discrete-distributions-hypergeometric-1}
# Generate random deviates on X~Hyper(N = 100, K = 90, n = 10)
set.seed(42)
rhyper(5, 90, 100 - 90, 10)
```

In the above random deviate generation we specified a 90% ($\frac{90}{90 + 10}$) chance of drawing a "success" object and we made 10 draws from this population of 100 objects. With such a high proportion of "_success_" objects it was not suprising to note we generated random deviates from this distribution close to 10/10.

Next we wish to find $P(X \leq k)$ which involves using `phyper(q, m, n, k)`{.R} where:

- _q_ is the quantile of interest (representing the number of "success" objects in the finite population $N$),
- _m_ is the number of "success" objects in the population $N$,
- _n_ is number of "failure" objects in the population $N$, and
- _k_ is the number of objects drawn

```{r discrete-distributions-hypergeometric-2}
# Find P(X <= 5) with X~Hyper(N = 100, K = 90, n = 10)
prob_hyper <- phyper(5, 90, 100 - 90, 10)

paste0(
  "With 10 objects drawn from a total 100 possible, ",
  "P(X <= 3) with X~Hyper() is ",
  format(prob_hyper, digits = 4),
  "."
)
```

To find $P(X = k)$ we can use the function `dhyper(x, m, n, k)`{.R} where:

- _x_ is the quantile of interest (representing the number of "success" objects in the finite population $N$),
- _m_ is the number of "success" objects in the population $N$,
- _n_ is number of "failure" objects in the population $N$, and
- _k_ is the number of objects drawn

```{r discrete-distributions-hypergeometric-3}
# Find P(X = 5) with X~Hyper(N = 100, K = 90, n = 10)
prob_hyper <- dhyper(5, 90, 100 - 90, 10)
```

We can also calculate $P(k_1 \leq X \leq k_2)$ with `dhyper(x, m, n, k)`{.R} by entering _x_ as a vector input and summing over the output.

```{r discrete-distributions-hypergeometric-4}
# Find P(7 <= X <= 9) with X~Hyper(N = 100, K = 90, n = 10)
prob_hyper <- sum(dhyper(7:9, 90, 100 - 90, 10))
```

When we are interested in finding the _xth_ percentile we can make use of the quantile function, `qhyper(p, m, n, k)`{.R} where:

- _p_ is the percentile of interest,
- _m_ is the number of "success" objects in the population $N$,
- _n_ is number of "failure" objects in the population $N$, and
- _k_ is the number of objects drawn

```{r discrete-distributions-hypergeometric-5}
# Find the 99th percentile of X~Hyper(N = 100, K = 90, n = 50)
percentile_99 <- qhyper(0.99, 90, 100 - 90, 50)

paste0(
  "When we draw 50 objects from a population of ",
  "90 'success' objects and 10 'failure' objects, ",
  "we expect the 99th percentile outcome to be ",
  percentile_99,
  "."
)
```
Finally, we finish with a plot of the hypergeometric distribution:

```{r discrete-distributions-hypergeometric-6}
# Plot the distribution function of X, Y, Z with:
# X,Y,Z~Hyper(N, K, n) for various {N, K, n}
library(ggplot2)
set.seed(42)

inputs <- c(
  "N = 500, K = 50, n = 100",
  "N = 500, K = 60, n = 200",
  "N = 500, K = 70, n = 300"
)

df <- data.frame(
  dist_types = rep(inputs, each = 1000),
  dist_values = c(
    rhyper(1000, 50, 500 - 50, 100),
    rhyper(1000, 60, 500 - 60, 200),
    rhyper(1000, 70, 500 - 70, 300)
  )
)

ggplot(df, aes(x = dist_values, fill = dist_types)) +
  geom_histogram(binwidth = 1, alpha = 0.75, position = "identity") +
  xlab("Distribution values") +
  ylab("Count") +
  theme_minimal() +
  labs(fill = "") +
  scale_fill_discrete(breaks = inputs) +
  ggtitle("Histogram of Hypergeometric distributions for various inputs")
```

## Poisson distribution

The Poisson distribution ($Y$) is usually defined as the probability of a given number of events occurring ($k$) in a fixed interval of time where events occur with a **constant mean rate** ($\lambda$) and **independently** of the time since the last event. More formally, for $\lambda > 0$ and $k = 0,\,1,\,2,\,\dots\,$:

$$P(Y = k) = \frac{\lambda^k e^{-k}}{k!}$$

We can generate random deviates from the Poisson distribution $Y \sim Pois(\lambda)$ using `rpois(n, lambda)`{.R} where:

- _n_ is the _number_ of random deviates we want to generate, and
- _lambda_ is the constant _mean_ rate.

```{r discrete-distributions-poisson-1}
# Generate 5 random deviates on Y~Pois(8)
set.seed(42)
rpois(5, 8)
```

To find $P(Y \leq k)$ we can use the function `ppois(q, lambda)`{.R} where:

- _q_ is the quantile of interest, and
- _lambda_ is the constant _mean_ rate.

```{r discrete-distributions-poisson-2}
# Find P(Y <= 5) with Y~Pois(8)
prob_pois <- ppois(5, 8)

paste0(
  "P(Y <= 5) with Y~Pois(5, 8) is ",
  format(prob_pois, digits = 4),
  "."
)
```

To find $P(Y = k)$ we can use the function `dpois(x, lambda)`{.R} where:

- _x_ is the quantile of interest, and
- _lambda_ is the constant _mean_ rate.

```{r discrete-distributions-poisson-3}
# Find P(Y = 5) with Y~Pois(8)
prob_pois <- dpois(5, 8)

paste0(
  "P(Y = 5) with Y~Pois(5, 8) is ",
  format(prob_pois, digits = 4),
  "."
)
```

If we wanted to find $P(k_1 \leq Y \leq k_2)$ we can also use the function `dpois(x, lambda)`{.R} with _x_ entered as a vector and summing over the output.

```{r discrete-distributions-poisson-4}
# Find P(4 <= Y <= 6) where Y~Pois(8)
prob_pois <- sum(dpois(4:6, 5))

paste0(
  "P(4 <= Y <= 6) with Y~Pois(8) is ",
  format(prob_pois, digits = 4),
  "."
)
```

When we are interested in finding the _yth_ percentile we can use the quantile function, `qpois(p, lambda)`{.R} where:

- _p_ is the _percentile_ of interest, and
- _lambda_ is the constant _mean_ rate.

```{r discrete-distributions-poisson-5}
# Find the 99th percentile of Y~Poisson(8)
percentile_99 <- qpois(0.99, 8)

paste0(
  "The 99th percentile of Y~Pois(8) is ",
  percentile_99,
  "."
)
```

Finally, we finish with a plot of Poisson distributions. We note that for large $\lambda$ the Poisson distribution is well approximated by the normal distribution with $\mu = \lambda$ and $\sigma^2 = \lambda$. We can approximate the Poisson distribution with this normal distribution after allowing for a continuity correction:

$$F_{Pois}(x; \lambda) \approx F_{normal}(x + 0.5; \mu = \lambda, \sigma^2 = \lambda)$$

Here we have replaced $P(X \leq x)$ with $P(X \leq x + 0.5)$ on account of the continuity correction.

```{r discrete-distributions-poisson-6}
# Plot the distribution function of X, Y, Z with:
# X,Y,Z~Pois(lambda) for various lambda
library(ggplot2)
set.seed(42)

lambdas <- c(
  "lambda = 8",
  "lambda = 16",
  "lambda = 32"
)

df <- data.frame(
  dist_types = rep(lambdas, each = 1000),
  dist_values = c(
    rpois(1000, 8),
    rpois(1000, 16),
    rpois(1000, 32)
  )
) 

ggplot(df, aes(x = dist_values, fill = dist_types)) +
  geom_histogram(binwidth = 1, alpha = 0.75, position = "identity") +
  xlab("Distribution values") +
  ylab("Count") +
  theme_minimal() +
  labs(fill = "") +
  scale_fill_discrete(breaks = lambdas) +
  ggtitle("Histogram of Poisson distributions for various lambda")
```

## Uniform distribution {#discrete-uniform}

Our final discrete probability distribution concerns the uniform distribution ($Z$) defined over a finite set ${a, b}$. This is one discrete probability distribution which is **not** part of the source code for the `stats` package in base `R`. We can use `rdunif(n, b, a)` function within the `purrr` package^[[`purrr`](https://purrr.tidyverse.org/) is a part of the tidyverse suite of packages] to generate random deviates following the discrete uniform distribution.

We have:

$$P(Z = k) = \frac{1}{b - a + 1}$$

As an example we can simulate a die throw using $Z \sim \mathcal{U}(1, 6)$ as follows:

```{r discrete-distributions-uniform-1}
# Generate 5 random deviates on Z~Unif(1, 6)
set.seed(42)
purrr::rdunif(5, 6, 1)
```

To find $P(Z \leq k)$ we can calculate this using the CDF:

$$P(Z \leq k) = \frac{\lfloor k \rfloor - a + 1}{b - a + 1}$$

```{r discrete-distributions-uniform-2}
# Find P(Z <= 2) with Z~Unif(1, 6)
prob_uniform <- (2 - 1 + 1) / (6 - 1 + 1)

paste0(
  "P(Z <= 2) with Z~Unif(1, 6) is ",
  format(prob_uniform, digits = 4),
  "."
)
```

To find $P(Z = k)$ we can use the PMF, $P(Z = k) = \frac{1}{b - a + 1}$:

```{r discrete-distributions-uniform-3}
# Find P(Z = 2)  with Z~Unif(1, 6)
prob_uniform <- 1 / (6 - 1 + 1)

paste0(
  "P(Z = 2) with Z~Unif(1, 6) is ",
  format(prob_uniform, digits = 4),
  "."
)
```

To find $P(k_1 \leq Z \leq k_2)$ we can sum over the PMFs:

```{r discrete-distributions-uniform-4}
# Find P(3 <= Z <= 5) where Z~Unif(1, 6)
# P(Z = k) is constant for all k:
prob_uniform <- 1 / (6 - 1 + 1)
# We have k = 3, 4, 5:
prob_uniform <- prob_uniform * 3

paste0(
  "P(3 <= Z <= 5) with Z~Unif(1, 6) is ",
  format(prob_uniform, digits = 4),
  "."
)
```

When we are interested in finding the _zth_ percentile we can use the quantile function $F^{-1}_{Z}(p)$:

$$F^{-1}(p) = \lfloor (b - a + 1)p \rfloor - 1$$

```{r discrete-distributions-uniform-5}
# Plot the distribution function of Z~Uniform(1, 6)
percentile_99 <- floor((6 - 1 + 1)*0.99)

paste0(
  "The 99th percentile of Z~Unif(1, 6) is ",
  percentile_99,
  "."
)
```

Finally, we finish with a plot of uniform distributions for various inputs.

```{r discrete-distributions-uniform-6}
# Plot the distribution function of X, Y, Z with:
# X,Y,Z~Unif(a, b) for various {a, b}
library(ggplot2)
set.seed(42)

finite_sets <- c(
 "a = 1, b = 6", # 1 die
 "a = 1, b = 12", # 2 dice
 "a = 1, b = 20" # d20 die from D&D
)

df <- data.frame(
  dist_types = rep(finite_sets, each = 5000),
  dist_values = c(
    purrr::rdunif(5000, 1 , 6),
    purrr::rdunif(5000, 1, 12),
    purrr::rdunif(5000, 1, 20)
  )
) 

ggplot(df, aes(x = dist_values, fill = dist_types)) +
  geom_histogram(binwidth = 1, alpha = 0.75, position = "identity") +
  xlab("Distribution values") +
  ylab("Count") +
  theme_minimal() +
  labs(fill = "") +
  scale_fill_discrete(breaks = finite_sets) +
  ggtitle("Histogram of uniform distributions for various inputs")
```

## `R` Practice {-#practice-discrete-distributions}

We finish with a comprehensive example of an univariate discrete distribution question in `R`.