404: File not found
The requested file was not found.
Please click here to go to the home page.
diff --git a/.JuliaFormatter.toml b/.JuliaFormatter.toml new file mode 100644 index 00000000..c7439503 --- /dev/null +++ b/.JuliaFormatter.toml @@ -0,0 +1 @@ +style = "blue" \ No newline at end of file diff --git a/.devcontainer/devcontainer.json b/.devcontainer/devcontainer.json new file mode 100644 index 00000000..7f81a55a --- /dev/null +++ b/.devcontainer/devcontainer.json @@ -0,0 +1,17 @@ +// For format details, see https://aka.ms/devcontainer.json. For config options, see the README at: +// https://github.com/microsoft/vscode-dev-containers/tree/v0.166.0/containers/julia +// See https://github.com/julia-vscode/julia-devcontainer/blob/master/Dockerfile for image contents +{ + "name": "Julia (Community)", + "image": "ghcr.io/julia-vscode/julia-devcontainer", + "extensions": [ + "julialang.language-julia", + "eamodio.gitlens", + "visualstudioexptteam.vscodeintellicode", + "ms-vscode.vscode-js-profile-flame", + "vscode-icons-team.vscode-icons", + "yzhang.markdown-all-in-one" + ], + "postCreateCommand": "/julia-devcontainer-scripts/postcreate.jl", + "remoteUser": "vscode" +} diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml new file mode 100644 index 00000000..f58a1763 --- /dev/null +++ b/.pre-commit-config.yaml @@ -0,0 +1,9 @@ +repos: + - repo: https://github.com/crate-ci/typos + rev: v1.16.23 + hooks: + - id: typos + - repo: https://github.com/qiaojunfeng/pre-commit-julia-format + rev: v0.1.1 # use the most recent version + hooks: + - id: julia-format # formatter for Julia code diff --git a/.typos.toml b/.typos.toml new file mode 100644 index 00000000..e3858ade --- /dev/null +++ b/.typos.toml @@ -0,0 +1,31 @@ +[default] +extend-ignore-re = [ + "textcite\\{.*\\}", # citation keys + "\\b[0-9A-Za-z+/]{91}(=|==)?\\b", # base64 strings + "[0-9a-fA-F]{7,}", # git commit hashes + "\\b[0-9A-Za-z+/]{33,}(=|==)?\\b", # SHA/tpub/adresses etc strings +] + +[default.extend-words] +# code stuff +lik = "lik" +quation = "quation" +# surnames +Yau = "Yau" + +[files] +extend-exclude = [ + "datasets/*", + "_assets/*", + "_libs/*", + "images/*", + "pages/images/*", +] + +[type.bib] +check-file = false +extend-glob = ["*.bib"] + +[type.gitignore] +check-file = false +extend-glob = [".gitignore"] diff --git a/404.html b/404.html new file mode 100644 index 00000000..424ab8d5 --- /dev/null +++ b/404.html @@ -0,0 +1 @@ +
The requested file was not found.
Please click here to go to the home page.
Welcome to the repository of tutorials on how to do Bayesian Statistics using Julia and Turing. Tutorials are available at storopoli.io/Bayesian-Julia.
Bayesian statistics is an approach to inferential statistics based on Bayes' theorem, where available knowledge about parameters in a statistical model is updated with the information in observed data. The background knowledge is expressed as a prior distribution and combined with observational data in the form of a likelihood function to determine the posterior distribution. The posterior can also be used for making predictions about future events.
Bayesian statistics is a departure from classical inferential statistics that prohibits probability statements about parameters and is based on asymptotically sampling infinite samples from a theoretical population and finding parameter values that maximize the likelihood function. Mostly notorious is null-hypothesis significance testing (NHST) based on p-values. Bayesian statistics incorporate uncertainty (and prior knowledge) by allowing probability statements about parameters, and the process of parameter value inference is a direct result of the Bayes' theorem.
Julia is a fast dynamic-typed language that just-in-time (JIT) compiles into native code using LLVM. It "runs like C but reads like Python", meaning that is blazing fast, easy to prototype and to read/write code. It is multi-paradigm, combining features of imperative, functional, and object-oriented programming. I won't cover Julia basics and any sort of data manipulation using Julia in the tutorials, instead please take a look into the following resources which covers most of the introduction to Julia and how to work with tabular data in Julia:
Julia Documentation: Julia documentation is a very friendly and well-written resource that explains the basic design and functionality of the language.
Julia Data Science: open source and open access book on how to do Data Science using Julia.
Thinking Julia: introductory beginner-friendly book that explains the main concepts and functionality behind the Julia language.
Julia High Performance: book by two of the creators of the Julia Language (Avik Sengupta and Alan Edelman), it covers how to make Julia even faster with some principles and tricks of the trade.
An Introduction DataFrames: the package DataFrames.jl
provides a set of tools for working with tabular data in Julia. Its design and functionality are similar to those of pandas
(in Python) and data.frame
, data.table
and dplyr
(in R), making it a great general purpose data science tool, especially for those coming to Julia from R or Python.This is a collection of notebooks that introduces DataFrames.jl
made by one of its core contributors Bogumił Kamiński.
Turing is an ecosystem of Julia packages for Bayesian Inference using probabilistic programming. Models specified using Turing are easy to read and write — models work the way you write them. Like everything in Julia, Turing is fast.
José Eduardo Storopoli, PhD - Lattes CV - ORCID - https://storopoli.io
The content is licensed under a very permissive Creative Commons license (CC BY-SA). You are mostly welcome to contribute with issues and pull requests. My hope is to have more people into Bayesian statistics. The content is aimed towards social scientists and PhD candidates in social sciences. I chose to provide an intuitive approach rather than focusing on rigorous mathematical formulations. I've made it to be how I would have liked to be introduced to Bayesian statistics.
To configure a local environment:
Download and install Julia
Clone the repository from GitHub: git clone https://github.com/storopoli/Bayesian-Julia.git
Access the directory: cd Bayesian-Julia
Activate the environment by typing in the Julia REPL:
using Pkg
+Pkg.activate(".")
+Pkg.instantiate()
kidiq
(linear regression): data from a survey of adult American women and their children (a subsample from the National Longitudinal Survey of Youth). Source: Gelman and Hill (2007).
wells
(logistic regression): a survey of 3200 residents in a small area of Bangladesh suffering from arsenic contamination of groundwater. Respondents with elevated arsenic levels in their wells had been encouraged to switch their water source to a safe public or private well in the nearby area and the survey was conducted several years later to learn which of the affected residents had switched wells. Source: Gelman and Hill (2007).
esoph
(ordinal regression): data from a case-control study of (o)esophageal cancer in Ille-et-Vilaine, France. Source: Breslow and Day (1980).
roaches
(Poisson regression): data on the efficacy of a pest management system at reducing the number of roaches in urban apartments. Source: Gelman and Hill (2007).
duncan
(robust regression): data from occupation's prestige filled with outliers. Source: Duncan (1961).
cheese
(hierarchical models): data from cheese ratings. A group of 10 rural and 10 urban raters rated 4 types of different cheeses (A, B, C and D) in two samples. Source: Boatwright, McCulloch and Rossi (1999).
Despite not being the only Turing tutorial that exists, this tutorial aims to introduce Bayesian inference along with how to use Julia and Turing. Here is a (not complete) list of other Turing tutorials:
Official Turing Tutorials: tutorials on how to implement common models in Turing
Statistical Rethinking - Turing Models: Julia versions of the Bayesian models described in Statistical Rethinking Edition 1 (McElreath, 2016) and Edition 2 (McElreath, 2020)
Håkan Kjellerstrand Turing Tutorials: a collection of Julia Turing models
I also have a free and open source graduate course on Bayesian Statistics with Turing and Stan code. You can find it at storopoli/Bayesian-Statistics
.
To cite these tutorials, please use:
Storopoli (2021). Bayesian Statistics with Julia and Turing. https://storopoli.io/Bayesian-Julia.
+Or in BibTeX format :
+@misc{storopoli2021bayesianjulia,
+ author = {Storopoli, Jose},
+ title = {Bayesian Statistics with Julia and Turing},
+ url = {https://storopoli.io/Bayesian-Julia},
+ year = {2021}
+ }
+The references are divided in books, papers, software, and datasets.
+Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis. Chapman and Hall/CRC.
+ +McElreath, R. (2020). Statistical rethinking: A Bayesian course with examples in R and Stan. CRC press.
+ +Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and other stories. Cambridge University Press.
+ +Brooks, S., Gelman, A., Jones, G., & Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo. CRC Press. https://books.google.com?id=qfRsAIKZ4rIC
+Geyer, C. J. (2011). Introduction to markov chain monte carlo. In S. Brooks, A. Gelman, G. L. Jones, & X.-L. Meng (Eds.), Handbook of markov chain monte carlo.
+ +van de Schoot, R., Depaoli, S., King, R., Kramer, B., Märtens, K., Tadesse, M. G., Vannucci, M., Gelman, A., Veen, D., Willemsen, J., & Yau, C. (2021). Bayesian statistics and modelling. Nature Reviews Methods Primers, 1(1, 1), 1–26. https://doi.org/10.1038/s43586-020-00001-2
+ +Gabry, J., Simpson, D., Vehtari, A., Betancourt, M., & Gelman, A. (2019). Visualization in Bayesian workflow. Journal of the Royal Statistical Society: Series A (Statistics in Society), 182(2), 389–402. https://doi.org/10.1111/rssa.12378
+ +Gelman, A., Vehtari, A., Simpson, D., Margossian, C. C., Carpenter, B., Yao, Y., Kennedy, L., Gabry, J., Bürkner, P.-C., & Modr’ak, M. (2020, November 3). Bayesian Workflow. http://arxiv.org/abs/2011.01808
+ +Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E.-J., Berk, R., Bollen, K. A., Brembs, B., Brown, L., Camerer, C., Cesarini, D., Chambers, C. D., Clyde, M., Cook, T. D., De Boeck, P., Dienes, Z., Dreber, A., Easwaran, K., Efferson, C., … Johnson, V. E. (2018). Redefine statistical significance. Nature Human Behaviour, 2(1), 6–10. https://doi.org/10.1038/s41562-017-0189-z
+ +McShane, B. B., Gal, D., Gelman, A., Robert, C., & Tackett, J. L. (2019). Abandon Statistical Significance. American Statistician, 73, 235–245. https://doi.org/10.1080/00031305.2018.1527253
+ +Amrhein, V., Greenland, S., & McShane, B. (2019). Scientists rise up against statistical significance. Nature, 567(7748), 305–307. https://doi.org/10.1038/d41586-019-00857-9
+ +van de Schoot, R., Kaplan, D., Denissen, J., Asendorpf, J. B., Neyer, F. J., & van Aken, M. A. G. (2014). A Gentle Introduction to Bayesian Analysis: Applications to Developmental Research. Child Development, 85(3), 842–860. https://doi.org/10.1111/cdev.12169
+ +Bezanson, J., Edelman, A., Karpinski, S., & Shah, V. B. (2017). Julia: A fresh approach to numerical computing. SIAM Review, 59(1), 65–98.
+ +Ge, H., Xu, K., & Ghahramani, Z. (2018). Turing: A Language for Flexible Probabilistic Inference. International Conference on Artificial Intelligence and Statistics, 1682–1690. http://proceedings.mlr.press/v84/ge18b.html
+ +Tarek, M., Xu, K., Trapp, M., Ge, H., & Ghahramani, Z. (2020). DynamicPPL: Stan-like Speed for Dynamic Probabilistic Models. ArXiv:2002.02702 [Cs, Stat]. http://arxiv.org/abs/2002.02702
+ +Xu, K., Ge, H., Tebbutt, W., Tarek, M., Trapp, M., & Ghahramani, Z. (2020). AdvancedHMC.jl: A robust, modular and efficient implementation of advanced HMC algorithms. Symposium on Advances in Approximate Bayesian Inference, 1–10. http://proceedings.mlr.press/v118/xu20a.html
+ +Revels, J., Lubin, M., & Papamarkou, T. (2016). Forward-Mode Automatic Differentiation in Julia. ArXiv:1607.07892 [Cs]. http://arxiv.org/abs/1607.07892
+ +Boatwright, P., McCulloch, R., & Rossi, P. (1999). Account-level modeling for trade promotion: An application of a constrained parameter hierarchical model. Journal of the American Statistical Association, 94(448), 1063–1073.
+ +Breslow, N. E. & Day, N. E. (1980). Statistical Methods in Cancer Research. Volume 1: The Analysis of Case-Control Studies. IARC Lyon / Oxford University Press.
+ +Duncan, O. D. (1961). A socioeconomic index for all occupations. Class: Critical Concepts, 1, 388–426.
+ +Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.
+ +This content is licensed under Creative Commons Attribution-ShareAlike 4.0 International.
+This website is built with Julia 1.10.0 and
+ +AlgebraOfGraphics 0.6.18
+BenchmarkTools 1.4.0
+Bijectors 0.13.8
+CSV 0.10.12
+CairoMakie 0.11.6
+CategoricalArrays 0.10.8
+Chain 0.5.0
+Colors 0.12.10
+DataFrames 1.6.1
+DifferentialEquations 7.12.0
+Distributions 0.25.107
+Downloads 1.6.0
+FillArrays 1.9.3
+ForwardDiff 0.10.36
+Franklin 0.10.95
+HTTP 1.10.1
+LazyArrays 1.8.3
+Literate 2.16.1
+MCMCChains 6.0.4
+NodeJS 2.0.0
+PDMats 0.11.31
+StatsBase 0.34.2
+StatsFuns 1.3.0
+Turing 0.30.3
+
+I will try to exemplify what would be the "two language problem" by showing you how I would code a simple Metropolis algorithm for a bivariate normal distribution. I would mostly prototype it in a dynamically-typed language such as R or Python. Then, deploy the algorithm using a fast but hard to code language such as C++. This is exactly what I'll do now. The algorithm will be coded in Julia, R, C++ and Stan
. There are two caveats. First, I am coding the original 1950s Metropolis version, not the 1970s Metropolis-Hastings, which implies symmetrical proposal distributions just for the sake of the example. Second, the proposals are based on a uniform distribution on the current proposal values of the proposal values ± a certain width
.
Let's start with Julia which uses the Distributions.jl
package for its probabilistic distributions along with logpdf()
defined methods for all of the distributions.
using Distributions
+function metropolis(S::Int64, width::Float64, ρ::Float64;
+ μ_x::Float64=0.0, μ_y::Float64=0.0,
+ σ_x::Float64=1.0, σ_y::Float64=1.0)
+ binormal = MvNormal([μ_x; μ_y], [σ_x ρ; ρ σ_y]);
+ draws = Matrix{Float64}(undef, S, 2);
+ x = randn(); y = randn();
+ accepted = 0::Int64;
+ for s in 1:S
+ x_ = rand(Uniform(x - width, x + width));
+ y_ = rand(Uniform(y - width, y + width));
+ r = exp(logpdf(binormal, [x_, y_]) - logpdf(binormal, [x, y]));
+
+ if r > rand(Uniform())
+ x = x_;
+ y = y_;
+ accepted += 1;
+ end
+ @inbounds draws[s, :] = [x y];
+ end
+ println("Acceptance rate is $(accepted / S)")
+ return draws
+end
Now let's go to the R version (from now on no more fancy names like μ
or σ
😭). Since this is a bivariate normal I am using the package {mnormt}
which allows for very fast (FORTRAN code) computation of multivariate normal distributions' pdf and logpdf.
metropolis <- function(S, width,
+ mu_X = 0, mu_Y = 0,
+ sigma_X = 1, sigma_Y = 1,
+ rho) {
+ Sigma <- diag(2)
+ Sigma[1, 2] <- rho
+ Sigma[2, 1] <- rho
+ draws <- matrix(nrow = S, ncol = 2)
+ x <- rnorm(1)
+ y <- rnorm(1)
+ accepted <- 0
+ for (s in 1:S) {
+ x_ <- runif(1, x - width, x + width)
+ y_ <- runif(1, y - width, y + width)
+ r <- exp(mnormt::dmnorm(c(x_, y_), mean = c(mu_X, mu_Y), varcov = Sigma, log = TRUE) -
+ mnormt::dmnorm(c(x, y), mean = c(mu_X, mu_Y), varcov = Sigma, log = TRUE))
+ if (r > runif(1, 0, 1)) {
+ x <- x_
+ y <- y_
+ accepted <- accepted + 1
+ }
+ draws[s, 1] <- x
+ draws[s, 2] <- y
+ }
+ print(paste0("Acceptance rate is ", accepted / S))
+ return(draws)
+}
Now C++. Here I am using the Eigen
library. Note that, since C++ is a very powerful language to be used as "close to the metal" as possible, I don't have any convenient predefined multivariate normal to use. So I will have to create this from zero[2].
Ok, be ready! This is a mouthful:
#include <Eigen/Eigen>
+#include <cmath>
+#include <iostream>
+#include <random>
+
+using std::cout;
+std::random_device rd{};
+std::mt19937 gen{rd()};
+
+// Random Number Generator Stuff
+double random_normal(double mean = 0, double std = 1) {
+ std::normal_distribution<double> d{mean, std};
+ return d(gen);
+};
+
+double random_unif(double min = 0, double max = 1) {
+ std::uniform_real_distribution<double> d{min, max};
+ return d(gen);
+};
+
+// Multivariate Normal
+struct Mvn {
+ Mvn(const Eigen::VectorXd &mu, const Eigen::MatrixXd &s)
+ : mean(mu), sigma(s) {}
+ double pdf(const Eigen::VectorXd &x) const;
+ double lpdf(const Eigen::VectorXd &x) const;
+ Eigen::VectorXd mean;
+ Eigen::MatrixXd sigma;
+};
+
+double Mvn::pdf(const Eigen::VectorXd &x) const {
+ double n = x.rows();
+ double sqrt2pi = std::sqrt(2 * M_PI);
+ double quadform = (x - mean).transpose() * sigma.inverse() * (x - mean);
+ double norm = std::pow(sqrt2pi, -n) * std::pow(sigma.determinant(), -0.5);
+
+ return norm * exp(-0.5 * quadform);
+}
+
+double Mvn::lpdf(const Eigen::VectorXd &x) const {
+ double n = x.rows();
+ double sqrt2pi = std::sqrt(2 * M_PI);
+ double quadform = (x - mean).transpose() * sigma.inverse() * (x - mean);
+ double norm = std::pow(sqrt2pi, -n) * std::pow(sigma.determinant(), -0.5);
+
+ return log(norm) + (-0.5 * quadform);
+}
+
+Eigen::MatrixXd metropolis(int S, double width, double mu_X = 0,
+ double mu_Y = 0, double sigma_X = 1,
+ double sigma_Y = 1, double rho = 0.8) {
+ Eigen::MatrixXd sigma(2, 2);
+ sigma << sigma_X, rho, rho, sigma_Y;
+ Eigen::VectorXd mean(2);
+ mean << mu_X, mu_Y;
+ Mvn binormal(mean, sigma);
+
+ Eigen::MatrixXd out(S, 2);
+ double x = random_normal();
+ double y = random_normal();
+ double accepted = 0;
+ for (size_t i = 0; i < S - 1; i++) {
+ double xmw = x - width;
+ double xpw = x + width;
+ double ymw = y - width;
+ double ypw = y + width;
+
+ double x_ = random_unif(xmw, xpw);
+ double y_ = random_unif(ymw, ypw);
+
+ double r = std::exp(binormal.lpdf(Eigen::Vector2d(x_, y_)) -
+ binormal.lpdf(Eigen::Vector2d(x, y)));
+ if (r > random_unif()) {
+ x = x_;
+ y = y_;
+ accepted++;
+ }
+ out(i, 0) = x;
+ out(i, 1) = y;
+ }
+ cout << "Acceptance rate is " << accepted / S << '\n';
+
+ return out;
+}
note that the PDF for a multivariate normal is:
where is a vector of means, is the number of dimensions, is a covariance matrix, is the determinant and is a vector of values that the PDF is evaluated for.
SPOILER ALERT: Julia will beat this C++ Eigen implementation by being almost 100x faster. So I will try to help C++ beat Julia (😂) by making a bivariate normal class BiNormal
in order to avoid the expensive operation of inverting a covariance matrix and computing determinants in every logpdf proposal evaluation. Also since we are not doing linear algebra computations I've removed Eigen and used C++ STL's <vector>
:
#define M_PI 3.14159265358979323846 /* pi */
+
+// Bivariate Normal
+struct BiNormal {
+ BiNormal(const std::vector<double> &mu, const double &rho)
+ : mean(mu), rho(rho) {}
+ double pdf(const std::vector<double> &x) const;
+ double lpdf(const std::vector<double> &x) const;
+ std::vector<double> mean;
+ double rho;
+};
+
+double BiNormal::pdf(const std::vector<double> &x) const {
+ double x_ = x[0];
+ double y_ = x[1];
+ return std::exp(-((std::pow(x_, 2) - (2 * rho * x_ * y_) + std::pow(y_, 2)) /
+ (2 * (1 - std::pow(rho, 2))))) /
+ (2 * M_PI * std::sqrt(1 - std::pow(rho, 2)));
+}
+
+double BiNormal::lpdf(const std::vector<double> &x) const {
+ double x_ = x[0];
+ double y_ = x[1];
+ return (-((std::pow(x_, 2) - (2 * rho * x_ * y_) + std::pow(y_, 2))) /
+ (2 * (1 - std::pow(rho, 2)))) -
+ std::log(2) - std::log(M_PI) - log(std::sqrt(1 - std::pow(rho, 2)));
+}
This means that I've simplified the PDF [3] from equation (1) into:
Since , equation (2) boils down to:
no more determinants or matrix inversions. Easy-peasy for C++.
Now let's go to the last, but not least: Stan
is a probabilistic language for specifying probabilistic models (does the same as Turing.jl
does) and comes also with a very fast C++-based MCMC sampler. Stan
is a personal favorite of mine and I have a whole graduate course of Bayesian statistics using Stan
. Here's the Stan
implementation:
functions {
+ real binormal_lpdf(real [] xy, real mu_X, real mu_Y, real sigma_X, real sigma_Y, real rho) {
+ real beta = rho * sigma_Y / sigma_X; real sigma = sigma_Y * sqrt(1 - square(rho));
+ return normal_lpdf(xy[1] | mu_X, sigma_X) +
+ normal_lpdf(xy[2] | mu_Y + beta * (xy[1] - mu_X), sigma);
+ }
+
+ matrix metropolis_rng(int S, real width,
+ real mu_X, real mu_Y,
+ real sigma_X, real sigma_Y,
+ real rho) {
+ matrix[S, 2] out; real x = normal_rng(0, 1); real y = normal_rng(0, 1); real accepted = 0;
+ for (s in 1:S) {
+ real xmw = x - width; real xpw = x + width; real ymw = y - width; real ypw = y + width;
+ real x_ = uniform_rng(xmw, xpw); real y_ = uniform_rng(ymw, ypw);
+ real r = exp(binormal_lpdf({x_, y_} | mu_X, mu_Y, sigma_X, sigma_Y, rho) -
+ binormal_lpdf({x , y } | mu_X, mu_Y, sigma_X, sigma_Y, rho));
+ if (r > uniform_rng(0, 1)) {
+ x = x_; y = y_; accepted += 1;
+ }
+ out[s, 1] = x; out[s, 2] = y;
+ }
+ print("Acceptance rate is ", accepted / S);
+ return out;
+ }
+}
Wow, that was lot... Not let's go to the results. I've benchmarked R and Stan
code using {bench}
and {rstan}
packages, C++ using catch2
, Julia using BenchmarkTools.jl
. For all benchmarks the parameters were: S = 10_000
simulations, width = 2.75
and ρ = 0.8
. From fastest to slowest:
Stan
- 3.6ms
Julia - 6.3ms
C++ BiNormal
- 17ms
C++ MvNormal
- 592ms
R - 1.35s which means 1350ms
Conclusion: a naïve Julia implementation beats C++ (while also beating a C++ math-helped faster implementation using bivariate normal PDFs) and gets very close to Stan
, a highly specialized probabilistic language that compiles and runs on C++ with lots of contributors, funding and development time invested.
Despite being blazing fast, Julia also codes very easily.
You can write and read code without much effort.
I think that this is the real game changer of Julia language: The ability to define function behavior across many combinations of argument types via multiple dispatch. Multiple dispatch is a feature that allows a function or method to be dynamically dispatched based on the run-time (dynamic) type or, in the more general case, some other attribute of more than one of its arguments. This is a generalization of single-dispatch polymorphism where a function or method call is dynamically dispatched based on the derived type of the object on which the method has been called. Multiple dispatch routes the dynamic dispatch to the implementing function or method using the combined characteristics of one or more arguments.
Most languages have single-dispatch polymorphism that rely on the first parameter of a method in order to determine which method should be called. But what Julia differs is that multiple parameters are taken into account. This enables multiple definitions of similar functions that have the same initial parameter. I think that this is best explained by one of the creators of Julia, Stefan Karpinski, at JuliaCon 2019 (see the video below):
I will reproduce Karpinski's example. In the talk, Karpinski designs a structure of classes which are very common in object-oriented programming (OOP). In Julia, we don't have classes but we have structures (struct
) that are meant to be "structured data": they define the kind of information that is embedded in the structure, that is a set of fields (aka "properties" or "attributes" in other languages), and then individual instances (or "objects") can be produced each with its own specific values for the fields defined by the structure.
We create an abstract type
called Pet
. Then, we proceed by creating two derived struct
from Pet
that has one field name
(a String
). These derived struct
are Dog
and Cat
. We also define some methods for what happens in an "encounter" by defining a generic function meets()
and several specific methods of meets()
that will be multiple dispatched by Julia in runtime to define the action that one type Pet
takes when it meets another Pet
:
abstract type Pet end
+struct Dog <: Pet
+ name::String
+end
+struct Cat <: Pet
+ name::String
+end
+
+function encounter(a::Pet, b::Pet)
+ verb = meets(a, b)
+ return println("$(a.name) meets $(b.name) and $verb")
+end
+
+meets(a::Dog, b::Dog) = "sniffs";
+meets(a::Dog, b::Cat) = "chases";
+meets(a::Cat, b::Dog) = "hisses";
+meets(a::Cat, b::Cat) = "slinks";
Let's see what happens when we instantiate objects from Dog
and Cat
and call encounter
on them in Julia:
fido = Dog("Fido");
+rex = Dog("Rex");
+whiskers = Cat("Whiskers");
+spots = Cat("Spots");
+
+encounter(fido, rex)
+encounter(rex, whiskers)
+encounter(spots, fido)
+encounter(whiskers, spots)
Fido meets Rex and sniffs
+Rex meets Whiskers and chases
+Spots meets Fido and hisses
+Whiskers meets Spots and slinks
+
It works as expected. Now let's translate this to modern C++ as literally as possible. Let's define a class Pet
with a member variable name
– in C ++ we can do this. Then we define a base function meets()
, a function encounter()
for two objects of the type Pet
, and finally, define derived classes Dog
and Cat
overload meets()
for them:
#include <iostream>
+#include <string>
+
+using std::string;
+using std::cout;
+
+class Pet {
+ public:
+ string name;
+};
+
+string meets(Pet a, Pet b) { return "FALLBACK"; } // If we use `return meets(a, b)` doesn't work
+
+void encounter(Pet a, Pet b) {
+ string verb = meets(a, b);
+ cout << a.name << " meets "
+ << b. name << " and " << verb << '\n';
+}
+
+class Cat : public Pet {};
+class Dog : public Pet {};
+
+string meets(Dog a, Dog b) { return "sniffs"; }
+string meets(Dog a, Cat b) { return "chases"; }
+string meets(Cat a, Dog b) { return "hisses"; }
+string meets(Cat a, Cat b) { return "slinks"; }
Now we add a main()
function to the C++ script:
int main() {
+ Dog fido; fido.name = "Fido";
+ Dog rex; rex.name = "Rex";
+ Cat whiskers; whiskers.name = "Whiskers";
+ Cat spots; spots.name = "Spots";
+
+ encounter(fido, rex);
+ encounter(rex, whiskers);
+ encounter(spots, fido);
+ encounter(whiskers, spots);
+
+ return 0;
+}
And this is what we get:
g++ main.cpp && ./a.out
+
+Fido meets Rex and FALLBACK
+Rex meets Whiskers and FALLBACK
+Spots meets Fido and FALLBACK
+Whiskers meets Spots and FALLBACK
Doesn't work... 🤷🏼
Now let's change to another nice example of creating a one-hot vector. One-hot vector is a vector of integers in which all indices are zero (0) except for one single index that is one (1). In machine learning, one-hot encoding is a frequently used method to deal with categorical data. Because many machine learning models need their input variables to be numeric, categorical variables need to be transformed in the pre-processing part. The example below is heavily inspired by a post from Vasily Pisarev[4].
How we would represent one-hot vectors in Julia? Simple: we create a new type OneHotVector
in Julia using the struct
keyword and define two fields len
and ind
, which represents the OneHotVector
length and which index is the entry 1 (i.e. which index is "hot"). Then, we define new methods for the Base
functions size()
and getindex()
for our newly defined OneHotVector
.
import Base: size, getindex
+
+struct OneHotVector <: AbstractVector{Int}
+ len::Int
+ ind::Int
+end
+
+size(v::OneHotVector) = (v.len,)
+
+getindex(v::OneHotVector, i::Integer) = Int(i == v.ind)
getindex (generic function with 928 methods)
+Since OneHotVector
is a struct
derived from AbstractVector
we can use all of the methods previously defined for AbstractVector
and it simply works right off the bat. Here we are constructing an Array
with a list comprehension:
onehot = [OneHotVector(3, rand(1:3)) for _ in 1:4]
4-element Vector{OneHotVector}:
+ [1, 0, 0]
+ [1, 0, 0]
+ [1, 0, 0]
+ [0, 0, 1]
+Now I define a new function inner_sum()
that is basically a recursive dot product with a summation. Here A – this is something matrix-like (although I did not indicate the types, and you can guess something only by the name), and vs
is a vector of some vector-like elements. The function proceeds by taking the dot product of the "matrix" with all vector-like elements of vs
and returning the accumulated values. This is all given generic definition without specifying any types. Generic programming here consists in this very function call inner()
in a loop.
using LinearAlgebra
+
+function inner_sum(A, vs)
+ t = zero(eltype(A))
+ for v in vs
+ t += inner(v, A, v) # multiple dispatch!
+ end
+ return t
+end
+
+inner(v, A, w) = dot(v, A * w) # very general definition
inner (generic function with 1 method)
+So, "look mom, it works":
+A = rand(3, 3)
+vs = [rand(3) for _ in 1:4]
+inner_sum(A, vs)
2.4426500326493357
+Since OneHotVector
is a subtype of AbstractVector
:
supertype(OneHotVector)
AbstractVector{Int64} (alias for AbstractArray{Int64, 1})
+We can use inner_sum
and it will do what it is supposed to do:
inner_sum(A, onehot)
1.604358084381523
+But this default implementation is slow:
+using BenchmarkTools
+
+@btime inner_sum($A, $onehot);
189.279 ns (4 allocations: 320 bytes)
+
+We can greatly optimize this procedure. Now let's redefine matrix multiplication by OneHotVector
with a simple column selection. We do this by defining a new method of the *
function (multiplier function) of Base
Julia. Additionally we also create a new optimized method of inner()
for dealing with OneHotVector
:
import Base: *
+
+*(A::AbstractMatrix, v::OneHotVector) = A[:, v.ind]
+inner(v::OneHotVector, A, w::OneHotVector) = A[v.ind, w.ind]
inner (generic function with 2 methods)
+That's it! Simple, huh? Now let's benchmark:
+@btime inner_sum($A, $onehot);
4.648 ns (0 allocations: 0 bytes)
+
+Huge gains of speed! 🚀
+Here are some more thoughts on why I believe Julia is the right approach to scientific computation.
+Below is a very opinionated image that divides the scientific computing languages that we've spoken so far in a 2x2 diagram with two axes: Slow-Fast and Easy-Hard. I've put C++ and FORTRAN in the hard and fast quadrant. R and Python goes into the easy and slow quadrant. Julia is the only language in the easy and fast quadrant. I don't know any language that would want to be hard and slow, so this quadrant is empty.
+What I want to say with this image is that if you want to code fast and easy use Julia.
+One other thing to note that I find quite astonishing is that Julia packages are all written in Julia. This does not happen in other scientific computing languages. For example, the whole {tidyverse}
ecosystem of R packages are based on C++. NumPy
and SciPy
are a mix of FORTRAN and C. Scikit-Learn
is also coded in C.
See the figure below where I compare the GitHub's "Languages" stack bar of PyTorch
, TensorFlow
and Flux.jl
(Julia's Deep Learning package). This figure I would call "Python my a**!" 😂:
ccall
function (to call C and Fortran libraries) to JuliaInterop[5] packages that connect Julia to Python, R, Matlab, C++, and more.Another example comes from a Julia podcast that unfortunately I cannot recollect either what podcast was nor who was being interviewed. While being asked about how he joined the Julia bandwagon, he replied something in the likes:
+++"Well, I was doing some crazy calculation using a library that was a wrapper to an algorithm coded in FORTRAN and I was trying to get help with a bug. I opened an issue and after 2 weeks of no reply I've dived into the FORTRAN code (despite having no knowledge of FORTRAN). There I saw a comment from the original author describing exactly the same bug that I was experiencing and saying that he would fix this in the future. The comment was dated from 1992. At the same time a colleague of mine suggested that I could try to code the algorithm in some new language called Julia. I thought 'me?! code an algorithm?!'. So, I coded the algorithm in Julia and it was faster than the FORTRAN implementation and also without the evil bug. One thing to note that it was really easy to code the algorithm in Julia."
+
Having stuff in different language and wrappers can hinder further research as you can see from this example.
+As you saw from the Karpinski's talk above, multiple dispatch empower users to define their own types (if necessary) and also allows them to extend functions and types from other users to their own special use. This results in an ecosystem that stimulates code sharing and code reuse in scientific computing that is unmatched. For instance, if I plug a differential equation from DifferentialEquations.jl
into a Turing.jl
model I get a Bayesian stochastic differential equation model, e.g. Bayesian SIR model for infectious disease. If I plug a Flux.jl
neural network into a Turing.jl
model I get a Bayesian neural network! When I saw this type of code sharing I was blown away (and I still am).
[1] + | please note that I've used updated versions for all languages and packages as of April, 2021. DataFrames.jl version 1.0.1, Pandas version 1.2.4, NumPy version 1.20.2, {dplyr} version 1.0.5. We did not cover R's {data.table} here. Further benchmarking information is available for example here: Tabular data benchmarking
+
+ |
[2] + | which of course I did not. The Mvn class is inspired by Iason Sarantopoulos' implementation.
+
+ |
[3] + | you can find all the math here. + + |
[4] + | the post in Russian, I've "Google Translated" it to English. + + |
[5] + | Julia has a lot of interoperability between languages. Check out: PyCall.jl and JuliaPy for Python; RCall.jl for Java; Cxx.jl and CxxWrap.jl for C++; Clang.jl for libclang and C; ObjectiveC.jl for Objective-C; JavaCall.jl for Java; MATLAB.jl for MATLAB; MathLink.jl for Mathematica/Wolfram Engine; OctCall.jl for GNU Octave; and ZMQ.jl for ZeroMQ.
+
+ |
Bezanson, J., Edelman, A., Karpinski, S., & Shah, V. B. (2017). Julia: A fresh approach to numerical computing. SIAM Review, 59(1), 65–98.
+ +Bayesian statistics is an approach to inferential statistics based on Bayes' theorem, where available knowledge about parameters in a statistical model is updated with the information in observed data. The background knowledge is expressed as a prior distribution and combined with observational data in the form of a likelihood function to determine the posterior distribution. The posterior can also be used for making predictions about future events.
Bayesian statistics is a departure from classical inferential statistics that prohibits probability statements about parameters and is based on asymptotically sampling infinite samples from a theoretical population and finding parameter values that maximize the likelihood function. Mostly notorious is null-hypothesis significance testing (NHST) based on p-values. Bayesian statistics incorporate uncertainty (and prior knowledge) by allowing probability statements about parameters, and the process of parameter value inference is a direct result of the Bayes' theorem.
Bayesian statistics is revolutionizing all fields of evidence-based science[1] (van de Schoot et al., 2021). Dissatisfaction with traditional methods of statistical inference (frequentist statistics) and the advent of computers with exponential growth in computing power[2] provided a rise in Bayesian statistics because it is an approach aligned with the human intuition of uncertainty, robust to scientific malpractices, but computationally intensive.
But before we get into Bayesian statistics, we have to talk about probability: the engine of Bayesian inference.
PROBABILITY DOES NOT EXIST!
de Finetti (1974)[3]
These are the first words in the preface to the famous book by Bruno de Finetti (figure below), one of the most important probability mathematician and philosopher. Yes, probability does not exist. Or rather, probability as a physical quantity, objective chance, does NOT exist. De Finetti showed that, in a precise sense, if we dispense with the question of objective chance nothing is lost. The mathematics of inductive reasoning remains exactly the same.
Consider tossing a weighted coin. The attempts are considered independent and, as a result, exhibit another important property: the order does not matter. To say that order does not matter is to say that if you take any finite sequence of heads and tails and exchange the results however you want, the resulting sequence will have the same probability. We say that this probability is invariant under permutations.
Or, to put it another way, the only thing that matters is the relative frequency. Result that have the same frequency of heads and tails consequently have the same probability. The frequency is considered a sufficient statistic. Saying that order doesn't matter or saying that the only thing that matters is frequency are two ways of saying exactly the same thing. This property is called exchangeability by de Finetti. And it is the most important property of probability that makes it possible for us to manipulate it mathematically (or philosophically) even if it does not exist as a physical "thing".
Still developing the argument:
"Probabilistic reasoning - always understood as subjective - stems merely from our being uncertain about something. It makes no difference whether the uncertainty relates to an unforeseeable future [4], or to an unnoticed past, or to a past doubtfully reported or forgotten [5]... The only relevant thing is uncertainty - the extent of our own knowledge and ignorance. The actual fact of whether or not the events considered are in some sense determined, or known by other people, and so on, is of no consequence." (de Finetti, 1974)
In conclusion: no matter what the probability is, you can use it anyway, even if it is an absolute frequency (ex: probability that I will ride my bike naked is ZERO because the probability that an event that never occurred will occur in the future it is ZERO) or a subjective guess (ex: maybe the probability is not ZERO, but 0.00000000000001; very unlikely, but not impossible).
With the philosophical intuition of probability elaborated, we move on to mathematical intuitions. The probability of an event is a real number[6], between 0 and 1, where, roughly, 0 indicates the impossibility of the event and 1 indicates the certainty of the event. The greater the likelihood of an event, the more likely it is that the event will occur. A simple example is the tossing of a fair (impartial) coin. Since the coin is fair, both results ("heads" and "tails") are equally likely; the probability of "heads" is equal to the probability of "tails"; and since no other result is possible[7], the probability of "heads" or "tails" is (which can also be written as 0.5 or 50%).
Regarding notation, we define as an event and as the probability of event , thus:
This means the "probability of the event to occur is the set of all real numbers between 0 and 1; including 0 and 1". In addition, we have three axioms[8], originated from Kolmogorov(1933) (figure below):
Non-negativity: For all , . Every probability is positive (greater than or equal to zero), regardless of the event.
Additivity: For two mutually exclusive and (cannot occur at the same time[9]): and .
Normalization: The probability of all possible events must add up to 1: .
With these three simple (and intuitive) axioms, we are able to derive and construct all the mathematics of probability.
An important concept is the conditional probability that we can define as the "probability that one event will occur if another has occurred or not". The notation we use is , which reads as "the probability that we have observed given we have already observed ".
A good example is the Texas Hold'em Poker game, where the player receives two cards and can use five "community cards" to set up his "hand". The probability that you are dealt a King () is :
And the probability of being dealt an Ace is also the same as (2), :
However, the probability that you are dealt a King as a second card since you have been dealt an Ace as a first card is:
Since we have one less card () because you have been dealt already an Ace (thus has been observed), we have 4 Kings still in the deck, so the (4) is .
Conditional probability leads us to another important concept: joint probability. Joint probability is the "probability that two events will both occur". Continuing with our Poker example, the probability that you will receive two cards, Ace () and a King () as two starting cards is:
Note that :
But this symmetry does not always exist (in fact it very rarely exists). The identity we have is as follows:
So this symmetry only exists when the baseline rates for conditional events are equal:
Which is what happens in our example.
Let's see a practical example. For example, I’m feeling good and start coughing in line at the supermarket. What do you think will happen? Everyone will think I have COVID, which is equivalent to thinking about . Seeing the most common symptoms of COVID, if you have COVID, the chance of coughing is very high. But we actually cough a lot more often than we have COVID – , so:
This is the last concept of probability that we need to address before diving into Bayesian statistics, but it is the most important. Note that it is not a semantic coincidence that Bayesian statistics and Bayes' theorem have the same prefix.
Thomas Bayes (1701 - 1761, figure below) was an English Presbyterian statistician, philosopher and minister known for formulating a specific case of the theorem that bears his name: Bayes' theorem. Bayes never published what would become his most famous accomplishment; his notes were edited and published after his death by his friend Richard Price[10]. In his later years, Bayes was deeply interested in probability. Some speculate that he was motivated to refute David Hume's argument against belief in miracles based on evidence from the testimony in "An Inquiry Concerning Human Understanding".
Let's move on to Theorem. Remember that we have the following identity in probability:
Ok, now move in the right of (11) to the left as a division:
And the final result is:
Bayesian statistics uses this theorem as inference engine of parameters of a model conditioned on observed data.
Everything that has been exposed so far is based on the assumption that the parameters are discrete. This was done in order to provide a better intuition of what is probability. We do not always work with discrete parameters. The parameters can be continuous, for example: age, height, weight, etc. But don't despair, all the rules and axioms of probability are also valid for continuous parameters. The only thing we have to do is to exchange all the sums for integrals . For example, the third axiom of Normalization for continuous random variables becomes:
Now that you know what probability is and what Bayes' theorem is, I will propose the following model:
where:
– parameter(s) of interest;
– observed data;
Prior – previous probability of the parameter value(s)[11] ;
Likelihood – probability of the observed data conditioned on the parameter value(s) ;
Posterior – posterior probability of the parameter value(s) after observing the data ; and
Normalizing Constant – does not make intuitive sense. This probability is transformed and can be interpreted as something that exists only so that the result of is somewhere between 0 and 1 – a valid probability by the axioms. We will talk more about this constant in 5. Markov Chain Monte Carlo (MCMC).
Bayesian statistics allow us to directly quantify the uncertainty related to the value of one or more parameters of our model conditioned to the observed data. This is the main feature of Bayesian statistics, for we are directly estimating using Bayes' theorem. The resulting estimate is totally intuitive: it simply quantifies the uncertainty we have about the value of one or more parameters conditioned on the data, the assumptions of our model (likelihood) and the previous probability(prior) we have about such values.
To contrast with Bayesian statistics, let's look at the frequentist statistics, also known as "classical statistics". And already take notice: it is not something intuitive like the Bayesian statistics.
For frequentist statistics, the researcher is prohibited from making probabilistic conjectures about parameters. Because they are not uncertain, quite the contrary they are determined quantities. The only issue is that we do not directly observe the parameters, but they are deterministic and do not allow any margin of uncertainty. Therefore, for the frequentist approach, parameters are unobserved amounts of interest in which we do not make probabilistic conjectures.
What, then, is uncertain in frequentist statistics? Short answer: the observed data. For the frequentist approach, the sample is uncertain. Thus, we can only make probabilistic conjectures about our sample. Therefore, the uncertainty is expressed in the probability that I will obtain data similar to those that I obtained if I sampled from a population of interest infinite samples of the same size as my sample[12]. Uncertainty is conditioned by a frequentist approach, in other words, uncertainty only exists only if I consider an infinite sampling process and extract a frequency from that process. The probability only exists if it represents a frequency. Frequentist statistics is based on an "infinite sampling process of a population that I have never seen", strange that it may sounds.
For the frequentist approach, there is no posterior or prior probability since both involve parameters, and we saw that this is a no-no on frequentist soil. Everything that is necessary for statistical inference is contained within likelihood[13].
In addition, for reasons of ease of computation, since most of these methods were invented in the first half of the 20th century (without any help of a computer), only the parameter value(s) that maximize(s) the likelihood function is(are) computed[14]. From this optimization process we extracted the mode of the likelihood function (i.e. maximum value). The maximum likelihood estimate (MLE) is(are) the parameter value(s) so that a -sized sample randomly sampled from a population (i.e. the data you've collected) is the most likely -sized sample from that population. All other potential samples that could be extracted from this population will have a worse estimate than the sample you actually have[15]. In other words, we are conditioning the parameter value(s) on the observed data from the assumption that we are sampling infinite -sized samples from a theoretical population and treating the parameter values as fixed and our sample as random (or uncertain).
The mode works perfectly in the fairytale world, which assumes that everything follows a normal distribution, in which the mode is equal to the median and to the mean – . There is only one problem, this assumption is rarely true (see figure below), especially when we are dealing with multiple parameters with complex relationships between them (complex models).
A brief sociological and computational explanation is worthwhile of why frequentist (classical) statistics prohibit probabilistic conjectures about parameters and we are only left with optimizing (finding the maximum value of a function) rather than approximating or estimating a complete likelihood density (in other words, "to pull up the whole file" of the likelihood verisimilitude instead of just the mode).
On the sociological side of things, science at the beginning of the 20th century assumed that it should be strictly objective and all subjectivity must be banned. Therefore, since the estimation of the a posterior probability of parameters involves elucidating an a prior probability of parameters, such a method should not be allowed in science, as it brings subjectivity (we know today that nothing in human behavior is purely objective, and subjectivity permeates all human endeavors).
Regarding the computational side of things, in the 1930s without computers it was much easier to use strong assumptions about the data to get a value from a statistical estimation using mathematical derivations than to calculate the statistical estimation by hand without depending on such assumptions. For example: Student's famous test is a test that indicates when we can reject that the mean of a certain parameter of interest between two groups is equal (famous null hypothesis - ). This test starts from the assumption that if the parameter of interest is distributed according to a normal distribution (assumption 1 – normality of the dependent variable), if the variance of the parameter of interest varies homogeneously between groups (assumption 2 – homogeneity of the variances), and if the number of observations in the two groups are similar (assumption 3 – homogeneity of the size of the groups) the difference between the groups weighted by the variance of the groups follows a Student- distribution (hence the name of the test).
So statistical estimation comes down to calculating the average of two groups, the variance of both groups for a parameter of interest and looking for the associated -value in a table and see if we can reject the . This was valid when everything we had to do was calculated by hand. Today, with a computer 1 million times more powerful than the Apollo 11 computer (one that took humanity to the moon) in your pocket [2], I don't know if it is still valid.
-values are hard to understand, .
Since I've mentioned the word, let me explain what it is. First, the correct[16] statistics textbook definition:
-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.
Unfortunately with frequentist statistics you have to choose one of two qualities for explanations: intuitive or accurate[17].
If you write this definition in any test, book or scientific paper, you are 100% accurate and correct in defining what a -value is. Now, understanding this definition is complicated. For that, let's break this definition down into parts for a better understanding:
"probability of obtaining test results..": notice that -values are related your data and not your theory or hypothesis.
"...at least as extreme as the results actually observed...": "at least as extreme" implies defining a threshold for the characterization of some relevant finding, which is commonly called . We generally stipulate alpha at 5% () and anything more extreme than alpha (ie less than 5%) we characterize as significant.
"... under the assumption that the null hypothesis is correct.": every statistical test that has a -value has a null hypothesis (usually written as ). Null hypotheses, always have to do with some null effect. For example, the null hypothesis of the Shapiro-Wilk and Komolgorov-Smirnov test is "the data is distributed according to a Normal distribution" and that of the Levene test is "the group variances are equal". Whenever you see a -value, ask yourself: "What is the null hypothesis that this test assumes is correct?".
To understand the -value any statistical test first find out what is the null hypothesis behind that test. The definition of -value will not change. In every test it is always the same. What changes with the test is the null hypothesis. Each test has its . For example, some common statistical tests ( = data at least as extreme as the results actually observed):
Test:
ANOVA:
Regression:
Shapiro-Wilk:
-value is the probability of the data you obtained given that the null hypothesis is true. For those who like mathematical formalism: . In English, this expression means "the probability of conditioned to ". Before moving on to some examples and attempts to formalize an intuition about -values, it is important to note that -values say something about data and not hypotheses. For -value, the null hypothesis is true, and we are only evaluating whether the data conforms to this null hypothesis or not. If you leave this tutorial armed with this intuition, the world will be rewarded with researchers better prepared to qualify and interpret evidence ().
Intuitive Example of a -value:
Imagine that you have a coin that you suspect is biased towards a higher probability of flipping "heads" than "tails". (Your null hypothesis is then that the coin is fair.) You flip the coin 100 times and get more heads than tails. The -value will not tell you if the coin is fair, but it will tell you the probability that you will get at least as many heads as if the coin were fair. That's it - nothing more.
There is no way to understand -values if we do not understand its origins and historical trajectory. The first mention of the term was made by statistician Ronald Fisher[18] (figure below). In 1925, Fisher (Fisher, 1925) defined the -value as an "index that measures the strength of the evidence against the null hypothesis". To quantify the strength of the evidence against the null hypothesis, Fisher defended " (5% significance) as a standard level to conclude that there is evidence against the tested hypothesis, although not as an absolute rule". Fisher did not stop there but rated the strength of the evidence against the null hypothesis. He proposed "if is between 0.1 and 0.9 there is certainly no reason to suspect the hypothesis tested. If it is below 0.02 it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at 0.05". Since Fisher made this statement almost 100 years ago, the 0.05 threshold has been used by researchers and scientists worldwide and it has become ritualistic to use 0.05 as a threshold as if other thresholds could not be used or even considered.
After that, the threshold of 0.05, now established as unquestionable, strongly influenced statistics and science. But there is no reason against adopting other thresholds () like 0.1 or 0.01 (Lakens et al., 2018). If well argued, the choice of thresholds other than 0.05 can be welcomed by editors, reviewers and advisors. Since the -value is a probability, it is a continuous quantity. There is no reason to differentiate a of 0.049 against a of 0.051. Robert Rosenthal, a psychologist already said "God loves 0.06 as much as a 0.05" (Rosnow & Rosenthal, 1989).
In the last year of his life, Fisher published an article (Fisher, 1962) examining the possibilities of Bayesian methods, but with a prior probabilities to be determined experimentally. Some authors even speculate (Jaynes, 2003) that if Fisher were alive today, he would probably be a "Bayesian".
With the definition and a well-anchored intuition of what -value is, we can move on to what the -value is not!
-value is not the probability of the null hypothesis - Famous confusion between and . -value is not the probability of the null hypothesis, but the probability of the data you obtained. To get the you need Bayesian statistics.
-value is not the probability of the data being produced by chance - No! Nobody said anything of "being produced by chance". Again: -value is the probability to get results at least as extreme as those that were observed, given that the null hypothesis is true.
-value measures the size of the effect of a statistical test - Also wrong... -value does not say anything about the size of the effect. Just about whether the observed data differs from what is expected under the null hypothesis. It is clear that large effects are more likely to be statistically significant than small effects. But this is not a rule and never judge a finding by its -value, but by its effect size. In addition, -values can be "hacked" in a number of ways (Head et al., 2015) and often their value is a direct consequence of the sample size.
To conclude, let's talk about the famous confidence intervals, which are not a measure that quantifies the uncertainty of the value of a parameter (remember probabilistic conjectures about parameters are prohibited in frequentist-land). Wait for it! Here is the definition of confidence intervals:
"An X% confidence interval for a parameter is an interval generated by a procedure that in repeated sampling has an X% probability of containing the true value of , for all possible values of ."
Jerzy Neyman, the "father" of confidence intervals (see figure below) (Neyman, 1937).
Again the idea of sampling an infinite number of times from a population you have never seen. For example: let's say that you performed a statistical analysis to compare the effectiveness of a public policy in two groups and you obtained the difference between the average of those groups. You can express this difference as a confidence interval. We generally choose 95% confidence (since it is analogous as ). You then write in your paper that the "observed difference between groups is 10.5 - 23.5 (95% CI)." This means that approximately 95 studies out of 100 would compute a confidence interval that contains the true mean difference –- but it says nothing about which ones those are (whereas the data might). In other words, 95% is not the probability of obtaining data such that the estimate of the true parameter is contained in the interval that we obtained, it is the probability of obtaining data such that, if we compute another confidence interval in the same way, it contains the true parameter. The interval that we got in this particular instance is irrelevant and might as well be thrown away.
Bayesian statistics have a concept similar to the confidence intervals of frequentist statistics. This concept is called credibility interval. And, unlike the confidence interval, its definition is intuitive. Credibility interval measures an interval in which we are sure that the value of the parameter of interest is, based on the likelihood conditioned on the observed data - ; and the prior probability of the parameter of interest - . It is basically a "slice" of the posterior probability of the parameter restricted to a certain level of certainty. For example: a 95% credibility interval shows the interval that we are 95% sure that captures the value of our parameter of interest. That simple...
For example, see figure below, which shows a Log-Normal distribution with mean 0 and standard deviation 2. The green dot shows the maximum likelihood estimation (MLE) of the value of which is simply the mode of distribution. And in the shaded area we have the 50% credibility interval of the value of , which is the interval between the 25% percentile and the 75% percentile of the probability density of . In this example, MLE leads to estimated values that are not consistent with the actual probability density of the value of .
using CairoMakie
+using Distributions
+
+d = LogNormal(0, 2)
+range_d = 0:0.001:4
+q25 = quantile(d, 0.25)
+q75 = quantile(d, 0.75)
+credint = range(q25; stop=q75, length=100)
+f, ax, l = lines(
+ range_d,
+ pdf.(d, range_d);
+ linewidth=3,
+ axis=(; limits=(-0.2, 4.2, nothing, nothing), xlabel=L"\theta", ylabel="Density"),
+)
+scatter!(ax, mode(d), pdf(d, mode(d)); color=:green, markersize=12)
+band!(ax, credint, 0.0, pdf.(d, credint); color=(:steelblue, 0.5))
Now an example of a multimodal distribution[19]. The figure below shows a bimodal distribution with two modes 2 and 10[20] The green dot shows the maximum likelihood estimation (MLE) of the value of which is the mode of distribution. See that even with 2 modes, maximum likelihood defaults to the highest mode[21]. And in the shaded area we have the 50% credibility interval of the value of , which is the interval between the 25% percentile and the 75% percentile of the probability density of . In this example, estimation by maximum likelihood again lead us to estimated values that are not consistent with the actual probability density of the value of .
d1 = Normal(10, 1)
+d2 = Normal(2, 1)
+mix_d = [0.4, 0.6]
+d = MixtureModel([d1, d2], mix_d)
+range_d = -2:0.01:14
+sim_d = rand(d, 10_000)
+q25 = quantile(sim_d, 0.25)
+q75 = quantile(sim_d, 0.75)
+credint = range(q25; stop=q75, length=100)
+
+f, ax, l = lines(
+ range_d,
+ pdf.(d, range_d);
+ linewidth=3,
+ axis=(;
+ limits=(-2, 14, nothing, nothing),
+ xticks=[0, 5, 10],
+ xlabel=L"\theta",
+ ylabel="Density",
+ ),
+)
+scatter!(ax, mode(d2), pdf(d, mode(d2)); color=:green, markersize=12)
+band!(ax, credint, 0.0, pdf.(d, credint); color=(:steelblue, 0.5))
What we've seen so fat can be resumed in the table below:
Bayesian Statistics | Frequentist Statistics | |
---|---|---|
Data | Fixed – Non-random | Uncertain – Random |
Parameters | Uncertain – Random | Fixed – Non-random |
Inference | Uncertainty over parameter values | Uncertainty over a sampling procedure from an infinite population |
Probability | Subjective | Objective (but with strong model assumptions) |
Uncertainty | Credible Interval – | Confidence Interval – |
Finally, I summarize the main advantages of Bayesian statistics:
Natural approach to express uncertainty
Ability to incorporate prior information
Greater model flexibility
Complete posterior distribution of parameters
Confidence Intervals vs Credibility Intervals
Natural propagation of uncertainty
And I believe that I also need to show the main disadvantage:
Slow model estimation speed (30 seconds instead of 3 seconds using the frequentist approach)
Dear reader, know that you are at a time in history when Statistics is undergoing major changes. I believe that frequentist statistics, especially the way we qualify evidence and hypotheses with -values, will transform in a "significant" way. Five years ago, the American Statistical Association (ASA, the world's largest professional statistical organization) published a statement on -values (Wasserstein & Lazar, 2016). The statement says exactly what we talk about here. The main concepts of the null hypothesis significance test, and in particular -values, fail to provide what researchers require of them. Despite what many statistical books, teaching materials and published articles say, -values below 0.05 do not "prove" the reality of anything. Nor, at this point, do the -values above 0.05 refute anything. ASA's statement has more than 3,600 citations causing significant impact. As an example, an international symposium was held in 2017 that led to a special open access edition of The American Statistician dedicated to practical ways to abandon (Wasserstein, Schirm & Lazar 2019).
Soon after, more attempts and claims followed. In September 2017, Nature Human Behavior published an editorial proposing that the significance level of the -value be reduced from to (Benjamin et al., 2018). Several authors, including many highly influential and important statisticians, have argued that this simple step would help tackle the problem of the science replicability crisis, which many believe is the main consequence of the abusive use of -values (Ioannidis, 2019). In addition, many have gone a step further and suggest that science discard once and for all -values (Nature, 2019). Many suggest (myself included) that the main inference tool be Bayesian statistics (Amrhein, Greenland & McShane, 2019; Goodman, 2016; van de Schoot et al., 2021)
Turing (Ge, Xu & Ghahramani, 2018) is a probabilistic programming interface written in Julia (Bezanson, Edelman, Karpinski & Shah, 2017). It enables intuitive modeling syntax with flexible composable probabilistic programming inference. Turing supports a wide range of sampling based inference algorithms by combining model inference with differentiable programming interfaces in Julia. Yes, Julia is that amazing. The same differentiable stuff that you develop for optimization in neural networks you can plug it in to a probabilistic programming framework and it will work without much effort and boiler-plate code. Most importantly, Turing inference is composable: it combines Markov chain sampling operations on subsets of model variables, e.g. using a combination of a Hamiltonian Monte Carlo (HMC) engine and a particle Gibbs (PG) engine. This composable inference engine allows the user to easily switch between black-box style inference methods such as HMC, and customized inference methods.
I believe Turing is the most important and popular probabilistic language framework in Julia. It is what PyMC3 and Stan are for Python and R, but for Julia. Furthermore, you don't have to do "cartwheels" with Theano backends and tensors like in PyMC3 or learn a new language to declare your models like in Stan (or even have to debug C++ stuff). Turing is all Julia. It uses Julia arrays, Julia distributions, Julia autodiff, Julia plots, Julia random number generator, Julia MCMC algorithms etc. I think that developing and estimating Bayesian probabilistic models using Julia and Turing is powerful, intuitive, fun, expressive and allows easily new breakthroughs simply by being 100% Julia and embedded in Julia ecosystem. As discussed in 1. Why Julia?, having multiple dispatch with LLVM's JIT compilation allows us to combine code, types and algorithms in a very powerful and yet simple way. By using Turing in this context, a researcher (or a curious Joe) can develop new methods and extend the frontiers of Bayesian inference with new models, samplers, algorithms, or any mix-match of those.
[1] | personally, like a good Popperian, I don't believe there is science without being evidence-based; what does not use evidence can be considered as logic, philosophy or social practices (no less or more important than science, just a demarcation of what is science and what is not; eg, mathematics and law). |
[2] | your smartphone (iPhone 12 - 4GB RAM) has 1,000,000x (1 million) more computing power than the computer that was aboard the Apollo 11 (4kB RAM) which took the man to the moon. Detail: this on-board computer was responsible for lunar module navigation, route and controls. |
[3] | if the reader wants an in-depth discussion see Nau (2001). |
[4] | my observation: related to the subjective Bayesian approach. |
[5] | my observation: related to the objective frequentist approach. |
[6] | a number that can be expressed as a point on a continuous line that originates from minus infinity and ends and plus infinity ; for those who like computing it is a floating point float ordouble . |
[7] | i.e. the events are "mutually exclusive". That means that only or can occur in the whole sample space. |
[8] | in mathematics, axioms are assumptions assumed to be true that serve as premises or starting points for the elaboration of arguments and theorems. Often the axioms are questionable, for example non-Euclidean geometry refutes Euclid's fifth axiom on parallel lines. So far there is no questioning that has supported the scrutiny of time and science about the three axioms of probability. |
[9] | for example, the result of a given coin is one of two mutually exclusive events: heads or tails. |
[10] | the formal name of the theorem is Bayes-Price-Laplace, as Thomas Bayes was the first to discover, Richard Price took his drafts, formalized in mathematical notation and presented to the Royal Society of London, and Pierre Laplace rediscovered the theorem without having had previous contact in the late 18th century in France by using probability for statistical inference with Census data in the Napoleonic era. |
[11] | I will cover prior probabilities in the content of tutorial 4. How to use Turing. |
[12] | I warned you that it was not intuitive... |
[13] | something worth noting: likelihood also carries a lot of subjectivity. |
[14] | for those who have a thing for mathematics (like myself), we calculate at which point of the derivative of the likelihood function is zero - . So we are talking really about an optimization problem that for some likelihood functions we can have a closed-form analytical solution. |
[15] | have I forgot to warn you that it is not so intuitive? |
[16] | there are several statistics textbooks that have wrong definitions of what a -value is. If you don't believe me, see Wasserstein & Lazar (2016). |
[17] | this duality is attributed to Andrew Gelman – Bayesian statistician. |
[18] | Ronald Fisher's personality and life controversy deserves a footnote. His contributions were undoubtedly crucial to the advancement of science and statistics. His intellect was brilliant and his talent already flourished young: before turning 33 years old he had proposed the maximum likelihood estimation method (MLE) (Stigler, 2007) and also created the concept of degrees of freedom when proposing a correction in Pearson's chi-square test (Baird, 1983). He also invented the Analysis of Variance (ANOVA) and was the first to propose randomization as a way of carrying out experiments, being considered the "father" of randomized clinical trials (RCTs). Not everything is golden in Fisher's life, he was a eugenicist and had a very strong view on ethnicity and race, advocating the superiority of certain ethnicities. Furthermore, he was extremely invariant, chasing, harming and mocking any critic of his theories and publications. What we see today in the monopoly of the Neyman-Pearson paradigm (Neyman & Pearson, 1933) with -values and null hypotheses the result of this Fisherian effort to silence critics and let only his voice echo. |
[19] | which is not uncommon to see in the real world. |
[20] | for the curious it is a mixture of two normal distributions both with standard deviation 1, but with different means. To complete it assigns the weights of 60% for the distribution with an average of 2 and 40% for the distribution with an average of 10. |
[21] | to be more precise, estimation by maximum likelihood in non-convex functions cannot find an analytical solution and, if we are going to use another iterative maximization procedure, there is a risk of it becoming stuck in the second – lower-valued – mode of distribution. |
Amrhein, V., Greenland, S., & McShane, B. (2019). Scientists rise up against statistical significance. Nature, 567(7748), 305–307. https://doi.org/10.1038/d41586-019-00857-9
Baird, D. (1983). The fisher/pearson chi-squared controversy: A turning point for inductive inference. The British Journal for the Philosophy of Science, 34(2), 105–118.
Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E.-J., Berk, R., … Johnson, V. E. (2018). Redefine statistical significance. Nature Human Behaviour, 2(1), 6–10. https://doi.org/10.1038/s41562-017-0189-z
Bezanson, J., Edelman, A., Karpinski, S., & Shah, V. B. (2017). Julia: A fresh approach to numerical computing. SIAM Review, 59(1), 65–98.
de Finetti, B. (1974). Theory of Probability. New York: John Wiley & Sons.
Eckhardt, R. (1987). Stan Ulam, John von Neumann, and the Monte Carlo Method. Los Alamos Science, 15(30), 131–136.
Fisher, R. A. (1925). Statistical methods for research workers. Oliver; Boyd.
Fisher, R. A. (1962). Some Examples of Bayes’ Method of the Experimental Determination of Probabilities A Priori. Journal of the Royal Statistical Society. Series B (Methodological), 24(1), 118–124. Retrieved from https://www.jstor.org/stable/2983751
Ge, H., Xu, K., & Ghahramani, Z. (2018). Turing: A Language for Flexible Probabilistic Inference. International Conference on Artificial Intelligence and Statistics, 1682–1690. http://proceedings.mlr.press/v84/ge18b.html
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis. Chapman and Hall/CRC.
Goodman, S. N. (2016). Aligning statistical and scientific reasoning. Science, 352(6290), 1180–1181. https://doi.org/10.1126/science.aaf5406
Head, M. L., Holman, L., Lanfear, R., Kahn, A. T., & Jennions, M. D. (2015). The extent and consequences of p-hacking in science. PLoS Biol, 13(3), e1002106.
Ioannidis, J. P. A. (2019). What Have We (Not) Learnt from Millions of Scientific Papers with <i>P</i> Values? The American Statistician, 73(sup1), 20–25. https://doi.org/10.1080/00031305.2018.1447512
It’s time to talk about ditching statistical significance. (2019). Nature, 567(7748, 7748), 283–283. https://doi.org/10.1038/d41586-019-00874-8
Jaynes, E. T. (2003). Probability theory: The logic of science. Cambridge university press.
Kolmogorov, A. N. (1933). Foundations of the Theory of Probability. Berlin: Julius Springer.
Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., … Zwaan, R. A. (2018). Justify your alpha. Nature Human Behaviour, 2(3), 168–171. https://doi.org/10.1038/s41562-018-0311-x
Nau, R. F. (2001). De Finetti was Right: Probability Does Not Exist. Theory and Decision, 51(2), 89–124. https://doi.org/10.1023/A:1015525808214
Neyman, J. (1937). Outline of a theory of statistical estimation based on the classical theory of probability. Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences, 236(767), 333–380.
Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231(694-706), 289–337.
Rosnow, R. L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science. American Psychologist, 44, 1276–1284.
Stigler, S. M. (2007). The epic story of maximum likelihood. Statistical Science, 22(4), 598–620.
van de Schoot, R., Depaoli, S., King, R., Kramer, B., Märtens, K., Tadesse, M. G., … Yau, C. (2021). Bayesian statistics and modelling. Nature Reviews Methods Primers, 1(1, 1), 1–26. https://doi.org/10.1038/s43586-020-00001-2
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s Statement on p-Values: Context, Process, and Purpose. American Statistician, 70(2), 129–133. https://doi.org/10.1080/00031305.2016.1154108
Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a World Beyond "p < 0.05." American Statistician, 73, 1–19. https://doi.org/10.1080/00031305.2019.1583913
Bayesian statistics uses probability distributions as the inference "engine" for the estimation of the parameter values along with their uncertainties.
Imagine that probability distributions are small pieces of "Lego". We can build whatever we want with these little pieces. We can make a castle, a house, a city; literally anything we want. The same is true for probabilistic models in Bayesian statistics. We can build models from the simplest to the most complex using probability distributions and their relationships to each other. In this tutorial we will give a brief overview of the main probabilistic distributions, their mathematical notation and their main uses in Bayesian statistics.
A probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon in terms of its sample space and the probabilities of events (subsets of the sample space).
We generally use the notation X ~ Dist (par1, par2, ...)
. Where X
is the variable,Dist
is the name of the distribution, and par
are the parameters that define how the distribution behaves. Any probabilistic distribution can be "parameterized" by specifying parameters that allow us to shape some aspects of the distribution for some specific purpose.
Let's start with discrete distributions and then we'll address the continuous ones.
Discrete probability distributions are those where the results are discrete numbers (also called whole numbers): and . In discrete distributions we say the probability that a distribution takes certain values as "mass". The probability mass function is the function that specifies the probability of the random variable taking the value :
The discrete uniform distribution is a symmetric probability distribution in which a finite number of values are equally likely to be observed. Each of the values has an equal probability . Another way of saying "discrete uniform distribution" would be "a known and finite number of results equally likely to happen".
The discrete uniform distribution has two parameters and its notation is :
Lower Bound ()
Upper Bound ()
Example: a 6-sided dice.
using CairoMakie
+using Distributions
+
+f, ax, b = barplot(
+ DiscreteUniform(1, 6);
+ axis=(;
+ title="6-sided Dice",
+ xlabel=L"\theta",
+ ylabel="Mass",
+ xticks=1:6,
+ limits=(nothing, nothing, 0, 0.3),
+ ),
+)
Bernoulli's distribution describes a binary event of a successful experiment. We usually represent as failure and as success, so the result of a Bernoulli distribution is a binary variable .
The Bernoulli distribution is widely used to model discrete binary outcomes in which there are only two possible results.
Bernoulli's distribution has only a single parameter and its notation is :
Success Probability ()
Example: Whether the patient survived or died or whether the customer completes their purchase or not.
f, ax1, b = barplot(
+ Bernoulli(0.5);
+ width=0.3,
+ axis=(;
+ title=L"p=0.5",
+ xlabel=L"\theta",
+ ylabel="Mass",
+ xticks=0:1,
+ limits=(nothing, nothing, 0, 1),
+ ),
+)
+ax2 = Axis(
+ f[1, 2]; title=L"p=0.2", xlabel=L"\theta", xticks=0:1, limits=(nothing, nothing, 0, 1)
+)
+barplot!(ax2, Bernoulli(0.2); width=0.3)
+linkaxes!(ax1, ax2)
The binomial distribution describes an event of the number of successes in a sequence of independent experiment(s), each asking a yes-no question with a probability of success . Note that the Bernoulli distribution is a special case of the binomial distribution where the number of experiments is .
The binomial distribution has two parameters and its notation is or :
Number of Experiment(s) ()
Probability of Success ()
Example: number of heads in 5 coin flips.
f, ax1, b = barplot(
+ Binomial(5, 0.5); axis=(; title=L"p=0.5", xlabel=L"\theta", ylabel="Mass")
+)
+ax2 = Axis(f[1, 2]; title=L"p=0.2", xlabel=L"\theta")
+barplot!(ax2, Binomial(5, 0.2))
+linkaxes!(ax1, ax2)
The Poisson distribution expresses the probability that a given number of events will occur in a fixed interval of time or space if those events occur with a known constant average rate and regardless of the time since the last event. The Poisson distribution can also be used for the number of events at other specified intervals, such as distance, area or volume.
The Poisson distribution has one parameter and its notation is :
Rate ()
Example: Number of emails you receive daily. Number of holes you find on the street.
f, ax1, b = barplot(
+ Poisson(1); axis=(; title=L"\lambda=1", xlabel=L"\theta", ylabel="Mass")
+)
+ax2 = Axis(f[1, 2]; title=L"\lambda=4", xlabel=L"\theta")
+barplot!(ax2, Poisson(4))
+linkaxes!(ax1, ax2)
The negative binomial distribution describes an event of the number of failures before the th success in a sequence of independent experiment(s), each asking a yes-no question with probability . Note that it becomes identical to the Poisson distribution at the limit of . This makes the negative binomial a robust option to replace a Poisson distribution to model phenomena with a overdispersion* (excess expected variation in data).
The negative binomial distribution has two parameters and its notation is or :
Number of Success(es) ()
Probability of Success ()
Any phenomenon that can be modeled with a Poisson distribution, can be modeled with a negative binomial distribution (Gelman et al., 2013; 2020).
Example: Annual count of tropical cyclones.
f, ax1, b = barplot(
+ NegativeBinomial(1, 0.5); axis=(; title=L"k=1", xlabel=L"\theta", ylabel="Mass")
+)
+ax2 = Axis(f[1, 2]; title=L"k=2", xlabel=L"\theta")
+barplot!(ax2, NegativeBinomial(2, 0.5))
+linkaxes!(ax1, ax2)
Continuous probability distributions are those where the results are values in a continuous range (also called real numbers): . In continuous distributions we call the probability that a distribution takes certain values as "density". As we are talking about real numbers we are not able to obtain the probability that a random variable takes the value of . This will always be , as there is no way to specify an exact value of . lives in the real numbers line, so we need to specify the probability that takes values in a range . The probability density function is defined as:
This distribution is generally used in the social and natural sciences to represent continuous variables in which its distributions are not known. This assumption is due to the central limit theorem. The central limit theorem states that, in some conditions, the average of many samples (observations) of a random variable with finite mean and variance is itself a random variable whose distribution converges to a normal distribution as the number of samples increases. Therefore, physical quantities that are expected to be the sum of many independent processes (such as measurement errors) often have distributions that are expected to be nearly normal.
The normal distribution has two parameters and its notation is or :
Mean (): distribution mean which is also both the mode and the median of the distribution
Standard Deviation (): the variance of the distribution () is a measure of the dispersion of the observations in relation to the mean
Example: Height, Weight, etc.
f, ax, l = lines(
+ Normal(0, 1);
+ label=L"\sigma=1",
+ linewidth=5,
+ axis=(; xlabel=L"\theta", ylabel="Density", limits=(-4, 4, nothing, nothing)),
+)
+lines!(ax, Normal(0, 0.5); label=L"\sigma=0.5", linewidth=5)
+lines!(ax, Normal(0, 2); label=L"\sigma=2", linewidth=5)
+axislegend(ax)
The Log-normal distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if a random variable is normally distributed by its natural log, then will have a normal distribution.
A random variable with logarithmic distribution accepts only positive real values. It is a convenient and useful model for measurements in the physical sciences and engineering, as well as medicine, economics and other fields, eg. for energies, concentrations, lengths, financial returns and other values.
A log-normal process is the statistical realization of the multiplicative product of many independent random variables, each one being positive.
The log-normal distribution has two parameters and its notation is :
Mean (): natural logarithm of the mean the distribution
Standard Deviation (): natural logarithm of the variance of the distribution () is a measure of the dispersion of the observations in relation to the mean
f, ax, l = lines(
+ LogNormal(0, 1);
+ label=L"\sigma=1",
+ linewidth=5,
+ axis=(; xlabel=L"\theta", ylabel="Density", limits=(0, 3, nothing, nothing)),
+)
+lines!(ax, LogNormal(0, 0.25); label=L"\sigma=0.25", linewidth=5)
+lines!(ax, LogNormal(0, 0.5); label=L"\sigma=0.5", linewidth=5)
+axislegend(ax)
The exponential distribution is the probability distribution of time between events that occur continuously and independently at a constant average rate.
The exponential distribution has one parameter and its notation is :
Rate ()
Example: How long until the next earthquake. How long until the next bus arrives.
f, ax, l = lines(
+ Exponential(1);
+ label=L"\lambda=1",
+ linewidth=5,
+ axis=(; xlabel=L"\theta", ylabel="Density", limits=(0, 4.5, nothing, nothing)),
+)
+lines!(ax, Exponential(0.5); label=L"\lambda=0.5", linewidth=5)
+lines!(ax, Exponential(1.5); label=L"\lambda=2", linewidth=5)
+axislegend(ax)
Student- distribution appears when estimating the average of a population normally distributed in situations where the sample size is small and the population standard deviation is unknown.
If we take a sample of observations from a normal distribution, then the distribution Student- with degrees of freedom can be defined as the distribution of the location of the sample mean relative to the true mean, divided by the standard deviation of the sample, after multiplying by the standardizing term .
The Student- distribution is symmetrical and bell-shaped, like the normal distribution, but has longer tails, which means that it is more likely to produce values that are far from its mean.
The Student- distribution has one parameter and its notation is :
Degrees of Freedom (): controls how much it resembles a normal distribution
Example: A database full of outliers.
f, ax, l = lines(
+ TDist(2);
+ label=L"\nu=2",
+ linewidth=5,
+ axis=(; xlabel=L"\theta", ylabel="Density", limits=(-4, 4, nothing, nothing)),
+)
+lines!(ax, TDist(8); label=L"\nu=8", linewidth=5)
+lines!(ax, TDist(30); label=L"\nu=30", linewidth=5)
+axislegend(ax)
The beta distributions is a natural choice to model anything that is constrained to take values between 0 and 1. So it is a good candidate for probabilities and proportions.
The beta distribution has two parameters and its notation is :
Shape parameter ( or sometimes ): controls how much the shape is shifted towards 1
Shape parameter ( or sometimes ): controls how much the shape is shifted towards 0
Example: A basketball player has made already scored 5 free throws while missing 3 in a total of 8 attempts – .
f, ax, l = lines(
+ Beta(1, 1);
+ label=L"a=b=1",
+ linewidth=5,
+ axis=(; xlabel=L"\theta", ylabel="Density", limits=(0, 1, nothing, nothing)),
+)
+lines!(ax, Beta(3, 2); label=L"a=3, b=2", linewidth=5)
+lines!(ax, Beta(2, 3); label=L"a=2, b=3", linewidth=5)
+axislegend(ax)
I did not cover all existing distributions. There is a whole plethora of probabilistic distributions.
To access the entire "distribution zoo" use this tool from Ben Lambert (statistician from Imperial College of London): https://ben18785.shinyapps.io/distribution-zoo/
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis. Chapman and Hall/CRC.
Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and other stories. Cambridge University Press.
Turing is an ecosystem of Julia packages for Bayesian Inference using probabilistic programming. Turing provides an easy and intuitive way of specifying models.
What is probabilistic programming (PP)? It is a programming paradigm in which probabilistic models are specified and inference for these models is performed automatically (Hardesty, 2015). In more clear terms, PP and PP Languages (PPLs) allows us to specify variables as random variables (like Normal, Binomial etc.) with known or unknown parameters. Then, we construct a model using these variables by specifying how the variables related to each other, and finally automatic inference of the variables' unknown parameters is then performed.
In a Bayesian approach this means specifying priors, likelihoods and letting the PPL compute the posterior. Since the denominator in the posterior is often intractable, we use Markov Chain Monte Carlo[1] and some fancy algorithm that uses the posterior geometry to guide the MCMC proposal using Hamiltonian dynamics called Hamiltonian Monte Carlo (HMC) to approximate the posterior. This involves, besides a suitable PPL, automatic differentiation, MCMC chains interface, and also an efficient HMC algorithm implementation. In order to provide all of these features, Turing has a whole ecosystem to address each and every one of these components.
Before we dive into how to specify models in Turing, let's discuss Turing's ecosystem. We have several Julia packages under Turing's GitHub organization TuringLang, but I will focus on 6 of those:
The first one is Turing.jl
(Ge, Xu & Ghahramani, 2018) itself, the main package that we use to interface with all the Turing ecosystem of packages and the backbone of the PPL Turing.
The second, MCMCChains.jl
, is an interface to summarizing MCMC simulations and has several utility functions for diagnostics and visualizations.
The third package is DynamicPPL.jl
(Tarek, Xu, Trapp, Ge & Ghahramani, 2020) which specifies a domain-specific language and backend for Turing (which itself is a PPL). The main feature of DynamicPPL.jl
is that is is entirely written in Julia and also it is modular.
AdvancedHMC.jl
(Xu, Ge, Tebbutt, Tarek, Trapp & Ghahramani, 2020) provides a robust, modular and efficient implementation of advanced HMC algorithms. The state-of-the-art HMC algorithm is the No-U-Turn Sampling (NUTS)[1] (Hoffman & Gelman, 2011) which is available in AdvancedHMC.jl
.
The fourth package, DistributionsAD.jl
defines the necessary functions to enable automatic differentiation (AD) of the logpdf
function from Distributions.jl
using the packages Tracker.jl
, Zygote.jl
, ForwardDiff.jl
and ReverseDiff.jl
. The main goal of DistributionsAD.jl
is to make the output of logpdf
differentiable with respect to all continuous parameters of a distribution as well as the random variable in the case of continuous distributions. This is the package that guarantees the "automatic inference" part of the definition of a PPL.
Finally, Bijectors.jl
implements a set of functions for transforming constrained random variables (e.g. simplexes, intervals) to Euclidean space. Note that Bijectors.jl
is still a work-in-progress and in the future we'll have better implementation for more constraints, e.g. positive ordered vectors of random variables.
Most of the time we will not be dealing with these packages directly, since Turing.jl
will take care of the interfacing for us. So let's talk about Turing.jl
.
Turing.jl
Turing.jl
is the main package in the Turing ecosystem and the backbone that glues all the other packages together. Turing's "workflow" begin with a model specification. We specify the model inside a macro @model
where we can assign variables in two ways:
using ~
: which means that a variable follows some probability distribution (Normal, Binomial etc.) and its value is random under that distribution
using =
: which means that a variable does not follow a probability distribution and its value is deterministic (like the normal =
assignment in programming languages)
Turing will perform automatic inference on all variables that you specify using ~
. Here is a simple example of how we would model a six-sided dice. Note that a "fair" dice will be distributed as a discrete uniform probability with the lower bound as 1 and the upper bound as 6:
Note that the expectation of a random variable is:
Graphically this means:
using CairoMakie
+using Distributions
+
+dice = DiscreteUniform(1, 6)
+f, ax, b = barplot(
+ dice;
+ label="six-sided Dice",
+ axis=(; xlabel=L"\theta", ylabel="Mass", xticks=1:6, limits=(nothing, nothing, 0, 0.3)),
+)
+vlines!(ax, [mean(dice)]; linewidth=5, color=:red, label=L"E(\theta)")
+axislegend(ax)
So let's specify our first Turing model. It will be named dice_throw
and will have a single parameter y
which is a -dimensional vector of integers representing the observed data, i.e. the outcomes of six-sided dice throws:
using Turing
+
+@model function dice_throw(y)
+ #Our prior belief about the probability of each result in a six-sided dice.
+ #p is a vector of length 6 each with probability p that sums up to 1.
+ p ~ Dirichlet(6, 1)
+
+ #Each outcome of the six-sided dice has a probability p.
+ for i in eachindex(y)
+ y[i] ~ Categorical(p)
+ end
+end;
Here we are using the Dirichlet distribution which is the multivariate generalization of the Beta distribution. The Dirichlet distribution is often used as the conjugate prior for Categorical or Multinomial distributions. Our dice is modelled as a Categorical distribution with six possible results with some probability vector . Since all mutually exclusive outcomes must sum up to 1 to be a valid probability, we impose the constraint that all s must sum up to 1 – . We could have used a vector of six Beta random variables but it would be hard and inefficient to enforce this constraint. Instead, I've opted for a Dirichlet with a weekly informative prior towards a "fair" dice which is encoded as a Dirichlet(6,1)
. This is translated as a 6-dimensional vector of elements that sum to one:
mean(Dirichlet(6, 1))
6-element Fill{Float64}, with entries equal to 0.16666666666666666
+And, indeed, it sums up to one:
+sum(mean(Dirichlet(6, 1)))
1.0
+Also, since the outcome of a Categorical distribution is an integer and y
is a -dimensional vector of integers we need to apply some sort of broadcasting here. We could use the familiar dot .
broadcasting operator in Julia: y .~ Categorical(p)
to signal that all elements of y
are distributed as a Categorical distribution. But doing that does not allow us to do predictive checks (more on this below). So, instead we use a for
-loop.
Now let's set a seed for the pseudo-random number generator and simulate 1,000 throws of a six-sided dice:
+using Random
+
+Random.seed!(123);
+
+my_data = rand(DiscreteUniform(1, 6), 1_000);
+The vector my_data
is a 1,000-length vector of Int
s ranging from 1 to 6, just like how a regular six-sided dice outcome would be:
first(my_data, 5)
5-element Vector{Int64}:
+ 4
+ 4
+ 6
+ 2
+ 4
+Once the model is specified we instantiate the model with the single parameter y
as the simulated my_data
:
model = dice_throw(my_data);
+Next, we call Turing's sample()
function that takes a Turing model as a first argument, along with a sampler as the second argument, and the third argument is the number of iterations. Here, I will use the NUTS()
sampler from AdvancedHMC.jl
and 1,000 iterations. Please note that, as default, Turing samplers will discard the first thousand (1,000) iterations as warmup. So the sampler will output 1,000 samples starting from sample 1,001 until sample 2,000:
chain = sample(model, NUTS(), 1_000);
+Now let's inspect the chain. We can do that with the function describe()
that will return a 2-element vector of ChainDataFrame
(this is the type defined by MCMCChains.jl
to store Markov chain's information regarding the inferred parameters). The first ChainDataFrame
has information regarding the parameters' summary statistics (mean
, std
, r_hat
, ...) and the second is the parameters' quantiles. Since describe(chain)
returns a 2-element vector, I will assign the output to two variables:
summaries, quantiles = describe(chain);
+We won't be focusing on quantiles, so let's put it aside for now. Let's then take a look at the parameters' summary statistics:
+summaries
Summary Statistics
+ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
+
+ p[1] 0.1571 0.0124 0.0003 2069.1239 657.7236 1.0047 536.1814
+ p[2] 0.1449 0.0115 0.0002 2261.9424 741.2396 1.0061 586.1473
+ p[3] 0.1401 0.0108 0.0002 1911.4282 868.5186 1.0069 495.3170
+ p[4] 0.1808 0.0120 0.0003 2087.7925 842.7605 1.0005 541.0190
+ p[5] 0.1971 0.0126 0.0003 1720.0950 673.6157 1.0019 445.7359
+ p[6] 0.1800 0.0126 0.0003 1786.2325 742.7542 1.0029 462.8745
+
+Here p
is a 6-dimensional vector of probabilities, which each one associated with a mutually exclusive outcome of a six-sided dice throw. As we expected, the probabilities are almost equal to , like a "fair" six-sided dice that we simulated data from (sampling from DiscreteUniform(1, 6)
). Indeed, just for a sanity check, the mean of the estimates of p
sums up to 1:
sum(summaries[:, :mean])
1.0
+In the future if you have some crazy huge models and you just want a subset of parameters from your chains? Just do group(chain, :parameter)
or index with chain[:, 1:6, :]
:
summarystats(chain[:, 1:3, :])
Summary Statistics
+ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
+
+ p[1] 0.1571 0.0124 0.0003 2069.1239 657.7236 1.0047 536.1814
+ p[2] 0.1449 0.0115 0.0002 2261.9424 741.2396 1.0061 586.1473
+ p[3] 0.1401 0.0108 0.0002 1911.4282 868.5186 1.0069 495.3170
+
+or chain[[:parameters,...]]
:
summarystats(chain[[:var"p[1]", :var"p[2]"]])
Summary Statistics
+ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
+
+ p[1] 0.1571 0.0124 0.0003 2069.1239 657.7236 1.0047 536.1814
+ p[2] 0.1449 0.0115 0.0002 2261.9424 741.2396 1.0061 586.1473
+
+And, finally let's compute the expectation of the estimated six-sided dice, , using the standard expectation definition of expectation for a discrete random variable:
+ +sum([idx * i for (i, idx) in enumerate(summaries[:, :mean])])
3.655798033295476
+Bingo! The estimated expectation is very close to the theoretical expectation of , as we've show in (2).
+Note that the type of our chain
is a Chains
object from MCMCChains.jl
:
typeof(chain)
MCMCChains.Chains{Float64, AxisArrays.AxisArray{Float64, 3, Array{Float64, 3}, Tuple{AxisArrays.Axis{:iter, StepRange{Int64, Int64}}, AxisArrays.Axis{:var, Vector{Symbol}}, AxisArrays.Axis{:chain, UnitRange{Int64}}}}, Missing, @NamedTuple{parameters::Vector{Symbol}, internals::Vector{Symbol}}, @NamedTuple{varname_to_symbol::OrderedCollections.OrderedDict{AbstractPPL.VarName, Symbol}, start_time::Float64, stop_time::Float64}}
+Since Chains
is a Tables.jl
-compatible data structure, we can use all of the plotting capabilities from AlgebraOfGraphics.jl
.
using AlgebraOfGraphics
+using AlgebraOfGraphics: density
+#exclude additional information such as log probability
+params = names(chain, :parameters)
+chain_mapping =
+ mapping(params .=> "sample value") *
+ mapping(; color=:chain => nonnumeric, row=dims(1) => renamer(params))
+plt1 = data(chain) * mapping(:iteration) * chain_mapping * visual(Lines)
+plt2 = data(chain) * chain_mapping * density()
+f = Figure(; resolution=(800, 600))
+draw!(f[1, 1], plt1)
+draw!(f[1, 2], plt2; axis=(; ylabel="density"))
+
On the figure above we can see, for each parameter in the model, on the left the parameter's traceplot and on the right the parameter's density[2].
+Predictive checks are a great way to validate a model. The idea is to generate data from the model using parameters from draws from the prior or posterior. Prior predictive check is when we simulate data using model parameter values drawn from the prior distribution, and posterior predictive check is is when we simulate data using model parameter values drawn from the posterior distribution.
+The workflow we do when specifying and sampling Bayesian models is not linear or acyclic (Gelman et al., 2020). This means that we need to iterate several times between the different stages in order to find a model that captures best the data generating process with the desired assumptions. The figure below demonstrates the workflow [3].
+This is quite easy in Turing. Our six-sided dice model already has a posterior distribution which is the object chain
. We need to create a prior distribution for our model. To accomplish this, instead of supplying a MCMC sampler like NUTS()
, we supply the "sampler" Prior()
inside Turing's sample()
function:
prior_chain = sample(model, Prior(), 2_000);
+Now we can perform predictive checks using both the prior (prior_chain
) or posterior (chain
) distributions. To draw from the prior and posterior predictive distributions we instantiate a "predictive model", i.e. a Turing model but with the observations set to missing
, and then calling predict()
on the predictive model and the previously drawn samples. First let's do the prior predictive check:
missing_data = similar(my_data, Missing) # vector of `missing`
+model_missing = dice_throw(missing_data) # instantiate the "predictive model
+prior_check = predict(model_missing, prior_chain);
+Here we are creating a missing_data
object which is a Vector
of the same length as the my_data
and populated with type missing
as values. We then instantiate a new dice_throw
model with the missing_data
vector as the y
argument. Finally, we call predict()
on the predictive model and the previously drawn samples, which in our case are the samples from the prior distribution (prior_chain
).
Note that predict()
returns a Chains
object from MCMCChains.jl
:
typeof(prior_check)
MCMCChains.Chains{Float64, AxisArrays.AxisArray{Float64, 3, Array{Float64, 3}, Tuple{AxisArrays.Axis{:iter, StepRange{Int64, Int64}}, AxisArrays.Axis{:var, Vector{Symbol}}, AxisArrays.Axis{:chain, UnitRange{Int64}}}}, Missing, @NamedTuple{parameters::Vector{Symbol}, internals::Vector{Symbol}}, @NamedTuple{varname_to_symbol::OrderedCollections.OrderedDict{AbstractPPL.VarName, Symbol}}}
+And we can call summarystats()
:
summarystats(prior_check[:, 1:5, :]) # just the first 5 prior samples
Summary Statistics
+ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Missing
+
+ y[1] 3.4265 1.7433 0.0396 1966.1989 NaN 1.0016 missing
+ y[2] 3.4815 1.6974 0.0414 1672.3452 NaN 1.0024 missing
+ y[3] 3.4620 1.7149 0.0396 1878.7524 NaN 1.0022 missing
+ y[4] 3.4905 1.7177 0.0387 1960.3938 NaN 1.0006 missing
+ y[5] 3.5095 1.7101 0.0386 2002.7276 NaN 0.9998 missing
+
+We can do the same with chain
for a posterior predictive check:
posterior_check = predict(model_missing, chain);
+summarystats(posterior_check[:, 1:5, :]) # just the first 5 posterior samples
Summary Statistics
+ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Missing
+
+ y[1] 3.8050 1.6637 0.0510 1065.1417 NaN 0.9992 missing
+ y[2] 3.6280 1.6638 0.0583 773.8349 NaN 1.0000 missing
+ y[3] 3.6430 1.7007 0.0554 921.8019 NaN 1.0044 missing
+ y[4] 3.5720 1.6845 0.0518 1052.7732 NaN 0.9993 missing
+ y[5] 3.6390 1.7400 0.0546 988.2165 NaN 1.0012 missing
+
+This is the basic overview of Turing usage. I hope that I could show you how simple and intuitive is to specify probabilistic models using Turing. First, specify a model with the macro @model
, then sample from it by specifying the data, sampler and number of interactions. All probabilistic parameters (the ones that you've specified using ~
) will be inferred with a full posterior density. Finally, you inspect the parameters' statistics like mean and standard deviation, along with convergence diagnostics like r_hat
. Conveniently, you can plot stuff easily if you want to. You can also do predictive checks using either the posterior or prior model's distributions.
[1] + | see 5. Markov Chain Monte Carlo (MCMC). + + |
[2] + | we'll cover those plots and diagnostics in 5. Markov Chain Monte Carlo (MCMC). + + |
[3] + | note that this workflow is a extremely simplified adaptation from the original workflow on which it was based. I suggest the reader to consult the original workflow of Gelman et al. (2020). + + |
Ge, H., Xu, K., & Ghahramani, Z. (2018). Turing: A Language for Flexible Probabilistic Inference. International Conference on Artificial Intelligence and Statistics, 1682–1690. http://proceedings.mlr.press/v84/ge18b.html
+Gelman, A., Vehtari, A., Simpson, D., Margossian, C. C., Carpenter, B., Yao, Y., … Modr’ak, M. (2020, November 3). Bayesian Workflow. Retrieved February 4, 2021, from http://arxiv.org/abs/2011.01808
+Hardesty (2015). "Probabilistic programming does in 50 lines of code what used to take thousands". phys.org. April 13, 2015. Retrieved April 13, 2015. https://phys.org/news/2015-04-probabilistic-lines-code-thousands.html
+Hoffman, M. D., & Gelman, A. (2011). The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1), 1593–1623. Retrieved from http://arxiv.org/abs/1111.4246
+Tarek, M., Xu, K., Trapp, M., Ge, H., & Ghahramani, Z. (2020). DynamicPPL: Stan-like Speed for Dynamic Probabilistic Models. ArXiv:2002.02702 [Cs, Stat]. http://arxiv.org/abs/2002.02702
+Xu, K., Ge, H., Tebbutt, W., Tarek, M., Trapp, M., & Ghahramani, Z. (2020). AdvancedHMC.jl: A robust, modular and efficient implementation of advanced HMC algorithms. Symposium on Advances in Approximate Bayesian Inference, 1–10. http://proceedings.mlr.press/v118/xu20a.html
+ +The main computational barrier for Bayesian statistics is the denominator of the Bayes formula:
In discrete cases we can turn the denominator into a sum of all parameters using the chain rule of probability:
This is also called marginalization:
However, in the continuous cases the denominator becomes a very large and complicated integral to calculate:
In many cases this integral becomes intractable (incalculable) and therefore we must find other ways to calculate the posterior probability in (1) without using the denominator .
Quick answer: to normalize the posterior in order to make it a valid probability distribution. This means that the sum of all probabilities of the possible events in the probability distribution must be equal to 1:
in the case of discrete probability distribution:
in the case of continuous probability distribution:
When we remove the denominator we have that the posterior is proportional to the prior multiplied by the likelihood [1].
This YouTube video explains the denominator problem very well, see below:
This is where Markov Chain Monte Carlo comes in. MCMC is a broad class of computational tools for approximating integrals and generating samples from a posterior probability (Brooks, Gelman, Jones & Meng, 2011). MCMC is used when it is not possible to sample directly from the subsequent probabilistic distribution . Instead, we sample in an iterative manner such that at each step of the process we expect the distribution from which we sample (here means simulated) becomes increasingly similar to the posterior . All of this is to eliminate the (often impossible) calculation of the denominator .
The idea is to define an ergodic Markov chain (that is to say that there is a single stationary distribution) of which the set of possible states is the sample space and the stationary distribution is the distribution to be approximated (or sampled). Let be a simulation of the chain. The Markov chain converges to the stationary distribution of any initial state after a large enough number of iterations , the distribution of the state will be similar to the stationary distribution, so we can use it as a sample. Markov chains have a property that the probability distribution of the next state depends only on the current state and not on the sequence of events that preceded: . This property is called Markovian, after the mathematician Andrey Markov (see figure below). Similarly, repeating this argument with as the starting point, we can use as a sample, and so on. We can then use the state sequence as almost independent samples of the stationary distribution of the Markov chain.
The effectiveness of this approach depends on:
how big must be to ensure a suitably good sample; and
computational power required for each iteration of the Markov chain.
In addition, it is customary to discard the first iterations of the algorithm as they are usually not representative of the distribution to be approximated. In the initial iterations of MCMC algorithms, generally the Markov current is in a warm-up process[2] and its state is far from ideal to start a reliable sampling. It is generally recommended to discard half of the iterations (Gelman, Carlin, Stern, Dunson, Vehtari, & Rubin, 2013a). For example: if the Markov chain has 4,000 iterations, we discard the first 2,000 as warm-up.
Stanislaw Ulam (figure below), who participated in the Manhattan project and when trying to calculate the neutron diffusion process for the hydrogen bomb ended up creating a class of methods called Monte Carlo.
Monte Carlo methods have the underlying concept of using randomness to solve problems that can be deterministic in principle. They are often used in physical and mathematical problems and are most useful when it is difficult or impossible to use other approaches. Monte Carlo methods are used mainly in three classes of problems: optimization, numerical integration and generating sample from a probability distribution.
The idea for the method came to Ulam while playing solitaire during his recovery from surgery, as he thought about playing hundreds of games to statistically estimate the probability of a successful outcome. As he himself mentions in Eckhardt (1987):
"The first thoughts and attempts I made to practice [the Monte Carlo method] were suggested by a question which occurred to me in 1946 as I was convalescing from an illness and playing solitaires. The question was what are the chances that a Canfield solitaire laid out with 52 cards will come out successfully? After spending a lot of time trying to estimate them by pure combinatorial calculations, I wondered whether a more practical method than "abstract thinking" might not be to lay it out say one hundred times and simply observe and count the number of successful plays. This was already possible to envisage with the beginning of the new era of fast computers, and I immediately thought of problems of neutron diffusion and other questions of mathematical physics, and more generally how to change processes described by certain differential equations into an equivalent form interpretable as a succession of random operations. Later... [in 1946, I ] described the idea to John von Neumann and we began to plan actual calculations."
Because it was secret, von Neumann and Ulam's work required a codename. A colleague of von Neumann and Ulam, Nicholas Metropolis (figure below), suggested using the name "Monte Carlo", which refers to Casino Monte Carlo in Monaco, where Ulam's uncle (Michał Ulam) borrowed money from relatives to gamble.
The applications of the Monte Carlo method are numerous: physical sciences, engineering, climate change, computational biology, computer graphics, applied statistics, artificial intelligence, search and rescue, finance and business and law. In the scope of these tutorials we will focus on applied statistics and specifically in the context of Bayesian inference: providing a random sample of the posterior distribution.
I will do some simulations to illustrate MCMC algorithms and techniques. So, here's the initial setup:
using CairoMakie
+using Distributions
+using Random
+
+Random.seed!(123);
Let's start with a toy problem of a multivariate normal distribution of and , where
If we assign and (mean 0 and standard deviation 1 for both and ), we have the following formulation from (6):
All that remains is to assign a value for in (7) for the correlation between and . For our example we will use correlation of 0.8 ():
const N = 100_000
+const μ = [0, 0]
+const Σ = [1 0.8; 0.8 1]
+
+const mvnormal = MvNormal(μ, Σ)
FullNormal(
+dim: 2
+μ: [0.0, 0.0]
+Σ: [1.0 0.8; 0.8 1.0]
+)
+
In the figure below it is possible to see a contour plot of the PDF of a multivariate normal distribution composed of two normal variables and , both with mean 0 and standard deviation 1. The correlation between and is :
x = -3:0.01:3
+y = -3:0.01:3
+dens_mvnormal = [pdf(mvnormal, [i, j]) for i in x, j in y]
+f, ax, c = contourf(x, y, dens_mvnormal; axis=(; xlabel=L"X", ylabel=L"Y"))
+Colorbar(f[1, 2], c)
Also a surface plot can be seen below for you to get a 3-D intuition of what is going on:
f, ax, s = surface(
+ x,
+ y,
+ dens_mvnormal;
+ axis=(type=Axis3, xlabel=L"X", ylabel=L"Y", zlabel="PDF", azimuth=pi / 8),
+)
The first MCMC algorithm widely used to generate samples from Markov chain originated in physics in the 1950s (in a very close relationship with the atomic bomb at the Manhattan project) and is called Metropolis (Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953) in honor of the first author Nicholas Metropolis (figure above). In summary, the Metropolis algorithm is an adaptation of a random walk with an acceptance/rejection rule to converge to the target distribution.
The Metropolis algorithm uses a proposal distribution ( stands for jumping distribution and indicates which state of the Markov chain we are in) to define next values of the distribution . This distribution must be symmetrical:
In the 1970s, a generalization of the Metropolis algorithm emerged that does not require that the proposal distributions be symmetric. The generalization was proposed by Wilfred Keith Hastings (Hastings, 1970) (figure below) and is called Metropolis-Hastings algorithm.
The essence of the algorithm is a random walk through the parameters' sample space, where the probability of the Markov chain changing state is defined as:
This means that the Markov chain will only change to a new state under two conditions:
When the probability of the parameters proposed by the random walk is greater than the probability of the parameters of the current state , we change with 100% probability. Note that if then the function chooses the value 1 which means 100%.
When the probability of the parameters proposed by the random walk is less than the probability of the parameters of the current state , we changed with a probability equal to the proportion of that difference. Note that if then the function does not choose the value 1, but the value which equates the proportion of the probability of the proposed parameters to the probability of the parameters at the current state.
Anyway, at each iteration of the Metropolis algorithm, even if the Markov chain changes state or not, we sample the parameter anyway. That is, if the chain does not change to a new state, will be sampled twice (or more if the current is stationary in the same state).
The Metropolis-Hastings algorithm can be described in the following way [3] ( is the parameter, or set of parameters, of interest and is the data):
Define a starting point of which , or sample it from an initial distribution . can be a normal distribution or a prior distribution of ().
For :
Sample a proposed from a proposal distribution in time , .
Calculate the ratio of probabilities:
Metropolis:
Metropolis-Hastings:
Assign:
The limitations of the Metropolis-Hastings algorithm are mainly computational. With randomly generated proposals, it usually takes a large number of iterations to enter areas of higher (more likely) posterior densities. Even efficient Metropolis-Hastings algorithms sometimes accept less than 25% of the proposals (Roberts, Gelman & Gilks, 1997). In lower-dimensional situations, the increased computational power can compensate for the lower efficiency to some extent. But in higher-dimensional and more complex modeling situations, bigger and faster computers alone are rarely enough to overcome the challenge.
In our toy example we will assume that is symmetric, thus , so I'll just implement the Metropolis algorithm (not the Metropolis-Hastings algorithm).
Below I created a Metropolis sampler for our toy example. At the end it prints the acceptance rate of the proposals. Here I am using the same proposal distribution for both and : a uniform distribution parameterized with a width
parameter:
I will use the already known Distributions.jl
MvNormal
from the plots above along with the logpdf()
function to calculate the PDF of the proposed and current s. It is easier to work with probability logs than with the absolute values[4]. Mathematically we will compute:
Here is a simple implementation in Julia:
function metropolis(
+ S::Int64,
+ width::Float64,
+ ρ::Float64;
+ μ_x::Float64=0.0,
+ μ_y::Float64=0.0,
+ σ_x::Float64=1.0,
+ σ_y::Float64=1.0,
+ start_x=-2.5,
+ start_y=2.5,
+ seed=123,
+)
+ rgn = MersenneTwister(seed)
+ binormal = MvNormal([μ_x; μ_y], [σ_x ρ; ρ σ_y])
+ draws = Matrix{Float64}(undef, S, 2)
+ accepted = 0::Int64
+ x = start_x
+ y = start_y
+ @inbounds draws[1, :] = [x y]
+ for s in 2:S
+ x_ = rand(rgn, Uniform(x - width, x + width))
+ y_ = rand(rgn, Uniform(y - width, y + width))
+ r = exp(logpdf(binormal, [x_, y_]) - logpdf(binormal, [x, y]))
+
+ if r > rand(rgn, Uniform())
+ x = x_
+ y = y_
+ accepted += 1
+ end
+ @inbounds draws[s, :] = [x y]
+ end
+ println("Acceptance rate is: $(accepted / S)")
+ return draws
+end
metropolis (generic function with 1 method)
+Now let's run our metropolis()
algorithm for the bivariate normal case with S = 10_000
, width = 2.75
and ρ = 0.8
:
const S = 10_000
+const width = 2.75
+const ρ = 0.8
+
+X_met = metropolis(S, width, ρ);
Acceptance rate is: 0.2116
+
+Take a quick peek into X_met
, we'll see it's a matrix of and values as columns and the time as rows:
X_met[1:10, :]
10×2 Matrix{Float64}:
+ -2.5 2.5
+ -2.5 2.5
+ -3.07501 1.47284
+ -2.60189 -0.990426
+ -2.60189 -0.990426
+ -0.592119 -0.34422
+ -0.592119 -0.34422
+ -0.592119 -0.34422
+ -0.592119 -0.34422
+ -0.592119 -0.34422
+Also note that the acceptance of the proposals was 21%, the expected for Metropolis algorithms (around 20-25%) (Roberts et. al, 1997).
+We can construct Chains
object using MCMCChains.jl
[5] by passing a matrix along with the parameters names as symbols inside the Chains()
constructor:
using MCMCChains
+
+chain_met = Chains(X_met, [:X, :Y]);
+Then we can get summary statistics regarding our Markov chain derived from the Metropolis algorithm:
+summarystats(chain_met)
Summary Statistics
+ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Missing
+
+ X -0.0314 0.9984 0.0313 1022.8896 1135.7248 0.9999 missing
+ Y -0.0334 0.9784 0.0305 1028.0976 1046.3325 0.9999 missing
+
+Both of X
and Y
have mean close to 0 and standard deviation close to 1 (which are the theoretical values). Take notice of the ess
(effective sample size - ESS) is approximate 1,000. So let's calculate the efficiency of our Metropolis algorithm by dividing the ESS by the number of sampling iterations that we've performed:
mean(summarystats(chain_met)[:, :ess_tail]) / S
0.10910286584687137
+Our Metropolis algorithm has around 10.2% efficiency. Which, in my honest opinion, sucks...(😂)
+I believe that a good visual intuition, even if you have not understood any mathematical formula, is the key for you to start a fruitful learning journey. So I made some animations!
+The animation in figure below shows the first 100 simulations of the Metropolis algorithm used to generate X_met
. Note that in several iterations the proposal is rejected and the algorithm samples the parameters and from the previous state (which becomes the current one, since the proposal is refused). The blue ellipsis represents the 90% HPD of our toy example's bivariate normal distribution.
Note: HPD
stands for Highest Probability Density (which in our case the posterior's 90% probability range).
using LinearAlgebra: eigvals, eigvecs
+#source: https://discourse.julialang.org/t/plot-ellipse-in-makie/82814/4
+function getellipsepoints(cx, cy, rx, ry, θ)
+ t = range(0, 2 * pi; length=100)
+ ellipse_x_r = @. rx * cos(t)
+ ellipse_y_r = @. ry * sin(t)
+ R = [cos(θ) sin(θ); -sin(θ) cos(θ)]
+ r_ellipse = [ellipse_x_r ellipse_y_r] * R
+ x = @. cx + r_ellipse[:, 1]
+ y = @. cy + r_ellipse[:, 2]
+ return (x, y)
+end
+function getellipsepoints(μ, Σ; confidence=0.95)
+ quant = sqrt(quantile(Chisq(2), confidence))
+ cx = μ[1]
+ cy = μ[2]
+
+ egvs = eigvals(Σ)
+ if egvs[1] > egvs[2]
+ idxmax = 1
+ largestegv = egvs[1]
+ smallesttegv = egvs[2]
+ else
+ idxmax = 2
+ largestegv = egvs[2]
+ smallesttegv = egvs[1]
+ end
+
+ rx = quant * sqrt(largestegv)
+ ry = quant * sqrt(smallesttegv)
+
+ eigvecmax = eigvecs(Σ)[:, idxmax]
+ θ = atan(eigvecmax[2] / eigvecmax[1])
+ if θ < 0
+ θ += 2 * π
+ end
+
+ return getellipsepoints(cx, cy, rx, ry, θ)
+end
+
+f, ax, l = lines(
+ getellipsepoints(μ, Σ; confidence=0.9)...;
+ label="90% HPD",
+ linewidth=2,
+ axis=(; limits=(-3, 3, -3, 3), xlabel=L"\theta_1", ylabel=L"\theta_2"),
+)
+axislegend(ax)
+record(f, joinpath(@OUTPUT, "met_anim.gif"); framerate=5) do frame
+ for i in 1:100
+ scatter!(ax, (X_met[i, 1], X_met[i, 2]); color=(:red, 0.5))
+ linesegments!(X_met[i:(i + 1), 1], X_met[i:(i + 1), 2]; color=(:green, 0.5))
+ recordframe!(frame)
+ end
+end;
+
Now let's take a look how the first 1,000 simulations were, excluding 1,000 initial iterations as warm-up.
+const warmup = 1_000
+
+f, ax, l = lines(
+ getellipsepoints(μ, Σ; confidence=0.9)...;
+ label="90% HPD",
+ linewidth=2,
+ axis=(; limits=(-3, 3, -3, 3), xlabel=L"\theta_1", ylabel=L"\theta_2"),
+)
+axislegend(ax)
+scatter!(
+ ax,
+ X_met[warmup:(warmup + 1_000), 1],
+ X_met[warmup:(warmup + 1_000), 2];
+ color=(:red, 0.3),
+)
+
And, finally, lets take a look in the all 9,000 samples generated after the warm-up of 1,000 iterations.
+f, ax, l = lines(
+ getellipsepoints(μ, Σ; confidence=0.9)...;
+ label="90% HPD",
+ linewidth=2,
+ axis=(; limits=(-3, 3, -3, 3), xlabel=L"\theta_1", ylabel=L"\theta_2"),
+)
+axislegend(ax)
+scatter!(ax, X_met[warmup:end, 1], X_met[warmup:end, 2]; color=(:red, 0.3))
+
To circumvent the problem of low acceptance rate of the Metropolis (and Metropolis-Hastings) algorithm, the Gibbs algorithm was developed, which does not have an acceptance/rejection rule for new proposals to change the state of the Markov chain. All proposals are accepted.
+Gibbs' algorithm had an original idea conceived by physicist Josiah Willard Gibbs (figure below), in reference to an analogy between a sampling algorithm and statistical physics (a branch of physics which is based on statistical mechanics). The algorithm was described by brothers Stuart and Donald Geman in 1984 (Geman & Geman, 1984), about eight decades after Gibbs's death.
+The Gibbs algorithm is very useful in multidimensional sample spaces (in which there are more than 2 parameters to be sampled for the posterior probability). It is also known as alternating conditional sampling, since we always sample a parameter conditioned to the probability of the other parameters.
+The Gibbs algorithm can be seen as a special case of the Metropolis-Hastings algorithm because all proposals are accepted (Gelman, 1992).
+The essence of Gibbs' algorithm is an iterative sampling of parameters conditioned to other parameters .
+Gibbs's algorithm can be described in the following way[6] ( is the parameter, or set of parameters, of interest and is the data):
+Define : the prior probability of each of the parameters.
+ +Sample a starting point . We usually sample from a normal distribution or from a distribution specified as the prior distribution of .
+ +For :
+ + +The main limitation of the Gibbs algorithm is with regard to alternative conditional sampling.
+If we compare with the Metropolis algorithm (and consequently Metropolis-Hastings) we have random proposals from a proposal distribution in which we sample each parameter unconditionally to other parameters. In order for the proposals to take us to the posterior probability's correct locations to sample, we have an acceptance/rejection rule for these proposals, otherwise the samples of the Metropolis algorithm would not approach the target distribution of interest. The state changes of the Markov chain are then carried out multidimensionally [7]. As you saw in the Metropolis' figures, in a 2-D space (as is our example's bivariate normal distribution), when there is a change of state in the Markov chain, the new proposal location considers both and , causing diagonal movement in space 2-D sample. In other words, the proposal is done regarding all dimensions of the parameter space.
+In the case of the Gibbs algorithm, in our toy example, this movement occurs only in a single parameter, i.e single dimension, as we sample sequentially and conditionally to other parameters. This causes horizontal movements (in the case of ) and vertical movements (in the case of ), but never diagonal movements like the ones we saw in the Metropolis algorithm.
+Here are some new things compared to the Metropolis algorithm implementation. First to conditionally sample the parameters and , we need to create two new variables β
and λ
. These variables represent the correlation between and ( and respectively) scaled by the ratio of the variance of and . And then we use these variables in the sampling of and :
function gibbs(
+ S::Int64,
+ ρ::Float64;
+ μ_x::Float64=0.0,
+ μ_y::Float64=0.0,
+ σ_x::Float64=1.0,
+ σ_y::Float64=1.0,
+ start_x=-2.5,
+ start_y=2.5,
+ seed=123,
+)
+ rgn = MersenneTwister(seed)
+ binormal = MvNormal([μ_x; μ_y], [σ_x ρ; ρ σ_y])
+ draws = Matrix{Float64}(undef, S, 2)
+ x = start_x
+ y = start_y
+ β = ρ * σ_y / σ_x
+ λ = ρ * σ_x / σ_y
+ sqrt1mrho2 = sqrt(1 - ρ^2)
+ σ_YX = σ_y * sqrt1mrho2
+ σ_XY = σ_x * sqrt1mrho2
+ @inbounds draws[1, :] = [x y]
+ for s in 2:S
+ if s % 2 == 0
+ y = rand(rgn, Normal(μ_y + β * (x - μ_x), σ_YX))
+ else
+ x = rand(rgn, Normal(μ_x + λ * (y - μ_y), σ_XY))
+ end
+ @inbounds draws[s, :] = [x y]
+ end
+ return draws
+end
gibbs (generic function with 1 method)
+Generally a Gibbs sampler is not implemented in this way. Here I coded the Gibbs algorithm so that it samples a parameter for each iteration. To be more computationally efficient we would sample all parameters are on each iteration. I did it on purpose because I want to show in the animations the real trajectory of the Gibbs sampler in the sample space (vertical and horizontal, not diagonal). So to remedy this I will provide gibbs()
double the amount of S
(20,000 in total). Also take notice that we are now proposing new parameters' values conditioned on other parameters, so there is not an acceptance/rejection rule here.
X_gibbs = gibbs(S * 2, ρ);
+As before lets' take a quick peek into X_gibbs
, we'll see it's a matrix of and values as columns and the time as rows:
X_gibbs[1:10, :]
10×2 Matrix{Float64}:
+ -2.5 2.5
+ -2.5 -1.28584
+ 0.200236 -1.28584
+ 0.200236 0.84578
+ 0.952273 0.84578
+ 0.952273 0.523811
+ 0.0202213 0.523811
+ 0.0202213 0.604758
+ 0.438516 0.604758
+ 0.438516 0.515102
+Again, we construct a Chains
object by passing a matrix along with the parameters names as symbols inside the Chains()
constructor:
chain_gibbs = Chains(X_gibbs, [:X, :Y]);
+Then we can get summary statistics regarding our Markov chain derived from the Gibbs algorithm:
+summarystats(chain_gibbs)
Summary Statistics
+ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Missing
+
+ X 0.0161 1.0090 0.0217 2152.7581 3985.5939 1.0005 missing
+ Y 0.0097 1.0071 0.0219 2114.7477 4038.1490 1.0003 missing
+
+Both of X
and Y
have mean close to 0 and standard deviation close to 1 (which are the theoretical values). Take notice of the ess
(effective sample size - ESS) that is around 2,100. Since we used S * 2
as the number of samples, in order for we to compare with Metropolis, we would need to divide the ESS by 2. So our ESS is between 1,000, which is similar to Metropolis' ESS. Now let's calculate the efficiency of our Gibbs algorithm by dividing the ESS by the number of sampling iterations that we've performed also accounting for the S * 2
:
(mean(summarystats(chain_gibbs)[:, :ess_tail]) / 2) / S
0.20059357221090376
+Our Gibbs algorithm has around 10.6% efficiency. Which, in my honest opinion, despite the small improvement still sucks...(😂)
+Oh yes, we have animations for Gibbs also!
+The animation in figure below shows the first 100 simulations of the Gibbs algorithm used to generate X_gibbs
. Note that all proposals are accepted now, so the at each iteration we sample new parameters values. The blue ellipsis represents the 90% HPD of our toy example's bivariate normal distribution.
f, ax, l = lines(
+ getellipsepoints(μ, Σ; confidence=0.9)...;
+ label="90% HPD",
+ linewidth=2,
+ axis=(; limits=(-3, 3, -3, 3), xlabel=L"\theta_1", ylabel=L"\theta_2"),
+)
+axislegend(ax)
+record(f, joinpath(@OUTPUT, "gibbs_anim.gif"); framerate=5) do frame
+ for i in 1:200
+ scatter!(ax, (X_gibbs[i, 1], X_gibbs[i, 2]); color=(:red, 0.5))
+ linesegments!(X_gibbs[i:(i + 1), 1], X_gibbs[i:(i + 1), 2]; color=(:green, 0.5))
+ recordframe!(frame)
+ end
+end;
+
Now let's take a look how the first 1,000 simulations were, excluding 1,000 initial iterations as warm-up.
+f, ax, l = lines(
+ getellipsepoints(μ, Σ; confidence=0.9)...;
+ label="90% HPD",
+ linewidth=2,
+ axis=(; limits=(-3, 3, -3, 3), xlabel=L"\theta_1", ylabel=L"\theta_2"),
+)
+axislegend(ax)
+scatter!(
+ ax,
+ X_gibbs[(2 * warmup):(2 * warmup + 1_000), 1],
+ X_gibbs[(2 * warmup):(2 * warmup + 1_000), 2];
+ color=(:red, 0.3),
+)
+
And, finally, lets take a look in the all 9,000 samples generated after the warm-up of 1,000 iterations.
+f, ax, l = lines(
+ getellipsepoints(μ, Σ; confidence=0.9)...;
+ label="90% HPD",
+ linewidth=2,
+ axis=(; limits=(-3, 3, -3, 3), xlabel=L"\theta_1", ylabel=L"\theta_2"),
+)
+axislegend(ax)
+scatter!(ax, X_gibbs[(2 * warmup):end, 1], X_gibbs[(2 * warmup):end, 2]; color=(:red, 0.3))
+
Since the Markov chains are independent, we can run them in parallel. The key to this is defining different starting points for each Markov chain (if you use a sample of a previous distribution of parameters as a starting point this is not a problem). We will use the same toy example of a bivariate normal distribution and that we used in the previous examples, but now with 4 Markov chains in parallel with different starting points[8].
+First, let's defined 4 different pairs of starting points using a nice Cartesian product from Julia's Base.Iterators
:
const starts = collect(Iterators.product((-2.5, 2.5), (2.5, -2.5)))
2×2 Matrix{Tuple{Float64, Float64}}:
+ (-2.5, 2.5) (-2.5, -2.5)
+ (2.5, 2.5) (2.5, -2.5)
+Also, I will restrict this simulation to 100 samples:
+const S_parallel = 100;
+Additionally, note that we are using different seed
s:
X_met_1 = metropolis(
+ S_parallel, width, ρ; seed=124, start_x=first(starts[1]), start_y=last(starts[1])
+);
+X_met_2 = metropolis(
+ S_parallel, width, ρ; seed=125, start_x=first(starts[2]), start_y=last(starts[2])
+);
+X_met_3 = metropolis(
+ S_parallel, width, ρ; seed=126, start_x=first(starts[3]), start_y=last(starts[3])
+);
+X_met_4 = metropolis(
+ S_parallel, width, ρ; seed=127, start_x=first(starts[4]), start_y=last(starts[4])
+);
Acceptance rate is: 0.24
+Acceptance rate is: 0.18
+Acceptance rate is: 0.18
+Acceptance rate is: 0.23
+
+There have been some significant changes in the approval rate for Metropolis proposals. All were around 13%-24%, this is due to the low number of samples (only 100 for each Markov chain), if the samples were larger we would see these values converge to close to 20% according to the previous example of 10,000 samples with a single stream (Roberts et. al, 1997).
+Now let's take a look on how those 4 Metropolis Markov chains sample the parameter space starting from different positions. Each chain will have its own marker and path color, so that we can see their different behavior:
+using Colors
+const logocolors = Colors.JULIA_LOGO_COLORS;
+f, ax, l = lines(
+ getellipsepoints(μ, Σ; confidence=0.9)...;
+ label="90% HPD",
+ linewidth=2,
+ axis=(; limits=(-3, 3, -3, 3), xlabel=L"\theta_1", ylabel=L"\theta_2"),
+)
+axislegend(ax)
+record(f, joinpath(@OUTPUT, "parallel_met.gif"); framerate=5) do frame
+ for i in 1:99
+ scatter!(ax, (X_met_1[i, 1], X_met_1[i, 2]); color=(logocolors.blue, 0.5))
+ linesegments!(
+ X_met_1[i:(i + 1), 1], X_met_1[i:(i + 1), 2]; color=(logocolors.blue, 0.5)
+ )
+ scatter!(ax, (X_met_2[i, 1], X_met_2[i, 2]); color=(logocolors.red, 0.5))
+ linesegments!(
+ X_met_2[i:(i + 1), 1], X_met_2[i:(i + 1), 2]; color=(logocolors.red, 0.5)
+ )
+ scatter!(ax, (X_met_3[i, 1], X_met_3[i, 2]); color=(logocolors.green, 0.5))
+ linesegments!(
+ X_met_3[i:(i + 1), 1], X_met_3[i:(i + 1), 2]; color=(logocolors.green, 0.5)
+ )
+ scatter!(ax, (X_met_4[i, 1], X_met_4[i, 2]); color=(logocolors.purple, 0.5))
+ linesegments!(
+ X_met_4[i:(i + 1), 1], X_met_4[i:(i + 1), 2]; color=(logocolors.purple, 0.5)
+ )
+ recordframe!(frame)
+ end
+end;
+
Now we'll do the the same for Gibbs, taking care to provide also different seed
s and starting points:
X_gibbs_1 = gibbs(
+ S_parallel * 2, ρ; seed=124, start_x=first(starts[1]), start_y=last(starts[1])
+);
+X_gibbs_2 = gibbs(
+ S_parallel * 2, ρ; seed=125, start_x=first(starts[2]), start_y=last(starts[2])
+);
+X_gibbs_3 = gibbs(
+ S_parallel * 2, ρ; seed=126, start_x=first(starts[3]), start_y=last(starts[3])
+);
+X_gibbs_4 = gibbs(
+ S_parallel * 2, ρ; seed=127, start_x=first(starts[4]), start_y=last(starts[4])
+);
+Now let's take a look on how those 4 Gibbs Markov chains sample the parameter space starting from different positions. Each chain will have its own marker and path color, so that we can see their different behavior:
+f, ax, l = lines(
+ getellipsepoints(μ, Σ; confidence=0.9)...;
+ label="90% HPD",
+ linewidth=2,
+ axis=(; limits=(-3, 3, -3, 3), xlabel=L"\theta_1", ylabel=L"\theta_2"),
+)
+axislegend(ax)
+record(f, joinpath(@OUTPUT, "parallel_gibbs.gif"); framerate=5) do frame
+ for i in 1:199
+ scatter!(ax, (X_gibbs_1[i, 1], X_gibbs_1[i, 2]); color=(logocolors.blue, 0.5))
+ linesegments!(
+ X_gibbs_1[i:(i + 1), 1], X_gibbs_1[i:(i + 1), 2]; color=(logocolors.blue, 0.5)
+ )
+ scatter!(ax, (X_gibbs_2[i, 1], X_gibbs_2[i, 2]); color=(logocolors.red, 0.5))
+ linesegments!(
+ X_gibbs_2[i:(i + 1), 1], X_gibbs_2[i:(i + 1), 2]; color=(logocolors.red, 0.5)
+ )
+ scatter!(ax, (X_gibbs_3[i, 1], X_gibbs_3[i, 2]); color=(logocolors.green, 0.5))
+ linesegments!(
+ X_gibbs_3[i:(i + 1), 1], X_gibbs_3[i:(i + 1), 2]; color=(logocolors.green, 0.5)
+ )
+ scatter!(ax, (X_gibbs_4[i, 1], X_gibbs_4[i, 2]); color=(logocolors.purple, 0.5))
+ linesegments!(
+ X_gibbs_4[i:(i + 1), 1], X_gibbs_4[i:(i + 1), 2]; color=(logocolors.purple, 0.5)
+ )
+ recordframe!(frame)
+ end
+end;
+
The problems of low acceptance rates of proposals for Metropolis techniques and the low performance of the Gibbs algorithm in multidimensional problems (in which the posterior's topology is quite complex) led to the emergence of a new MCMC technique using Hamiltonian dynamics (in honor of the Irish physicist William Rowan Hamilton (1805-1865) (figure below). The name for this technique is Hamiltonian Monte Carlo – HMC.
+The HMC is an adaptation of the Metropolis technique and employs a guided scheme for generating new proposals: this improves the proposal's acceptance rate and, consequently, efficiency. More specifically, the HMC uses the posterior log gradient to direct the Markov chain to regions of higher posterior density, where most samples are collected. As a result, a Markov chain with the well-adjusted HMC algorithm will accept proposals at a much higher rate than the traditional Metropolis algorithm (Roberts et. al, 1997).
+HMC was first described in the physics literature (Duane, Kennedy, Pendleton & Roweth, 1987) (which they called the "Hybrid" Monte Carlo – HMC). Soon after, HMC was applied to statistical problems by Neal (1994) (which he called Hamiltonean Monte Carlo – HMC). For an in-depth discussion regarding HMC (which is not the focus of this tutorial), I recommend Neal (2011) and Betancourt (2017).
+HMC uses Hamiltonian dynamics applied to particles exploring the topology of a posterior density. In some simulations Metropolis has an acceptance rate of approximately 23%, while HMC 65% (Gelman et al., 2013b). In addition to better exploring the posterior's topology and tolerating complex topologies, HMC is much more efficient than Metropolis and does not suffer from the Gibbs' parameter correlation problem.
+For each component , the HMC adds a momentum variable . The subsequent density is increased by an independent distribution of the momentum, thus defining a joint distribution:
+ +HMC uses a proposal distribution that changes depending on the current state in the Markov chain. The HMC discovers the direction in which the posterior distribution increases, called gradient, and distorts the distribution of proposals towards the gradient. In the Metropolis algorithm, the distribution of the proposals would be a (usually) Normal distribution centered on the current position, so that jumps above or below the current position would have the same probability of being proposed. But the HMC generates proposals quite differently.
+You can imagine that for high-dimensional posterior densities that have narrow diagonal valleys and even curved valleys, the HMC dynamics will find proposed positions that are much more promising than a naive symmetric proposal distribution, and more promising than the Gibbs sampling, which can get stuck in diagonal walls.
+The probability of the Markov chain changing state in the HMC algorithm is defined as:
+ +where is the momentum.
+We normally give a normal multivariate distribution with a mean of 0 and a covariance of , a "mass matrix". To keep things a little bit simpler, we use a diagonal mass matrix . This makes the components of independent with
+The HMC algorithm is very similar to the Metropolis algorithm but with the inclusion of the momentum as a way of quantifying the gradient of the posterior distribution:
+Sample from a
+ +Simultaneously sample and with leapfrog steps each scaled by a factor. In a leapfrog step, both and are changed, in relation to each other. Repeat the following steps times:
+ +2.1. Use the gradient of log posterior of to produce a half-step of :
+ +2.2 Use the momentum vector to update the parameter vector :
+ +2.3. Use again the gradient of log posterior of to another half-step of :
+ +Assign and as the values of the parameter vector and the momentum vector, respectively, at the beginning of the leapfrog process (step 2) and and as the values after steps. As an acceptance/rejection rule calculate:
+ +Assign:
+ +Alright let's code the HMC algorithm for our toy example's bivariate normal distribution:
+using ForwardDiff: gradient
+function hmc(
+ S::Int64,
+ width::Float64,
+ ρ::Float64;
+ L=40,
+ ϵ=0.001,
+ μ_x::Float64=0.0,
+ μ_y::Float64=0.0,
+ σ_x::Float64=1.0,
+ σ_y::Float64=1.0,
+ start_x=-2.5,
+ start_y=2.5,
+ seed=123,
+)
+ rgn = MersenneTwister(seed)
+ binormal = MvNormal([μ_x; μ_y], [σ_x ρ; ρ σ_y])
+ draws = Matrix{Float64}(undef, S, 2)
+ accepted = 0::Int64
+ x = start_x
+ y = start_y
+ @inbounds draws[1, :] = [x y]
+ M = [1.0 0.0; 0.0 1.0]
+ ϕ_d = MvNormal([0.0, 0.0], M)
+ for s in 2:S
+ x_ = rand(rgn, Uniform(x - width, x + width))
+ y_ = rand(rgn, Uniform(y - width, y + width))
+ ϕ = rand(rgn, ϕ_d)
+ kinetic = sum(ϕ .^ 2) / 2
+ log_p = logpdf(binormal, [x, y]) - kinetic
+ ϕ += 0.5 * ϵ * gradient(x -> logpdf(binormal, x), [x_, y_])
+ for l in 1:L
+ x_, y_ = [x_, y_] + (ϵ * M * ϕ)
+ ϕ += +0.5 * ϵ * gradient(x -> logpdf(binormal, x), [x_, y_])
+ end
+ ϕ = -ϕ # make the proposal symmetric
+ kinetic = sum(ϕ .^ 2) / 2
+ log_p_ = logpdf(binormal, [x_, y_]) - kinetic
+ r = exp(log_p_ - log_p)
+
+ if r > rand(rgn, Uniform())
+ x = x_
+ y = y_
+ accepted += 1
+ end
+ @inbounds draws[s, :] = [x y]
+ end
+ println("Acceptance rate is: $(accepted / S)")
+ return draws
+end
hmc (generic function with 1 method)
+In the hmc()
function above I am using the gradient()
function from ForwardDiff.jl
(Revels, Lubin & Papamarkou, 2016) which is Julia's package for forward mode auto differentiation (autodiff). The gradient()
function accepts a function as input and an array . It literally evaluates the function at and returns the gradient . This is one the advantages of Julia: I don't need to implement an autodiff for logpdf()
s of any distribution, it will be done automatically for any one from Distributions.jl
. You can also try reverse mode autodiff with ReverseDiff.jl
if you want to; and it will also very easy to get a gradient. Now, we've got carried away with Julia's amazing autodiff potential... Let me show you an example of a gradient of a log PDF evaluated at some value. I will use our mvnormal
bivariate normal distribution as an example and evaluate its gradient at and :
gradient(x -> logpdf(mvnormal, x), [1, -1])
2-element Vector{Float64}:
+ -5.000000000000003
+ 5.000000000000003
+So the gradient tells me that the partial derivative of with respect to our mvnormal
distribution is -5
and the partial derivative of with respect to our mvnormal
distribution is 5
.
Now let's run our HMC Markov chain. We are going to use and (don't ask me how I found out) :
+X_hmc = hmc(S, width, ρ; ϵ=0.0856, L=40);
Acceptance rate is: 0.2079
+
+Our acceptance rate is 20.79%. As before lets' take a quick peek into X_hmc
, we'll see it's a matrix of and values as columns and the time as rows:
X_hmc[1:10, :]
10×2 Matrix{Float64}:
+ -2.5 2.5
+ -2.5 2.5
+ -1.50904 1.52822
+ -1.50904 1.52822
+ -1.85829 -0.464346
+ -0.739967 -1.15873
+ -0.739967 -1.15873
+ -0.739967 -1.15873
+ -0.739967 -1.15873
+ -0.739967 -1.15873
+Again, we construct a Chains
object by passing a matrix along with the parameters names as symbols inside the Chains()
constructor:
chain_hmc = Chains(X_hmc, [:X, :Y]);
+Then we can get summary statistics regarding our Markov chain derived from the HMC algorithm:
+summarystats(chain_hmc)
Summary Statistics
+ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Missing
+
+ X 0.0169 0.9373 0.0222 1738.2585 1537.0420 1.0014 missing
+ Y 0.0073 0.9475 0.0237 1556.0589 1640.5275 1.0003 missing
+
+Both of X
and Y
have mean close to 0 and standard deviation close to 1 (which are the theoretical values). Take notice of the ess
(effective sample size - ESS) that is around 1,600. Now let's calculate the efficiency of our HMC algorithm by dividing the ESS by the number of sampling iterations:
mean(summarystats(chain_hmc)[:, :ess_tail]) / S
0.15887847790933807
+We see that a simple naïve (and not well-calibrated[9]) HMC has 70% more efficiency from both Gibbs and Metropolis. ≈ 10% versus ≈ 17%. Great! 😀
+This wouldn't be complete without animations for HMC!
+The animation in figure below shows the first 100 simulations of the HMC algorithm used to generate X_hmc
. Note that we have a gradient-guided proposal at each iteration, so the animation would resemble more like a very lucky random-walk Metropolis [10]. The blue ellipsis represents the 90% HPD of our toy example's bivariate normal distribution.
f, ax, l = lines(
+ getellipsepoints(μ, Σ; confidence=0.9)...;
+ label="90% HPD",
+ linewidth=2,
+ axis=(; limits=(-3, 3, -3, 3), xlabel=L"\theta_1", ylabel=L"\theta_2"),
+)
+axislegend(ax)
+record(f, joinpath(@OUTPUT, "hmc_anim.gif"); framerate=5) do frame
+ for i in 1:100
+ scatter!(ax, (X_hmc[i, 1], X_hmc[i, 2]); color=(:red, 0.5))
+ linesegments!(X_hmc[i:(i + 1), 1], X_hmc[i:(i + 1), 2]; color=(:green, 0.5))
+ recordframe!(frame)
+ end
+end;
+
As usual, let's take a look how the first 1,000 simulations were, excluding 1,000 initial iterations as warm-up.
+f, ax, l = lines(
+ getellipsepoints(μ, Σ; confidence=0.9)...;
+ label="90% HPD",
+ linewidth=2,
+ axis=(; limits=(-3, 3, -3, 3), xlabel=L"\theta_1", ylabel=L"\theta_2"),
+)
+axislegend(ax)
+scatter!(
+ ax,
+ X_hmc[warmup:(warmup + 1_000), 1],
+ X_hmc[warmup:(warmup + 1_000), 2];
+ color=(:red, 0.3),
+)
+
And, finally, lets take a look in the all 9,000 samples generated after the warm-up of 1,000 iterations.
+f, ax, l = lines(
+ getellipsepoints(μ, Σ; confidence=0.9)...;
+ label="90% HPD",
+ linewidth=2,
+ axis=(; limits=(-3, 3, -3, 3), xlabel=L"\theta_1", ylabel=L"\theta_2"),
+)
+axislegend(ax)
+scatter!(ax, X_hmc[warmup:end, 1], X_hmc[warmup:end, 2]; color=(:red, 0.3))
+
There are cases where HMC will be much better than Metropolis or Gibbs. In particular, these cases focus on complicated and difficult-to-explore posterior topologies. In these contexts, an algorithm that can guide the proposals to regions of higher density (such as the case of the HMC) is able to explore much more efficient (less iterations for convergence) and effective (higher rate of acceptance of the proposals).
+See figure below for an example of a bimodal posterior density with also the marginal histogram of and :
+ +d1 = MvNormal([10, 2], [1 0; 0 1])
+d2 = MvNormal([0, 0], [8.4 2.0; 2.0 1.7])
+
+d = MixtureModel([d1, d2])
+
+x = -6:0.01:15
+y = -2.5:0.01:4.2
+dens_mixture = [pdf(d, [i, j]) for i in x, j in y]
+
+f, ax, s = surface(
+ x,
+ y,
+ dens_mixture;
+ axis=(type=Axis3, xlabel=L"X", ylabel=L"Y", zlabel="PDF", azimuth=pi / 4),
+)
+
And to finish an example of Neal's funnel (Neal, 2003) in the figure below. This is a very difficult posterior to be sampled even for HMC, as it varies in geometry in the dimensions and . This means that the HMC sampler has to change the leapfrog steps and the scaling factor every time, since at the top of the image (the top of the funnel) a large value of is needed along with a small ; and at the bottom (at the bottom of the funnel) the opposite: small and large .
+funnel_y = rand(Normal(0, 3), 10_000)
+funnel_x = rand(Normal(), 10_000) .* exp.(funnel_y / 2)
+
+f, ax, s = scatter(
+ funnel_x,
+ funnel_y;
+ color=(:steelblue, 0.3),
+ axis=(; xlabel=L"X", ylabel=L"Y", limits=(-100, 100, nothing, nothing)),
+)
+
If you haven't understood anything by now, don't despair. Skip all the formulas and get the visual intuition of the algorithms. See the limitations of Metropolis and Gibbs and compare the animations and figures with those of the HMC. The superiority of efficiency (more samples with low autocorrelation - ESS) and effectiveness (more samples close to the most likely areas of the target distribution) is self-explanatory by the images.
+In addition, you will probably never have to code your HMC algorithm (Gibbs, Metropolis or any other MCMC) by hand. For that there are packages like Turing's AdvancedHMC.jl
In addition, AdvancedHMC
has a modified HMC with a technique called No-U-Turn Sampling (NUTS)[11] (Hoffman & Gelman, 2011) that selects automatically the values of (scaling factor) and (leapfrog steps). The performance of the HMC is highly sensitive to these two "hyperparameters" (parameters that must be specified by the user). In particular, if is too small, the algorithm exhibits undesirable behavior of a random walk, while if is too large, the algorithm wastes computational efficiency. NUTS uses a recursive algorithm to build a set of likely candidate points that span a wide range of proposal distribution, automatically stopping when it starts to go back and retrace its steps (why it doesn't turn around - No U-turn), in addition NUTS also automatically calibrates simultaneously and .
All Markov chains have some convergence and diagnostics metrics that you should be aware of. Note that this is still an ongoing area of intense research and new metrics are constantly being proposed (e.g. Lambert & Vehtari (2020) or Vehtari, Gelman., Simpson, Carpenter & Bürkner (2021)) To show MCMC metrics let me recover our six-sided dice example from 4. How to use Turing:
+using Turing
+
+@model function dice_throw(y)
+ #Our prior belief about the probability of each result in a six-sided dice.
+ #p is a vector of length 6 each with probability p that sums up to 1.
+ p ~ Dirichlet(6, 1)
+
+ #Each outcome of the six-sided dice has a probability p.
+ return y ~ filldist(Categorical(p), length(y))
+end;
+Let's once again generate 1,000 throws of a six-sided dice:
+data_dice = rand(DiscreteUniform(1, 6), 1_000);
+Like before we'll sample using NUTS()
and 1,000 iterations. Note that you can use Metropolis with MH()
, Gibbs with Gibbs()
and HMC with HMC()
if you want to. You can check all Turing's different MCMC samplers in Turing's documentation.
model = dice_throw(data_dice)
+chain = sample(model, NUTS(), 1_000);
+summarystats(chain)
Summary Statistics
+ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
+
+ p[1] 0.1689 0.0118 0.0003 1965.3374 876.7033 0.9995 613.7843
+ p[2] 0.1661 0.0119 0.0003 1419.7589 739.3257 1.0001 443.3975
+ p[3] 0.1600 0.0108 0.0003 1594.9950 873.7550 0.9990 498.1246
+ p[4] 0.1700 0.0116 0.0003 1998.9151 884.6355 1.0045 624.2708
+ p[5] 0.1710 0.0117 0.0003 1670.1179 813.8896 0.9994 521.5858
+ p[6] 0.1641 0.0120 0.0003 2055.8182 792.1886 1.0011 642.0419
+
+We have the following columns that outputs some kind of MCMC summary statistics:
+mcse
: Monte Carlo Standard Error, the uncertainty about a statistic in the sample due to sampling error.
ess
: Effective Sample Size, a rough approximation of the number of effective samples sampled by the MCMC estimated by the value of rhat
.
rhat
: a metric of convergence and stability of the Markov chain.
The most important metric to take into account is the rhat
which is a metric that measures whether the Markov chains are stable and converged to a value during the total progress of the sampling procedure. It is basically the proportion of variation when comparing two halves of the chains after discarding the warmups. A value of 1 implies convergence and stability. By default, rhat
must be less than 1.01 for the Bayesian estimation to be valid (Brooks & Gelman, 1998; Gelman & Rubin, 1992).
Note that all of our model's parameters have achieve a nice ess
and, more important, rhat
in the desired range, a solid indicator that the Markov chain is stable and has converged to the estimated parameter values.
Depending on the model and data, it is possible that HMC (even with NUTS) will not achieve convergence. NUTS will not converge if it encounters divergences either due to a very pathological posterior density topology or if you supply improper parameters. To exemplify let me run a "bad" chain by passing the NUTS()
constructor a target acceptance rate of 0.3
with only 500 sampling iterations (including warmup of 1,000 iters):
bad_chain = sample(model, NUTS(0.3), 500)
+summarystats(bad_chain)
Summary Statistics
+ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
+
+ p[1] 0.1684 0.0098 0.0029 8.6088 16.0774 1.1119 128.4895
+ p[2] 0.1657 0.0130 0.0029 12.6348 NaN 1.0553 188.5793
+ p[3] 0.1637 0.0112 0.0025 20.4216 26.6760 1.2401 304.7994
+ p[4] 0.1719 0.0102 0.0022 27.0245 23.5726 1.0613 403.3510
+ p[5] 0.1728 0.0181 0.0056 13.6728 13.2203 1.1900 204.0718
+ p[6] 0.1575 0.0175 0.0059 6.4771 3.5114 1.1389 96.6725
+
+Here we can see that the ess
and rhat
of the bad_chain
are really bad! There will be several divergences that we can access in the column numerical_error
of a Chains
object. Here we have 64 divergences.
sum(bad_chain[:numerical_error])
17.0
+Also we can see the Markov chain acceptance rate in the column acceptance_rate
. This is the bad_chain
acceptance rate:
mean(bad_chain[:acceptance_rate])
0.018655899447476025
+And now the "good" chain
:
mean(chain[:acceptance_rate])
0.7934378228300696
+What a difference huh? 80% versus 1.5%.
+Besides the rhat
values, we can also visualize the Markov chain with a traceplot. The traceplot is the overlap of the MCMC chain sampling for each estimated parameter (vertical axis). The idea is that the chains mix and that there is no slope or visual pattern along the iterations (horizontal axis). This demonstrates that the chains have mixed converged to a certain value of the parameter and remained in that region during a good part (or all) of the Markov chain sampling. We can do that with the MCMCChains.jl
's function traceplot()
. Let's look the "good" chain
first:
using AlgebraOfGraphics
+params = names(chain, :parameters)
+chain_mapping =
+ mapping(params .=> "sample value") *
+ mapping(; color=:chain => nonnumeric, row=dims(1) => renamer(params))
+plt = data(chain) * mapping(:iteration) * chain_mapping * visual(Lines)
+f = Figure(; resolution=(1200, 900))
+draw!(f[1, 1], plt)
+
chain
And now the bad_chain
:
params = names(bad_chain, :parameters)
+chain_mapping =
+ mapping(params .=> "sample value") *
+ mapping(; color=:chain => nonnumeric, row=dims(1) => renamer(params))
+plt = data(bad_chain) * mapping(:iteration) * chain_mapping * visual(Lines)
+f = Figure(; resolution=(1200, 900))
+draw!(f[1, 1], plt)
+
bad_chain
We can see that the bad_chain
is all over the place and has definitely not converged or became stable.
First: Before making fine adjustments to the number of chains or the number of iterations (among others ...) know that Turing's NUTS sampler is very efficient and effective in exploring the most diverse and complex "crazy" topologies of posteriors' target distributions. The standard arguments NUTS()
work perfectly for 99% of cases (even in complex models). That said, most of the time when you have sampling and computational problems in your Bayesian model, the problem is in the model specification and not in the MCMC sampling algorithm. This is known as Folk Theorem (Gelman, 2008). In the words of Andrew Gelman:
++"When you have computational problems, often there's a problem with your model".
+
Andrew Gelman (Gelman, 2008)
If your Bayesian model has problems with convergence there are some steps that can be tried[12]. Listed here from the simplest to the most complex:
+Increase the number of iterations and chains: First option is to increase the number of MCMC iterations and it is also possible to increase the number of parallel chains to be sampled.
+ +Model reparameterization: the second option is to reparameterize the model. There are two ways to parameterize the model: the first with centered parameterization (CP) and the second with non-centered parameterization (NCP). NCP is most useful in Multilevel Models, therefore we will cover NCP in 10. Multilevel Models.
+ +Collect more data: sometimes the model is too complex and we need a larger sample to get stable estimates.
+ +Rethink the model: convergence failure when we have adequate sampling is usually due to a specification of priors and likelihood function that are not compatible with the data. In this case, it is necessary to rethink the data's generative process in which the model's assumptions are anchored.
+ +[1] + | the symbol (\propto ) should be read as "proportional to".
+
+ |
[2] + | some references call this process burnin. + + |
[3] + | if you want a better explanation of the Metropolis and Metropolis-Hastings algorithms I suggest to see Chib & Greenberg (1995). + + |
[4] + | Due to easier computational complexity and to avoid numeric overflow we generally use sum of logs instead of multiplications, specially when dealing with probabilities, i.e. . + + |
[5] + | this is one of the packages of Turing's ecosystem. I recommend you to take a look into 4. How to use Turing. + + |
[6] + | if you want a better explanation of the Gibbs algorithm I suggest to see Casella & George (1992). + + |
[7] + | this will be clear in the animations and images. + + |
[8] + | note that there is some shenanigans here to take care. You would also want to have different seeds for the random number generator in each Markov chain. This is why metropolis() and gibbs() have a seed parameter.
+
+ |
[9] + | as detailed in the following sections, HMC is quite sensitive to the choice of and and I haven't tried to get the best possible combination of those. + + |
[10] + | or a not-drunk random-walk Metropolis 😂. + + |
[11] + | NUTS is an algorithm that uses the symplectic leapfrog integrator and builds a binary tree composed of leaf nodes that are simulations of Hamiltonian dynamics using leapfrog steps in forward or backward directions in time where is the integer representing the iterations of the construction of the binary tree. Once the simulated particle starts to retrace its trajectory, the tree construction is interrupted and the ideal number of leapfrog steps is defined as in time from the beginning of the retrogression of the trajectory. So the simulated particle never "turns around" so "No U-turn". For more details on the algorithm and how it relates to Hamiltonian dynamics see Hoffman & Gelman (2011). + + |
[12] + | furthermore, if you have high-correlated variables in your model, you can try a QR decomposition of the data matrix . I will cover QR decomposition and other Turing's computational "tricks of the trade" in 11. Computational Tricks with Turing. + + |
Betancourt, M. (2017, January 9). A Conceptual Introduction to Hamiltonian Monte Carlo. Retrieved November 6, 2019, from http://arxiv.org/abs/1701.02434
+Brooks, S., Gelman, A., Jones, G., & Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo. Retrieved from https://books.google.com?id=qfRsAIKZ4rIC
+Brooks, S. P., & Gelman, A. (1998). General Methods for Monitoring Convergence of Iterative Simulations. Journal of Computational and Graphical Statistics, 7(4), 434–455. https://doi.org/10.1080/10618600.1998.10474787
+Casella, G., & George, E. I. (1992). Explaining the gibbs sampler. The American Statistician, 46(3), 167–174. https://doi.org/10.1080/00031305.1992.10475878
+Chib, S., & Greenberg, E. (1995). Understanding the Metropolis-Hastings Algorithm. The American Statistician, 49(4), 327–335. https://doi.org/10.1080/00031305.1995.10476177
+Duane, S., Kennedy, A. D., Pendleton, B. J., & Roweth, D. (1987). Hybrid Monte Carlo. Physics Letters B, 195(2), 216–222. https://doi.org/10.1016/0370-2693(87)91197-X
+Eckhardt, R. (1987). Stan Ulam, John von Neumann, and the Monte Carlo Method. Los Alamos Science, 15(30), 131–136.
+Gelman, A. (1992). Iterative and Non-Iterative Simulation Algorithms. Computing Science and Statistics (Interface Proceedings), 24, 457–511. PROCEEDINGS PUBLISHED BY VARIOUS PUBLISHERS.
+Gelman, A. (2008). The folk theorem of statistical computing. Retrieved from https://statmodeling.stat.columbia.edu/2008/05/13/thefolktheore/
+Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013a). Basics of Markov Chain Simulation. In Bayesian Data Analysis. Chapman and Hall/CRC.
+Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013b). Bayesian Data Analysis. Chapman and Hall/CRC.
+Gelman, A., & Rubin, D. B. (1992). Inference from Iterative Simulation Using Multiple Sequences. Statistical Science, 7(4), 457–472. https://doi.org/10.1214/ss/1177011136
+Geman, S., & Geman, D. (1984). Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6(6), 721–741. https://doi.org/10.1109/TPAMI.1984.4767596
+Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1), 97–109. https://doi.org/10.1093/biomet/57.1.97
+Hoffman, M. D., & Gelman, A. (2011). The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1), 1593–1623. Retrieved from http://arxiv.org/abs/1111.4246
+Lambert, B., & Vehtari, A. (2020). : A robust MCMC convergence diagnostic with uncertainty using decision tree classifiers. ArXiv:2003.07900 [Stat]. http://arxiv.org/abs/2003.07900
+Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equation of State Calculations by Fast Computing Machines. The Journal of Chemical Physics, 21(6), 1087–1092. https://doi.org/10.1063/1.1699114
+Neal, Radford M. (1994). An Improved Acceptance Procedure for the Hybrid Monte Carlo Algorithm. Journal of Computational Physics, 111(1), 194–203. https://doi.org/10.1006/jcph.1994.1054
+Neal, Radford M. (2003). Slice Sampling. The Annals of Statistics, 31(3), 705–741. Retrieved from https://www.jstor.org/stable/3448413
+Neal, Radford M. (2011). MCMC using Hamiltonian dynamics. In S. Brooks, A. Gelman, G. L. Jones, & X.-L. Meng (Eds.), Handbook of markov chain monte carlo.
+Revels, J., Lubin, M., & Papamarkou, T. (2016). Forward-Mode Automatic Differentiation in Julia. ArXiv:1607.07892 [Cs]. http://arxiv.org/abs/1607.07892
+Roberts, G. O., Gelman, A., & Gilks, W. R. (1997). Weak convergence and optimal scaling of random walk Metropolis algorithms. Annals of Applied Probability, 7(1), 110–120. https://doi.org/10.1214/aoap/1034625254
+Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., & Bürkner, P.-C. (2021). Rank-Normalization, Folding, and Localization: An Improved Rˆ for Assessing Convergence of MCMC. Bayesian Analysis, 1(1), 1–28. https://doi.org/10.1214/20-BA1221
+ +"All models are wrong but some are useful"
George Box (Box, 1976)
This tutorial begins with a very provocative quote from the statistician George Box (figure below) on statistical models. Yes, all models are somehow wrong. But they are very useful. The idea is that the reality is too complex for us to understand when analyzing it in a naked and raw way. We need to somehow simplify it into individual components and analyze their relationships. But there is a danger here: any simplification of reality promotes loss of information in some way. Therefore, we always have a delicate balance between simplifications of reality through models and the inherent loss of information. Now you ask me: "how are they useful?" Imagine that you are in total darkness and you have a very powerful flashlight but with a narrow beam. Are you going to throw the flashlight away because it can't light everything around you and stay in the dark? You must use the flashlight to aim at interesting places in the darkness in order to illuminate them. You will never find a flashlight that illuminates everything with the clarity you need to analyze all the fine details of reality. Just as you will never find a unique model that will explain the whole reality around you. You need different flashlights just like you need different models. Without them you will be in total darkness.
Let's talk about a class of model known as linear regression. The idea here is to model a continuous dependent variable with a linear combination of independent variables.
where:
– dependent variable
– intercept
– coefficient vector
– data matrix
– model error
To estimate the coefficients we use a Gaussian/normal likelihood function. Mathematically the Bayesian regression model is:
Here we see that the likelihood function is a normal distribution in which depends on the parameters of the model and , in addition to having an error . We condition onto the observed data by inserting as the linear predictor of the model (the mean parameter of the model's Normal likelihood function, and is the variance parameter). What remains is to specify which are the priors of the model parameters:
Prior Distribution of – Knowledge we possess regarding the model's intercept.
Prior Distribution of – Knowledge we possess regarding the model's independent variables' coefficients.
Prior Distribution of – Knowledge we possess regarding the model's error. Important that the error can only be positive. In addition, it is intuitive to place a distribution that gives greater weight to values close to zero, but that also allows values that are far from zero, so a distribution with a long tail is welcome. Candidate distributions are which is only supported on positive real numbers (so it solves the question of negative errors) or truncated to only positive numbers (remembering that the distribution Cauchy is Student's with degrees of freedom ).
Our goal is to instantiate a linear regression with the observed data ( and ) and find the posterior distribution of our model's parameters of interest ( and ). This means to find the full posterior distribution of:
This is easily accomplished with Turing:
using Turing
+using LinearAlgebra: I
+using Statistics: mean, std
+using Random: seed!
+seed!(123)
+
+@model function linreg(X, y; predictors=size(X, 2))
+ #priors
+ α ~ Normal(mean(y), 2.5 * std(y))
+ β ~ filldist(TDist(3), predictors)
+ σ ~ Exponential(1)
+
+ #likelihood
+ return y ~ MvNormal(α .+ X * β, σ^2 * I)
+end;
Here I am specifying very weakly informative priors:
– This means a normal distribution centered on y
's mean with a standard deviation 2.5 times the standard deviation of y
. That prior should with ease cover all possible values of . Remember that the normal distribution has support over all the real number line .
– The predictors all have a prior distribution of a Student- distribution centered on 0 with variance 1 and degrees of freedom . That wide-tailed distribution will cover all possible values for our coefficients. Remember the Student- also has support over all the real number line . Also the filldist()
is a nice Turing's function which takes any univariate or multivariate distribution and returns another distribution that repeats the input distribution.
– A wide-tailed-positive-only distribution perfectly suited for our model's error.
Also, we are using the MvNormal
construction where we specify both a vector of means (first positional argument) and a covariance matrix (second positional argument). Regarding the covariance matrix σ^2 * I
, it uses the model's errors σ
, here parameterized as a standard deviation, squares it to produce a variance paramaterization, and multiplies by I
, which is Julia's LinearAlgebra
standard module implementation to represent an identity matrix of any size.
For our example, I will use a famous dataset called kidiq
(Gelman & Hill, 2007), which is data from a survey of adult American women and their respective children. Dated from 2007, it has 434 observations and 4 variables:
kid_score
: child's IQ
mom_hs
: binary/dummy (0 or 1) if the child's mother has a high school diploma
mom_iq
: mother's IQ
mom_age
: mother's age
Ok let's read our data with CSV.jl
and output into a DataFrame
from DataFrames.jl
:
using DataFrames
+using CSV
+using HTTP
+
+url = "https://raw.githubusercontent.com/storopoli/Bayesian-Julia/master/datasets/kidiq.csv"
+kidiq = CSV.read(HTTP.get(url).body, DataFrame)
+describe(kidiq)
4×7 DataFrame
+ Row │ variable mean min median max nmissing eltype
+ │ Symbol Float64 Real Float64 Real Int64 DataType
+─────┼──────────────────────────────────────────────────────────────────────
+ 1 │ kid_score 86.7972 20 90.0 144 0 Int64
+ 2 │ mom_hs 0.785714 0 1.0 1 0 Int64
+ 3 │ mom_iq 100.0 71.0374 97.9153 138.893 0 Float64
+ 4 │ mom_age 22.7857 17 23.0 29 0 Int64
As you can see from the describe()
output, the mean children's IQ is around 87 while the mother's is 100. Also the mother's range from 17 to 29 years with mean of around 23 years old. Finally, note that 79% of mothers have a high school diploma.
Now let's us instantiate our model with the data:
X = Matrix(select(kidiq, Not(:kid_score)))
+y = kidiq[:, :kid_score]
+model = linreg(X, y);
And, finally, we will sample from the Turing model. We will be using the default NUTS()
sampler with 1_000
samples, but now we will sample from 4 Markov chains using multiple threads MCMCThreads()
:
chain = sample(model, NUTS(), MCMCThreads(), 1_000, 4)
+summarystats(chain)
Summary Statistics
+ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
+
+ α 21.5126 8.6720 0.2217 1528.8886 1868.2612 1.0009 24.7261
+ β[1] 2.0014 1.7943 0.0419 2281.1940 1734.6512 1.0016 36.8928
+ β[2] 0.5788 0.0584 0.0013 2163.9754 2292.8814 1.0006 34.9971
+ β[3] 0.2566 0.3092 0.0074 1762.0214 2135.6795 1.0010 28.4965
+ σ 17.8859 0.6033 0.0106 3271.1669 2347.2435 1.0008 52.9033
+
We had no problem with the Markov chains as all the rhat
are well below 1.01
(or above 0.99
). Our model has an error σ
of around 18. So it estimates IQ±9. The intercept α
is the basal child's IQ. So each child has 22±9 IQ before we add the coefficients multiplied by the child's independent variables. And from our coefficients , we can see that the quantile()
tells us the uncertainty around their estimates:
quantile(chain)
Quantiles
+ parameters 2.5% 25.0% 50.0% 75.0% 97.5%
+ Symbol Float64 Float64 Float64 Float64 Float64
+
+ α 4.7278 15.7633 21.2942 27.4322 38.4426
+ β[1] -0.5876 0.7324 1.6761 2.9919 6.3388
+ β[2] 0.4662 0.5392 0.5793 0.6184 0.6924
+ β[3] -0.3477 0.0440 0.2588 0.4733 0.8490
+ σ 16.7525 17.4685 17.8796 18.2703 19.1238
+
β[1]
– first column of X
, mom_hs
, has 95% credible interval that is all over the place, including zero. So its effect on child's IQ is inconclusive.
β[2]
– second column of X
, mom_iq
, has a 95% credible interval from 0.46 to 0.69. So we expect that every increase in the mother's IQ is associated with a 0.46 to 0.69 increase in the child's IQ.
β[3]
– third column of X
, mom_age
, has also 95% credible interval that is all over the place, including zero. Like mom_hs
, its effect on child's IQ is inconclusive.
That's how you interpret 95% credible intervals from a quantile()
output of a linear regression Chains
object.
Box, G. E. P. (1976). Science and Statistics. Journal of the American Statistical Association, 71(356), 791–799. https://doi.org/10.2307/2286841
Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.
Leaving the universe of linear models, we start to venture into generalized linear models (GLM). The first is logistic regression (also called binomial regression).
A logistic regression behaves exactly like a linear model: it makes a prediction simply by computing a weighted sum of the independent variables by the estimated coefficients , plus an intercept . However, instead of returning a continuous value , such as linear regression, it returns the logistic function of :
We use logistic regression when our dependent variable is binary. It has only two distinct values, usually encoded as or . See the figure below for a graphical intuition of the logistic function:
using CairoMakie
+
+function logistic(x)
+ return 1 / (1 + exp(-x))
+end
+
+f, ax, l = lines(-10 .. 10, logistic; axis=(; xlabel=L"x", ylabel=L"\mathrm{Logistic}(x)"))
+f
As we can see, the logistic function is basically a mapping of any real number to a real number in the range between 0 and 1:
That is, the logistic function is the ideal candidate for when we need to convert something continuous without restrictions to something continuous restricted between 0 and 1. That is why it is used when we need a model to have a probability as a dependent variable (remembering that any real number between 0 and 1 is a valid probability). In the case of a binary dependent variable, we can use this probability as the chance of the dependent variable taking a value of 0 or 1.
Linear regression follows the following mathematical formulation:
- model parameters
- intercept
- independent variables coefficients
- total number of independent variables
Logistic regression would add the logistic function to the linear term:
- predicted probability of the observation being the value 1
- predicted discrete value of
Example:
We can model logistic regression in two ways. The first option with a Bernoulli likelihood function and the second option with a binomial likelihood function.
With the Bernoulli likelihood we model a binary dependent variable which is the result of a Bernoulli trial with a certain probability .
In a binomial likelihood, we model a continuous dependent variable which is the number of successes of independent Bernoulli trials.
where:
– binary dependent variable.
– probability of taking the value of – success of an independent Bernoulli trial.
– logistic function.
– intercept.
– coefficient vector.
– data matrix.
where:
– binary dependent variable – successes of independent Bernoulli trials.
– number of independent Bernoulli trials.
– probability of taking the value of – success of an independent Bernoulli trial.
– logistic function.
– intercept.
– coefficient vector.
– data matrix.
In both likelihood options, what remains is to specify the model parameters' prior distributions:
Prior Distribution of – Knowledge we possess regarding the model's intercept.
Prior Distribution of – Knowledge we possess regarding the model's independent variables' coefficients.
Our goal is to instantiate a logistic regression with the observed data ( and ) and find the posterior distribution of our model's parameters of interest ( and ). This means to find the full posterior distribution of:
Note that contrary to the linear regression, which used a Gaussian/normal likelihood function, we don't have an error parameter in our logistic regression. This is due to neither the Bernoulli nor binomial distributions having a "scale" parameter such as the parameter in the Gaussian/normal distribution.
Also note that the Bernoulli distribution is a special case of the binomial distribution where :
This is easily accomplished with Turing:
using Turing
+using LazyArrays
+using Random: seed!
+seed!(123)
+
+@model function logreg(X, y; predictors=size(X, 2))
+ #priors
+ α ~ Normal(0, 2.5)
+ β ~ filldist(TDist(3), predictors)
+
+ #likelihood
+ return y ~ arraydist(LazyArray(@~ BernoulliLogit.(α .+ X * β)))
+end;
Here I am specifying very weakly informative priors:
– This means a normal distribution centered on 0 with a standard deviation of 2.5. That prior should with ease cover all possible values of . Remember that the normal distribution has support over all the real number line .
– The predictors all have a prior distribution of a Student- distribution centered on 0 with variance 1 and degrees of freedom . That wide-tailed distribution will cover all possible values for our coefficients. Remember the Student- also has support over all the real number line . Also the filldist()
is a nice Turing's function which takes any univariate or multivariate distribution and returns another distribution that repeats the input distribution.
Turing's arraydist()
function wraps an array of distributions returning a new distribution sampling from the individual distributions. And the LazyArrays' LazyArray()
constructor wrap a lazy object that wraps a computation producing an array to an array. Last, but not least, the macro @~
creates a broadcast and is a nice short hand for the familiar dot .
broadcasting operator in Julia. This is an efficient way to tell Turing that our y
vector is distributed lazily as a BernoulliLogit
broadcasted to α
added to the product of the data matrix X
and β
coefficient vector.
If your dependent variable y
is continuous and represents the number of successes of independent Bernoulli trials you can use the binomial likelihood in your model:
y ~ arraydist(LazyArray(@~ BinomialLogit.(n, α .+ X * β)))
+For our example, I will use a famous dataset called wells
(Gelman & Hill, 2007), which is data from a survey of 3,200 residents in a small area of Bangladesh suffering from arsenic contamination of groundwater. Respondents with elevated arsenic levels in their wells had been encouraged to switch their water source to a safe public or private well in the nearby area and the survey was conducted several years later to learn which of the affected residents had switched wells. It has 3,200 observations and the following variables:
switch
– binary/dummy (0 or 1) for well-switching.
arsenic
– arsenic level in respondent's well.
dist
– distance (meters) from the respondent's house to the nearest well with safe drinking water.
association
– binary/dummy (0 or 1) if member(s) of household participate in community organizations.
educ
– years of education (head of household).
Ok let's read our data with CSV.jl
and output into a DataFrame
from DataFrames.jl
:
using DataFrames
+using CSV
+using HTTP
+
+url = "https://raw.githubusercontent.com/storopoli/Bayesian-Julia/master/datasets/wells.csv"
+wells = CSV.read(HTTP.get(url).body, DataFrame)
+describe(wells)
5×7 DataFrame
+ Row │ variable mean min median max nmissing eltype
+ │ Symbol Float64 Real Float64 Real Int64 DataType
+─────┼──────────────────────────────────────────────────────────────────
+ 1 │ switch 0.575166 0 1.0 1 0 Int64
+ 2 │ arsenic 1.65693 0.51 1.3 9.65 0 Float64
+ 3 │ dist 48.3319 0.387 36.7615 339.531 0 Float64
+ 4 │ assoc 0.422848 0 0.0 1 0 Int64
+ 5 │ educ 4.82848 0 5.0 17 0 Int64
+As you can see from the describe()
output 58% of the respondents switched wells and 42% percent of respondents somehow are engaged in community organizations. The average years of education of the household's head is approximate 5 years and ranges from 0 (no education at all) to 17 years. The distance to safe drinking water is measured in meters and averages 48m ranging from less than 1m to 339m. Regarding arsenic levels I cannot comment because the only thing I know that it is toxic and you probably would never want to have your well contaminated with it. Here, we believe that all of those variables somehow influence the probability of a respondent switch to a safe well.
Now let's us instantiate our model with the data:
+X = Matrix(select(wells, Not(:switch)))
+y = wells[:, :switch]
+model = logreg(X, y);
+And, finally, we will sample from the Turing model. We will be using the default NUTS()
sampler with 1_000
samples, with 4 Markov chains using multiple threads MCMCThreads()
:
chain = sample(model, NUTS(), MCMCThreads(), 1_000, 4)
+summarystats(chain)
Summary Statistics
+ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
+
+ α -0.1537 0.1008 0.0026 1487.6006 2091.0499 1.0020 26.5174
+ β[1] 0.4664 0.0419 0.0009 2237.3707 2405.0432 1.0010 39.8825
+ β[2] -0.0090 0.0010 0.0000 4269.6543 3192.4775 1.0009 76.1093
+ β[3] -0.1226 0.0777 0.0019 1680.2431 1877.4329 1.0002 29.9514
+ β[4] 0.0424 0.0096 0.0002 2656.3110 2520.0618 1.0020 47.3504
+
+We had no problem with the Markov chains as all the rhat
are well below 1.01
(or above 0.99
). Note that the coefficients are in log-odds scale. They are the natural log of the odds[1], and odds is defined as:
where is a probability. So log-odds is defined as:
+ +So in order to get odds from a log-odds we must undo the log operation with a exponentiation. This translates to:
+ +We can do this with a transformation in a DataFrame
constructed from a Chains
object:
using Chain
+
+@chain quantile(chain) begin
+ DataFrame
+ select(_, :parameters, names(_, r"%") .=> ByRow(exp); renamecols=false)
+end
5×6 DataFrame
+ Row │ parameters 2.5% 25.0% 50.0% 75.0% 97.5%
+ │ Symbol Float64 Float64 Float64 Float64 Float64
+─────┼──────────────────────────────────────────────────────────────
+ 1 │ α 0.700906 0.801908 0.859547 0.91731 1.04274
+ 2 │ β[1] 1.47317 1.54865 1.59402 1.63747 1.7351
+ 3 │ β[2] 0.989048 0.990354 0.991039 0.991766 0.993111
+ 4 │ β[3] 0.75859 0.84017 0.885127 0.931194 1.03098
+ 5 │ β[4] 1.0235 1.03671 1.04353 1.0502 1.06242
+Our interpretation of odds is the same as in betting games. Anything below 1 signals a unlikely probability that will be . And anything above 1 increases the probability of being , while 1 itself is a neutral odds for being either or . Since I am not a gambling man, let's talk about probabilities. So I will create a function called logodds2prob()
that converts log-odds to probabilities:
function logodds2prob(logodds::Float64)
+ return exp(logodds) / (1 + exp(logodds))
+end
+
+@chain quantile(chain) begin
+ DataFrame
+ select(_, :parameters, names(_, r"%") .=> ByRow(logodds2prob); renamecols=false)
+end
5×6 DataFrame
+ Row │ parameters 2.5% 25.0% 50.0% 75.0% 97.5%
+ │ Symbol Float64 Float64 Float64 Float64 Float64
+─────┼──────────────────────────────────────────────────────────────
+ 1 │ α 0.412078 0.445033 0.462234 0.478436 0.510462
+ 2 │ β[1] 0.59566 0.607635 0.614497 0.620848 0.634383
+ 3 │ β[2] 0.497247 0.497577 0.49775 0.497933 0.498272
+ 4 │ β[3] 0.431363 0.456572 0.469532 0.482186 0.507626
+ 5 │ β[4] 0.505807 0.509012 0.510649 0.512243 0.515133
+There you go, much better now. Let's analyze our results. The intercept α
is the basal switch
probability which has a median value of 46%. All coefficients whose 95% credible intervals captures the value tells that the effect on the propensity of switch
is inconclusive. It is pretty much similar to a 95% credible interval that captures the 0 in the linear regression coefficients. So this rules out β[3]
which is the third column of X
– assoc
. The other remaining 95% credible intervals can be interpreted as follows:
β[1]
– first column of X
, arsenic
, has 95% credible interval 0.595 to 0.634. This means that each increase in one unit of arsenic
is related to an increase of 9.6% to 13.4% propension of switch
being 1.
β[2]
– second column of X
, dist
, has a 95% credible interval from 0.497 to 0.498. So we expect that each increase in one meter of dist
is related to a decrease of 0.01% propension of switch
being 1.
β[4]
– fourth column of X
, educ
, has a 95% credible interval from 0.506 to 0.515. Each increase in one year of educ
is related to an increase of 0.6% to 1.5% propension of switch
being 1.
That's how you interpret 95% credible intervals from a quantile()
output of a logistic regression Chains
object converted from log-odds to probability.
[1] + | actually the logit function or the log-odds is the logarithm of the odds where is a probability. + + |
Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.
+ +Leaving the universe of linear models, we start to venture into generalized linear models (GLM). The second is ordinal regression.
A ordinal regression behaves exactly like a linear model: it makes a prediction simply by computing a weighted sum of the independent variables by the estimated coefficients , but now we have not only one intercept but several intercepts for .
We use ordinal regression when our dependent variable is ordinal. That means it has different that have a "natural order"**. Most important, the distance between values is not the same. For example, imagine a pain score scale that goes from 1 to 10. The distance between 1 and 2 is different from the distance 9 to 10. Another example is opinion pools with their ubiquitous disagree-agree range of plausible values. These are also known as Likert scale variables. The distance between "disagree" to "not agree or disagree" is different than the distance between "agree" and "strongly agree".
This assumption is what we call the "metric" assumption, also called as the equidistant assumption. Almost always when we model an ordinal dependent variable this assumption is violated. Thus, we cannot blindly employ linear regression here.
So, how we deal with ordered discrete responses in our dependent variable? This is similar with the previous logistic regression approach. We have to do a non-linear transformation of the dependent variable.
In the case of ordinal regression, we need to first transform the dependent variable into a cumulative scale. We need to calculate the cumulative distribution function (CDF) of our dependent variable:
The CDF is a monotonic increasing function that depicts the probability of our random variable taking values less than a certain value. In our case, the discrete ordinal case, these values can be represented as positive integers ranging from 1 to the length of possible values. For instance, a 6-categorical ordered response variable will have 6 possible values, and all their CDFs will lie between 0 and 1. Furthermore, their sum will be 1; since it represents the total probability of the variable taking any of the possible values, i.e. 100%.
That is still not enough, we need to apply the logit function to the CDF:
where is the natural logarithm function.
The logit is the inverse of the logistic transformation, it takes as a input any number between 0 and 1 (where a probability is the perfect candidate) and outputs a real number, which we call the log-odds.
Since we are taking the log-odds of the CDF, we can call this complete transformation as log-odds of the CDF, or log-cumulative-odds.
Now, the next question is: what do we do with the log-cumulative-odds?
We need the log-cumulative-odds because it allows us to construct different intercepts for the possible values our ordinal dependent variable. We create an unique intercept for each possible outcome .
Notice that the highest probable value of will always have a log-cumulative-odds of , since for :
Thus, we only need intercepts for a possible dependent variables' response values. These are known as cut points.
Each intercept implies a CDF for each value . This allows us to violate the equidistant assumption absent in most ordinal variables.
Each intercept implies a log-cumulative-odds for each . We also need to undo the cumulative nature of the intercepts. We can accomplish this by first converting a log-cumulative-odds back to a cumulative probability. This is done by undoing the logit transformation and applying the logistic function:
Then, finally, we can remove the cumulative from "cumulative probability" by subtraction of each of the cut points by their previous cut point:
where is the dependent variable and are the cut points for each intercept.
Let me show you an example with some synthetic data.
using DataFrames
+using CairoMakie
+using AlgebraOfGraphics
+using Distributions
+using StatsFuns: logit
Here we have a discrete variable x
with 6 possible ordered values as response. The values range from 1 to 6 having probability, respectively: 10%, 15%, 33%, 25%, 10%, and 7%; represented with the probs
vector. The underlying distribution is represented by a Categorical
distribution from Distributions.jl
, which takes a vector of probabilities as parameters (probs
).
For each value we are calculating:
Probability Mass Function with the pdf
function
Cumulative Distribution Function with the cdf
function
Log-cumulative-odds with the logit
transformation of the CDF
In the plot below there are 3 subplots:
Upper corner: histogram of x
Left lower corner: CDF of x
Right lower corner: log-cumulative-odds of x
let
+ probs = [0.10, 0.15, 0.33, 0.25, 0.10, 0.07]
+ dist = Categorical(probs)
+ x = 1:length(probs)
+ x_pmf = pdf.(dist, x)
+ x_cdf = cdf.(dist, x)
+ x_logodds_cdf = logit.(x_cdf)
+ df = DataFrame(; x, x_pmf, x_cdf, x_logodds_cdf)
+ labels = ["CDF", "Log-cumulative-odds"]
+ f = Figure()
+ plt1 = data(df) * mapping(:x, :x_pmf) * visual(BarPlot)
+ plt2 =
+ data(df) *
+ mapping(:x, [:x_cdf, :x_logodds_cdf]; col=dims(1) => renamer(labels)) *
+ visual(ScatterLines)
+ axis = (; xticks=1:6)
+ draw!(f[1, 2:3], plt1; axis)
+ draw!(f[2, 1:4], plt2; axis, facet=(; linkyaxes=:none))
+ f
+end
CairoMakie.Screen{SVG}
+
As we can see, we have (in our case ) intercept values in log-cumulative-odds. You can carly see that these values they violate the equidistant assumption for metric response values. The spacing between the cut points are not the same, they vary.
Ok, the intercepts are done. Now let's add coefficients to act as covariate effects in our ordinal regression model.
We transformed everything into log-odds scale so that we can add effects (coefficients multiplying a covariate) or basal rates (intercepts) together. For each , we calculate:
where is the log-cumulative-odds for the intercepts, is the coefficient for the th covariate . Finally, represents the linear predictor for the th intercept.
Observe that the coefficient is being added to a log-cumulative-odds, such that, it will be expressed in a log-cumulative-odds also.
We can express it in matrix form:
where , and are vectors and is the data matrix where each row is an observation and each column a covariate.
This still obeys the ordered constraint on the dependent variable possible values.
Now, suppose we have found our ideal value for our . How we would interpret our estimated values?
First, to recap, measures the strength of association between the covariate and dependent variable . But, is nested inside a non-linear transformation called logistic function:
So, our first step is to undo the logistic function. There is a function that is called logit function that is the inverse of the logistic function:
where is the natural logarithm function.
If we analyze closely the logit function we can find that inside the there is a disguised odds in the . Hence, our can be interpreted as the logarithm of the odds, or in short form: the log-odds.
We already saw how odds, log-odds, and probability are related in the previous logistic regression tutorial. So you might want to go back there to get the full explanation.
The log-odds are the key to interpret coefficient **. Any positive value of means that there exists a positive association between and , while any negative values indicate a negative association between and . Values close to 0 demonstrates the lack of association between and .
We have almost everything we need for our ordinal regression. We are only missing a final touch. Currently our logistic function outputs a vector of probabilities that sums to 1.
All of the intercepts along with the coefficients are in log-cumulative-odds scale. If we apply the logistic function to the linear predictors we get probabilities: one for each
We need a likelihood that can handle a vector of probabilities and outputs a single discrete value. The categorical distribution is the perfect candidate.
Now we have all the components for our Bayesian ordinal regression specification:
where:
– ordered discrete dependent variable.
– probability vector of length .
– number of possible values can take, i.e. number of ordered discrete values.
– log-cumulative-odds, i.e. cut points considering the intercepts and covariates effect.
– intercept in log-cumulative-odds for each .
– covariate data matrix.
– coefficient vector of the same length as the number of columns in .
– logistic function.
– cumulative distribution function.
What remains is to specify the model parameters' prior distributions:
Prior Distribution of – Knowledge we possess regarding the model's intercepts.
Prior Distribution of – Knowledge we possess regarding the model's independent variables' coefficients.
Our goal is to instantiate an ordinal regression with the observed data ( and ) and find the posterior distribution of our model's parameters of interest ( and ). This means to find the full posterior distribution of:
Note that contrary to the linear regression, which used a Gaussian/normal likelihood function, we don't have an error parameter in our ordinal regression. This is due to the Categorical distribution not having a "scale" parameter such as the parameter in the Gaussian/normal distribution.
This is easily accomplished with Turing:
using Turing
+using Bijectors
+using LazyArrays
+using LinearAlgebra
+using Random: seed!
+using Bijectors: transformed, OrderedBijector
+seed!(123)
+
+@model function ordreg(X, y; predictors=size(X, 2), ncateg=maximum(y))
+ #priors
+ cutpoints ~ transformed(filldist(TDist(3) * 5, ncateg - 1), OrderedBijector())
+ β ~ filldist(TDist(3) * 2.5, predictors)
+
+ #likelihood
+ return y ~ arraydist([OrderedLogistic(X[i, :] ⋅ β, cutpoints) for i in 1:length(y)])
+end;
First, let's deal with the new stuff in our model: the Bijectors.jl
's transformed
and OrderedBijector
. As I've said in the 4. How to use Turing, Turing has a rich ecosystem of packages. Bijectors implements a set of functions for transforming constrained random variables (e.g. simplexes, intervals) to Euclidean space. Here we are defining cutpoints
as a ncateg - 1
vector of Student- distributions with mean 0, standard deviation 5 and degrees of freedom . Remember that we only need cutpoints for all of our intercepts. And we are also constraining it to be an ordered vector with transformed(d, OrderedBijector)
, such that for all cutpoints we have .
As before, we are giving a very weakly informative priors of a Student- distribution centered on 0 with variance 1 and degrees of freedom . That wide-tailed distribution will cover all possible values for our coefficients. Remember the Student- also has support over all the real number line . Also the filldist()
is a nice Turing's function which takes any univariate or multivariate distribution and returns another distribution that repeats the input distribution.
Finally, in the likelihood, Turing's arraydist()
function wraps an array of distributions returning a new distribution sampling from the individual distributions. And we use some indexing inside an array literal.
For our example, I will use a famous dataset called esoph
(Breslow & Day, 1980), which is data from a case-control study of (o)esophageal cancer in Ille-et-Vilaine, France. It has records for 88 age/alcohol/tobacco combinations:
agegp
: Age group
1
: 25-34 years
2
: 35-44 years
3
: 45-54 years
4
: 55-64 years
5
: 65-74 years
6
: 75+ years
alcgp
: Alcohol consumption
1
: 0-39 g/day
2
: 40-79 g/day
3
: 80-119 g/day
4
: 120+ g/day
tobgp
: Tobacco consumption
1
: 0-9 g/day
2
: 10-19 g/day
3
: 20-29 g/day
4
: 30+ g/day
ncases
: Number of cases
ncontrols
: Number of controls
Ok let's read our data with CSV.jl
and output into a DataFrame
from DataFrames.jl
:
using DataFrames
+using CSV
+using HTTP
+
+url = "https://raw.githubusercontent.com/storopoli/Bayesian-Julia/master/datasets/esoph.csv"
+esoph = CSV.read(HTTP.get(url).body, DataFrame)
88×5 DataFrame
+ Row │ agegp alcgp tobgp ncases ncontrols
+ │ String7 String15 String15 Int64 Int64
+─────┼─────────────────────────────────────────────────
+ 1 │ 25-34 0-39g/day 0-9g/day 0 40
+ 2 │ 25-34 0-39g/day 10-19 0 10
+ 3 │ 25-34 0-39g/day 20-29 0 6
+ 4 │ 25-34 0-39g/day 30+ 0 5
+ 5 │ 25-34 40-79 0-9g/day 0 27
+ 6 │ 25-34 40-79 10-19 0 7
+ 7 │ 25-34 40-79 20-29 0 4
+ 8 │ 25-34 40-79 30+ 0 7
+ 9 │ 25-34 80-119 0-9g/day 0 2
+ 10 │ 25-34 80-119 10-19 0 1
+ 11 │ 25-34 80-119 30+ 0 2
+ 12 │ 25-34 120+ 0-9g/day 0 1
+ 13 │ 25-34 120+ 10-19 1 0
+ 14 │ 25-34 120+ 20-29 0 1
+ 15 │ 25-34 120+ 30+ 0 2
+ 16 │ 35-44 0-39g/day 0-9g/day 0 60
+ 17 │ 35-44 0-39g/day 10-19 1 13
+ 18 │ 35-44 0-39g/day 20-29 0 7
+ 19 │ 35-44 0-39g/day 30+ 0 8
+ 20 │ 35-44 40-79 0-9g/day 0 35
+ 21 │ 35-44 40-79 10-19 3 20
+ 22 │ 35-44 40-79 20-29 1 13
+ 23 │ 35-44 40-79 30+ 0 8
+ 24 │ 35-44 80-119 0-9g/day 0 11
+ 25 │ 35-44 80-119 10-19 0 6
+ 26 │ 35-44 80-119 20-29 0 2
+ 27 │ 35-44 80-119 30+ 0 1
+ 28 │ 35-44 120+ 0-9g/day 2 1
+ 29 │ 35-44 120+ 10-19 0 3
+ 30 │ 35-44 120+ 20-29 2 2
+ 31 │ 45-54 0-39g/day 0-9g/day 1 45
+ 32 │ 45-54 0-39g/day 10-19 0 18
+ 33 │ 45-54 0-39g/day 20-29 0 10
+ 34 │ 45-54 0-39g/day 30+ 0 4
+ 35 │ 45-54 40-79 0-9g/day 6 32
+ 36 │ 45-54 40-79 10-19 4 17
+ 37 │ 45-54 40-79 20-29 5 10
+ 38 │ 45-54 40-79 30+ 5 2
+ 39 │ 45-54 80-119 0-9g/day 3 13
+ 40 │ 45-54 80-119 10-19 6 8
+ 41 │ 45-54 80-119 20-29 1 4
+ 42 │ 45-54 80-119 30+ 2 2
+ 43 │ 45-54 120+ 0-9g/day 4 0
+ 44 │ 45-54 120+ 10-19 3 1
+ 45 │ 45-54 120+ 20-29 2 1
+ 46 │ 45-54 120+ 30+ 4 0
+ 47 │ 55-64 0-39g/day 0-9g/day 2 47
+ 48 │ 55-64 0-39g/day 10-19 3 19
+ 49 │ 55-64 0-39g/day 20-29 3 9
+ 50 │ 55-64 0-39g/day 30+ 4 2
+ 51 │ 55-64 40-79 0-9g/day 9 31
+ 52 │ 55-64 40-79 10-19 6 15
+ 53 │ 55-64 40-79 20-29 4 13
+ 54 │ 55-64 40-79 30+ 3 3
+ 55 │ 55-64 80-119 0-9g/day 9 9
+ 56 │ 55-64 80-119 10-19 8 7
+ 57 │ 55-64 80-119 20-29 3 3
+ 58 │ 55-64 80-119 30+ 4 0
+ 59 │ 55-64 120+ 0-9g/day 5 5
+ 60 │ 55-64 120+ 10-19 6 1
+ 61 │ 55-64 120+ 20-29 2 1
+ 62 │ 55-64 120+ 30+ 5 1
+ 63 │ 65-74 0-39g/day 0-9g/day 5 43
+ 64 │ 65-74 0-39g/day 10-19 4 10
+ 65 │ 65-74 0-39g/day 20-29 2 5
+ 66 │ 65-74 0-39g/day 30+ 0 2
+ 67 │ 65-74 40-79 0-9g/day 17 17
+ 68 │ 65-74 40-79 10-19 3 7
+ 69 │ 65-74 40-79 20-29 5 4
+ 70 │ 65-74 80-119 0-9g/day 6 7
+ 71 │ 65-74 80-119 10-19 4 8
+ 72 │ 65-74 80-119 20-29 2 1
+ 73 │ 65-74 80-119 30+ 1 0
+ 74 │ 65-74 120+ 0-9g/day 3 1
+ 75 │ 65-74 120+ 10-19 1 1
+ 76 │ 65-74 120+ 20-29 1 0
+ 77 │ 65-74 120+ 30+ 1 0
+ 78 │ 75+ 0-39g/day 0-9g/day 1 17
+ 79 │ 75+ 0-39g/day 10-19 2 4
+ 80 │ 75+ 0-39g/day 30+ 1 2
+ 81 │ 75+ 40-79 0-9g/day 2 3
+ 82 │ 75+ 40-79 10-19 1 2
+ 83 │ 75+ 40-79 20-29 0 3
+ 84 │ 75+ 40-79 30+ 1 0
+ 85 │ 75+ 80-119 0-9g/day 1 0
+ 86 │ 75+ 80-119 10-19 1 0
+ 87 │ 75+ 120+ 0-9g/day 2 0
+ 88 │ 75+ 120+ 10-19 1 0
Now let's us instantiate our model with the data. But here I need to do some data wrangling to create the data matrix X
. Specifically, I need to convert all of the categorical variables to integer values, representing the ordinal values of both our independent and also dependent variables:
using CategoricalArrays
+
+DataFrames.transform!(
+ esoph,
+ :agegp =>
+ x -> categorical(
+ x; levels=["25-34", "35-44", "45-54", "55-64", "65-74", "75+"], ordered=true
+ ),
+ :alcgp =>
+ x -> categorical(x; levels=["0-39g/day", "40-79", "80-119", "120+"], ordered=true),
+ :tobgp =>
+ x -> categorical(x; levels=["0-9g/day", "10-19", "20-29", "30+"], ordered=true);
+ renamecols=false,
+)
+DataFrames.transform!(
+ esoph, [:agegp, :alcgp, :tobgp] .=> ByRow(levelcode); renamecols=false
+)
+
+X = Matrix(select(esoph, [:agegp, :alcgp]))
+y = esoph[:, :tobgp]
+model = ordreg(X, y);
And, finally, we will sample from the Turing model. We will be using the default NUTS()
sampler with 1_000
samples, with 4 Markov chains using multiple threads MCMCThreads()
:
chain = sample(model, NUTS(), MCMCThreads(), 1_000, 4)
+summarystats(chain)
Summary Statistics
+ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
+
+ cutpoints[1] -1.3806 0.6268 0.0119 2788.6718 2472.5534 1.0009 69.0744
+ cutpoints[2] -0.2021 0.6072 0.0113 2897.1099 2438.5542 1.0013 71.7604
+ cutpoints[3] 0.8493 0.6203 0.0114 2962.0272 2770.1618 1.0021 73.3684
+ β[1] -0.0696 0.1161 0.0021 3108.6746 2347.4018 1.0007 77.0008
+ β[2] -0.0658 0.1691 0.0030 3128.0391 2909.9287 1.0016 77.4804
+
We had no problem with the Markov chains as all the rhat
are well below 1.01
(or above 0.99
). Note that the coefficients are in log-odds scale. They are the natural log of the odds[1], and odds is defined as:
where is a probability. So log-odds is defined as:
So in order to get odds from a log-odds we must undo the log operation with a exponentiation. This translates to:
We can do this with a transformation in a DataFrame
constructed from a Chains
object:
using Chain
+
+@chain quantile(chain) begin
+ DataFrame
+ select(_, :parameters, names(_, r"%") .=> ByRow(exp); renamecols=false)
+end
5×6 DataFrame
+ Row │ parameters 2.5% 25.0% 50.0% 75.0% 97.5%
+ │ Symbol Float64 Float64 Float64 Float64 Float64
+─────┼─────────────────────────────────────────────────────────────────
+ 1 │ cutpoints[1] 0.0738932 0.163178 0.252302 0.389008 0.845115
+ 2 │ cutpoints[2] 0.254298 0.534801 0.818166 1.2405 2.71577
+ 3 │ cutpoints[3] 0.704014 1.52676 2.32062 3.56916 8.04289
+ 4 │ β[1] 0.745092 0.863281 0.933127 1.00918 1.17439
+ 5 │ β[2] 0.677071 0.832166 0.935842 1.05535 1.29072
Our interpretation of odds is the same as in betting games. Anything below 1 signals a unlikely probability that will be . And anything above 1 increases the probability of being , while 1 itself is a neutral odds for being either or . Since I am not a gambling man, let's talk about probabilities. So I will create a function called logodds2prob()
that converts log-odds to probabilities:
function logodds2prob(logodds::Float64)
+ return exp(logodds) / (1 + exp(logodds))
+end
+
+@chain quantile(chain) begin
+ DataFrame
+ select(_, :parameters, names(_, r"%") .=> ByRow(logodds2prob); renamecols=false)
+end
5×6 DataFrame
+ Row │ parameters 2.5% 25.0% 50.0% 75.0% 97.5%
+ │ Symbol Float64 Float64 Float64 Float64 Float64
+─────┼─────────────────────────────────────────────────────────────────
+ 1 │ cutpoints[1] 0.0688087 0.140286 0.201471 0.280062 0.458028
+ 2 │ cutpoints[2] 0.202741 0.34845 0.449995 0.553671 0.730877
+ 3 │ cutpoints[3] 0.41315 0.604236 0.698851 0.781141 0.889416
+ 4 │ β[1] 0.426964 0.463312 0.482703 0.502285 0.540101
+ 5 │ β[2] 0.403722 0.454198 0.483429 0.513466 0.563457
There you go, much better now. Let's analyze our results. The cutpoints
is the basal rate of the probability of our dependent variable having values less than a certain value. For example the cutpoint for having values less than 2
which its code represents the tobacco consumption of 10-19 g/day has a median value of 20%.
Now let's take a look at our coefficients All coefficients whose 95% credible intervals captures the value tells that the effect on the propensity of tobacco consumption is inconclusive. It is pretty much similar to a 95% credible interval that captures the 0 in the linear regression coefficients.
That's how you interpret 95% credible intervals from a quantile()
output of a ordinal regression Chains
object converted from log-odds to probability.
[1] | actually the logit function or the log-odds is the logarithm of the odds where is a probability. |
Breslow, N. E. & Day, N. E. (1980). Statistical Methods in Cancer Research. Volume 1: The Analysis of Case-Control Studies. IARC Lyon / Oxford University Press.
Leaving the universe of linear models, we start to venture into generalized linear models (GLM). The third of these is regression with count data (also called Poisson regression).
A regression with count data behaves exactly like a linear model: it makes a prediction simply by computing a weighted sum of the independent variables by the estimated coefficients , plus an intercept . However, instead of returning a continuous value , such as linear regression, it returns the natural log of .
We use regression with count data when our dependent variable is restricted to positive integers, i.e. . See the figure below for a graphical intuition of the exponential function:
using CairoMakie
+
+f, ax, l = lines(-6 .. 6, exp; axis=(xlabel=L"x", ylabel=L"e^x"))
As we can see, the exponential function is basically a mapping of any real number to a positive real number in the range between 0 and (non-inclusive):
That is, the exponential function is the ideal candidate for when we need to convert something continuous without restrictions to something continuous restricted to taking positive values only. That is why it is used when we need a model to have a positive-only dependent variable. This is the case of a dependent variable for count data.
Linear regression follows the following mathematical formulation:
- model parameters
- intercept
- independent variables coefficients
- total number of independent variables
Regression with count data would add the exponential function to the linear term:
which is the same as:
We can model regression with count data in two ways. The first option with a Poisson likelihood function and the second option with a negative binomial likelihood function.
With the Poisson likelihood we model a discrete and positive dependent variable by assuming that a given number of independent events will occur with a known constant average rate.
In a negative binomial likelihood, model a discrete and positive dependent variable by assuming that a given number of independent events will occur by asking a yes-no question for each with probability until success(es) is obtained. Note that it becomes identical to the Poisson likelihood when at the limit of . This makes the negative binomial a robust option to replace a Poisson likelihood to model phenomena with a overdispersion (excess expected variation in data). This occurs due to the Poisson likelihood making an assumption that the dependent variable has the same mean and variance, while in the negative binomial likelihood the mean and the variance do not need to be equal.
where:
– discrete and positive dependent variable.
– exponential.
– intercept.
– coefficient vector.
– data matrix.
As we can see, the linear predictor is the logarithm of the value of . So we need to apply the exponential function the values of the linear predictor:
The intercept and coefficients are only interpretable after exponentiation.
What remains is to specify the model parameters' prior distributions:
Prior Distribution of – Knowledge we possess regarding the model's intercept.
Prior Distribution of – Knowledge we possess regarding the model's independent variables' coefficients.
Our goal is to instantiate a regression with count data using the observed data ( and ) and find the posterior distribution of our model's parameters of interest ( and ). This means to find the full posterior distribution of:
Note that contrary to the linear regression, which used a Gaussian/normal likelihood function, we don't have an error parameter in our regression with count data. This is due to the Poisson not having a "scale" parameter such as the parameter in the Gaussian/normal distribution.
This is easily accomplished with Turing:
using Turing
+using LazyArrays
+using Random: seed!
+seed!(123)
+
+@model function poissonreg(X, y; predictors=size(X, 2))
+ #priors
+ α ~ Normal(0, 2.5)
+ β ~ filldist(TDist(3), predictors)
+
+ #likelihood
+ return y ~ arraydist(LazyArray(@~ LogPoisson.(α .+ X * β)))
+end;
Here I am specifying very weakly informative priors:
– This means a normal distribution centered on 0 with a standard deviation of 2.5. That prior should with ease cover all possible values of . Remember that the normal distribution has support over all the real number line .
– The predictors all have a prior distribution of a Student- distribution centered on 0 with variance 1 and degrees of freedom . That wide-tailed distribution will cover all possible values for our coefficients. Remember the Student- also has support over all the real number line . Also the filldist()
is a nice Turing's function which takes any univariate or multivariate distribution and returns another distribution that repeats the input distribution.
Turing's arraydist()
function wraps an array of distributions returning a new distribution sampling from the individual distributions. And the LazyArrays' LazyArray()
constructor wrap a lazy object that wraps a computation producing an array to an array. Last, but not least, the macro @~
creates a broadcast and is a nice short hand for the familiar dot .
broadcasting operator in Julia. This is an efficient way to tell Turing that our y
vector is distributed lazily as a LogPoisson
broadcasted to α
added to the product of the data matrix X
and β
coefficient vector. LogPoisson
is Turing's efficient distribution that already apply exponentiation to all the linear predictors.
where:
– discrete and positive dependent variable.
– exponential.
– dispersion.
– reciprocal dispersion.
– intercept.
– coefficient vector.
– data matrix.
Note that when we compare with the Poisson model, we have a new parameter that parameterizes the negative binomial likelihood. This parameter is the probability of successes of the negative binomial distribution and we generally use a Gamma distribution as prior so that the inverse of which is fulfills the function of a "reciprocal dispersion" parameter. Most of the time we use a weakly informative prior of the parameters shape and scale (Gelman et al., 2013; 2020). But you can also use as prior (McElreath, 2020).
Here is what a looks like:
using Distributions
+f, ax, l = lines(
+ Gamma(0.01, 0.01);
+ linewidth=2,
+ axis=(xlabel=L"\phi", ylabel="Density", limits=(0, 0.03, nothing, nothing)),
+)
In both likelihood options, what remains is to specify the model parameters' prior distributions:
Prior Distribution of – Knowledge we possess regarding the model's intercept.
Prior Distribution of – Knowledge we possess regarding the model's independent variables' coefficients.
Our goal is to instantiate a regression with count data using the observed data ( and ) and find the posterior distribution of our model's parameters of interest ( and ). This means to find the full posterior distribution of:
Note that contrary to the linear regression, which used a Gaussian/normal likelihood function, we don't have an error parameter in our regression with count data. This is due to neither the Poisson nor negative binomial distributions having a "scale" parameter such as the parameter in the Gaussian/normal distribution.
One last thing before we get into the details of the negative binomial distribution is to consider an alternative parameterization. Julia's Distributions.jl
and, consequently, Turing's parameterization of the negative binomial distribution follows the following the Wolfram reference:
where:
– number of failures before the th success in a sequence of independent Bernoulli trials
– number of successes
– probability of success in an individual Bernoulli trial
This is not ideal for most of the modeling situations that we would employ the negative binomial distribution. In particular, we want to have a parameterization that is more appropriate for count data. What we need is the familiar mean (or location) and variance (or scale) parameterization. If we look in Stan's documentation for the neg_binomial_2
function, we have the following two equations:
With a little bit of algebra, we can substitute the first equation of (11) into the right hand side of the second equation and get the following:
Then in (11) we have:
Hence, the resulting map is . I would like to point out that this implementation was done by Tor Fjelde in a COVID-19 model with the code available in GitHub. So we can use this parameterization in our negative binomial regression model. But first, we need to define an alternative negative binomial distribution function:
function NegativeBinomial2(μ, ϕ)
+ p = 1 / (1 + μ / ϕ)
+ p = p > 0 ? p : 1e-4 # numerical stability
+ r = ϕ
+
+ return NegativeBinomial(r, p)
+end
NegativeBinomial2 (generic function with 1 method)
+Now we create our Turing model with the alternative NegBinomial2
parameterization:
@model function negbinreg(X, y; predictors=size(X, 2))
+ #priors
+ α ~ Normal(0, 2.5)
+ β ~ filldist(TDist(3), predictors)
+ ϕ⁻ ~ Gamma(0.01, 0.01)
+ ϕ = 1 / ϕ⁻
+
+ #likelihood
+ return y ~ arraydist(LazyArray(@~ NegativeBinomial2.(exp.(α .+ X * β), ϕ)))
+end;
+Here I am also specifying very weakly informative priors:
+– This means a normal distribution centered on 0 with a standard deviation of 2.5. That prior should with ease cover all possible values of . Remember that the normal distribution has support over all the real number line .
+ + – The predictors all have a prior distribution of a Student- distribution centered on 0 with variance 1 and degrees of freedom . That wide-tailed distribution will cover all possible values for our coefficients. Remember the Student- also has support over all the real number line . Also the filldist()
is a nice Turing's function which takes any univariate or multivariate distribution and returns another distribution that repeats the input distribution.
– overdispersion parameter of the negative binomial distribution.
+ +Turing's arraydist()
function wraps an array of distributions returning a new distribution sampling from the individual distributions. And the LazyArrays' LazyArray()
constructor wrap a lazy object that wraps a computation producing an array to an array. Last, but not least, the macro @~
creates a broadcast and is a nice short hand for the familiar dot .
broadcasting operator in Julia. This is an efficient way to tell Turing that our y
vector is distributed lazily as a NegativeBinomial2
broadcasted to α
added to the product of the data matrix X
and β
coefficient vector. Note that NegativeBinomial2
does not apply exponentiation so we had to include the exp.()
broadcasted function to all the linear predictors.
For our example, I will use a famous dataset called roaches
(Gelman & Hill, 2007), which is data on the efficacy of a pest management system at reducing the number of roaches in urban apartments. It has 262 observations and the following variables:
y
– number of roaches caught.
roach1
– pretreatment number of roaches.
treatment
– binary/dummy (0 or 1) for treatment indicator.
senior
– binary/dummy (0 or 1) for only elderly residents in building.
exposure2
– number of days for which the roach traps were used
Ok let's read our data with CSV.jl
and output into a DataFrame
from DataFrames.jl
:
using DataFrames
+using CSV
+using HTTP
+
+url = "https://raw.githubusercontent.com/storopoli/Bayesian-Julia/master/datasets/roaches.csv"
+roaches = CSV.read(HTTP.get(url).body, DataFrame)
+describe(roaches)
5×7 DataFrame
+ Row │ variable mean min median max nmissing eltype
+ │ Symbol Float64 Real Float64 Real Int64 DataType
+─────┼────────────────────────────────────────────────────────────────────
+ 1 │ y 25.6489 0 3.0 357 0 Int64
+ 2 │ roach1 42.1935 0.0 7.0 450.0 0 Float64
+ 3 │ treatment 0.603053 0 1.0 1 0 Int64
+ 4 │ senior 0.305344 0 0.0 1 0 Int64
+ 5 │ exposure2 1.02105 0.2 1.0 4.28571 0 Float64
+As you can see from the describe()
output the average number of roaches caught by the pest management system is around 26 roaches. The average number of roaches pretreatment is around 42 roaches (oh boy...). 30% of the buildings has only elderly residents and 60% of the buildings received a treatment by the pest management system. Also note that the traps were set in general for only 1 day and it ranges from 0.2 days (almost 5 hours) to 4.3 days (which is approximate 4 days and 7 hours).
Let's first run the Poisson regression. First, we instantiate our model with the data:
+X = Matrix(select(roaches, Not(:y)))
+y = roaches[:, :y]
+model_poisson = poissonreg(X, y);
+And, finally, we will sample from the Turing model. We will be using the default NUTS()
sampler with 1_000
samples, with 4 Markov chains using multiple threads MCMCThreads()
:
chain_poisson = sample(model_poisson, NUTS(), MCMCThreads(), 1_000, 4)
+summarystats(chain_poisson)
Summary Statistics
+ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
+
+ α 2.9867 0.1526 0.0257 307.5895 81.9752 1.0121 9.3558
+ β[1] 0.0065 0.0002 0.0000 3084.6402 2620.6876 1.0023 93.8237
+ β[2] -0.5007 0.1066 0.0179 277.4972 68.4987 1.0220 8.4405
+ β[3] -0.3620 0.1247 0.0215 319.4964 79.5812 1.0136 9.7179
+ β[4] 0.1265 0.2666 0.0452 310.0447 75.1405 1.0124 9.4304
+
+We had no problem with the Markov chains as all the rhat
are well below 1.01
(or above 0.99
). Note that the coefficients are in log scale. So we need to apply the exponential function to them. We can do this with a transformation in a DataFrame
constructed from a Chains
object:
using Chain
+
+@chain quantile(chain_poisson) begin
+ DataFrame
+ select(_, :parameters, names(_, r"%") .=> ByRow(exp); renamecols=false)
+end
5×6 DataFrame
+ Row │ parameters 2.5% 25.0% 50.0% 75.0% 97.5%
+ │ Symbol Float64 Float64 Float64 Float64 Float64
+─────┼───────────────────────────────────────────────────────────────────
+ 1 │ α 17.8944 18.8715 19.4401 20.0227 21.6618
+ 2 │ β[1] 1.00636 1.00648 1.00655 1.00661 1.00672
+ 3 │ β[2] 0.569218 0.587878 0.598309 0.608326 0.642261
+ 4 │ β[3] 0.640725 0.669002 0.685442 0.701448 0.74567
+ 5 │ β[4] 1.07174 1.14706 1.17526 1.20443 1.26015
+Let's analyze our results. The intercept α
is the basal number of roaches caught y
and has a median value of 19.4 roaches caught. The remaining 95% credible intervals for the β
s can be interpreted as follows:
β[1]
– first column of X
, roach1
, has 95% credible interval 1.01 to 1.01. This means that each increase in one unit of roach1
is related to an increase of 1% more roaches caught.
β[2]
– second column of X
, treatment
, has 95% credible interval 0.57 to 0.63. This means that if an apartment was treated with the pest management system then we expect an decrease of around 40% roaches caught.
β[3]
– third column of X
, senior
, has a 95% credible interval from 0.64 to 0.73. This means that if an apartment building has only elderly residents then we expect an decrease of around 30% roaches caught.
β[4]
– fourth column of X
, exposure2
, has a 95% credible interval from 1.09 to 1.26. Each increase in one day for the exposure of traps in an apartment we expect an increase of between 9% to 26% roaches caught.
That's how you interpret 95% credible intervals from a quantile()
output of a regression with count data Chains
object converted from a log scale.
Let's now run the negative binomial regression.
+model_negbin = negbinreg(X, y);
+We will also default NUTS()
sampler with 1_000
samples, with 4 Markov chains using multiple threads MCMCThreads()
:
chain_negbin = sample(model_negbin, NUTS(), MCMCThreads(), 1_000, 4)
+summarystats(chain_negbin)
Summary Statistics
+ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
+
+ α 2.3935 0.3639 0.0077 2289.8720 1937.3708 1.0026 50.6474
+ β[1] 0.0127 0.0015 0.0000 4640.3636 2667.0438 1.0014 102.6357
+ β[2] -0.7327 0.1499 0.0027 3027.2982 2637.8716 1.0028 66.9578
+ β[3] -0.3199 0.1553 0.0027 3278.7372 2381.6903 1.0018 72.5192
+ β[4] 0.4266 0.3365 0.0073 2198.4749 1778.9698 1.0027 48.6259
+ ϕ⁻ 1.4058 0.0800 0.0014 3056.0376 2581.1948 1.0003 67.5935
+
+We had no problem with the Markov chains as all the rhat
are well below 1.01
(or above 0.99
). Note that the coefficients are in log scale. So we need to also apply the exponential function as we did before.
@chain quantile(chain_negbin) begin
+ DataFrame
+ select(_, :parameters, names(_, r"%") .=> ByRow(exp); renamecols=false)
+end
6×6 DataFrame
+ Row │ parameters 2.5% 25.0% 50.0% 75.0% 97.5%
+ │ Symbol Float64 Float64 Float64 Float64 Float64
+─────┼─────────────────────────────────────────────────────────────────
+ 1 │ α 5.28389 8.5925 11.0566 13.9834 21.9849
+ 2 │ β[1] 1.00987 1.01173 1.01277 1.01381 1.016
+ 3 │ β[2] 0.358343 0.433873 0.481535 0.53245 0.645215
+ 4 │ β[3] 0.540157 0.650629 0.724575 0.806835 0.984437
+ 5 │ β[4] 0.829193 1.21638 1.50724 1.92373 3.05504
+ 6 │ ϕ⁻ 3.52309 3.85985 4.07161 4.30516 4.80385
+Our results show much more uncertainty in the coefficients than in the Poisson regression. So it might be best to use the Poisson regression in the roaches
dataset.
Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.
+Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis. Chapman and Hall/CRC.
+Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and other stories. Cambridge University Press.
+McElreath, R. (2020). Statistical rethinking: A Bayesian course with examples in R and Stan. CRC press.
+ +Leaving the universe of linear models, we start to venture into generalized linear models (GLM). The fourth of these is robust regression.
A regression with count data behaves exactly like a linear model: it makes a prediction simply by computing a weighted sum of the independent variables by the estimated coefficients , plus an intercept . However, instead of using a Gaussian/normal likelihood function, it uses a Student- likelihood function.
We use robust regression in the same context as linear regression: our dependent variable is continuous. But robust regression allows us to better handle outliers in our data.
Before we dive in the nuts and bolts of robust regression let's remember the Gaussian/normal curve that has a bell shape (figure below). It does not have a "fat tail" (or sometimes known as "long tail"). In other words, the observations are not far from the mean. When we use this distribution as a likelihood function in the Bayesian models, we force that all estimates must be conditioned into a normal distribution of the dependent variable. If there are many outliers in the data (observations quite far from the mean), this causes the estimates of the independent variables' coefficients to be unstable. This is because the normal distribution cannot contemplate observations that are very spread away from the mean without having to change the mean's position (or location). In other words, the bell curve needs to "shift" to be able to contemplate outliers, thus making the inference unstable.
using CairoMakie
+using Distributions
+
+f, ax, l = lines(-4 .. 4, Normal(0, 1); linewidth=5, axis=(; xlabel=L"x", ylabel="Density"))
So we need a more "malleable" distribution as a likelihood function. A distribution that is more robust to outliers. A distribution similar to Normal but that has "fatter" (or "longer") tails to precisely contemplate observations very far from the average without having to "shift" the mean's position (or location). For that we have the Student- distribution. See the figure below to remember its shape.
f, ax, l = lines(-4 .. 4, TDist(2); linewidth=5, axis=(xlabel=L"x", ylabel="Density"))
+
Take a look at the tails in the comparison below:
+f, ax, l = lines(
+ -4 .. 4,
+ Normal(0, 1);
+ linewidth=5,
+ label="Normal",
+ axis=(; xlabel=L"x", ylabel="Density"),
+)
+lines!(ax, -4 .. 4, TDist(2); linewidth=5, label="Student")
+axislegend(ax)
+
The standard approach for modeling a continuous dependent variable is with a Gaussian/normal likelihood function. This implies that the model error, of the Gaussian/normal likelihood function is distributed as a normal distribution:
+ +From a Bayesian point of view, there is nothing special about Gaussian/normal likelihood function It is just a probabilistic distribution specified in a model. We can make the model more robust by using a Student- distribution as a likelihood function. This implies that the model error, does not follow a normal distribution, instead it follows a Student- distribution:
+ +Here are some differences:
+Student- likelihood function requires one additional parameter: , degrees of freedom. These control how "fat" (or "long") the tails will be. Values of forces the Student- distribution to practically become a normal distribution.
+ +There is nothing special about . It is just another parameter for the model to estimate. So just specify a prior on it. In this case, I am using a Log-Normal distribution with mean 2 and standard deviation 1.
+ +Note that there is also nothing special about the priors of the coefficients or the intercept . We could very well also specify other distributions as priors or even make the model even more robust to outliers by specifying priors as Student- distributions with degrees of freedom :
+ +Our goal is to instantiate a regression with count data using the observed data ( and ) and find the posterior distribution of our model's parameters of interest ( and ). This means to find the full posterior distribution of:
+ +This is easily accomplished with Turing:
+using Turing
+using Statistics: mean, std
+using StatsBase: mad
+using Random: seed!
+seed!(123)
+
+@model function robustreg(X, y; predictors=size(X, 2))
+ #priors
+ α ~ LocationScale(median(y), 2.5 * mad(y), TDist(3))
+ β ~ filldist(TDist(3), predictors)
+ σ ~ Exponential(1)
+ ν ~ LogNormal(2, 1)
+
+ #likelihood
+ return y ~ arraydist(LocationScale.(α .+ X * β, σ, TDist.(ν)))
+end;
+Here I am specifying very weakly informative priors:
+ – This means a Student- distribution with degrees of freedom ν = 3
centered on y
's median with variance 2.5 times the mean absolute deviation (MAD) of y
. That prior should with ease cover all possible values of . Remember that the Student- distribution has support over all the real number line . The LocationScale()
Turing's function adds location and scale parameters to distributions that doesn't have it. This is the case with the TDist()
distribution which only takes the ν
degrees of of freedom as parameter.
– The predictors all have a prior distribution of a Student- distribution with degrees of freedom ν = 3
centered on 0 with variance 1 and degrees of freedom . That wide-tailed distribution will cover all possible values for our coefficients. Remember the Student- also has support over all the real number line . Also the filldist()
is a nice Turing's function which takes any univariate or multivariate distribution and returns another distribution that repeats the input distribution.
– A wide-tailed-positive-only distribution perfectly suited for our model's error.
+ +Turing's arraydist()
function wraps an array of distributions returning a new distribution sampling from the individual distributions. It creates a broadcast and is a nice short hand for the familiar dot .
broadcasting operator in Julia. By specifying that y
vector is "broadcasted distributed" as a LocationScale
broadcasted to mean (location parameter) α
added to the product of the data matrix X
and β
coefficient vector along with a variance (scale parameter) σ
. To conclude, we place inside the LocationScale
a broadcasted TDist
with ν
degrees of freedom parameter.
For our example, I will use a famous dataset called duncan
(Duncan, 1961), which is data from occupation's prestige filled with outliers. It has 45 observations and the following variables:
profession
: name of the profession.
type
: type of occupation. A qualitative variable:
prof
- professional or management.
wc
- white-collar.
bc
- blue-collar.
income
: percentage of people in the occupation earning over U$ 3,500 per year in 1950 (more or less U$ 36,000 in 2017).
education
: percentage of people in the occupation who had a high school diploma in 1949 (which, being cynical, we can say is somewhat equivalent to a PhD degree in 2017).
prestige
: percentage of respondents in the survey who classified their occupation as at least "good" with respect to prestige.
Ok let's read our data with CSV.jl
and output into a DataFrame
from DataFrames.jl
:
using DataFrames
+using CSV
+using HTTP
+
+url = "https://raw.githubusercontent.com/storopoli/Bayesian-Julia/master/datasets/duncan.csv"
+duncan = CSV.read(HTTP.get(url).body, DataFrame)
+describe(duncan)
5×7 DataFrame
+ Row │ variable mean min median max nmissing eltype
+ │ Symbol Union… Any Union… Any Int64 DataType
+─────┼──────────────────────────────────────────────────────────────────────────────
+ 1 │ profession RR.engineer welfare.worker 0 String31
+ 2 │ type bc wc 0 String7
+ 3 │ income 41.8667 7 42.0 81 0 Int64
+ 4 │ education 52.5556 7 45.0 100 0 Int64
+ 5 │ prestige 47.6889 3 41.0 97 0 Int64
+As you can see from the describe()
output the average occupation's percentage of respondents who classified their occupation as at least "good" with respect to prestige is around 41%. But prestige
variable is very dispersed and actually has a bimodal distribution:
f = Figure()
+plt = data(duncan) * mapping(:prestige) * AlgebraOfGraphics.density()
+draw!(f[1, 1], plt)
UndefVarError: `data` not defined
+
+// Image matching '/assets/pages/10_robust_reg/code/prestige_density' not found. //
prestige
Besides that, the mean prestige
per type
shows us where the source of variation might come from:
gdf = groupby(duncan, :type)
+f = Figure()
+plt =
+ data(combine(gdf, :prestige => mean)) * mapping(:type, :prestige_mean) * visual(BarPlot)
+draw!(f[1, 1], plt)
UndefVarError: `data` not defined
+
+// Image matching '/assets/pages/10_robust_reg/code/prestige_per_type' not found. //
prestige
per type
Now let's us instantiate our model with the data:
+X = Matrix(select(duncan, [:income, :education]))
+y = duncan[:, :prestige]
+model = robustreg(X, y);
+And, finally, we will sample from the Turing model. We will be using the default NUTS()
sampler with 1_000
samples, with 4 Markov chains using multiple threads MCMCThreads()
:
chain = sample(model, NUTS(), MCMCThreads(), 1_000, 4)
+summarystats(chain)
Summary Statistics
+ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
+
+ α -7.5790 3.0116 0.0628 2309.7654 1978.4710 1.0042 70.5574
+ β[1] 0.7719 0.1024 0.0025 1748.2750 1843.0804 1.0028 53.4053
+ β[2] 0.4464 0.0836 0.0020 1702.5880 1820.3078 1.0052 52.0097
+ σ 7.1578 1.7728 0.0477 1225.3095 826.2654 1.0027 37.4300
+ ν 2.8148 2.4661 0.0635 1452.4245 1333.5578 1.0009 44.3678
+
+We had no problem with the Markov chains as all the rhat
are well below 1.01
(or above 0.99
). Also note that all degrees of freedom parameters, the ν
stuff, have been estimated with mean around 3 to 5, which indeed signals that our model needed fat tails to make a robust inference. Our model has an error σ
of around 7. So it estimates occupation's prestige ±7. The intercept α
is the basal occupation's prestige value. So each occupation has -7±7 prestige before we add the coefficients multiplied by the occupations' independent variables. And from our coefficients , we can see that the quantile()
tells us the uncertainty around their estimates:
quantile(chain)
Quantiles
+ parameters 2.5% 25.0% 50.0% 75.0% 97.5%
+ Symbol Float64 Float64 Float64 Float64 Float64
+
+ α -13.3670 -9.6056 -7.5776 -5.5600 -1.7272
+ β[1] 0.5567 0.7114 0.7743 0.8380 0.9659
+ β[2] 0.2796 0.3936 0.4460 0.4993 0.6170
+ σ 4.1120 5.8269 7.0177 8.3067 10.9849
+ ν 1.0269 1.6074 2.1733 3.1841 8.1579
+
+β[1]
– first column of X
, income
, has 95% credible interval from 0.55 to 0.96. This means that an increase of U$ 1,000 in occupations' annual income is associated with an increase in roughly 0.5 to 1.0 in occupation's prestige.
β[2]
– second column of X
, education
, has a 95% credible interval from 0.29 to 0.61. So we expect that an increase of 1% in occupations' percentage of respondents who had a high school diploma increases occupations' prestige roughly 0.3 to 0.6.
That's how you interpret 95% credible intervals from a quantile()
output of a robust regression Chains
object.
Duncan, O. D. (1961). A socioeconomic index for all occupations. Class: Critical Concepts, 1, 388–426.
+ +Bayesian hierarchical models (also called multilevel models) are a statistical model written at multiple levels (hierarchical form) that estimates the parameters of the posterior distribution using the Bayesian approach. The sub-models combine to form the hierarchical model, and Bayes' theorem is used to integrate them with the observed data and to account for all the uncertainty that is present. The result of this integration is the posterior distribution, also known as an updated probability estimate, as additional evidence of the likelihood function is integrated together with the prior distribution of the parameters.
Hierarchical modeling is used when information is available at several different levels of observation units. The hierarchical form of analysis and organization helps to understand multiparameter problems and also plays an important role in the development of computational strategies.
Hierarchical models are mathematical statements that involve several parameters, so that the estimates of some parameters depend significantly on the values of other parameters. The figure below shows a hierarchical model in which there is a hyperparameter that parameterizes the parameters that are finally used to infer the posterior density of some variable of interest .
Multilevel models are particularly suitable for research projects where participant data is organized at more than one level, i.e. nested data. Units of analysis are usually individuals (at a lower level) that are nested in contextual/aggregate units (at a higher level). An example is when we are measuring the performance of individuals and we have additional information about belonging to different groups such as sex, age group, hierarchical level, educational level or housing status.
There is a main assumption that cannot be violated in multilevel models which is exchangeability (de Finetti, 1974; Nau, 2001). Yes, this is the same assumption that we discussed in 2. What is Bayesian Statistics?. This assumption assumes that groups are exchangeable. The figure below shows a graphical representation of the exchangeability. The groups shown as "cups" that contain observations shown as "balls". If in the model's inferences, this assumption is violated, then multilevel models are not appropriate. This means that, since there is no theoretical justification to support exchangeability, the inferences of the multilevel model are not robust and the model can suffer from several pathologies and should not be used for any scientific or applied analysis.
As the priors of the parameters are sampled from another prior of the hyperparameter (upper-level's parameter), which are called hyperpriors. This makes one group's estimates help the model to better estimate the other groups by providing more robust and stable estimates.
We call the global parameters as population effects (or population-level effects, also sometimes called fixed effects) and the parameters of each group as group effects (or group-level effects, also sometimes called random effects). That is why multilevel models are also known as mixed models in which we have both fixed effects and random effects.
Multilevel models generally fall into three approaches:
Random-intercept model: each group receives a different intercept in addition to the global intercept.
Random-slope model: each group receives different coefficients for each (or a subset of) independent variable(s) in addition to a global intercept.
Random-intercept-slope model: each group receives both a different intercept and different coefficients for each independent variable in addition to a global intercept.
Correlated-Random-intercept-slope model: each group receives both a different intercept and different coefficients for each independent variable in addition to a global intercept; while also taking into account the correlation/covariance amongst intercept/coefficients.
The first approach is the random-intercept model in which we specify a different intercept for each group, in addition to the global intercept. These group-level intercepts are sampled from a hyperprior.
To illustrate a multilevel model, I will use the linear regression example with a Gaussian/normal likelihood function. Mathematically a Bayesian multilevel random-slope linear regression model is:
The priors on the global intercept , global coefficients and error , along with the Gaussian/normal likelihood on are the same as in the linear regression model. But now we have new parameters. The first are the group intercepts prior that denotes that every group has its own intercept sampled from a normal distribution centered on 0 with a standard deviation . This group intercept is added to the linear predictor inside the Gaussian/normal likelihood function. The group intercepts' standard deviation have a hyperprior (being a prior of a prior) which is sampled from a positive-constrained Cauchy distribution (a special case of the Student- distribution with degrees of freedom ) with mean 0 and standard deviation . This makes the group-level intercept's dispersions being sampled from the same parameter which allows the model to use information from one group intercept to infer robust information regarding another group's intercept dispersion and so on.
This is easily accomplished with Turing:
using Turing
+using LinearAlgebra
+using Statistics: mean, std
+using Random: seed!
+seed!(123)
+
+@model function varying_intercept(
+ X, idx, y; n_gr=length(unique(idx)), predictors=size(X, 2)
+)
+ #priors
+ α ~ Normal(mean(y), 2.5 * std(y)) # population-level intercept
+ β ~ filldist(Normal(0, 2), predictors) # population-level coefficients
+ σ ~ Exponential(std(y)) # residual SD
+ #prior for variance of random intercepts
+ #usually requires thoughtful specification
+ τ ~ truncated(Cauchy(0, 2); lower=0) # group-level SDs intercepts
+ αⱼ ~ filldist(Normal(0, τ), n_gr) # group-level intercepts
+
+ #likelihood
+ ŷ = α .+ X * β .+ αⱼ[idx]
+ return y ~ MvNormal(ŷ, σ^2 * I)
+end;
The second approach is the random-slope model in which we specify a different slope for each group, in addition to the global intercept. These group-level slopes are sampled from a hyperprior.
To illustrate a multilevel model, I will use the linear regression example with a Gaussian/normal likelihood function. Mathematically a Bayesian multilevel random-slope linear regression model is:
Here we have a similar situation from before with the same hyperprior, but now it is a hyperprior for the the group coefficients' standard deviation prior: . This makes the group-level coefficients's dispersions being sampled from the same parameter which allows the model to use information from one group coefficients to infer robust information regarding another group's coefficients dispersion and so on.
In Turing we can accomplish this as:
@model function varying_slope(X, idx, y; n_gr=length(unique(idx)), predictors=size(X, 2))
+ #priors
+ α ~ Normal(mean(y), 2.5 * std(y)) # population-level intercept
+ σ ~ Exponential(std(y)) # residual SD
+ #prior for variance of random slopes
+ #usually requires thoughtful specification
+ τ ~ filldist(truncated(Cauchy(0, 2); lower=0), n_gr) # group-level slopes SDs
+ βⱼ ~ filldist(Normal(0, 1), predictors, n_gr) # group-level standard normal slopes
+
+ #likelihood
+ ŷ = α .+ X * βⱼ * τ
+ return y ~ MvNormal(ŷ, σ^2 * I)
+end;
The third approach is the random-intercept-slope model in which we specify a different intercept and slope for each group, in addition to the global intercept. These group-level intercepts and slopes are sampled from hyperpriors.
To illustrate a multilevel model, I will use the linear regression example with a Gaussian/normal likelihood function. Mathematically a Bayesian multilevel random-intercept-slope linear regression model is:
Here we have a similar situation from before with the same hyperpriors, but now we fused both random-intercept and random-slope together.
In Turing we can accomplish this as:
@model function varying_intercept_slope(
+ X, idx, y; n_gr=length(unique(idx)), predictors=size(X, 2)
+)
+ #priors
+ α ~ Normal(mean(y), 2.5 * std(y)) # population-level intercept
+ σ ~ Exponential(std(y)) # residual SD
+ #prior for variance of random intercepts and slopes
+ #usually requires thoughtful specification
+ τₐ ~ truncated(Cauchy(0, 2); lower=0) # group-level SDs intercepts
+ τᵦ ~ filldist(truncated(Cauchy(0, 2); lower=0), n_gr) # group-level slopes SDs
+ αⱼ ~ filldist(Normal(0, τₐ), n_gr) # group-level intercepts
+ βⱼ ~ filldist(Normal(0, 1), predictors, n_gr) # group-level standard normal slopes
+
+ #likelihood
+ ŷ = α .+ αⱼ[idx] .+ X * βⱼ * τᵦ
+ return y ~ MvNormal(ŷ, σ^2 * I)
+end;
In all of the models so far, we are using the MvNormal
construction where we specify both a vector of means (first positional argument) and a covariance matrix (second positional argument). Regarding the covariance matrix σ^2 * I
, it uses the model's errors σ
, here parameterized as a standard deviation, squares it to produce a variance parameterization, and multiplies by I
, which is Julia's LinearAlgebra
standard module implementation to represent an identity matrix of any size.
The third approach is the correlated-random-intercept-slope model in which we specify a different intercept and slope for each group, in addition to the global intercept. These group-level intercepts and slopes are sampled from hyperpriors. Finally, we also model the correlation/covariance amongst the intercepts and slopes. We assume that they are not independent anymore, rather they are correlated.
In order to model the correlation between parameters, we need a prior on correlation/covariance matrices. There are several ways to define priors on covariance matrices. In some ancient time, "Bayesian elders" used distributions such as the Wishart distribution and the Inverse Wishart distribution. However, priors on covariance matrices are not that intuitive, and can be hard to translate into real-world scenarios. That's why I much prefer to construct my covariances matrices from a correlation matrix and a vector of standard deviations. Remember that every covariance matrix can be decomposed into:
where is a correlation matrix with s in the diagonal and the off-diagonal elements between -1 e 1 . is a vector composed of the variables' standard deviation from (is is the 's diagonal).
These are much more intuitive and easy to work with, specially if you are not working together with some "Bayesian elders". This approach leaves us with the need to specify a prior on correlation matrices. Luckily, we have the LKJ distribution: distribution for positive definite matrices with unit diagonals (exactly what a correlation matrix is).
To illustrate a multilevel model, I will use the linear regression example with a Gaussian/normal likelihood function. Mathematically a Bayesian multilevel correlated-random-intercept-slope linear regression model is:
We don't have any s. If we want a varying intercept, we just insert a column filled with s in the data matrix . Mathematically, this makes the column behave like an "identity" variable (because the number in the multiplication operation is the identity element. It maps keeping the value of intact) and, consequently, we can interpret the column's coefficient as the model's intercept.
In Turing we can accomplish this as:
using PDMats
+
+@model function correlated_varying_intercept_slope(
+ X, idx, y; n_gr=length(unique(idx)), predictors=size(X, 2)
+)
+ #priors
+ Ω ~ LKJCholesky(predictors, 2.0) # Cholesky decomposition correlation matrix
+ σ ~ Exponential(std(y))
+
+ #prior for variance of random correlated intercepts and slopes
+ #usually requires thoughtful specification
+ τ ~ filldist(truncated(Cauchy(0, 2); lower=0), predictors) # group-level SDs
+ γ ~ filldist(Normal(0, 5), predictors, n_gr) # matrix of group coefficients
+
+ #reconstruct Σ from Ω and τ
+ Σ_L = Diagonal(τ) * Ω.L
+ Σ = PDMat(Cholesky(Σ_L + 1e-6 * I)) # numerical instability
+ #reconstruct β from Σ and γ
+ β = Σ * γ
+
+ #likelihood
+ return y ~ arraydist([Normal(X[i, :] ⋅ β[:, idx[i]], σ) for i in 1:length(y)])
+end;
In the correlated_varying_intercept_slope
model, we are using the efficient Cholesky decomposition version of the LKJ prior with LKJCholesky
. We put priors on all group coefficients γ
and reconstruct both the covariance matrix Σ
and the coefficient matrix β
.
For our example, I will use a famous dataset called cheese
(Boatwright, McCulloch & Rossi, 1999), which is data from cheese ratings. A group of 10 rural and 10 urban raters rated 4 types of different cheeses (A, B, C and D) in two samples. So we have observations and 4 variables:
cheese
: type of cheese from A
to D
rater
: id of the rater from 1
to 10
background
: type of rater, either rural
or urban
y
: rating of the cheese
OK, let's read our data with CSV.jl
and output into a DataFrame
from DataFrames.jl
:
using DataFrames
+using CSV
+using HTTP
+
+url = "https://raw.githubusercontent.com/storopoli/Bayesian-Julia/master/datasets/cheese.csv"
+cheese = CSV.read(HTTP.get(url).body, DataFrame)
+describe(cheese)
4×7 DataFrame
+ Row │ variable mean min median max nmissing eltype
+ │ Symbol Union… Any Union… Any Int64 DataType
+─────┼───────────────────────────────────────────────────────────────
+ 1 │ cheese A D 0 String1
+ 2 │ rater 5.5 1 5.5 10 0 Int64
+ 3 │ background rural urban 0 String7
+ 4 │ y 70.8438 33 71.5 91 0 Int64
As you can see from the describe()
output, the mean cheese ratings is around 70 ranging from 33 to 91.
In order to prepare the data for Turing, I will convert the String
s in variables cheese
and background
to Int
s. Regarding cheese
, I will create 4 dummy variables one for each cheese type; and background
will be converted to integer data taking two values: one for each background type. My intent is to model background
as a group both for intercept and coefficients. Take a look at how the data will look like for the first 5 observations:
for c in unique(cheese[:, :cheese])
+ cheese[:, "cheese_$c"] = ifelse.(cheese[:, :cheese] .== c, 1, 0)
+end
+
+cheese[:, :background_int] = map(cheese[:, :background]) do b
+ if b == "rural"
+ 1
+ elseif b == "urban"
+ 2
+ else
+ missing
+ end
+end
+
+first(cheese, 5)
5×9 DataFrame
+ Row │ cheese rater background y cheese_A cheese_B cheese_C cheese_D background_int
+ │ String1 Int64 String7 Int64 Int64 Int64 Int64 Int64 Int64
+─────┼───────────────────────────────────────────────────────────────────────────────────────────
+ 1 │ A 1 rural 67 1 0 0 0 1
+ 2 │ A 1 rural 66 1 0 0 0 1
+ 3 │ B 1 rural 51 0 1 0 0 1
+ 4 │ B 1 rural 53 0 1 0 0 1
+ 5 │ C 1 rural 75 0 0 1 0 1
Now let's us instantiate our model with the data. Here, I will specify a vector of Int
s named idx
to represent the different observations' group memberships. This will be used by Turing when we index a parameter with the idx
, e.g. αⱼ[idx]
.
X = Matrix(select(cheese, Between(:cheese_A, :cheese_D)));
+y = cheese[:, :y];
+idx = cheese[:, :background_int];
The first model is the varying_intercept
:
model_intercept = varying_intercept(X, idx, y)
+chain_intercept = sample(model_intercept, NUTS(), MCMCThreads(), 1_000, 4)
+println(DataFrame(summarystats(chain_intercept)))
9×8 DataFrame
+ Row │ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ │ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
+─────┼───────────────────────────────────────────────────────────────────────────────────────
+ 1 │ α 70.7489 5.02573 0.176587 1049.21 944.855 1.00255 19.8948
+ 2 │ β[1] 2.91501 1.30963 0.0257346 2596.92 2763.9 1.00097 49.2419
+ 3 │ β[2] -10.632 1.34056 0.028425 2224.99 2724.4 1.00189 42.1895
+ 4 │ β[3] 6.52144 1.3159 0.0265587 2450.36 2667.92 1.00052 46.463
+ 5 │ β[4] 1.09178 1.31719 0.0264376 2475.97 2627.54 1.00117 46.9485
+ 6 │ σ 7.3913 0.449209 0.00807096 3127.77 2530.83 1.00102 59.3078
+ 7 │ τ 6.11784 5.54993 0.152948 1863.81 1849.7 1.00172 35.341
+ 8 │ αⱼ[1] -3.39662 4.92084 0.174051 1057.29 929.725 1.00312 20.0479
+ 9 │ αⱼ[2] 3.63474 4.95267 0.175295 1054.17 920.198 1.00331 19.9889
+
Here we can see that the model has a population-level intercept α
along with population-level coefficients β
s for each cheese
dummy variable. But notice that we have also group-level intercepts for each of the groups αⱼ
s. Specifically, αⱼ[1]
are the rural raters and αⱼ[2]
are the urban raters.
Now let's go to the second model, varying_slope
:
model_slope = varying_slope(X, idx, y)
+chain_slope = sample(model_slope, NUTS(), MCMCThreads(), 1_000, 4)
+println(DataFrame(summarystats(chain_slope)))
12×8 DataFrame
+ Row │ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ │ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
+─────┼────────────────────────────────────────────────────────────────────────────────────────
+ 1 │ α 71.6225 6.12929 0.80038 70.2758 24.8834 1.0542 1.3079
+ 2 │ σ 7.98871 0.435528 0.0142789 879.095 1907.61 1.00481 16.3607
+ 3 │ τ[1] 5.95148 5.05219 0.283585 297.117 1630.98 1.01456 5.52961
+ 4 │ τ[2] 7.13182 7.54656 1.22127 120.798 26.97 1.03962 2.24817
+ 5 │ βⱼ[1,1] 0.199936 0.839602 0.0512732 282.448 223.989 1.01652 5.25661
+ 6 │ βⱼ[2,1] -0.933251 1.06721 0.0676919 231.737 252.835 1.01661 4.31283
+ 7 │ βⱼ[3,1] 0.522532 0.883634 0.035598 636.854 1246.34 1.02562 11.8524
+ 8 │ βⱼ[4,1] 0.0664401 0.797414 0.0514516 278.881 154.864 1.02163 5.19022
+ 9 │ βⱼ[1,2] 0.290895 0.843615 0.0732187 135.896 84.0176 1.03096 2.52915
+ 10 │ βⱼ[2,2] -0.853134 1.05917 0.0568238 415.139 467.237 1.01995 7.7261
+ 11 │ βⱼ[3,2] 0.563184 0.876935 0.0375404 474.159 818.21 1.01366 8.82451
+ 12 │ βⱼ[4,2] 0.124051 0.787212 0.0525178 224.833 143.946 1.018 4.18434
+
Here we can see that the model has still a population-level intercept α
. But now our population-level coefficients β
s are replaced by group-level coefficients βⱼ
s along with their standard deviation τᵦ
s. Specifically βⱼ
's first index denotes the 4 dummy cheese
variables' and the second index are the group membership. So, for example βⱼ[1,1]
is the coefficient for cheese_A
and rural raters (group 1).
Now let's go to the third model, varying_intercept_slope
:
model_intercept_slope = varying_intercept_slope(X, idx, y)
+chain_intercept_slope = sample(model_intercept_slope, NUTS(), MCMCThreads(), 1_000, 4)
+println(DataFrame(summarystats(chain_intercept_slope)))
15×8 DataFrame
+ Row │ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ │ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
+─────┼───────────────────────────────────────────────────────────────────────────────────────
+ 1 │ α 70.8973 7.06973 0.226273 1080.56 988.602 1.00354 13.1307
+ 2 │ σ 7.07716 0.403315 0.00763609 2791.06 2389.42 1.00236 33.9161
+ 3 │ τₐ 6.19068 5.86481 0.187349 1286.35 1007.7 1.00342 15.6313
+ 4 │ τᵦ[1] 6.19739 5.39243 0.202217 646.032 1159.5 1.00601 7.85039
+ 5 │ τᵦ[2] 6.24244 5.30258 0.215136 598.407 1527.72 1.00463 7.27167
+ 6 │ αⱼ[1] -3.62952 5.05254 0.176822 1253.95 761.597 1.00336 15.2377
+ 7 │ αⱼ[2] 3.42934 5.05582 0.176181 1271.82 839.894 1.00348 15.4548
+ 8 │ βⱼ[1,1] 0.273658 0.779529 0.0188846 1727.94 1703.37 1.00193 20.9975
+ 9 │ βⱼ[2,1] -0.88519 1.07495 0.0402493 794.926 375.533 1.00615 9.6597
+ 10 │ βⱼ[3,1] 0.540418 0.897065 0.0325307 849.719 418.537 1.00343 10.3255
+ 11 │ βⱼ[4,1] 0.119311 0.761924 0.0183527 1730.92 1000.11 1.00092 21.0336
+ 12 │ βⱼ[1,2] 0.230796 0.819673 0.0210441 1571.83 1681.27 1.00307 19.1004
+ 13 │ βⱼ[2,2] -0.916136 1.02804 0.0319528 1027.79 2087.51 1.00208 12.4894
+ 14 │ βⱼ[3,2] 0.577911 0.873303 0.0269948 1094.9 1738.12 1.00304 13.3049
+ 15 │ βⱼ[4,2] 0.115502 0.784112 0.0186258 1758.58 1981.55 1.0031 21.3698
+
Now we have fused the previous model in one. We still have a population-level intercept α
. But now we have in the same model group-level intercepts for each of the groups αⱼ
s and group-level along with their standard deviation τₐ
. We also have the coefficients βⱼ
s with their standard deviation τᵦ
s. The parameters are interpreted exactly as the previous cases.
Now let's go to the fourth model, varying_intercept_slope
. Here we are going to add a columns of into the data matrix X
:
X_correlated = hcat(fill(1, size(X, 1)), X)
+model_correlated = correlated_varying_intercept_slope(X_correlated, idx, y)
+chain_correlated = sample(model_correlated, NUTS(), MCMCThreads(), 1_000, 4)
+println(DataFrame(summarystats(chain_correlated)))
31×8 DataFrame
+ Row │ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ │ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
+─────┼───────────────────────────────────────────────────────────────────────────────────────────────
+ 1 │ Ω.L[1,1] 1.0 0.0 NaN NaN NaN NaN NaN
+ 2 │ Ω.L[2,1] 0.0499316 0.31051 0.00698543 1975.54 2314.64 1.00168 1.52283
+ 3 │ Ω.L[3,1] -0.454537 0.249772 0.00571052 1977.49 2262.51 1.00096 1.52433
+ 4 │ Ω.L[4,1] 0.308835 0.25287 0.00599969 1843.28 2199.73 1.00014 1.42088
+ 5 │ Ω.L[5,1] -0.125632 0.307712 0.00654629 2239.37 2376.57 1.00314 1.72619
+ 6 │ Ω.L[2,2] 0.946451 0.0731193 0.00168809 2176.21 2182.87 1.00018 1.67751
+ 7 │ Ω.L[3,2] 0.00316296 0.326756 0.00512707 4010.69 2567.33 1.0003 3.0916
+ 8 │ Ω.L[4,2] 0.0351896 0.354115 0.00527131 4546.36 3111.1 1.00178 3.50452
+ 9 │ Ω.L[5,2] 0.0261588 0.366052 0.00538327 4345.52 2064.09 1.00291 3.3497
+ 10 │ Ω.L[3,3] 0.775999 0.148672 0.00348151 1845.72 2482.95 1.00082 1.42276
+ 11 │ Ω.L[4,3] -0.0330662 0.349329 0.00604209 3242.42 2090.97 1.00099 2.49939
+ 12 │ Ω.L[5,3] 0.0374594 0.360744 0.00516219 4901.98 2431.25 1.00176 3.77865
+ 13 │ Ω.L[4,4] 0.753896 0.150472 0.00357772 1850.93 2236.89 1.00165 1.42677
+ 14 │ Ω.L[5,4] 0.0116934 0.349883 0.00506856 4664.43 2772.27 1.00128 3.59553
+ 15 │ Ω.L[5,5] 0.684545 0.179677 0.00469272 1541.5 1667.6 1.00377 1.18825
+ 16 │ σ 7.08991 0.392342 0.00609954 4281.99 2994.66 1.00037 3.30073
+ 17 │ τ[1] 3.82687 0.792461 0.0243249 1071.64 1274.54 1.00157 0.82606
+ 18 │ τ[2] 0.515217 0.457244 0.0118942 1162.98 1701.28 1.00367 0.896471
+ 19 │ τ[3] 1.57042 0.675595 0.0194601 1075.05 1066.81 1.00327 0.82869
+ 20 │ τ[4] 0.79098 0.502001 0.0137349 1042.75 683.387 1.0037 0.803794
+ 21 │ τ[5] 0.536124 0.435909 0.0118638 1106.15 1058.94 1.00122 0.852668
+ 22 │ γ[1,1] 5.1869 2.19555 0.0618344 1249.75 1539.01 1.00063 0.963357
+ 23 │ γ[2,1] -0.104914 4.6473 0.0843589 3038.11 2765.66 0.999955 2.3419
+ 24 │ γ[3,1] -2.26726 3.24333 0.0677281 2307.05 2102.85 1.00399 1.77836
+ 25 │ γ[4,1] 0.786036 4.18682 0.0682086 3757.86 2648.24 1.00005 2.89671
+ 26 │ γ[5,1] 0.317867 4.53047 0.067744 4456.83 2826.46 1.0041 3.43551
+ 27 │ γ[1,2] 5.68517 2.38212 0.0679177 1209.5 1530.76 1.00015 0.932333
+ 28 │ γ[2,2] 0.406946 4.69823 0.0761723 3785.93 2816.48 1.00131 2.91835
+ 29 │ γ[3,2] -2.43749 3.41662 0.0695933 2451.7 2481.02 1.00083 1.88987
+ 30 │ γ[4,2] 1.40997 4.31934 0.0742324 3389.91 2550.09 1.00076 2.61308
+ 31 │ γ[5,2] -0.565495 4.69425 0.0822518 3281.19 2340.3 1.00369 2.52928
+
We can see that we also get all of the Cholesky factors of the correlation matrix Ω
which we can transform back into a correlation matrix by doing Ω = Ω.L * Ω.L'
.
Boatwright, P., McCulloch, R., & Rossi, P. (1999). Account-level modeling for trade promotion: An application of a constrained parameter hierarchical model. Journal of the American Statistical Association, 94(448), 1063–1073.
de Finetti, B. (1974). Theory of Probability (Volume 1). New York: John Wiley & Sons.
Nau, R. F. (2001). De Finetti was Right: Probability Does Not Exist. Theory and Decision, 51(2), 89–124. https://doi.org/10.1023/A:1015525808214
There are some computational tricks that we can employ with Turing. I will cover here two computational tricks:
QR Decomposition
Non-Centered Parametrization
Back in "Linear Algebra 101" we've learned that any matrix (even rectangular ones) can be factored into the product of two matrices:
: an orthogonal matrix (its columns are orthogonal unit vectors meaning .
: an upper triangular matrix.
This is commonly known as the QR Decomposition:
Let me show you an example with a random matrix :
A = rand(3, 2)
3×2 Matrix{Float64}:
+ 0.720103 0.295367
+ 0.573619 0.276597
+ 0.664468 0.983436
Now let's factor A
using LinearAlgebra
's qr()
function:
using LinearAlgebra: qr, I
+Q, R = qr(A)
LinearAlgebra.QRCompactWY{Float64, Matrix{Float64}, Matrix{Float64}}
+Q factor: 3×3 LinearAlgebra.QRCompactWYQ{Float64, Matrix{Float64}, Matrix{Float64}}
+R factor:
+2×2 Matrix{Float64}:
+ -1.13539 -0.902615
+ 0.0 0.562299
Notice that qr()
produced a tuple containing two matrices Q
and R
. Q
is a 3x3 orthogonal matrix. And R
is a 2x2 upper triangular matrix. So that (the transpose is equal the inverse):
Matrix(Q') ≈ Matrix(Q^-1)
MethodError: no method matching ^(::LinearAlgebra.AdjointQ{Float64, LinearAlgebra.QRCompactWYQ{Float64, Matrix{Float64}, Matrix{Float64}}}, ::Int64)
+
+Closest candidates are:
+ ^(!Matched::Float16, ::Integer)
+ @ Base math.jl:1283
+ ^(!Matched::Regex, ::Integer)
+ @ Base regex.jl:863
+ ^(!Matched::Float32, ::Integer)
+ @ Base math.jl:1277
+ ...
+
+
Also note that (identity matrix):
Q' * Q ≈ I(3)
true
+This is nice. But what can we do with QR decomposition? It can speed up Turing's sampling by a huge factor while also decorrelating the columns of , i.e. the independent variables. The orthogonal nature of QR decomposition alters the posterior's topology and makes it easier for HMC or other MCMC samplers to explore it. Let's see how fast we can get with QR decomposition. First, let's go back to the kidiq
example in 6. Bayesian Linear Regression:
using Turing
+using LinearAlgebra: I
+using Statistics: mean, std
+using Random: seed!
+seed!(123)
+
+@model function linreg(X, y; predictors=size(X, 2))
+ #priors
+ α ~ Normal(mean(y), 2.5 * std(y))
+ β ~ filldist(TDist(3), predictors)
+ σ ~ Exponential(1)
+
+ #likelihood
+ return y ~ MvNormal(α .+ X * β, σ^2 * I)
+end;
+
+using DataFrames
+using CSV
+using HTTP
+
+url = "https://raw.githubusercontent.com/storopoli/Bayesian-Julia/master/datasets/kidiq.csv"
+kidiq = CSV.read(HTTP.get(url).body, DataFrame)
+X = Matrix(select(kidiq, Not(:kid_score)))
+y = kidiq[:, :kid_score]
+model = linreg(X, y)
+chain = sample(model, NUTS(), MCMCThreads(), 1_000, 4)
Chains MCMC chain (1000×17×4 Array{Float64, 3}):
+
+Iterations = 501:1:1500
+Number of chains = 4
+Samples per chain = 1000
+Wall duration = 6.24 seconds
+Compute duration = 23.56 seconds
+parameters = α, β[1], β[2], β[3], σ
+internals = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size
+
+Summary Statistics
+ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
+
+ α 21.5126 8.6720 0.2217 1528.8886 1868.2612 1.0009 64.8962
+ β[1] 2.0014 1.7943 0.0419 2281.1940 1734.6512 1.0016 96.8290
+ β[2] 0.5788 0.0584 0.0013 2163.9754 2292.8814 1.0006 91.8534
+ β[3] 0.2566 0.3092 0.0074 1762.0214 2135.6795 1.0010 74.7919
+ σ 17.8859 0.6033 0.0106 3271.1669 2347.2435 1.0008 138.8500
+
+Quantiles
+ parameters 2.5% 25.0% 50.0% 75.0% 97.5%
+ Symbol Float64 Float64 Float64 Float64 Float64
+
+ α 4.7278 15.7633 21.2942 27.4322 38.4426
+ β[1] -0.5876 0.7324 1.6761 2.9919 6.3388
+ β[2] 0.4662 0.5392 0.5793 0.6184 0.6924
+ β[3] -0.3477 0.0440 0.2588 0.4733 0.8490
+ σ 16.7525 17.4685 17.8796 18.2703 19.1238
+
+See the wall duration in Turing's chain
: for me it took around 24 seconds.
Now let's us incorporate QR decomposition in the linear regression model. Here, I will use the "thin" instead of the "fat" QR, which scales the and matrices by a factor of where is the number of rows of . In practice it is better to implement the thin QR decomposition, which is to be preferred to the fat QR decomposition. It is numerically more stable. Mathematically, the thin QR decomposition is:
+ +Then we can recover the original with:
+ +In Turing, a model with QR decomposition would be the same linreg
but with a different X
matrix supplied, since it is a data transformation. First, we decompose your model data X
into Q
and R
:
Q, R = qr(X)
+Q_ast = Matrix(Q) * sqrt(size(X, 1) - 1)
+R_ast = R / sqrt(size(X, 1) - 1);
+Then, we instantiate a model with Q
instead of X
and sample as you would:
model_qr = linreg(Q_ast, y)
+chain_qr = sample(model_qr, NUTS(1_000, 0.65), MCMCThreads(), 1_000, 4)
Chains MCMC chain (1000×17×4 Array{Float64, 3}):
+
+Iterations = 1001:1:2000
+Number of chains = 4
+Samples per chain = 1000
+Wall duration = 2.65 seconds
+Compute duration = 7.82 seconds
+parameters = α, β[1], β[2], β[3], σ
+internals = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size
+
+Summary Statistics
+ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
+
+ α 32.9057 7.8274 0.2470 1006.3265 1242.3293 1.0026 128.7521
+ β[1] -49.9922 7.0245 0.2210 1009.4898 1401.5358 1.0022 129.1568
+ β[2] 22.0858 3.5735 0.1101 1054.2580 1418.8568 1.0028 134.8846
+ β[3] 0.2869 0.8775 0.0238 1370.8734 1800.6496 1.0010 175.3932
+ σ 17.8703 0.5859 0.0113 2699.9100 2464.9195 1.0019 345.4337
+
+Quantiles
+ parameters 2.5% 25.0% 50.0% 75.0% 97.5%
+ Symbol Float64 Float64 Float64 Float64 Float64
+
+ α 17.6798 27.6922 32.7076 38.1740 47.9793
+ β[1] -63.6746 -54.6970 -50.1683 -45.2700 -36.2948
+ β[2] 15.1066 19.7019 22.1804 24.4485 29.1569
+ β[3] -1.3554 -0.2981 0.2697 0.8374 2.1505
+ σ 16.7293 17.4690 17.8683 18.2608 19.0084
+
+See the wall duration in Turing's chain_qr
: for me it took around 5 seconds. Much faster than the regular linreg
. Now we have to reconstruct our s:
betas = mapslices(
+ x -> R_ast^-1 * x, chain_qr[:, namesingroup(chain_qr, :β), :].value.data; dims=[2]
+)
+chain_beta = setrange(
+ Chains(betas, ["real_β[$i]" for i in 1:size(Q_ast, 2)]), 1_001:1:2_000
+)
+chain_qr_reconstructed = hcat(chain_beta, chain_qr)
Chains MCMC chain (1000×20×4 Array{Float64, 3}):
+
+Iterations = 1001:1:2000
+Number of chains = 4
+Samples per chain = 1000
+parameters = real_β[1], real_β[2], real_β[3], α, β[1], β[2], β[3], σ
+internals = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size
+
+Summary Statistics
+ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Missing
+
+ real_β[1] 6.2103 2.1904 0.0339 4190.5377 2075.4888 1.0015 missing
+ real_β[2] 0.5062 0.0616 0.0015 1657.4420 2135.3528 1.0021 missing
+ real_β[3] -0.0701 0.2143 0.0058 1370.8734 1800.6496 1.0010 missing
+ α 32.9057 7.8274 0.2470 1006.3265 1242.3293 1.0026 missing
+ β[1] -49.9922 7.0245 0.2210 1009.4898 1401.5358 1.0022 missing
+ β[2] 22.0858 3.5735 0.1101 1054.2580 1418.8568 1.0028 missing
+ β[3] 0.2869 0.8775 0.0238 1370.8734 1800.6496 1.0010 missing
+ σ 17.8703 0.5859 0.0113 2699.9100 2464.9195 1.0019 missing
+
+Quantiles
+ parameters 2.5% 25.0% 50.0% 75.0% 97.5%
+ Symbol Float64 Float64 Float64 Float64 Float64
+
+ real_β[1] 1.8485 4.7441 6.2339 7.7050 10.3766
+ real_β[2] 0.3815 0.4656 0.5071 0.5480 0.6252
+ real_β[3] -0.5252 -0.2045 -0.0659 0.0728 0.3310
+ α 17.6798 27.6922 32.7076 38.1740 47.9793
+ β[1] -63.6746 -54.6970 -50.1683 -45.2700 -36.2948
+ β[2] 15.1066 19.7019 22.1804 24.4485 29.1569
+ β[3] -1.3554 -0.2981 0.2697 0.8374 2.1505
+ σ 16.7293 17.4690 17.8683 18.2608 19.0084
+
+Now let's us explore Non-Centered Parametrization (NCP). This is useful when the posterior's topology is very difficult to explore as has regions where HMC sampler has to change the step size and the factor. This is I've showed one of the most infamous case in 5. Markov Chain Monte Carlo (MCMC): Neal's Funnel (Neal, 2003):
+using CairoMakie
+using Distributions
+funnel_y = rand(Normal(0, 3), 10_000)
+funnel_x = rand(Normal(), 10_000) .* exp.(funnel_y / 2)
+
+f, ax, s = scatter(
+ funnel_x,
+ funnel_y;
+ color=(:steelblue, 0.3),
+ axis=(; xlabel=L"X", ylabel=L"Y", limits=(-100, 100, nothing, nothing)),
+)
+
Here we see that in upper part of the funnel HMC has to take few steps and be more liberal with the factor. But, the opposite is in the lower part of the funnel: way more steps and be more conservative with the factor.
+To see the devil's funnel (how it is known in some Bayesian circles) in action, let's code it in Turing and then sample:
+@model function funnel()
+ y ~ Normal(0, 3)
+ return x ~ Normal(0, exp(y / 2))
+end
+
+chain_funnel = sample(funnel(), NUTS(), MCMCThreads(), 1_000, 4)
Chains MCMC chain (1000×14×4 Array{Float64, 3}):
+
+Iterations = 501:1:1500
+Number of chains = 4
+Samples per chain = 1000
+Wall duration = 5.91 seconds
+Compute duration = 23.42 seconds
+parameters = y, x
+internals = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size
+
+Summary Statistics
+ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
+
+ y 0.3304 2.5005 0.4266 34.1757 64.0933 1.0981 1.4589
+ x -0.7113 8.8403 0.6150 467.3926 188.6939 1.0944 19.9527
+
+Quantiles
+ parameters 2.5% 25.0% 50.0% 75.0% 97.5%
+ Symbol Float64 Float64 Float64 Float64 Float64
+
+ y -3.2368 -2.1271 0.0819 1.9243 5.8793
+ x -13.3185 -0.6603 -0.2764 0.4686 6.4055
+
+Wow, take a look at those rhat
values... That sucks: all are above 1.01
even with 4 parallel chains with 1,000 iterations!
How do we deal with that? We reparametrize! Note that we can add two normal distributions in the following manner:
+ +where the standard normal is the normal with mean and standard deviation . This is why is called Non-Centered Parametrization because we "decouple" the parameters and reconstruct them before.
+@model function ncp_funnel()
+ x̃ ~ Normal()
+ ỹ ~ Normal()
+ y = 3.0 * ỹ # implies y ~ Normal(0, 3)
+ return x = exp(y / 2) * x̃ # implies x ~ Normal(0, exp(y / 2))
+end
+
+chain_ncp_funnel = sample(ncp_funnel(), NUTS(), MCMCThreads(), 1_000, 4)
Chains MCMC chain (1000×14×4 Array{Float64, 3}):
+
+Iterations = 501:1:1500
+Number of chains = 4
+Samples per chain = 1000
+Wall duration = 5.38 seconds
+Compute duration = 21.32 seconds
+parameters = x̃, ỹ
+internals = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size
+
+Summary Statistics
+ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
+
+ x̃ -0.0006 1.0002 0.0161 3829.5014 2972.2855 0.9999 179.6454
+ ỹ -0.0444 1.0054 0.0160 3917.0636 2987.0567 0.9999 183.7530
+
+Quantiles
+ parameters 2.5% 25.0% 50.0% 75.0% 97.5%
+ Symbol Float64 Float64 Float64 Float64 Float64
+
+ x̃ -1.8916 -0.6865 -0.0231 0.6704 2.0065
+ ỹ -2.0393 -0.7107 -0.0524 0.6362 1.9541
+
+Much better now: all rhat
are well below 1.01
(or below 0.99
).
How we would implement this a real-world model in Turing? Let's go back to the cheese
random-intercept model in 11. Multilevel Models (a.k.a. Hierarchical Models). Here was the approach that we took, also known as Centered Parametrization (CP):
@model function varying_intercept(
+ X, idx, y; n_gr=length(unique(idx)), predictors=size(X, 2)
+)
+ #priors
+ α ~ Normal(mean(y), 2.5 * std(y)) # population-level intercept
+ β ~ filldist(Normal(0, 2), predictors) # population-level coefficients
+ σ ~ Exponential(std(y)) # residual SD
+ #prior for variance of random intercepts
+ #usually requires thoughtful specification
+ τ ~ truncated(Cauchy(0, 2); lower=0) # group-level SDs intercepts
+ αⱼ ~ filldist(Normal(0, τ), n_gr) # CP group-level intercepts
+
+ #likelihood
+ ŷ = α .+ X * β .+ αⱼ[idx]
+ return y ~ MvNormal(ŷ, σ^2 * I)
+end;
+To perform a Non-Centered Parametrization (NCP) in this model we do as following:
+@model function varying_intercept_ncp(
+ X, idx, y; n_gr=length(unique(idx)), predictors=size(X, 2)
+)
+ #priors
+ α ~ Normal(mean(y), 2.5 * std(y)) # population-level intercept
+ β ~ filldist(Normal(0, 2), predictors) # population-level coefficients
+ σ ~ Exponential(std(y)) # residual SD
+
+ #prior for variance of random intercepts
+ #usually requires thoughtful specification
+ τ ~ truncated(Cauchy(0, 2); lower=0) # group-level SDs intercepts
+ zⱼ ~ filldist(Normal(0, 1), n_gr) # NCP group-level intercepts
+
+ #likelihood
+ ŷ = α .+ X * β .+ zⱼ[idx] .* τ
+ return y ~ MvNormal(ŷ, σ^2 * I)
+end;
+Here we are using a NCP with the zⱼ
s following a standard normal and we reconstruct the group-level intercepts by multiplying the zⱼ
s by τ
. Since the original αⱼ
s had a prior centered on 0 with standard deviation τ
, we only have to use the multiplication by τ
to get back the αⱼ
s.
Now let's see how NCP compares to the CP. First, let's redo our CP hierarchical model:
+url = "https://raw.githubusercontent.com/storopoli/Bayesian-Julia/master/datasets/cheese.csv"
+cheese = CSV.read(HTTP.get(url).body, DataFrame)
+
+for c in unique(cheese[:, :cheese])
+ cheese[:, "cheese_$c"] = ifelse.(cheese[:, :cheese] .== c, 1, 0)
+end
+
+cheese[:, :background_int] = map(cheese[:, :background]) do b
+ if b == "rural"
+ 1
+ elseif b == "urban"
+ 2
+ else
+ missing
+ end
+end
+
+X = Matrix(select(cheese, Between(:cheese_A, :cheese_D)));
+y = cheese[:, :y];
+idx = cheese[:, :background_int];
+
+model_cp = varying_intercept(X, idx, y)
+chain_cp = sample(model_cp, NUTS(), MCMCThreads(), 1_000, 4)
Chains MCMC chain (1000×21×4 Array{Float64, 3}):
+
+Iterations = 501:1:1500
+Number of chains = 4
+Samples per chain = 1000
+Wall duration = 7.72 seconds
+Compute duration = 26.82 seconds
+parameters = α, β[1], β[2], β[3], β[4], σ, τ, αⱼ[1], αⱼ[2]
+internals = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size
+
+Summary Statistics
+ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
+
+ α 70.6388 5.4554 0.2162 915.6164 759.2940 1.0028 34.1457
+ β[1] 2.9639 1.3435 0.0265 2565.3867 2773.4856 1.0012 95.6698
+ β[2] -10.5915 1.3904 0.0277 2522.8976 2504.4545 1.0016 94.0853
+ β[3] 6.5485 1.3619 0.0264 2668.6532 2413.9865 0.9997 99.5209
+ β[4] 1.0905 1.3750 0.0285 2338.0486 2471.5094 1.0010 87.1918
+ σ 7.4087 0.4507 0.0088 2643.0429 2581.2624 1.0001 98.5658
+ τ 6.2641 5.7017 0.1952 1222.7006 1203.5556 1.0038 45.5976
+ αⱼ[1] -3.3228 5.3516 0.2167 865.3478 780.3551 1.0027 32.2710
+ αⱼ[2] 3.7386 5.3657 0.2173 894.2647 746.3891 1.0034 33.3494
+
+Quantiles
+ parameters 2.5% 25.0% 50.0% 75.0% 97.5%
+ Symbol Float64 Float64 Float64 Float64 Float64
+
+ α 59.8857 68.3215 70.7533 73.2141 81.3633
+ β[1] 0.3781 2.0754 2.9757 3.8536 5.5841
+ β[2] -13.2884 -11.5392 -10.6060 -9.6587 -7.8122
+ β[3] 3.9443 5.6304 6.5351 7.4435 9.2425
+ β[4] -1.6999 0.1515 1.1072 2.0425 3.7349
+ σ 6.5891 7.0826 7.3872 7.7026 8.3437
+ τ 1.8181 3.2859 4.6436 7.1806 20.4141
+ αⱼ[1] -14.0825 -5.6883 -3.3988 -1.1424 7.5331
+ αⱼ[2] -6.5136 1.3252 3.5179 5.9215 14.7231
+
+Now let's do the NCP hierarchical model:
+model_ncp = varying_intercept_ncp(X, idx, y)
+chain_ncp = sample(model_ncp, NUTS(), MCMCThreads(), 1_000, 4)
Chains MCMC chain (1000×21×4 Array{Float64, 3}):
+
+Iterations = 501:1:1500
+Number of chains = 4
+Samples per chain = 1000
+Wall duration = 11.84 seconds
+Compute duration = 45.18 seconds
+parameters = α, β[1], β[2], β[3], β[4], σ, τ, zⱼ[1], zⱼ[2]
+internals = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size
+
+Summary Statistics
+ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
+
+ α 71.1250 3.5397 0.1218 878.2014 784.3499 1.0054 19.4365
+ β[1] 2.9093 1.2858 0.0297 1876.0233 2481.1601 1.0026 41.5206
+ β[2] -10.6841 1.3789 0.0608 571.3700 192.4652 1.0074 12.6457
+ β[3] 6.5318 1.3379 0.0326 1713.4983 1455.9135 1.0030 37.9235
+ β[4] 1.0746 1.3279 0.0316 1767.8948 2009.5112 1.0071 39.1274
+ σ 7.3765 0.4561 0.0158 777.6440 188.6146 1.0139 17.2110
+ τ 5.0157 2.6654 0.1354 604.8463 193.9385 1.0075 13.3866
+ zⱼ[1] -0.9156 0.7846 0.0225 1184.6917 1334.6682 1.0041 26.2199
+ zⱼ[2] 0.8326 0.8046 0.0223 1253.2053 947.4690 1.0023 27.7362
+
+Quantiles
+ parameters 2.5% 25.0% 50.0% 75.0% 97.5%
+ Symbol Float64 Float64 Float64 Float64 Float64
+
+ α 63.6608 68.9974 71.0956 73.1433 78.5094
+ β[1] 0.3584 2.0549 2.9066 3.7865 5.4293
+ β[2] -13.5656 -11.5643 -10.6594 -9.7752 -7.9711
+ β[3] 3.9994 5.5582 6.4857 7.4393 9.0957
+ β[4] -1.4433 0.2116 1.0321 1.9658 3.6415
+ σ 6.5332 7.0626 7.3592 7.6827 8.3100
+ τ 1.8047 3.1023 4.2758 6.3272 11.7515
+ zⱼ[1] -2.5756 -1.4160 -0.8862 -0.3486 0.5239
+ zⱼ[2] -0.6131 0.2661 0.7795 1.3822 2.4805
+
+Notice that some models are better off with a standard Centered Parametrization (as is our cheese
case here). While others are better off with a Non-Centered Parametrization. But now you know how to apply both parametrizations in Turing. Before we conclude, we need to recover our original αⱼ
s. We can do this by multiplying zⱼ[idx] .* τ
:
τ = summarystats(chain_ncp)[:τ, :mean]
+αⱼ = mapslices(
+ x -> x * τ, chain_ncp[:, namesingroup(chain_ncp, :zⱼ), :].value.data; dims=[2]
+)
+chain_ncp_reconstructed = hcat(
+ MCMCChains.resetrange(chain_ncp), Chains(αⱼ, ["αⱼ[$i]" for i in 1:length(unique(idx))])
+)
Chains MCMC chain (1000×23×4 Array{Float64, 3}):
+
+Iterations = 1:1000
+Number of chains = 4
+Samples per chain = 1000
+Wall duration = 11.84 seconds
+Compute duration = 45.18 seconds
+parameters = α, β[1], β[2], β[3], β[4], σ, τ, zⱼ[1], zⱼ[2], αⱼ[1], αⱼ[2]
+internals = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size
+
+Summary Statistics
+ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
+
+ α 71.1250 3.5397 0.1218 878.2014 784.3499 1.0054 19.4365
+ β[1] 2.9093 1.2858 0.0297 1876.0233 2481.1601 1.0026 41.5206
+ β[2] -10.6841 1.3789 0.0608 571.3700 192.4652 1.0074 12.6457
+ β[3] 6.5318 1.3379 0.0326 1713.4983 1455.9135 1.0030 37.9235
+ β[4] 1.0746 1.3279 0.0316 1767.8948 2009.5112 1.0071 39.1274
+ σ 7.3765 0.4561 0.0158 777.6440 188.6146 1.0139 17.2110
+ τ 5.0157 2.6654 0.1354 604.8463 193.9385 1.0075 13.3866
+ zⱼ[1] -0.9156 0.7846 0.0225 1184.6917 1334.6682 1.0041 26.2199
+ zⱼ[2] 0.8326 0.8046 0.0223 1253.2053 947.4690 1.0023 27.7362
+ αⱼ[1] -4.5922 3.9351 0.1128 1184.6917 1334.6682 1.0041 26.2199
+ αⱼ[2] 4.1762 4.0358 0.1118 1253.2053 947.4690 1.0023 27.7362
+
+Quantiles
+ parameters 2.5% 25.0% 50.0% 75.0% 97.5%
+ Symbol Float64 Float64 Float64 Float64 Float64
+
+ α 63.6608 68.9974 71.0956 73.1433 78.5094
+ β[1] 0.3584 2.0549 2.9066 3.7865 5.4293
+ β[2] -13.5656 -11.5643 -10.6594 -9.7752 -7.9711
+ β[3] 3.9994 5.5582 6.4857 7.4393 9.0957
+ β[4] -1.4433 0.2116 1.0321 1.9658 3.6415
+ σ 6.5332 7.0626 7.3592 7.6827 8.3100
+ τ 1.8047 3.1023 4.2758 6.3272 11.7515
+ zⱼ[1] -2.5756 -1.4160 -0.8862 -0.3486 0.5239
+ zⱼ[2] -0.6131 0.2661 0.7795 1.3822 2.4805
+ αⱼ[1] -12.9183 -7.1023 -4.4447 -1.7483 2.6277
+ αⱼ[2] -3.0749 1.3348 3.9098 6.9328 12.4414
+
+Neal, Radford M. (2003). Slice Sampling. The Annals of Statistics, 31(3), 705–741. Retrieved from https://www.jstor.org/stable/3448413
+ +Ok, now this is something that really makes me very excited with Julia's ecosystem. If you want to use an Ordinary Differential Equation solver in your Turing model, you don't need to code it from scratch. You've just borrow a pre-made one from DifferentialEquations.jl
. This is what makes Julia so great. We can use functions and types defined in other packages into another package and it will probably work either straight out of the bat or without much effort!
For this tutorial I'll be using Brazil's COVID data from the Media Consortium. For reproducibility, we'll restrict the data to the year of 2020:
using Downloads
+using DataFrames
+using CSV
+using Chain
+using Dates
+
+url = "https://data.brasil.io/dataset/covid19/caso_full.csv.gz"
+file = Downloads.download(url)
+df = DataFrame(CSV.File(file))
+br = @chain df begin
+ filter(
+ [:date, :city] =>
+ (date, city) ->
+ date < Dates.Date("2021-01-01") &&
+ date > Dates.Date("2020-04-01") &&
+ ismissing(city),
+ _,
+ )
+ groupby(:date)
+ combine(
+ [
+ :estimated_population_2019,
+ :last_available_confirmed_per_100k_inhabitants,
+ :last_available_deaths,
+ :new_confirmed,
+ :new_deaths,
+ ] .=>
+ sum .=> [
+ :estimated_population_2019,
+ :last_available_confirmed_per_100k_inhabitants,
+ :last_available_deaths,
+ :new_confirmed,
+ :new_deaths,
+ ],
+ )
+end;
Let's take a look in the first observations
first(br, 5)
5×6 DataFrame
+ Row │ date estimated_population_2019 last_available_confirmed_per_100k_inhabitants last_available_deaths new_confirmed new_deaths
+ │ Date Int64 Float64 Int64 Int64 Int64
+─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
+ 1 │ 2020-04-02 210147125 79.6116 305 1167 61
+ 2 │ 2020-04-03 210147125 90.9596 365 1114 60
+ 3 │ 2020-04-04 210147125 103.622 445 1169 80
+ 4 │ 2020-04-05 210147125 115.594 496 1040 51
+ 5 │ 2020-04-06 210147125 125.766 569 840 73
Also the bottom rows
last(br, 5)
5×6 DataFrame
+ Row │ date estimated_population_2019 last_available_confirmed_per_100k_inhabitants last_available_deaths new_confirmed new_deaths
+ │ Date Int64 Float64 Int64 Int64 Int64
+─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
+ 1 │ 2020-12-27 210147125 1.22953e5 191250 17614 337
+ 2 │ 2020-12-28 210147125 1.23554e5 191788 27437 538
+ 3 │ 2020-12-29 210147125 1.24291e5 192839 56371 1051
+ 4 │ 2020-12-30 210147125 1.25047e5 194056 55600 1217
+ 5 │ 2020-12-31 210147125 1.25724e5 195072 54469 1016
Here is a plot of the data:
using AlgebraOfGraphics
+using CairoMakie
+f = Figure()
+plt = data(br) * mapping(:date => L"t", :new_confirmed => "infected daily") * visual(Lines)
+draw!(f[1, 1], plt)
The Susceptible-Infected-Recovered (SIR) (Grinsztajn, Semenova, Margossian & Riou, 2021) model splits the population in three time-dependent compartments: the susceptible, the infected (and infectious), and the recovered (and not infectious) compartments. When a susceptible individual comes into contact with an infectious individual, the former can become infected for some time, and then recover and become immune. The dynamics can be summarized in a system ODEs:
where:
– the number of people susceptible to becoming infected (no immunity)
– the number of people currently infected (and infectious)
– the number of recovered people (we assume they remain immune indefinitely)
– the constant rate of infectious contact between people
– constant recovery rate of infected individuals
It's very easy:
Create a ODE function
Choose:
Initial Conditions –
Parameters –
Time Span –
Optional – Solver or leave blank for auto
PS: If you like SIR models checkout epirecipes/sir-julia
The following function provides the derivatives of the model, which it changes in-place. State variables and parameters are unpacked from u
and p
; this incurs a slight performance hit, but makes the equations much easier to read.
using DifferentialEquations
+
+function sir_ode!(du, u, p, t)
+ (S, I, R) = u
+ (β, γ) = p
+ N = S + I + R
+ infection = β * I * S / N
+ recovery = γ * I
+ @inbounds begin
+ du[1] = -infection # Susceptible
+ du[2] = infection - recovery # Infected
+ du[3] = recovery # Recovered
+ end
+ return nothing
+end;
This is what the infection would look with some fixed β
and γ
in a timespan of 100 days starting from day one with 1,167 infected (Brazil in April 2020):
i₀ = first(br[:, :new_confirmed])
+N = maximum(br[:, :estimated_population_2019])
+
+u = [N - i₀, i₀, 0.0]
+p = [0.5, 0.05]
+prob = ODEProblem(sir_ode!, u, (1.0, 100.0), p)
+sol_ode = solve(prob)
+f = Figure()
+plt =
+ data(stack(DataFrame(sol_ode), Not(:timestamp))) *
+ mapping(
+ :timestamp => L"t",
+ :value => L"N";
+ color=:variable => renamer(["value1" => "S", "value2" => "I", "value3" => "R"]),
+ ) *
+ visual(Lines; linewidth=3)
+draw!(f[1, 1], plt; axis=(; title="SIR Model for 100 days, β = $(p[1]), γ = $(p[2])"))
Please note that we are using the alternative negative binomial parameterization as specified in 8. Bayesian Regression with Count Data:
function NegativeBinomial2(μ, ϕ)
+ p = 1 / (1 + μ / ϕ)
+ r = ϕ
+
+ return NegativeBinomial(r, p)
+end
NegativeBinomial2 (generic function with 1 method)
+Now this is the fun part. It's easy: just stick it inside!
+using Turing
+using LazyArrays
+using Random: seed!
+seed!(123)
+
+@model function bayes_sir(infected, i₀, r₀, N)
+ #calculate number of timepoints
+ l = length(infected)
+
+ #priors
+ β ~ TruncatedNormal(2, 1, 1e-4, 10) # using 10 because numerical issues arose
+ γ ~ TruncatedNormal(0.4, 0.5, 1e-4, 10) # using 10 because numerical issues arose
+ ϕ⁻ ~ truncated(Exponential(5); lower=0, upper=1e5)
+ ϕ = 1.0 / ϕ⁻
+
+ #ODE Stuff
+ I = i₀
+ u0 = [N - I, I, r₀] # S,I,R
+ p = [β, γ]
+ tspan = (1.0, float(l))
+ prob = ODEProblem(sir_ode!, u0, tspan, p)
+ sol = solve(
+ prob,
+ Tsit5(); # similar to Dormand-Prince RK45 in Stan but 20% faster
+ saveat=1.0,
+ )
+ solᵢ = Array(sol)[2, :] # New Infected
+ solᵢ = max.(1e-4, solᵢ) # numerical issues arose
+
+ #likelihood
+ return infected ~ arraydist(LazyArray(@~ NegativeBinomial2.(solᵢ, ϕ)))
+end;
+Now run the model and inspect our parameters estimates. We will be using the default NUTS()
sampler with 1_000
samples on only one Markov chain:
infected = br[:, :new_confirmed]
+r₀ = first(br[:, :new_deaths])
+model_sir = bayes_sir(infected, i₀, r₀, N)
+chain_sir = sample(model_sir, NUTS(), 1_000)
+summarystats(chain_sir[[:β, :γ]])
Summary Statistics
+ parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
+ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
+
+ β 1.1199 0.0292 0.0013 477.2173 505.0551 0.9996 11.2615
+ γ 1.0869 0.0296 0.0014 477.1981 493.8323 0.9996 11.2610
+
+Hope you had learned some new bayesian computational skills and also took notice of the amazing potential of Julia's ecosystem of packages.
+Grinsztajn, L., Semenova, E., Margossian, C. C., & Riou, J. (2021). Bayesian workflow for disease transmission modeling in Stan. ArXiv:2006.02985 [q-Bio, Stat]. http://arxiv.org/abs/2006.02985
+ +