-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsection3_mcmc.jl
3942 lines (3118 loc) · 147 KB
/
section3_mcmc.jl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
### A Pluto.jl notebook ###
# v0.19.46
using Markdown
using InteractiveUtils
# ╔═╡ ce716e92-0c3c-11ed-364f-2f8266fa10f2
begin
using PlutoUI
using StatsPlots
using Plots; default(fontfamily="Computer Modern", framestyle=:box) # LaTex-style
using PlutoTeachingTools
using LinearAlgebra
using Distributions
using ForwardDiff
using MCMCChains
using DataFrames
using Random
using LaTeXStrings
using HypertextLiteral
using Logging; Logging.disable_logging(Logging.Info);
end;
# ╔═╡ 8daf0123-b22e-4cda-8eec-7cad3484c8e0
TableOfContents()
# ╔═╡ bcc4c430-92ee-4a4a-8744-0b0e2508211a
md"
[**↩ Home**](https://lf28.github.io/BayesianModelling/)
[**↪ Next Chapter**](./section4_turing.html)
"
# ╔═╡ 70f5222b-aadf-4dc0-bc03-b1d10e7db8b6
md"""
# Bayesian inference with MCMC
"""
# ╔═╡ 46af5470-afc8-47a8-a6c4-07de22105e91
md"""
In this chapter, we introduce a technique called Markov Chain Monte Carlo (MCMC), which is arguably the most important algorithm for Bayesian inference.
In a nutshell, MCMC aims at computing the posterior distribution ``p(\theta|\mathcal{D})`` efficiently. Without it, most Bayesian computations cannot be finished in a realistic timeframe. Therefore, it is safe to say the algorithm is instrumental to the very success of the modern Bayesian method.
In the following sections, we will first explain why the algorithm is needed for Bayesian inference. Different forms of MCMC algorithms, such as Metropolis-Hastings, Gibbs sampling, and Hamiltonian sampler, are introduced afterwards. The intuitions behind the algorithms are given higher priority than their theoretical buildups. From a Bayesian practitioner's perspective, it is probably more important to know how to use the algorithm in real-world analysis. Therefore, the practical aspects of the technique, such as MCMC diagnostic tools, are also explained in detail.
"""
# ╔═╡ f9584f19-9020-481d-93b2-8d52e8c4bdfc
md"""
## Bayesian computation is hard
At a first glance, it might be hard to see why Bayesian computation can be difficult. After all, **Bayes' theorem** has provided us with a very straightforward mechanism to compute the posterior:
$$\text{posterior}=\frac{\text{prior} \times \text{likelihood}}{\text{evidence}}\;\;\text{or}\;\;p(\theta|\mathcal D) =\frac{p(\theta) p(\mathcal D|\theta)}{p(\mathcal D)},$$
where
$$p(\mathcal D) = \int p(\theta) p(\mathcal D|\theta) \mathrm{d}\theta,$$ known as *evidence* or *marginal likelihood*, is a constant with respect to the model parameter ``\theta``.
The recipe is only easy to write down, but the daunting bit actually lies in the computation of the normalising constant: ``p(\mathcal D)``. The integration is often high-dimensional, therefore usually *intractable*.
As a result, posterior probability calculations, such as $\theta \in A$
$$\mathbb{P}(\theta \in A|\mathcal D) = \int_{\theta \in A} p(\theta |\mathcal D) \mathrm{d}\theta = \frac{ \int_{\theta \in A} p(\theta) p(\mathcal D|\theta) \mathrm{d}\theta}{p(\mathcal D)},$$ can not be evaluated exactly. For example, in the coin tossing example introduced last time, we may consider any coin with a bias ``0.5 \pm 0.05`` fair, the posterior we want to know is
$$\mathbb{P}(0.45 \leq \theta \leq 0.55|\mathcal D).$$
We sidestepped the integration problem last time by discretising the parameter space $\Theta = [0,1]$ into some finite discrete choices. The method has essentially replaced a difficult integration with a **brute-force enumeration** summation
$$\mathbb{P}(0.45 \leq \theta \leq 0.55|\mathcal D) = \frac{\int_{.45}^.55 p(\theta)p(\theta |\mathcal D) \mathrm{d}\theta}{p(\mathcal D)}\approx \frac{\sum_{\theta_0'\in \{.5\}} p(\theta=\theta_0')p(\mathcal D|\theta=\theta_0')}{\sum_{\theta_0\in \{0, 0.1, \ldots, 1.0\}} p(\theta=\theta_0)p(\mathcal D|\theta=\theta_0)}.$$
Unfortunately, this brute force discretisation-based method is not scalable. When the parameter space's dimension gets larger, the algorithm becomes too slow to use. To see it, consider discretising a ``D`` dimensional parameter space, ``\Theta \in \mathbb{R}^D``, if each parameter is discretised with 2 choices (which is a very crude discretisation), the total discretized space is of order ``2^D``, which grows exponentially. Such an exponentially growing size soon becomes problematic for all modern computers to handle.
What's worse, the difficulty does not end here. Assuming we knew the normalising constant, *i.e.* ``p(\mathcal D)`` were known and the posterior can be evaluated exactly, we still need to evaluate numerator's integration: ``\int_{\theta \in A} p(\theta)p(\mathcal D|\theta) \mathrm{d}\theta``, which is again generally intractable.
To summarise, the difficulty of Bayesian computation are two-folds
1. the posterior distribution is only evaluable up to some unknown normalising constant;
2. posterior summary involves integrations, which are intractable.
"""
# ╔═╡ 4e93aa8d-00d8-4677-8e06-0032a0634a5a
md"""
## How to estimate ? -- Monte Carlo
"""
# ╔═╡ e0dcfd8c-2415-4c4f-b254-5f9f52cf8ebf
md"""
Recall that, to summarise a posterior, we need to calculate intractable integrations such as
$$\mathbb{P}(\theta \in A|\mathcal D) = \int_{\theta \in A} p(\theta |\mathcal D) \mathrm{d}\theta = \frac{ \int_{\theta \in A} p(\theta) p(\mathcal D|\theta) \mathrm{d}\theta}{p(\mathcal D)}.$$
More generally, we want to calculate expectations of *any functions of random variable* ``\theta`` under the posterior:
"""
# ╔═╡ cda9d867-a6d4-4194-8afd-dbc437637b23
md"""
Note that when ``t(\cdot)`` is a *counting function*, e.g.
$$t(\theta) = \mathbf 1(\theta \in A)=\begin{cases} 1 & \theta \in A \\ 0 & \theta \notin A,\end{cases}$$ we recover the first question as
$$\mathbb E[t(\theta)|\mathcal D] = \int \mathbf{1}(\theta\in A) p(\theta|\mathcal D) \mathrm{d}\theta = \int_{\theta \in A} 1\cdot p(\theta|\mathcal D) \mathrm{d}\theta = \mathbb P(\theta\in A|\mathcal D).$$ That's why the expectation problem is more "general".
Therefore, the problem we want to tackle is:
> **Problem 1**: How to estimate expectations of functions of ``\theta``, such as equation (1), under the posterior?
"""
# ╔═╡ 29c29fc1-d6c6-4a5e-879b-e24e675a335c
md"""
**Monte Carlo** methods are widely known for their capability to approximate intractable integrations. Suppose we can *sample* from the posterior distribution,
$$\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(R)} \sim p(\theta|\mathcal D),$$
if the sample size, ``R``, is large enough, due to the law of large numbers, we can approximate integration by frequency counting:
$${\mathbb{P}}(\theta \in A|\mathcal D) \approx \frac{\#\{\theta^{(r)} \in A\}}{R},$$ where ``\#\{\cdot\}`` counts how many samples falls in set ``A``.
And general expectations of the form
$$\mathbb E[t(\theta)|\mathcal D] = \int t(\theta) p(\theta|\mathcal D) \mathrm{d}\theta = \frac{\int t(\theta) p(\theta) p(\mathcal D|\theta) \mathrm{d}\theta}{p(\mathcal D)}$$
can also be approximated by its Monte Carlo estimation
$${\mathbb{E}}[t(\theta)|\mathcal D] \approx \hat t =\frac{1}{R} \sum_{r} t(\theta^{(r)}),$$
which is the sample average evaluated at the drawn samples.
"""
# ╔═╡ ddf162c5-8425-439f-a7a5-b6735f6849d1
md"""
### Visual intuition
The following diagram illustrates the idea of Monte Carlo approximations. The true posterior distribution ``p(\theta|\mathcal D)`` is plotted on the left; and the histogram of ``R=2000`` random samples of the posterior is plotted on the right. Monte Carlo method essentially uses the *histogram* on the right to replace the true posterior. For example, to calculate the mean or expectation of ``\theta``, i.e. ``t(\theta) = \theta``, the Monte Carlo estimator becomes
$$\mathbb E[\theta|\mathcal D] \approx \frac{1}{R} \sum_r \theta^{(r)},$$ the sample average of the samples (which is very close to the ground truth on the left).
"""
# ╔═╡ 27e25739-b2fd-4cee-8df9-089cfbce4321
begin
approx_samples = 2000
dist = MixtureModel([Normal(2, sqrt(2)), Normal(9, sqrt(19))], [0.3, 0.7])
Random.seed!(100)
x = rand(dist, approx_samples)
ids = (x .< 15) .& (x .> 0)
prob_mc=sum(ids)/approx_samples
end;
# ╔═╡ 306bb2c5-6c7e-40f7-b3af-738e6fb7613e
let
plt = Plots.plot(
dist;
xlabel=raw"$\theta$",
ylabel=raw"$p(\theta|\mathcal{D})$",
title="true posterior",
fill=true,
alpha=0.3,
xlims=(-10, 25),
label="",
components=false,
)
Plots.vline!(plt, [mean(dist)]; label="true mean", linewidth=3)
plt2 = Plots.plot(
dist;
xlabel=raw"$\theta$",
fill=false,
alpha=1,
linewidth =3,
xlims=(-10, 25),
label="",
components=false,
)
histogram!(plt2, x, bins=110, xlims=(-10, 25), norm=true, label="", color=1, xlabel=raw"$\theta$", title="Monte Carlo Approx.")
Plots.vline!(plt2, [mean(x)]; label="MC mean", linewidth=3, color=2)
Plots.plot(plt, plt2)
end
# ╔═╡ 80b7c755-fe82-43d9-a2f1-e37edaaab25a
md"""
Similarly, calculating probability, such as ``\mathbb P(0\leq \theta \leq 15)``, reduces to frequency counting:
$$
\hat{\mathbb{P}}(0\leq \theta \leq \theta) = \frac{\#\{0\leq \theta^{(r)}\leq 15\}}{2000} =0.905,$$ counting the proportion of samples that falls in the area of interest. The idea is illustrated in the following diagram.
"""
# ╔═╡ 62743b4f-acc5-4370-8586-627e45a5c9ed
let
plt2 = Plots.plot(
dist;
xlabel=raw"$\theta$",
fill=false,
alpha=1,
linewidth =3,
xlims=(-10, 25),
label="",
components=false,
size=(300,450)
)
histogram!(plt2, x, bins = -10:0.3:25 , xlims=(-10, 25), norm=false, label="", color=1, xlabel=raw"$\theta$", title="Monte Carlo est.")
Plots.plot!(plt2, 0:0.5:15,
dist;
xlabel=raw"$\theta$",
fill=true,
color = :orange,
alpha=0.5,
linewidth =3,
xlims=(-10, 25),
label=L"0 \leq θ \leq 15",
components=false,
)
histogram!(plt2, x[ids], bins = -10:0.3:25, xlims=(-10, 25), norm=false, label="", color=:orange, xlabel=raw"$\theta$")
end
# ╔═╡ 3df816ac-f2d0-46a7-afe0-d128a1f185b1
md"""
### Monte Carlo method is scalable
The most important property of Monte Carlo method is its **scalability**, which makes it a practical solution to **Problem 1** even when ``\theta \in \mathbb{R}^D`` is of high dimension.
!!! note "Monte Carlo method is scalable"
The accuracy of the Monte Carlo estimate does not depend on the dimensionality of the space sampled, ``D``. Roughly speaking, regardless of the dimensionality of ``\theta``, the accuracy (squared error from the mean) remains the same.
"""
# ╔═╡ 9fca6e17-c8ee-4739-9d2e-0aaf85032a6b
Foldable("Further details about the scalability*", md"For simplicity, we assume ``t`` is a scalar-valued function. Note all expectations here are w.r.t the posterior distribution from which the samples are drawn. Firstly, it can be shown that the Monte Carlo estimator is unbiased:
$$\mathbb E[\hat t] = \mathbb E\left [\frac{1}{R}\sum_r t(\theta^{(r)})\right ] =\frac{1}{R}\sum_r \mathbb E[t(\theta^{(r)})] = \mathbb E[t(\theta)].$$ It means, on average, the estimator converges to the true integration value.
To measure the estimator's accuracy, we only need to find the estimator's variance as it measures the average squared error between the estimator and true value:
$$\mathbb V[\hat t] = \mathbb E[(\hat t -t)^2].$$
If we assume samples are independent draws from the distribution, the variance is then
$$\mathbb V[\hat t] =\mathbb V\left [\frac{1}{R}\sum_r t(\theta^{(r)})\right ]= \frac{1}{R^2} \sum_r \mathbb{V}[t(\theta^{(r)})]=\frac{R\cdot \mathbb{V}[t(\theta^{(r)})]}{R^2}=\frac{\sigma^2}{R},$$ where
$$\sigma^2 = \mathbb V[t] = \int p(\theta|\mathcal D) (t(\theta)-\mathbb E[t])^2\mathrm{d}\theta$$ is some positive constant that only depends on the function ``t``. Therefore, as the number of samples ``R`` increases, the variance of ``\hat t`` will decrease linearly (the standard deviation, ``\sqrt{\mathbb V[\hat t]}``, unfortunately, shrinks at a rate of ``\sqrt{R}``). Note the accuracy (variance) only depends on ``\sigma^2``, the variance of the particular statistic ``t`` rather than ``D``, the dimensionality.")
# ╔═╡ 8892b42f-20c8-4dd3-be84-635d7b5f07fe
md"""
## How to sample ? -- MCMC
"""
# ╔═╡ 5e4551b3-1cd9-40f1-a2b7-ece9fbc7c082
md"""
In the previous section, we have established that Monte Carlo estimation is a scalable method *if we can obtain samples from a posterior distribution*. That is a big "*if*" to assume. Without a practical sampling method, no Monte Carlo estimators can be calculated.
We should also note that for a general Bayesian inference problem the posterior can only be evaluated up to some unknown constant:
$$p(\theta|\mathcal D) \propto p(\theta) p(\mathcal D|\theta),$$
where the scalar ``1/p(\mathcal D)`` involves a nasty integration which we do not know how to compute.
The question we face now is a sample generation problem:
> **Problem 2**: how to generate samples ``\{\theta^{(r)}\}_{r=1}^R`` from a un-normalised distribution:
> $$p(\theta|\mathcal D) \propto p(\theta)p(\mathcal D|\theta)?$$
"""
# ╔═╡ a54dc923-ea9e-4016-8980-56f21c3d3ca6
md"""
*Note.* *If we apply log on the posterior distribution, the log density becomes a sum of log prior and log-likelihood:
$$\ln p(\theta|\mathcal D) = \ln p(\theta) + \ln p(\mathcal D|\theta),$$ which is faster and numerically stable for computers to compute. Additions are in general faster than multiplications/divisions for floating number operations.*
"""
# ╔═╡ 16e59a89-daa8-467b-82b7-e4058c99edb8
md"""
### Basic idea
A class of methods called **Markov Chain Monte Carlo** (MCMC) is a popular and successful candidate to generate samples from a non-standard distribution. Markov Chain Monte Carlo (MCMC) algorithm is formed by two concepts:
* **Markov Chain**: $p(\theta^{(r)}|\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(r-1)}) = p(\theta^{(r)}|\theta^{(r-1)})$
* the current state only depends on the previous state given the whole history
* samples are generated by some carefully crafted Markov Chains
* current sample depends on the previous sample
* **Monte Carlo**: estimation by the Markov chain samples as illustrated in the previous section
The idea is to produce posterior samples ``\{\theta^{(r)}\}_{r=1}^R`` **in sequence**, each one depending only on ``\theta^{(r-1)}`` and not on its more distant history of predecessors, *i.e.* a **Markov Chain** (which accounts for the first **MC** of the acronym **MCMC**). When the transition probability of the chain satisfies certain conditions, Markov Chain theory then says that, under quite general conditions, the empirical distribution of the simulated samples will approach the desired target distribution as we simulate the chain long enough, i.e. when ``R`` is large.
Since the sequence converges to the target distribution when ``R`` is large, we usually only retain the last chunk of the sequence as "good samples" or equivalently **discard** the initial samples as **burn-in** since they are usually not representative of the distribution to be approximated. For example, if the Markov Chain has been simulated by 4,000 steps, we only keep the last 2000 to form an empirical distribution of the target.
"""
# ╔═╡ bc071d70-6eae-4416-9fa6-af275bd74f0b
md"""
### Metropolis Hastings
"""
# ╔═╡ 60e9fc94-a3e6-4d0a-805c-6759ee94dcae
md"""
*Metropolis-Hastings* (MH) algorithm is one of the MCMC methods (and arguably the most important one). MH constructs a Markov Chain in two simple steps:
* it *proposes* a candidate ``\theta_{\text{proposed}}`` based on the current state ``\theta_{\text{current}}`` according to a proposal distribution: ``q(\theta_{\text{proposed}}|\theta_{\text{current}})``
* it then either moves to ``\theta'`` (or *accept* the proposal) or stays put (or *reject*) based on the proposal's "quality" which involves a ratio:
$\frac{p(\theta_{\text{proposed}}|\mathcal D)}{p(\theta_{\text{current}}|\mathcal D)}^\ast$
* intuitively, if the proposal state is good, or the ratio is well above 1, we move to ``\theta_{\text{proposed}}``, otherwise stay put
* ``\ast`` *note the acceptance probability ratio also involves another ratio of the proposal distributions (check the algorithm below for details)*
A key observation to note here is MH's operations do no involve the normalising constant:
$$\frac{p(\theta_{\text{proposed}}|\mathcal D)}{p(\theta_{\text{current}}|\mathcal D)} = \frac{\frac{p(\theta_{\text{proposed}})p(\mathcal D|\theta_{\text{proposed}})}{\cancel{p(\mathcal D)}}}{\frac{p(\theta_{\text{current}})p(\mathcal D|\theta_{\text{current}})}{\cancel{p(\mathcal D)}}} = \frac{p(\theta_{\text{proposed}})p(\mathcal D|\theta_{\text{proposed}})}{p(\theta_{\text{current}})p(\mathcal D|\theta_{\text{current}})} \triangleq \frac{p^\ast(\theta_{\text{proposed}})}{p^\ast(\theta_{\text{current}})}.$$
Therefore, we only need to evaluate the un-normalised posterior distribution ``p^\ast``.
The algorithm details are summarised below.
"""
# ╔═╡ 8ddc5751-be76-40dd-a029-cf5f87cdb09d
md"""
!!! infor "Metropolis-Hastings algorithm"
0. Initialise ``\theta^{(0)}`` arbitrary
1. For ``r = 1,2,\ldots``:
* sample a candidate value from ``q``:
$$\theta' \sim q(\theta|\theta^{(r)})$$
* evaluate ``a``, where
$$a = \text{min}\left \{\frac{p^\ast(\theta')q(\theta^{(t)}|\theta')}{p^\ast(\theta^{(t)}) q(\theta'|\theta^{(t)})}, 1\right \}$$
* set
$$\theta^{(t+1)} = \begin{cases} \theta' & \text{with probability } a\\ \theta^{(t)} & \text{with probability } 1-a\end{cases}$$
"""
# ╔═╡ 918f4482-716d-42ff-a8d1-46168e9b920a
md"""*Remark*: *when the proposal distribution is symmetric, i.e.*
$$q(\theta^{(t)}|\theta')= q(\theta'|\theta^{(t)}),$$ *the acceptance probablity reduces to*
$$a = \text{min}\left \{\frac{p^\ast(\theta')}{p^\ast(\theta^{(t)}) }, 1\right \}$$ *and the algorithm is called **Metropolis** algorthm.* *A common symmetric proposal distribution is random-walk Gaussian, i.e.* ``q(\theta^{(t)}|\theta')= \mathcal N(\theta', \Sigma),`` *where ``\Sigma`` is some fixed variance matrix.*
"""
# ╔═╡ 00081011-582b-4cbe-aba8-c94c47d96c87
md"""
!!! danger "Question: sample a biased 🎲 by Metropolis Hasting"
We want to obtain samples from a completely *biased* die that lands with threes (⚂) all the time. If one uses the Metropolis-Hastings algorithm [^1] to generate samples from the biased die and a fair die is used as the proposal of the MH algorithm.
* Is the proposal symmetric?
* What would the MCMC chains look like?
* Will the chain ever converge to the target distribution? And do we need to discard any as burn-in?
"""
# ╔═╡ a072f6a9-472d-40f2-bd96-4a09292dade8
md"""
!!! hint "Answer"
Depending on the initial state, the chain can either be ``3,3,3,\ldots`` or
``x,x,\ldots,x,3,3,3,\ldots`` where ``x \in \{1,2,4,5,6\}``.
The proposal is symmetric. Since a fair die is used as the proposal, the proposal probability distribution is 1/6 for all 6 facets.
As a result, the acceptance probability ``a`` only depends on the ratio of the target distribution. There are two possible scenarios:
1. the chain starts at 3, then all following proposals other than 3 will be rejected (as ``a=\tfrac{0}{1}=0\%``). Therefore, only samples of 3 will be produced.
2. the chain is initialised with a state other than 3, e.g. ``x=1``, then all proposals other than 3 will be rejected [^2], so the initial samples will be all ones (``1,1,1,\ldots``); until a three is proposed, then it will be accepted with ``a=\frac{1}{0}= 100\%`` chance; and only three will be produced onwards (the same argument of case 1).
Regardless of the starting state, the chain will eventually converge and produce the correct samples, i.e. ``3,3,3,3,\ldots``. For chains starting with a state other than 3, the initial chunk (all ones in the previous case) should be discarded as **burn-in**: the chain has not yet converged.
Luckily, the discard initial chunk will be short. On average, we should expect the burn-in lasts for 6 iterations only, as the length is a geometric random variable with ``p=1/6``.
"""
# ╔═╡ 716920e3-2c02-4da9-aa6b-8a51ba638dfa
md"""
#### Demonstration
"""
# ╔═╡ 24efefa9-c4e3-487f-a228-571fec271886
md"""
To demonstrate the idea, consider sampling from a bivariate Gaussian distribution:
$$\begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix} \sim \mathcal N\left (\begin{bmatrix} 0 \\ 0 \end{bmatrix} , \begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}\right ).$$
The Gaussian distribution has a mean of zero and variance of 1; ``\rho`` measures the correlation between the two dimensions. Here we use ``\rho = 0.9``. The probability density can be written as
$$p^\ast(\boldsymbol \theta) \propto \text{exp}\left (-\frac{1}{2} \boldsymbol \theta^\top \mathbf \Sigma^{-1} \boldsymbol \theta\right ),$$ where
``\mathbf \Sigma = \begin{bmatrix} 1 & 0.9 \\ 0.9 & 1 \end{bmatrix}.`` The target distribution's contour plot is shown below.
"""
# ╔═╡ f0f07e50-45ee-4907-8ca8-50d5aaeeafb4
begin
ρ = 0.9
μ = [0, 0]
Σ = [1.0 ρ; ρ 1]
target_p = (x) -> logpdf(MvNormal(μ, Σ), x)
plt_contour = plot(-5:0.1:5, -5:0.1:5, (x, y)-> exp(target_p([x,y])),legend = :none, st=:contour, linewidth=0.5, la=0.5, levels=5, xlabel=L"\theta_1", ylabel=L"\theta_2", title="Contour plot a highly correlated bivariate Gaussian")
end
# ╔═╡ 44c167b9-375d-4051-a4c1-825e5ec9570c
md"""
To apply an MH algorithm, we adopt a simple uncorrelated bivariate Gaussian as the proposal distribution:
$$q(\theta'|\theta^{(t)}) = \mathcal N\left (\theta^{(t)}, \sigma^2_q \times \begin{bmatrix} 1 & 0 \\ 0 & 1\end{bmatrix}\right )$$
* the proposal distribution proposes a Gaussian random walk conditional on the current state ``\theta^{(t)}``.
* ``\sigma^2_q > 0`` is a tuning parameter to set.
* note that since the proposal distribution is symmetric, i.e.
$$q(\theta'|\theta^{(t)}) = q(\theta^{(t)}|\theta')$$
the acceptance probability simplifies to
$$a = \text{min}\left \{\frac{p^\ast(\theta')}{p^\ast(\theta^{(t)})}, 1\right \}$$
The Metropolis-Hastings algorithm with the mentioned proposal distribution can be implemented in Julia as follows:
```julia
# Metropolis Hastings with isotropic Gaussian proposal
- `ℓπ`: the target log probability density
- `mc`: number of Monte Carlo samples to draw
- `dim`: the dimension of the target space
- `Σ`: proposal's covariance matrix
- `x0`: initial starting state
function metropolis_hastings(ℓπ, mc=500; dim=2, Σ = 10. * Matrix(I, dim, dim), x0=zeros(dim))
samples = zeros(dim, mc)
samples[:, 1] = x0
L = cholesky(Σ).L
ℓx0 = ℓπ(x0)
accepted = 0
for i in 2:mc
# radomly draw from the proposal distribution
# faster than xstar = rand(MvNormal(x0, Σ))
xstar = x0 + L * randn(dim)
ℓxstar = ℓπ(xstar)
# calculate the acceptance ratio `a`
# note ℓπ is in log scale, need to apply exp
a = exp(ℓxstar - ℓx0)
# if accepted
if rand() < a
x0 = xstar
ℓx0 = ℓxstar
accepted += 1
end
# store the sample
@inbounds samples[:, i] = x0
end
accpt_rate = accepted / (mc-1)
return samples, accpt_rate
end
```
To use the algorithm draw samples from the target Gaussian distribution, we simply feed in the required inputs, e.g.
```julia
# set the target posterior distribution to sample
target_ℓπ = (x) -> logpdf(MvNormal(μ, Σ), x)
# set proposal distribution's variance
σ²q = 1.0
mh_samples, rate=metropolis_hastings(target_p, 2000; Σ = σ²q * Matrix(I, 2, 2), x0=[-2.5, 2.5])
```
"""
# ╔═╡ 5de392a4-63d7-4313-9cbd-8eaeb4b08eea
md"""
An animation is listed below to demonstrate the MH algorithm with a proposal distribution ``\sigma^2_q = 1.0``, where
* the ``{\color{red}\text{red dots}}`` dots are the rejected proposals
* the ``{\color{green}\text{green dots}}`` dots are the MCMC samples (either accepted new proposals or stay-put old samples)
"""
# ╔═╡ 5b3f3b8a-1cfa-4b32-9a20-1e1232549e78
md"""
It can be observed that
* the acceptance rate is about 0.32
* and the chain is **mixing well**, which means the high-density area of the target distribution has been explored well.
2000 samples drawn from the MH algorithm after 2000 burn-in are shown below.
"""
# ╔═╡ 922a89e6-a1a0-4f07-b815-552a9b2a4fbd
md"""
### Reduce dependence of the chain
"""
# ╔═╡ eba93b03-23d4-4ccd-88d2-dcea51bb20b9
md"""
Ideally, the traditional Monte Carlo method requires samples to be *independent*. MCMC samples, however, are simulated as Markov chains: i.e. the next sample depends on the current state, therefore MCMC samples are all dependent.
It is worth noting that Monte Carlo estimation is still valid even when the samples are dependent as long as the chain reaches equilibrium. However, an estimation's standard error, in general, deteriorates when the samples in use are more dependent.
There are two commonly used practices to reduce the temporal correlations of an MCMC chain: *thinning* and *parallelism*. And they can be used together to further improve the sample quality.
"""
# ╔═╡ d3a88626-0d60-45ad-9725-9ed23853fc85
md"""
#### Thinning
"""
# ╔═╡ 76a0e827-927c-4a48-a4c9-98c401357211
md"""
As the dependence decreases as the lag increases, one therefore can retain MCMC samples at some fixed lag. For example, after convergence, save every 10th sample.
"""
# ╔═╡ f56bb3ef-657d-4eff-8778-11d550e6d803
md"""
#### Parallel chains
"""
# ╔═╡ ae766de8-2d2a-4dd3-8127-4ca69a6082f1
md"""
Each MCMC chain simulates independently, to fully make use of modern computers' concurrent processing capabilities, we can run several chains in parallel.
"""
# ╔═╡ 2628cd3b-d2dc-40b9-a0c2-1e6b79fb736b
md"""
*Why should we run parallel chains rather than a single long chain?* Note that ideally, we want the samples to be *independent*. Samples from one single chain is temporally dependent. Running multiple chains exploring the posterior landscape independently, when mixing together, produces more independent samples.
"""
# ╔═╡ 5a7a300a-e2e0-4c99-9f71-077b43602fdd
md"""
The following animation shows the evolution of four parallel Metropolis-Hastings chains. To make the visualisation cleaner, rejected proposals are not shown in the animation. The four chains start at four initial locations (i.e. the four corners). Note that all four chains quickly move to the high-density regions regardless of their starting locations, which is a key property of MCMC.
"""
# ╔═╡ 99153a03-8954-48c6-8396-1c2b669e4ea6
md"""
### Limitations of Metropolis-Hastings
"""
# ╔═╡ 0802dc90-8312-4509-82f0-6eca735e852b
md"""
MH algorithm's performance depends on the proposal's quality. And setting a suitable proposal distribution for an MH algorithm is not easy, especially when the target distribution is high-dimensional. As a rule of thumb, we should aim at an acceptance rate between 0.2 to 0.6. Too high or too low implies the chain is struggling.
**Acceptance rate too high**
Usually happens when ``\sigma^2_q`` is too small, proposals are all close to each other in one high-density area ``\Rightarrow`` most of the proposals are accepted, but, as a result ``\Rightarrow`` other high-density areas are not explored sufficiently enough.
The following animation illustrates the idea: the proposal's variance is set as ``\sigma_q^2 = 0.005``; despite a high acceptance rate of 0.924, the chain only makes tiny moves and therefore does not explore the target distribution well enough.
"""
# ╔═╡ a00b9476-af30-4609-8b1e-4693246fdaef
md"""
**Acceptance rate too low**
Usually happens when ``\sigma^2_q`` is too large, the proposals jump further but likely to propose less desired locations ``\Rightarrow`` a lot of rejections ``\Rightarrow`` very temporal correlated samples.
The chain shown below employs a more audacious proposal distribution with ``\sigma_q^2 = 10.0``. As a result, most of the proposals are rejected which leads to an inefficient sampler (most of the samples are the same).
"""
# ╔═╡ dd15f969-733e-491c-a1b7-e0fbf4532e64
md"""
To avoid the difficulty of setting good proposal distributions, more advanced variants of MH algorithms, such as the **Hamiltonian sampler** (HMC), **No-U-Turn sampler** (NUTS), have been proposed. Instead of randomly exploring the target distribution, HMC and its variants employ the gradient information of the target distribution to help explore the landscape.
"""
# ╔═╡ 85cad369-df16-4f06-96ae-de5f9c5bb6cd
md"""
## Digress: Julia's `MCMCChains.jl`
"""
# ╔═╡ 0cda2d90-a2fa-41ab-a655-c7e4550e4eb1
md"""
Julia has a wide range of packages to analyse MCMC chains. In particular, `MCMCChains.jl` package provides us tools to visualise and summarise MCMC chains.
We can construct `Chains` object using `MCMCChains.jl` by passing a matrix of raw chain values along with optional parameter names:
"""
# ╔═╡ bdf1a4ca-a9ef-449d-82a3-07dd9db1f1ba
md"""
The initial 3 samples of the raw samples:
```julia
mh_samples_parallel[1:3, :, :]
```
"""
# ╔═╡ 3d7c71ce-6532-4cd8-a4a3-5ef2ded70caa
md"""
Create a `Chains` object by passing the raw sample matrix and names of the parameters (optional)
```julia
# use MCMCChains.jl to create a Chains object
# the first 2000 samples are discarded as burn-in
# Note that to apply thinning=2, one can use ``mh_samples_parallel[2001:2:end, :, :]``
chain_mh = Chains(mh_samples_parallel[2001:end, :, :], [:θ₁, :θ₂])
```
"""
# ╔═╡ 358719cd-d903-45ab-bba6-7fe91697d1ee
md"""
### Visualisations
One can visualise `Chains` objects by
* `traceplot(chain_mh)`: plots the chain samples at each iteration of the Markov Chain, i.e. time series plot
* `density(chain_mh)`: plots kernel density estimations fit on the samples
* or `plot(chain_mh)`: plots both trace and density plots
"""
# ╔═╡ 2c18acb0-9bc5-4336-a10d-727538dbd3c8
md"""
For example, the following plots show the four parallel MH chains for the bi-variate Gaussian example
* the left two plots show the trace plots of four chains
* the density of the right show fit to the four chains
It can be seen visually that all four chains converge to roughly the same equilibrium distributions.
"""
# ╔═╡ 55ed9e58-7feb-4dbc-b807-05796a02fc62
md"""
Other visualisation methods include
* `meanplot(c::Chains)`: running average of the chain
* `histogram(c::Chains)`: construct histogram of the chain
* `autocorplot(c::Chains)`: auto correlations of the chain
* `corner(c::Chains)`: make a scatter plot of the chains; a tool to view correlations between dimensions of the samples
"""
# ╔═╡ 2c06c3f1-11e6-4191-b505-080342a9b787
md"For example, the following plot uses `corner()` to show a scatter plot of the chain."
# ╔═╡ 17a4d60b-7181-4268-866c-ddbde37dc349
md"""
### Summary statistics
"""
# ╔═╡ 45eb8c7b-5f17-43ac-b412-0f3ced44a018
md"""
`MCMCChains.jl` also provides easy-to-use interfaces to tabulate important statistics of MCMC chains instances
* `summarystats`: aggregates statistics such as mean, standard deviations, and essential sample size etc
* `describe`: apart from above, it shows 2.5% to 97.5% percentile of the samples
```julia
# summary of the chain
summarystats(chain_mh)
describe(chain_mh)
```
"""
# ╔═╡ e0e4f50c-52c0-4261-abae-3cf0399e04e0
md"""
# MCMC diagonistics
"""
# ╔═╡ 81ce6e4c-ef48-4a07-9133-4414867c2b29
md"""
Markov chain theory only provides us with a *theoretical guarantee*:
> if one simulates an MCMC chain *long enough*, *eventually* the sample empirical distribution will *converge* to the target posterior distribution.
Unfortunately, this is only a *theoretical guarantee*. We do not have the time and resources to run a chain indefinitely long. In practice, we simulate chains at fixed lengths and inpect the chains to check **convergence**.
### MCMC metrics
Luckily, there are multiple easy-to-compute metrics to diagnose a chain. Most of the techniques apply stationary time series models to access the performance of a chain. The most commonly used metrics are **$\hat R$ statistic**, and **effective sample size** (`ess`).
**R̂ statistic (`rhat`)**
R̂ is a metric that measures the stability of a chain. Upon convergence, any chunk of a chain, e.g. the first and last halves of the chain, or parallel chains, should be *similar to each other*. ``\hat R`` statistic, which is defined as the ratio of within-chain to between-chain variability, should be close to one if all chains have converged.
> As a rule of thumb, a valid converging chain's ``\hat R`` statistic should be less than 1.01.
"""
# ╔═╡ 57c26917-8300-45aa-82e7-4fd5d5925eba
md"""
**Effective sample size (`ess`)**
The Traditional Monte Carlo method requires samples to be independent. However, MCMC samples are all dependent, as the chain is simulated in a Markovian fashion: the next sample depends on the previous state. It is worth noting that Monte Carlo estimation is still valid even when the samples are dependent as long as the chain has mixed well. However, the standard error of the estimation, in general, deteriorates when the samples are more dependent.
**`ess`** roughly estimates how many equivalent *independent* samples are in a dependent chain. Since samples in an MCMC chain are correlated, they contain less information than truly independent samples. `ess` therefore discounts the sample size by some factor that depends on the temporal correlation between the iterations.
We can also measure the *efficiency* of an MCMC algorithm by calculating
$$\text{efficiency} = \frac{\text{ess}}{\text{iterations}},$$
which measures the information content in a chain. One needs to run a highly efficient algorithm with fewer iterations to achieve the same result.
> Larger effective sample size (ess) implies the MCMC algorithm is more efficient
"""
# ╔═╡ 97f81b83-50ff-47c0-8c2d-df95816b2ac3
md"""
**Examples** To demonstrate the ideas, we compare `ess` and `R̂` of the three MH chains with different proposal variances: ``\sigma^2_q=`` 1.0, .005 and 20. Remember, ``\sigma^2_q=1.0`` performs the best; .005 and 20 are less optimal.
"""
# ╔═╡ d2ae1f4f-e324-42f6-8eaa-c8e32ec3fc39
md"""
**Summary statistics with MH chain of ``\sigma^2_q=1.0``**
"""
# ╔═╡ 5487dd11-7c8e-428f-a727-300780dd02a7
md"**Summary statistics with MH chain of ``\sigma^2_q=.005``**"
# ╔═╡ 3a92e430-8fba-4d69-aa25-13e08a485720
md"**Summary statistics with MH chain of ``\sigma^2_q=20.0``**"
# ╔═╡ e38c1d3d-fb36-4793-94fc-665dc0eacd99
md"""
It can be observed that
* `ess`: `ess` are around 360, 20, and 134 respectively with the three algorithms, which implies efficiencies of 0.18, 0.01, and 0.067. The second algorithm with small proposal variance, although achieving a high acceptance rate, is the least efficient.
* `rhat`: both the second and third algorithm's `rhat` > 1.01, which indicates they do not mix well.
"""
# ╔═╡ 2998ed6a-b505-4cfb-b5da-bdb3f887a646
md"""
### Visual inspection
"""
# ╔═╡ 95e71f9b-ed81-4dc1-adf9-bd4fe1fc8bbe
md"""
Simple trace plots reveal a lot of information about a chain. One can diagnose a chain by visual inspection.
**What do bad chains look like?** The following two trace-plots show the chain traces of the two less desired MH algorithms for the bivariate Gaussian example: one with ``\sigma^2_q=0.005`` and ``\sigma^2_q=20``.
Recall that when ``\sigma^2_q`` is too small, a chain proposes small changes at each iteration. As a result, the chain does not explore HPD region sufficiently well. If one splits the chain into two halves, the two chunks (with green and red backgrounds) exhibit drastic different values.
"""
# ╔═╡ 82dc987c-4649-4e65-83e7-e337be0e99e8
md"""
On the other hand, when ``\sigma^2_q`` is too large, the chain jumps back and force with large steps, resulting most proposals being rejected and old samples being retained. The chain becomes very *sticky*. The chain does not contain a lot of information comparing with independent samples.
"""
# ╔═╡ ac904acb-4438-45d9-8795-8ab724240da0
md"""
**What good chains should look like ?**
The figure below shows trace plots of four different chains with `ess=` 4.6, 49.7, 90.9, and 155. The top two are *bad* chains. And the bottom two chains are of relatively better quality. A good chain should show stable statistical properties across the iterations with a reasonable level of variance.
"""
# ╔═╡ 68c98e53-7ac3-4832-a7dd-97459a89d7cb
md"""
# Other MCMC samplers
"""
# ╔═╡ 5f1f12b2-5b63-416e-a459-9b5d0e37b0e8
md"""
Metropolis-Hastings is considered the mother of most modern MCMC algorithms. There are a lot of other variants of MCMC algorithms that can be considered as specific cases of MH algorithm. We will quickly introduce two other popular variants: Gibbs sampling and Hamiltonian sampler. We will focus on the intuition behind the samplers.
"""
# ╔═╡ b29597f1-3fd7-4b44-9097-7c1dc1b7629b
md"""
## Gibbs sampling
Gibbs sampling reduces a multivariate sampling problem to a series of uni-variate samplings. Assume the target distribution is bivariate with density ``p(\theta_1, \theta_2)``, and the chain is currently at state ``(\theta_1^{(r)}, \theta_2^{(r)})``, Gibbs sampling alternatively samples from their full conditionals in two steps:
* sample ``\theta_1^{(r+1)} \sim p(\theta_1|\theta_2^{(r)})``
* sample ``\theta_2^{(r+1)} \sim p(\theta_2|\theta_1^{(r+1)})``
The new sample ``(\theta_1^{(r+1)}, \theta_2^{(r+1)})`` is retained as a new sample.
"""
# ╔═╡ 6c83a79a-8581-45cb-8d31-31185dc42a0f
md"""
### Visual intuition
The following diagram demonstrates how Gibbs sampling explores the bivariate Gaussian distribution. Note the zig-zag behaviour: at each step, Gibbs sampling only changes one dimension of the sample.
"""
# ╔═╡ 9a281767-3d88-4742-8efc-e4fe764c705a
md"""
**Multivariate Gibbs sampling** The general Gibbs sampling algorithm for a multi-variate problem is listed below. The idea is the same, at each step, a series of small steps based on the full conditionals are made and chained together.
"""
# ╔═╡ 022b7068-de1b-4506-abf1-2289976f1597
md"""
!!! infor "Gibbs sampling"
0. Initialise ``\boldsymbol \theta^{(0)}=[\theta_1^{(0)},\theta_2^{(0)}, \ldots, \theta_D^{(0)} ]`` arbitrary
1. For ``r = 1,2,\ldots``:
* sample dimension ``1``:
$$\theta_1^{(r)} \sim p(\theta_1|\theta_2^{(r-1)}, \ldots, \theta_D^{(r-1)})$$
* sample dimension ``2``:
$$\theta_2^{(r)} \sim p(\theta_2|\theta_1^{(r)}, \theta_3^{(r-1)}, \ldots, \theta_D^{(r-1)})$$
$$\vdots$$
* sample dimension ``D``:
$$\theta_D^{(r)} \sim p(\theta_D|\theta_1^{(r)}, \ldots, \theta_{D-1}^{(r)})$$
"""
# ╔═╡ 16be37d7-f6d5-469b-b0fd-20816b42d4e5
md"""
**Gibbs sampling is a Metropolis-Hastings**
One drawback of MH sampler is one needs to specify a proposal distribution. Gibbs sampling alleviates this burden from the user **by using the full conditionals of the target distribution as proposal distribution**. Gibbs sampling, therefore, is a specific case of MH algorithm. One can also show that when full conditionals are used, the acceptance probability is always 100%. Therefore, there is no rejection step in a Gibbs sampling.
"""
# ╔═╡ aafd802c-e529-4116-8d58-13f2ba1c4e49
Foldable("Further details on Gibbs sampling is a MH algorithm.", md"For simplicity, we only consider bivariate case. Assume the chain is currently at state ``\boldsymbol \theta^{(r)}=(\theta_1^{(r)}, \theta_2^{(r)})``, the proposal distribution proposes to move to a new state ``\boldsymbol \theta' = (\theta_1', \theta_2')`` with a proposal density
$$q(\boldsymbol \theta'|\boldsymbol \theta^{(r)}) = p(\theta_1'|\theta_2') \cdot \mathbf 1({\theta_2'= \theta_2^{(r)})},$$
where ``\mathbf 1(\cdot)`` returns 1 when the testing condition is true and 0 otherwise. The transition kernel basically states ``\theta_2^{(r)}`` is not changed.
The acceptance probability then is
$$a \triangleq \frac{p(\boldsymbol \theta') q(\boldsymbol \theta^{(r)}|\boldsymbol \theta')}{p(\boldsymbol \theta^{(r)})q(\boldsymbol \theta'|\boldsymbol \theta^{(r)})} = \frac{p(\theta_1', \theta_2') \times p(\theta_1^{(r)}|\theta_2')\cdot \mathbf 1({\theta_2'= \theta_2^{(r)})}}{p(\theta_1^{(r)}, \theta_2^{(r)})\times p(\theta_1'|\theta_2')\cdot \mathbf 1({\theta_2'= \theta_2^{(r)})}}.$$
When ``\theta_2' = \theta_2^{(r)}``,
$$a = \frac{p(\theta_1', \theta_2^{(r)}) \times p(\theta_1^{(r)}|\theta_2^{(r)})}{p(\theta_1^{(r)}, \theta_2^{(r)})\times p(\theta_1'|\theta_2^{(r)})}= \frac{p(\theta_1', \theta_2^{(r)}) \times \frac{p(\theta_1^{(r)}, \theta_2^{(r)})}{p(\theta_2^{(r)})}}{p(\theta_1^{(r)}, \theta_2^{(r)})\times \frac{p(\theta_1',\theta_2^{(r)})}{p(\theta_2^{(r)})}} = \frac{p(\theta_2^{(r)})}{p(\theta_2^{(r)})} =1.0,$$
when ``\theta_2'\neq \theta_2^{(r)}``, we have ``a=\tfrac{0}{0}`` is not defined, a trivial boundary case which should never happen. Therefore, there is no need to test the proposal. The proposed state should always be accepted.
")
# ╔═╡ d6d332c5-769c-4fe1-8f84-fd2bbf19250a
md"""
We have already shown that a high acceptance rate does not necessarily imply high efficiency. The same applies to Gibbs sampling. The zig-zag exploration scheme of Gibbs sampling can be slow for highly correlated sample space. For example, in our bi-variate Gaussian example, Gibbs sampling has achieved around `ess = 129` (out of 2000 samples) and an efficiency around $(129/2000).
"""
# ╔═╡ 39edf0df-a9c5-4e76-af01-f730b4a619b9
md"**Summary statistics of Gibbs sampling**"
# ╔═╡ dd8095b0-8e42-4f6d-a545-8ac1db07ff79
md"""
## Hamiltonian sampler
"""
# ╔═╡ c387be0b-3683-48d6-9041-d8f987428499
md"""
We have discussed the limitation of the Metropolis-Hastings algorithm: the proposal distribution is hard to tune. Ideally, we want a proposal distribution with the following properties:
* propose the next state as far, therefore, *independent*, from the current state as possible
* at the same time, the proposed state should be of *good* quality,
* ideally as good (of high probability density) as the current one, leading to an acceptance rate ``a`` close to one
Simple proposals, such as Gaussians, cannot achieve both of the desired properties. The problem lies in their random-walk exploration nature. The proposals blindly propose the next steps and ignore the local geographical information of the target distribution. *Gradients*, which always point to the steepest ascent direction, are the *geographic* information that can guide the proposal to reach a further yet high probability region.
"""
# ╔═╡ df08ac1d-bcd7-4cb7-bcd1-68f5de699aa0
md"""
### Visual intuition
"""
# ╔═╡ b15f101a-b593-4992-ae80-f32dc894c773
md"""
To understand the idea, let's consider a more complicated target distribution which is formed by super-imposing three bivariate Gaussians together. The three Gaussians are with means ``[-4,0], [4,0],`` and ``[0,0]``, and the variances ``\begin{bmatrix} 2 & 1.5 \\ 1.5& 2\end{bmatrix}``, ``\begin{bmatrix} 2 & -1.5 \\ -1.5 & 2\end{bmatrix}`` and ``\begin{bmatrix} 2 & 0 \\ 0 & 2\end{bmatrix}`` respectively.
The contour and surface of the target distribution are plotted below for reference.
"""
# ╔═╡ 2fc46172-cd45-48b2-bff2-9dd5a91e21d1
begin
d1 = MvNormal([-4, 0], [2 1.5; 1.5 2])
d2 = MvNormal([4, 0], [2 -1.5; -1.5 2])
d3 = MvNormal([0, 0], [2 0; 0. 2])
d = MixtureModel([d1, d2, d3])
end;
# ╔═╡ bcca401f-b64d-45f5-b8c0-766cc2a1a50e
begin
ℓ(x) = logpdf(d, x)
∇ℓ(x) = ForwardDiff.gradient(ℓ, x)
end;
# ╔═╡ d3c70aee-c9ce-44f7-9b02-ca674f8c5f01
begin
con_p = contour(-9:0.1:9, -6:0.1:6, (x,y) -> pdf(d, [x,y]), legend=false, ratio=1, xlim =[-9,9], ylim=[-6,6])
surf_p = surface(-9:0.1:9, -6:0.1:6, (x,y) -> pdf(d, [x,y]), legend=false)
Plots.plot(con_p, surf_p, layout=(1,2), size=(780,400))
end
# ╔═╡ d20cf19f-70fc-47dd-a031-22079bbd10b9
md"""
A vector field view of the gradient of the target density is shown below. Note that at each location, a gradient (a blue vector) always points to the steepest ascent direction. The gradient therefore provides key geographical information about the landscape. Hamiltonian sampler proposes the next state by making use of the gradient information.
"""
# ╔═╡ 1fbb02a8-299a-406c-a337-a74c5ad444e1
begin
meshgrid(x, y) = (repeat(x, outer=length(y)), repeat(y, inner=length(x))) # helper function to create a quiver grid.
∇ℓ_(x, y) = ∇ℓ([x, y])/5
contour(-9:0.1:9, -6:0.1:6, (x,y) -> pdf(d, [x,y]), legend=false, title="Contour plot and gradient plot of the target density")
xs,ys = meshgrid(range(-9, stop=9, length=15), range(-6,stop=6,length=11))
quiver!(xs, ys, quiver=∇ℓ_, c=:blue)
end
# ╔═╡ c9fa60c5-f216-4eb9-b052-b3472e551bdf
Foldable("Julia code for the target distribution and visualisation.", md"By using Julia's `Distributions.jl` and `ForwardDiff.jl`, we can formulate the density and its corresponding gradient function as following:
```julia
d1 = MvNormal([-4, 0], [2 1.5; 1.5 2])
d2 = MvNormal([4, 0], [2 -1.5; -1.5 2])
d3 = MvNormal([0, 0], [2 0; 0. 2])
# Superimpose the three Gaussians together to form a new density
d = MixtureModel([d1, d2, d3])
# log likelihood of the mixture distribution
ℓ(x) = logpdf(d, x)
# gradient of the log pdf
∇ℓ(x) = ForwardDiff.gradient(ℓ, x)
```
The contour and surface of the targe distribution can be plotted easily via
```julia
contour(-9:0.1:9, -6:0.1:6, (x,y) -> pdf(d, [x,y]), legend=false, ratio=1, xlim =[-9,9], ylim=[-6,6])
surface(-9:0.1:9, -6:0.1:6, (x,y) -> pdf(d, [x,y]), legend=false)
```
")
# ╔═╡ 89a6cb78-b3e2-4780-9961-679a73967f5e
md"""
#### A useful physics metaphor
A very useful metaphor to understand the Hamiltonian sampler is to **imagine a hockey pluck sliding over a surface of the negative target distribution without friction**. The surface is formed by the negative target density (i.e. flip the surface plot, and intuitively a mountain becomes a valley).
"""
# ╔═╡ bad85a4b-6a14-445a-8af6-ac5f5ebaf183
surface(-9:0.1:9, -6:0.1:6, (x,y) -> -pdf(d, [x,y]), legend=false, title="Negative of the target density")
# ╔═╡ ba2545d4-cc39-4069-b6a4-a60f15911cec
md"""
At each iteration, a random initial force (usually drawn from a Gaussian) is applied to the pluck, the pluck then slides in the landscape to reach the next state.
A numerical algorithm called *Leapfrog* integration is usually used to simulate the movement of the pluck. In particular, the trajectory is simulated by a sequence of ``T`` steps with length ``\epsilon``. The two tuning parameters are:
* ``\epsilon``: each *Leapfrog*'s step length
* `T`: total number of steps taken to form the whole trajectory
Therefore, the total length of the pluck's movement is proportional to ``\epsilon \times T``.