-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsection5_regressions.jl
3935 lines (3171 loc) · 150 KB
/
section5_regressions.jl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
### A Pluto.jl notebook ###
# v0.19.46
using Markdown
using InteractiveUtils
# ╔═╡ e2f564ae-b741-4532-be01-9255a235793e
using Turing
# ╔═╡ da388a86-19cb-11ed-0c64-95c301c27153
begin
using PlutoUI
using StatsPlots
using Plots; default(fontfamily="Computer Modern", framestyle=:box) # LaTex-style
using PlutoTeachingTools
using CSV, DataFrames
using LinearAlgebra
# using Turing
using Random
using LaTeXStrings, Latexify
using Logging; Logging.disable_logging(Logging.Warn);
end;
# ╔═╡ c1026d6b-4e2e-4045-923e-eb5886b45604
TableOfContents()
# ╔═╡ 13cbb8af-9857-43c1-902c-52ccdbe02384
md"[**↩ Home**](https://lf28.github.io/BayesianModelling/)
[**↪ Next Chapter**](./section6_logistic_regression.html)
"
# ╔═╡ 4ed16d76-2815-4f8c-8ab0-3819a03a1acc
md"""
# Bayesian linear regressions
Supervised learning in general is to predict a *dependent variable* ``Y`` based on *covariate features* ``X``. In other words, it tries to access how ``X`` affects ``Y``. Depending on the value type of the labelled targets ``Y``, the analysis can be further classified as *regression* and *classification*. When ``Y`` can take a spectrum of continuous values, such as real value, the learning task is known as regression. When ``Y`` take discrete choices, the task is commonly referred to as classification.
In this chapter, we consider the case of ``Y`` being real-valued, ``Y\in \mathbb{R}``, *i.e.* regression. And leave the discussion on other general cases such as ``Y\in \{\texttt{true}, \texttt{false}\}`` or ``Y\in \{0,1,2,\ldots\}`` in the following chapters. We will first briefly review the Frequentist's regression model and then move on to introduce the Bayesian treatment. From a modelling perspective, the Bayesian model is almost the same as the Frequentist's model except for the additional prior distributions. It turns out the additional prior not only offers a way to incorporate the modeller's prior expert knowledge but also has a *regularisation* effect in estimating the model parameters.
Lastly, we are going to see how to do practical Bayesian linear regression analysis with `Turing`. Some extensions are also considered at the end.
"""
# ╔═╡ c8547607-92be-4700-9468-863798c2ddce
md"""
## Frequentist's linear regression
Firstly, we recap the Frequentist's regression model. A typical regression model assumes each observation ``y_n`` is generated based on some deterministic transformation of the independent variable ``\mathbf{x}_n \in \mathbb{R}^D`` plus and some observation noise ``\varepsilon_n``:
```math
y_n = \mu(\mathbf{x}_n) + \varepsilon_n
```
for ``n=1,\ldots, N``, where the observation noises are usually assumed to be white Gaussian noises:
$$\varepsilon_n \sim \mathcal N(0, \sigma^2).$$
The deterministic function ``\mu`` is usually assumed to be linear:
$$\mu(\mathbf x_n) = \beta_0 + \mathbf{x}_n^\top \boldsymbol{\beta}_1 ,$$
where ``\beta_0`` is called intercept and ``\boldsymbol{\beta}_1 \in \mathbb{R}^D`` determine linear relationship between ``\mathbf{x}`` and ``y``. Since ``\mu`` is a linear function, the regression is called **linear regression**.
"""
# ╔═╡ 8f95b8c8-70a0-471a-be06-3b020db667b3
md"""
**Probabilistic reformulation.**
The Frequentist's model above can also be equivalently written as:
```math
p(y_n|\mathbf{x}_n, \boldsymbol{\beta}, \sigma^2) = \mathcal{N}(y_n; \beta_0+\mathbf{x}_n^\top \boldsymbol{\beta}_1, \sigma^2),
```
where ``\boldsymbol{\beta}^\top = [\beta_0, \boldsymbol{\beta}_1]`` is used to denote the both the intercept and regression parameters. In other words, a **likelihood model** for the observed data ``\mathbf{y} = \{y_n\}`` for ``n=1,\ldots, N``. And each ``y_n`` is Gaussian distributed with mean $\beta_0+\mathbf{x}_n^\top \boldsymbol{\beta}_1$ and variance $\sigma^2$. The model also assumes the independent variables ``\mathbf{x}_n`` are fixed (i.e. non-random).
By using matrix notation, the above model can be compactly written as:
```math
p(\mathbf{y}|\mathbf{X}, \boldsymbol{\beta}, \sigma^2) = \mathcal{N}_N(\mathbf{y}; \beta_0 \mathbf{1}_N + \mathbf{X} \boldsymbol{\beta}_1, \sigma^2\mathbf{I}_N),
```
where ``\mathbf{y} = [y_1, y_2,\ldots,y_N]^\top, \mathbf{X} = [\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n]^\top``, ``\mathbf{1}_N=[1, \ldots, 1]^\top`` is a ``N\times 1`` column vector of ones and ``\mathbf{I}_{N}`` is a ``N\times N`` identity matrix.
The likelihood assumption is illustrated below for a simple linear regression model with a one-dimensional predictor, i.e. ``D=1``. Conditioning on where ``x_n`` is, ``y_n`` is Gaussian distributed with a mean determined by the regression line and an observation variance ``\sigma^2``.
"""
# ╔═╡ 3a46c193-5a25-423f-bcb5-038f3756d7ba
md"""
**OLS (MLE) Estimation*.** Frequentists estimate the model by using maximum likelihood estimation (MLE). The idea is to find point estimators ``\hat{\beta_0}, \hat{\boldsymbol{\beta}}_1, \hat{\sigma^2}`` such that the likelihood function is maximised.
```math
\hat{\beta_0}, \hat{\boldsymbol{\beta}}_1, \hat{\sigma^2} \leftarrow \arg\max p(\mathbf y|\mathbf X, \beta_0, \boldsymbol{\beta}_1, \sigma^2)
```
It can be shown that the MLE are equivalent to the more widely known ordinary least square (OLS) estimators, in which we are minimising the sum of squared error
```math
\hat{\beta_0}, \hat{\boldsymbol{\beta}}_1 \leftarrow \arg\min \sum_{n=1}^N (y_n - \mathbf{x}_n^\top \boldsymbol{\beta}_1 -\beta_0)^2.
```
By taking the derivative and setting it to zero, the closed-form OLS estimator is:
```math
\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y},
```
where we have assumed ``\boldsymbol{\beta}^\top = [\beta_0, \boldsymbol{\beta}_1]``, and the design matrix ``\mathbf{X}`` is augmented with ``\mathbf{1}_N`` as the first column (*i.e.* dummy variable for intercept).
"""
# ╔═╡ 1cae04d1-82ca-47c8-b1fa-7fa5a754d4b1
Foldable("Derivation details about the MLE.", md"
Due to the independence assumption,
```math
p(\mathbf{y}|\mathbf{X}, \boldsymbol{\beta}, \sigma^2) = \prod_{n=1}^N p({y}_n|\mathbf{x}_n, \boldsymbol{\beta}, \sigma^2)= \prod_{n=1}^N \mathcal{N}(y_n; \mathbf{x}_n^\top \boldsymbol{\beta}, \sigma^2)
```
Since the logarithm function is a monotonically increasing function, the maximum is invariant under the transformation. We therefore can maximise the log-likelihood function instead. And the log-likelihood function is
$$\begin{align}\mathcal{L}(\beta_0, \boldsymbol{\beta}, \sigma^2) &= \ln \prod_{n} \mathcal{N}(y_n; \mathbf{x}_n^\top \boldsymbol{\beta}, \sigma^2)\\
&= \sum_{n=1}^N \ln \left\{ (2\pi)^{-\frac{1}{2}}({\sigma^2})^{-\frac{1}{2}} \exp\left ( -\frac{1}{2 \sigma^2} (y_n - \beta_0 - \mathbf{x}_n^\top \boldsymbol{\beta})^2\right )\right \} \\
&= {-\frac{N}{2}}\cdot 2\pi -\frac{N}{2}{\sigma^2} -\frac{1}{2 \sigma^2} \sum_{n=1}^N (y_n - \beta_0 - \mathbf{x}_n^\top \boldsymbol{\beta})^2\end{align}$$
Maximising the above log-likelihood is the same as minimising its negation, which leads to the least squared error estimation:
```math
\begin{align}
\hat{\beta_0}, \hat{\boldsymbol{\beta}} &\leftarrow \arg\min -\mathcal{L}(\beta_0, \boldsymbol{\beta}, \sigma^2) \\
&= \arg\min \sum_{n=1}^N (y_n - \beta_0 - \mathbf{x}_n^\top \boldsymbol{\beta})^2,
\end{align}
```
where we have assumed ``\sigma^2`` is known.
")
# ╔═╡ effcd3d2-ba90-4ca8-a69c-f1ef1ad697ab
md"We use `GLM.jl` to fit the OLS estimation based on the simulated data. The estimated model and true model is shown below."
# ╔═╡ af404db3-7397-4fd7-a5f4-0c812bd90c4a
begin
Random.seed!(100)
β₀, β₁, σ² = 3, 3, 0.5
N = 50
X = rand(N)
μ = β₀ .+ β₁ * X
yy = μ + sqrt(σ²) * randn(N)
end;
# ╔═╡ c6938b7f-e6e5-4bea-a273-52ab3916d07c
md"""
**Example.** Consider a simple linear regression model with one predictor, i.e.
```math
y_n = \beta_0 + \beta_1 x_n + \varepsilon_n.
```
To better illustrate the idea, we have simulated a dataset with ``N``=$(N) observations and the regression is defined with the following parameters: ``\beta_0=3, \beta_1=3, \sigma^2=0.5``. The simulated dataset is plotted below.
"""
# ╔═╡ 3e98e9ff-b674-43f9-a3e0-1ca5d8614327
begin
using GLM
ols_simulated = lm([ones(N) X], yy)
end
# ╔═╡ f8040fa0-fdff-42da-8b9f-785627f51ee1
let
p_lr = plot(title="Frequentist's linear regression's probabilistic model", framestyle = :origin,legend=:bottomright)
β0, σ²0, N = [3, 3], 0.5, 250
Random.seed!(100)
X = rand(N)
μ = β0[1] .+ β0[2] * X
yy = μ + sqrt(σ²0) * randn(N)
plot!(X, yy, st=:scatter, label="")
plot!([0,1], x->β0[1]+β0[2]*x, c= 1, linewidth=5, label=L"\mu(x) = \beta_0+\beta_1x")
xis = [0, 0.25, 0.5, 0.75, 0.99, 0.4]
for i in 1:length(xis)
x = xis[i]
μi = dot(β0, [1, x])
xs_ = μi-3:0.01:μi+3
ys_ = pdf.(Normal(μi, sqrt(σ²0)), xs_)
ys_ = 0.1 *ys_ ./ maximum(ys_)
if i == length(xis)
plot!([x], [μi], st=:sticks, markerstrokewidth =1, markershape = :diamond, c=:black, label="", markersize=4)
plot!(ys_ .+x, xs_, c=:red, label=L"\mathcal{N}(\mu(x), \sigma^2)", linewidth=3)
else
plot!(ys_ .+x, xs_, c=:gray, label="", linewidth=1)
end
end
old_xticks = xticks(p_lr[1])
new_xticks = ([xis[end]], ["\$x\$"])
keep_indices = findall(x -> all(x .≠ new_xticks[1]), old_xticks[1])
merged_xticks = (old_xticks[1][keep_indices] ∪ new_xticks[1], old_xticks[2][keep_indices] ∪ new_xticks[2])
xticks!(merged_xticks)
p_lr
end
# ╔═╡ af2e55f3-08b8-48d4-ae95-e65168a23eeb
let
pred(x) = coef(ols_simulated)' * [1, x]
scatter(X, yy, xlabel=L"x", ylabel=L"y", label="")
plot!(0:0.1:1, pred, lw=2, label="OLS fit", legend=:topleft)
plot!(0:0.1:1, (x) -> β₀ + β₁*x, lw=2, label="True model", legend=:topleft)
end
# ╔═╡ 43191a7a-f1a2-41df-910d-cf85907e8f7a
md"""
## Bayesian linear regression model
"""
# ╔═╡ 98ef1cca-de03-44c2-bcf4-6e79da139e11
md"""
The Bayesian model reuses the frequentist's likelihood model for ``\mathbf{y}``:
```math
p(y_n|\mathbf{x}_n, \boldsymbol{\beta}, \sigma^2) = \mathcal{N}(y_n; \beta_0+\mathbf{x}_n^\top \boldsymbol{\beta}_1, \sigma^2),
```
where the model parameters are
* ``\beta_0\in \mathbb{R}`` -- intercept
* ``\boldsymbol{\beta}_1 \in \mathbb{R}^D`` -- regression coefficient vector
* ``\sigma^2\in \mathbb{R}^+`` -- Gaussian noise's variance
In addition, the Bayesian model also needs to impose a suitable prior for the unknowns
```math
p(\beta_0, \boldsymbol{\beta}_1, \sigma^2).
```
In most cases, we further assume the joint prior is independent, *i.e.*
```math
p(\beta_0, \boldsymbol{\beta}_1, \sigma^2)= p(\beta_0)p(\boldsymbol{\beta}_1)p( \sigma^2).
```
"""
# ╔═╡ a1421ccb-2d6e-4406-b770-ad7dff007c69
md"""
Based on their value ranges, a commonly used set of prior choices is summarised below.
**Prior for the intercept ``p(\beta_0)``**
Since ``\beta_0`` takes an unconstrained real value, a common choice is Gaussian
$$p(\beta_0) = \mathcal N(m_0^{\beta_0}, v_0^{\beta_0});$$
where the hyper-parameters ``m_0^{\beta_0}, v_0^{\beta_0}`` can be specified based on the data or independently.
*Data-dependent hyper-parameter:* e.g. ``\mathcal{N}(m_0^{\beta_0}={\bar{\mathbf{y}}}, v_0^{\beta_0}= 2.5^2 \sigma_{\mathbf{y}}^2 )``
* the corresponding prior on ``\beta_0`` centres around the sample average ``\bar{\mathbf{y}}`` with variance 2.5 times the standard deviation of ``\bar{\mathbf{y}}``
* covering a wide range of possible values, the prior is a weakly informative prior
*Data-independent hyper-parameter:* e.g. ``\mathcal{N}(m_0^{\beta_0}=0, v_0^{\beta_0}= 10^2)``
* the prior guess centres around ``0`` but with a great amount of uncertainty (very large variance)
* the zero prior mean here encourages the posterior shrinks towards 0 (but very weakly).
"""
# ╔═╡ e1552414-e701-42b5-8eaf-21ae04a829a8
md"""
**Prior for the coefficient ``p(\boldsymbol{\beta}_1)``**
Since ``\boldsymbol{\beta}_1\in \mathbb{R}^D`` is an unconstrained real vector, a suitable prior choice is a D-dimensional multivariate Gaussian (or other similar distributions such as Student-``t``):
$$p(\boldsymbol{\beta}_1) = \mathcal N_{D}(\mathbf m_0^{\boldsymbol{\beta}_1}, \mathbf V_0^{\boldsymbol{\beta}_1}),$$
where the hyper-parameters are usually set data-independently:
* ``\mathbf{m}_0^{\boldsymbol{\beta}_1}=\mathbf{0}``, the prior encourages the posterior shrinks towards zeros, which has a regularisation effect
* ``\mathbf{V}_0^{\boldsymbol{\beta}_1} = v_0 \mathbf{I}_D``, *i.e.* a diagonal matrix with a common variance ``v_0``;
* the common variance ``v_0`` is often set to a large number, *e.g.``v_0=10^2``* to impose a vague prior.
**Prior for the observation variance ``p(\sigma)``**
``\sigma^2 >0`` is a positive real number, a good choice is truncated real-valued distribution, where the negative part is truncated, such as ``\texttt{Half-Cauchy}`` (Some ``\texttt{Half-Cauchy}`` distributions are plotted below):
$$p(\sigma^2) = \texttt{HalfCauchy}(s_0^{\sigma^2});$$
* Cauchy distributions have heavy tails, making them suitable choices to express uncertain beliefs on the modelling parameter. And the hyperparameter ``s_0^{\sigma^2}`` controls the tail of a Cauchy. The larger the scale parameter ``s_0^{\sigma^2}`` is, the weaker the prior is.
* ``s_0^{\sigma^2} > 2`` usually is good enough.
"""
# ╔═╡ 9387dcb4-3f4e-4ec7-8393-30a483e00c63
let
plot(-5:0.1:25, (x) -> pdf(truncated(Cauchy(0, 2), lower=0), x), lw=1.5, label=L"\texttt{HalfCauchy}(2.0)")
plot!(-5:0.1:25, (x) -> pdf(truncated(Cauchy(0, 4), lower=0), x), lw=1.5, label=L"\texttt{HalfCauchy}(4.0)")
plot!(-5:0.1:25, (x) -> pdf(truncated(Cauchy(0, 6), lower=0), x), lw=1.5, label=L"\texttt{HalfCauchy}(6.0)")
plot!(-5:0.1:25, (x) -> pdf(truncated(Cauchy(0, 10), lower=0), x), lw=1.5, label=L"\texttt{HalfCauchy}(10.0)")
end
# ╔═╡ d3f4ac7b-1482-4840-b24b-d08066d1d70c
md"""
To put everything together, a fully-specified (data independent) Bayesian model is summarised below.
!!! infor "Bayesian linear regression"
```math
\begin{align}
\text{Priors: }\;\;\;\;\;\;\beta_0 &\sim \mathcal{N}(m_0^{\beta_0}, v_0^{\beta_0})\\
\boldsymbol{\beta}_1 &\sim \mathcal{N}(\mathbf{m}_0^{\boldsymbol{\beta}_1}, \mathbf{V}_0^{\boldsymbol{\beta}_1 })\\
\sigma^2 &\sim \texttt{HalfCauchy}(s_0) \\
\text{Likelihood: }\;\;\text{for } n &= 1,2,\ldots, N:\\
\mu_n &=\beta_0 + \boldsymbol{\beta}_1^\top \mathbf{x}_n \\
y_n &\sim \mathcal{N}(\mu_n, \sigma^2).
\end{align}
```
"""
# ╔═╡ bab3a19c-deb0-4b1c-a8f9-f713d66d9199
md"""
### Connection to ridge regression
"""
# ╔═╡ b433c147-f547-4234-9817-2b29e7d57219
md"""
Setting the prior's mean to zero, *i.e.* ``\mathbf{m}_0^{\beta_1}= \mathbf{0}``, seems an arbitrary choice. However, it can be justified from a regularisation perspective. For regression problems with many predictors, the regression model can be too flexible. In some extreme cases, the regression model can be under-determined, *i.e.* the number of predictors is greater than the number of observations. For problems like this, the ordinary OLS estimator is not stable or not even existed.
By introducing a zero-mean Gaussian prior for ``\boldsymbol{\beta}``, the log posterior density becomes:
$$\begin{align}
\ln p(\boldsymbol{\beta}|\mathbf y, \mathbf{X}) &= \ln p(\boldsymbol{\beta}) +\ln p(\mathbf y|\boldsymbol{\beta}, \mathbf{X}) + C\\
&= -\frac{1}{2v_0} \|\boldsymbol{\beta}\|^2 - \frac{1}{2\sigma^2}\sum_{n=1}^N( y_n-\mathbf{x}_n^\top \boldsymbol{\beta})^2 + C
\end{align}$$
If one minimises the negative log posterior by:
```math
\hat{\boldsymbol{\beta}}_{\text{ridge}} \leftarrow \arg\min_{\boldsymbol{\beta}}\frac{\lambda}{2} \|\boldsymbol{\beta}\|^2 + \frac{1}{2}\sum_{n=1}^N( y_n-\mathbf{x}_n^\top \boldsymbol{\beta})^2,
```
where ``\lambda \triangleq \frac{\sigma^2}{v_0}``, and the estimator is called the **ridge estimator**. The corresponding regression model is Frequentist's **ridge regression**. Note that the first term serves as a penalty that regulates the estimation of ``\boldsymbol{\beta}`` such that large-magnitude estimators are discouraged.
"""
# ╔═╡ 89414966-5447-4133-80cb-0933c8b0d5d0
Foldable("Derivation details.", md"
The prior is
$$p(\boldsymbol{\beta}) = \mathcal{N}(\mathbf{0}, v_0 \mathbf{I}) = \frac{1}{\sqrt{(2\pi)^d |v_0 \mathbf{I}|}} \text{exp}\left (-\frac{1}{2v_0 }\boldsymbol{\beta}^\top \boldsymbol{\beta} \right ).$$
Take the log and add the log-likelihood function, we have
```math
\begin{align}
\ln p(\boldsymbol{\beta}|\mathbf y, \mathbf{X}) &= \ln p(\boldsymbol{\beta}) +\ln p(\mathbf y|\boldsymbol{\beta}, \mathbf{X}) + C\\
&= -\frac{1}{2v_0}\boldsymbol{\beta}^\top \boldsymbol{\beta} - \frac{1}{2\sigma^2}(\mathbf y-\mathbf{X} \boldsymbol{\beta})^\top (\mathbf y-\mathbf{X} \boldsymbol{\beta}) + C\\
&=-\frac{1}{2v_0} \|\boldsymbol{\beta}\|^2 - \frac{1}{2\sigma^2}\sum_{n=1}^N( y_n-\mathbf{x}_n^\top \boldsymbol{\beta})^2.
\end{align}
```
")
# ╔═╡ 824aa830-c958-4063-9ef7-39eeb743fc06
md"""
### Posterior distribution in analytical form*
"""
# ╔═╡ d825e29d-3e0b-4c25-852f-0c9a544aa916
md"""
After specifying the Bayesian model, what is left is to apply Bayes' rule to find the posterior distribution:
```math
p(\boldsymbol{\beta}, \sigma^2|\mathcal{D}) \propto p(\boldsymbol{\beta}, \sigma^2) p(\mathbf{y}|\boldsymbol{\beta}, \sigma^2,\mathbf{X}),
```
where ``\mathcal{D}`` denotes all the observed data *i.e.* ``\mathcal{D} = \{\mathbf{X}, \mathbf{y}\}``, and ``\boldsymbol{\beta}^\top = [\beta_0, \boldsymbol{\beta}_1]`` denotes the concatenated regression parameter vector. Due to the independence assumption, the joint prior for ``\boldsymbol{\beta}`` can be formed as a ``D+1`` dimensional Gaussian with mean ``\mathbf{m}_0^\top = [m_0^{\beta_0}, \mathbf{m}_0^{\beta_1}]`` and variance ``\mathbf{V}_0 = \text{diag}\{v_0^{\beta_0}, \mathbf{V}_0^{\boldsymbol{\beta}_1}\}.``
Unfortunately, the full joint posterior distribution has no closed-form analytical form, and an approximation method such as MCMC is required to do a full-scale analysis. However, if the observation variance ``\sigma^2`` is assumed known, the *conditional posterior* ``p(\boldsymbol{\beta}|\mathcal{D}, \sigma^2)`` can be computed exactly. And it can be shown that the posterior admits a multivariate Gaussian form:
```math
p(\boldsymbol{\beta}|\mathcal{D}, \sigma^2) =\mathcal{N}_{D+1}(\mathbf{m}_N, \mathbf{V}_N),
```
where
```math
\mathbf{m}_N = \mathbf{V}_N\left (\mathbf{V}_0^{-1}\mathbf{m}_0 +\frac{1}{\sigma^2}\mathbf{X}^\top\mathbf{y}\right ) ,\;\;\mathbf{V}_N =\left (\mathbf{V}_0^{-1} + \frac{1}{\sigma^2}\mathbf{X}^\top \mathbf{X}^\top\right )^{-1}.
```
It is one of very few models that the posterior can be evaluated in closed form.
"""
# ╔═╡ f6cedb98-0d29-40f0-b9f3-430dd283fa36
md"""
**Demonstration:**
An animation is created below to demonstrate the posterior update equation. The toy dataset is reused here. Recall the true parameters are ``\beta_0=\beta_1=3``. A zero-mean vague Gaussian prior is placed on ``\boldsymbol{\beta}``:
```math
p\left(\begin{bmatrix}\beta_0\\ \beta_1\end{bmatrix}\right) = \mathcal{N}\left (\begin{bmatrix}0\\ 0\end{bmatrix}, \begin{bmatrix}10^2& 0 \\ 0 & 10^2\end{bmatrix}\right).
```
The posterior distribution is then updated sequentially with the first 10 observations. As can be observed, initially the prior distribution is circular and covers a wide range of possible values. With more data observed, the posterior quickly converges to the posterior centre: ``[3,3]^\top``. Also, note the shrinking posterior variance (or increasing estimation precision) as more data is observed.
"""
# ╔═╡ 71bb2994-7fd2-4e11-b2f1-d88b407f86c1
let
xs = range(-10, 10, 200)
ys = range(-10, 10, 200)
m₀, V₀ = zeros(2), 10^2 * Matrix(1.0I,2,2)
prior = heatmap(xs, ys, (x,y)-> pdf(MvNormal(m₀, V₀), [x,y]), levels=20, colorbar=false , fill=true, ratio=1, color= :jet1, xlim=[-10, 10], xlabel=L"\beta_0", ylabel=L"\beta_1")
# lik1 = heatmap(xs, ys, (x,y)-> pdf(Normal(x + y * X[1] , sqrt(σ²)), yy[1]), levels=10, colorbar=false, ratio=1, color= :jet1, xlim=[-15, 15], xlabel=L"\beta_0", ylabel=L"\beta_1")
function seq_update(x, y, m0, V0, σ²)
xx = [1 x]
mn = m0 + V0 * xx'* (dot(xx, V0, xx') + σ²)^(-1)*(y - dot(xx, m0) )
Vn = inv(1/σ²* xx'* xx + inv(V0))
return mn[:], Symmetric(Vn)
end
posts = [prior]
anim = @animate for i in 1:10
post = heatmap(xs, ys, (x,y)-> pdf(MvNormal(m₀, V₀), [x,y]), levels=20, colorbar=false , fill=true, ratio=1, color= :jet1, xlim=[-10, 10], xlabel=L"\beta_0", ylabel=L"\beta_1", title="Update with " *L"N=%$(i-1)"*" data")
m₀, V₀ = seq_update(X[i], yy[i], m₀, V₀, σ²)
push!(posts, post)
end
gif(anim, fps=1)
end
# ╔═╡ d02d5490-cf53-487d-a0c6-651725600f52
md"""
**Interpretation:**
The posterior's parameter provides us with some insights into what Bayesian computation is doing. If we assume the matrix inverse ``(\mathbf{X}^\top\mathbf{X})^{-1}`` exists,
the posterior's mean can be rewritten as
```math
\mathbf{m}_N = \left (\mathbf{V}_0^{-1} + \frac{1}{\sigma^2}\mathbf{X}^\top \mathbf{X}\right )^{-1}\left (\mathbf{V}_0^{-1}\mathbf{m}_0 +\frac{1}{\sigma^2}\mathbf{X}^\top \mathbf{X}\hat{\boldsymbol{\beta}}\right ) .
```
Defining ``\tilde{\mathbf{V}}^{-1} = \frac{1}{\sigma^2}\mathbf{X}^\top \mathbf{X}``, the above formula becomes
```math
\mathbf{m}_N = \left (\mathbf{V}_0^{-1} + \tilde{\mathbf{V}}^{-1}\right )^{-1}\left (\mathbf{V}_0^{-1}\mathbf{m}_0 +\tilde{\mathbf{V}}^{-1}\hat{\boldsymbol{\beta}}\right ) .
```
Therefore, the posterior mean is a matrix-weighted average between the prior guess ``\mathbf{m}_0`` and the MLE estimator ``\hat{\boldsymbol{\beta}}``; and the weights are ``\mathbf{V}_0^{-1}``, the precision of the prior guess, and ``\tilde{\mathbf{V}}^{-1}`` the precision for the MLE. If the prior mean is zero, *i.e.* ``\mathbf{m}_0 =\mathbf{0}``, the posterior mean as an average will shrink towards ``\mathbf{0}``.
This can be understood better to consider some de-generate examples. Assume ``D=1``, and both the intercept and observation noise's variance are known, say ``\beta_0=0, \sigma^2=1``. The posterior degenerates to
```math
\begin{align}
m_N &= \frac{v_0^{-1}}{v_0^{-1} + \tilde{v}^{-1}}m_0 + \frac{\tilde{v}^{-1} }{v_0^{-1} + \tilde{v}^{-1}}\hat{\beta}_1\\
v_N &= \frac{1}{ v_0^{-1} + \tilde{v}^{-1}}
\end{align}
```
where ``\tilde{v}^{-1} = \sum_n x_n^2`` by definition.
Note that if we assume ``m_0=0``, and the prior precision ``v_0^{-1}`` gets large, say ``v_0^{-1} \rightarrow \infty`` (in other words, we strongly believe the slope is zero), the posterior mean ``m_N`` will get closer to zero: ``m_N\rightarrow m_0=0``.
Also note the posterior variance ``v_N`` is reduced in comparison with the prior variance ``v_0``. It makes sense since the posterior update reduces the estimation uncertainty.
"""
# ╔═╡ 59dd8a13-89c6-4ae9-8546-877bb7992570
md"""
## Implementation in `Turing`
Now we move on to show how to implement the modelling and inference in `Turing`. For simplicity, we consider simple linear regression, *i.e.* regression with one predictor, with simulated data first and move on to see a multiple regression example by using a real-world dataset.
### Simple linear regression
The corresponding Bayesian simple linear regression model is a specific case of the general model in which ``D=1``. The model can be specified as:
```math
\begin{align}
\text{Priors: }\;\;\;\;\;\;\beta_0 &\sim \mathcal{N}(0, v_0^{\beta_0})\\
\beta_1 &\sim \mathcal{N}(0, v_0^{\beta_1})\\
\sigma^2 &\sim \texttt{HalfCauchy}(s_0) \\
\text{Likelihood: }\;\;\text{for } n &= 1,2,\ldots, N:\\
\mu_n &=\beta_0 + \beta_1 x_n \\
y_n &\sim \mathcal{N}(\mu_n, \sigma^2).
\end{align}
```
Note that we have just replaced all the multivariate assumptions for ``\beta_1`` with its uni-variant equivalent.
"""
# ╔═╡ 632575ce-a1ce-4a36-95dc-010229367446
md"""
The model can be "translated" to `Turing` literally. Note that ``\texttt{HalfCauchy}`` distribution is a Cauchy distribution truncated from zero onwards, which can be implemented in `Julia` by:
```julia
truncated(Cauchy(0, s₀), lower=0) # HalfCauchy distribution with mean 0 and scale s₀
```
"""
# ╔═╡ c0f926f1-85e6-4d2c-8e9a-26cd099fd600
md"""
In terms of the prior parameters, we have set weak priors as default. For example, the ``\texttt{HalfCauchy}(s_0=5)`` prior for ``\sigma^2`` easily covers the true value, which is ``0.5``.
And the prior variances of the regression parameters are set to ``v_0^{\beta_0}= v_0^{\beta_1}=10^2``, leading to a very vague prior (and the true parameters are well covered within the prior's density area).
"""
# ╔═╡ e9bb7a3c-9143-48c5-b33f-e7d6b48cb224
@model function simple_bayesian_regression(Xs, ys; v₀ = 10^2, V₀ = 10^2, s₀ = 5)
# Priors
# Gaussian is parameterised with sd rather than variance
β0 ~ Normal(0, sqrt(v₀))
β ~ Normal(0, sqrt(V₀))
# Half-Cauchy prior for the observation variance
σ² ~ truncated(Cauchy(0, s₀), lower=0)
# calculate f(x) = β₀ + βx for all observations
# use .+ to broadcast the intercept to all
μs = β0 .+ β * Xs
# Likelihood
for i in eachindex(ys)
# Gaussian in `Distributions.jl` is parameterised by std σ rather than variance
ys[i] ~ Normal(μs[i], sqrt(σ²))
end
end
# ╔═╡ 1ef001cc-fe70-42e5-8b97-690bb725a734
md"""
Next, we use the above Turing model to infer the simulated dataset. A Turing model is first instantiated with the simulated data and then MCMC sampling algorithms are used to draw posterior samples.
"""
# ╔═╡ 4ae89384-017d-4937-bcc9-3d8c63edaeb5
begin
Random.seed!(100)
sim_data_model = simple_bayesian_regression(X, yy)
chain_sim_data = sample(sim_data_model, NUTS(), MCMCThreads(), 2000, 3; discard_initial=500)
end;
# ╔═╡ 1761f829-6c7d-4186-a66c-b347be7c9a15
md"""
Next, we inspect the chain first to do some diagnostics to see whether the chain has converged.
"""
# ╔═╡ d57a7a26-a8a5-429d-b1b9-5e0499549544
summarystats(chain_sim_data)
# ╔═╡ ba598c45-a76e-41aa-bfd6-a64bb6cea875
md"""Based on the fact that `rhat < 1.01` and the `ess` count, the chain has converged well. We can also plot the chain to visually check the chain traces."""
# ╔═╡ 391a1dc8-673a-47e9-aea3-ad79f366460d
md"""
**Result analysis**
"""
# ╔═╡ d3d8fe25-e674-42de-ac49-b83aac402e2d
md"Recall the true parameters are ``\beta_0=3, \beta_1=3, \sigma^2=0.5``. And the inferred posteriors are summarised below.
"
# ╔═╡ 748c1ff2-2346-44f4-a799-75fb2002c6fc
describe(chain_sim_data)[1]
# ╔═╡ ae720bcd-1607-4155-9a69-bfb6944d5dde
b0, b1 = describe(chain_sim_data)[1][:, :mean][1:2];
# ╔═╡ 08f7be6d-fda9-4013-88f5-92f1b5d26336
md"""
The posterior's means for the three unknowns are around $(round(b0; digits=2)) and $(round(b1; digits=2)), which are almost the same as the OLS estimators, which are expected since very weak-informative priors have been used.
!!! information "Bayesian inference under weak priors"
As a rule thumb, when non-informative or weak-informative priors are used, the Bayesian inference's posterior mean should be similar to the Frequentist estimators.
"""
# ╔═╡ 40daa902-cb85-4cda-9b9f-7a32ee9cbf1c
describe(chain_sim_data)[2]
# ╔═╡ ab77aef9-920c-423a-9ce0-2b5adead1b7f
begin
chain_sim_data_ = replacenames(chain_sim_data, Dict("σ²" => L"\sigma^2"))
density(chain_sim_data_)
end
# ╔═╡ b1d7e28a-b802-4fa1-87ab-7df3369a468a
md"""
The above density plots show the posterior distributions. It can be observed the true parameters are all within good credible ranges.
The following diagram shows the model ``\mu = \beta_0 + \beta_1 x`` estimated by the Bayesian method:
* The thick red line shows the posterior mean of the Bayesian model
* The lighter lines are some posterior samples, which also indicates the uncertainty about the posterior distribution (the true model is within the prediction)
* the posterior mean is simply the average of all the posterior samples.
"""
# ╔═╡ 4293298e-52a5-40ca-970d-3f17c2c89adb
let
parms_turing = describe(chain_sim_data)[1][:, :mean]
β_samples = Array(chain_sim_data[[:β0, :β]])
pred(x) = parms_turing[1:2]' * [1, x]
plt = scatter(X, yy, xlabel=L"x", ylabel=L"y", label="")
plot!(0:0.1:1, pred, lw=2, label=L"\mathbb{E}[\mu|\mathcal{D}]", legend=:topleft)
plot!(0:0.1:1, (x) -> β₀ + β₁*x, lw=2, label="True model", legend=:topleft)
plot!(0:0.1:1, (x) -> β_samples[1, 1] + β_samples[1, 2] *x, lw=0.5, lc=:gray, label=L"\mu^{(r)} \sim p(\mu|\mathcal{D})", legend=:topleft)
for i in 2:100
plot!(0:0.1:1, (x) -> β_samples[i, 1] + β_samples[i, 2] *x, lw=0.15, lc=:gray, label="", legend=:topleft)
end
plt
end
# ╔═╡ 860c4cf4-59b2-4036-8a30-7fbf44b18648
md"""
#### Predictive checks
To make sure our Bayesian model assumptions make sense, we should always carry out predictive checks. This is relatively straightforward to do the checks with `Turing`. Recall the procedure is to first simulate multiple pseudo samples ``\{\mathbf{y}_{pred}^{(r)}\}`` based on the posterior predictive distribution
```math
\mathbf{y}_{pred} \sim p(\mathbf{y}|\mathcal{D}, \mathbf{X}),
```
and then the empirical distribution of the simulated data is compared against the observed. If the model assumptions are reasonable, we should expect the observed lies well within the possible region of the simulated.
"""
# ╔═╡ 74a9a678-60b9-4e3f-97a2-56c8fdc7094f
md"""
To simulate the pseudo data, we first create a dummy `Turing` with the targets filled with `missing` types. And then use `predict()` method to simulate the missing data.
"""
# ╔═╡ 435036f6-76fc-458b-b0eb-119de02eabb7
pred_y_matrix = let
# define the predictive model by passing `missing` targets y such that the values will be samples
pred_model = simple_bayesian_regression(X, Vector{Union{Missing, Float64}}(undef, length(yy)))
# simulate the predictive pseudo observations
predict(pred_model, chain_sim_data) |> Array
end;
# ╔═╡ 6eff54f2-d2e0-4c18-8eda-3be3124b16a0
md"""
**Kernel density estimation** check. One of the possible visual checks is to use Kernel density estimation (KDE), which is demonstrated below for our model. It can be seen the simulated data agree with the observed.
"""
# ╔═╡ f11a15c9-bb0b-43c1-83e6-dce55f2a772f
let
pred_check_kde=density(yy, lw=2, label="Observed KDE", xlabel=L"y", ylabel="density")
for i in 1:20
label_ = i ==1 ? "Simulated KDE" : ""
density!(pred_y_matrix[i,:], lw=0.2, lc=:grey, label=label_)
end
pred_check_kde
end
# ╔═╡ 03b571e6-9d35-4123-b7bb-5f3b31558c9e
md"""
**Summary statistics** check. Another common visual check is to plot the summary statistics of the simulated data, such as mean and standard deviation (std). Again, the simulated data is similar to the observed.
"""
# ╔═╡ 7cffa790-bf2d-473f-95cc-ed67802d7b4f
let
# plot one third of the samples for a cleaner visualisation
first_N = Int(floor(size(pred_y_matrix)[1]/3))
df_pred_stats = DataFrame(mean= mean(pred_y_matrix, dims=2)[1:first_N], logstd = log.(std(pred_y_matrix, dims=2)[1:first_N]))
@df df_pred_stats scatter(:mean, :logstd, label="Simulated", ms=3, malpha=0.3, xlabel=L"\texttt{mean}(\mathbf{y})", ylabel=L"\ln(\texttt{std}(\mathbf{y}))")
scatter!([mean(yy)], [log(std(yy))], label="Observed")
end
# ╔═╡ 21aaa2db-6df0-4573-8951-bdfd9b6be6f9
md"""
### A multiple linear regression example
Consider the *Advertising* dataset which is described in the book [Introduction to Statistical Learning](https://hastie.su.domains/ISLR2/ISLRv2_website.pdf). The dataset records how advertising on TV, radio, and newspaper affects sales of a product.
"""
# ╔═╡ b1401e7d-9207-4e31-b5ce-df82c8e9b069
begin
Advertising = DataFrame(CSV.File(download("https://www.statlearning.com/s/Advertising.csv")))
first(Advertising, 5)
end
# ╔═╡ 241faf68-49b2-404b-b5d5-99061d1dd2a7
md"""
First, we shall plot the data to get a feeling about the dataset. By checking the correlations, it seems both *TV, radio* are effective methods and the correlation between *newspaper* and *sales* seems not strong enough.
$(begin
@df Advertising cornerplot([:sales :TV :radio :newspaper], compact = true)
end)
"""
# ╔═╡ 796ad911-9c95-454f-95f4-df5370b2496a
md"""
To understand the effects of the three different advertising methods, we can apply **multiple linear regression**. The multiple linear regression is formulated as:
```math
\texttt{sales} = \beta_0 + \beta_1 \times \texttt{TV} + \beta_2 \times \texttt{radio} + \beta_3 \times \texttt{newspaper} + \varepsilon
```
"""
# ╔═╡ ce97fbba-f5d0-42de-8b67-827c3478d8b0
md"A Frequentist's model is fitted by using `GLM` as a reference first."
# ╔═╡ f3b6f3ab-4304-4ef4-933c-728841c17998
ols_advertising = lm(@formula(sales ~ TV+radio+newspaper), Advertising)
# ╔═╡ ab51e2c3-dbd4-43e1-95b5-a875954ac532
md"According to the OLS fit, we can observe that the newspaper's effect is not statistically significant since its lower and upper 95% confidence interval encloses zero.
We now move on to fit a Bayesian model. A `Turing` multiple linear regression model is specified below. Note that the model is almost the same as the simple regression model. The only difference is we allow the number of predictors to vary.
"
# ╔═╡ e1db0ee1-1d91-49c4-be6c-1ea276750bfc
# Bayesian linear regression.
@model function general_linear_regression(X, ys; v₀ = 10^2, V₀ = 10^2, s₀ = 5)
# Set variance prior.
σ² ~ truncated(Cauchy(0, s₀), 0, Inf)
# Set intercept prior.
intercept ~ Normal(0, sqrt(v₀))
# Set the priors on our coefficients.
nfeatures = size(X, 2)
coefficients ~ MvNormal(nfeatures, sqrt(V₀))
# Calculate all the mu terms.
μs = intercept .+ X * coefficients
for i in eachindex(ys)
ys[i] ~ Normal(μs[i], sqrt(σ²))
end
end
# ╔═╡ 80727f81-2440-4da5-8a8b-38ebfdc4ddc9
md"We then fit the Bayesian model with the advertisement dataset:"
# ╔═╡ 929330bf-beb5-4e42-8668-2a10adf13972
chain_adv = let
xs = Advertising[:,2:4] |> Array # |> cast the dataframe to an array
advertising_bayes_model = general_linear_regression(xs, Advertising.sales)
Random.seed!(100)
chain_adv = sample(advertising_bayes_model, NUTS(), MCMCThreads(), 2000, 4)
replacenames(chain_adv, Dict(["coefficients[1]" => "TV", "coefficients[2]" => "radio", "coefficients[3]" => "newspaper"]))
end;
# ╔═╡ 9668ab51-f86d-4af5-bf1e-3bef7081a53f
summarystats(chain_adv)
# ╔═╡ ead10892-a94c-40a4-b8ed-efa96d4a32b8
describe(chain_adv)[2]
# ╔═╡ 6a330f81-f581-4ccd-8868-8a5b22afe9b8
md"As can be seen from the summary, the Bayesian results are very similar to the Frequentists again: the posterior means are close to the OLS estimators. By checking the posterior's 95% credible intervals, we can conclude again newspaper (with a 95% credible interval between -0.013 and 0.011) is not an effective method, which is in agreement with the frequentist's result. The trace and density plot of the posterior samples are listed below for reference."
# ╔═╡ b1f6c262-1a2d-4973-bd1a-ba363bcc5c41
plot(chain_adv)
# ╔═╡ 659a3760-0a18-4a95-8168-fc6ca237c4d5
md"""
## Extensions
For both examples we have seen so far, the Bayesian method and the frequentist's method return very similar results. So the reader may wonder why to bother learning the Bayesian inference. The benefit of the Bayesian approach lies in its flexibility: by modifying the model as the modeller sees fit, everyone can do statistical inference like a statistician. With the help of `Turing`, the modeller only needs to do the modelling and leave the computation to the powerful MCMC inference engine.
### Heterogeneous observation σ
To demonstrate the flexibility of the Bayesian approach, we are going to make improvements to the Advertising data's regression model. Here, we consider the effect of *TV* on *sales* only.
"""
# ╔═╡ 9cb32dc0-8c0c-456e-bbcb-7ff6ae63da60
ols_tv = lm(@formula(sales ~ TV), Advertising);
# ╔═╡ 2e7f780e-8850-446e-97f8-51ec26f5e36a
let
plt=@df Advertising scatter(:TV, :sales, xlabel="TV", ylabel="Sales", label="")
xis = [25, 150, 280]
σ²0 = [5, 12, 30]
pred(x) = coef(ols_tv)' * [1, x]
for i in 1:length(xis)
x = xis[i]
μi = pred(x)
xs_ = μi-2*sqrt(σ²0[i]):0.01:μi+2*sqrt(σ²0[i])
ys_ = pdf.(Normal(μi, sqrt(σ²0[i])), xs_)
ys_ = 20 * ys_ ./ maximum(ys_)
plot!(ys_ .+x, xs_, c=2, label="", linewidth=2)
end
plt
end
# ╔═╡ 65d702d6-abec-43a1-89a8-95c30b3e748a
md"It can be observed that the observation noise's scale ``\sigma^2`` is not constant across the horizontal axis. With a larger investment on TV, the sales are increasing but also the variance of the sales (check the two Gaussians' scales at two ends of the axis). Therefore, the old model which assumes a constant observation variance scale is not a good fit for the data.
"
# ╔═╡ 611ad745-f92d-47ac-8d58-6f9baaf3077c
let
# error = Advertising.sales - [ones(length(Advertising.TV)) Advertising.TV] * coef(ols_tv)
# σ²_ml = sum(error.^2) / (length(Advertising.sales)-2)
pred(x) = coef(ols_tv)' * [1, x]
@df Advertising scatter(:TV, :sales, xlabel="TV", ylabel="Sales", label="")
plot!(0:1:300, pred, lw=2, lc=2, label="OLS", legend=:topleft)
test_data = DataFrame(TV= 0:300)
pred_ = predict(ols_tv, test_data, interval = :prediction, level = 0.9)
# plot!(0:1:300, (x) -> pred(x) + 2 * , lw=2, label="OLS", legend=:topleft)
# plot!(0:1:300, pred, lw=2, label="OLS", legend=:topleft)
plot!(0:300, pred_.prediction, linewidth = 0.1,
ribbon = (pred_.prediction .- pred_.lower, pred_.upper .- pred_.prediction), label=L"90\% "* " OLS prediction interval")
end
# ╔═╡ ddb0d87e-cf0a-4b09-9bb6-4ee267fcdb9d
md"""
The ordinary OLS estimated model with the old model assumption is plotted above. It can be observed that the prediction interval is too wide at the lower end of the input scale but too small on the upper side.
"""
# ╔═╡ 1954544c-48b4-4997-871f-50c9bfa402b7
md"""
**A better "story".**
One possible approach to improve the model is to assume the observation scale ``\sigma`` itself is a function of the independent variable:
```math
\sigma(x) = r_0 + r_1 x,
```
If the slope ``r_1>0``, then the observation scale ``\sigma(x)`` will steadily increase over the input ``x``.
However, note that ``\sigma`` is a constrained parameter, which needs to be strictly positive. To make the transformation valid across the input ``x``, we model ``\sigma``'s transformation ``\rho`` instead.
To be more specific, we define the following deterministic transformations between ``\rho`` and ``\sigma``:
```math
\begin{align}
&\rho(x) = \gamma_0 + \gamma_1 x,\\
&\sigma = \ln(1+\exp(\rho)).
\end{align}
```
The unconstrained observation scale ``\rho \in \mathbb{R}`` is linearly dependent on the input ``x``; and a ``\texttt{softplus}`` transformation ``\ln(1+\exp(\cdot))`` is then applied to ``\rho`` such that the output is always positive.
The second transformation ``\ln(1+ \exp(\cdot))`` is widely known as soft-plus method. The function is plotted below.
$(begin
plot(-10:.1:10, (x)-> log(1+exp(x)), label=L"\ln(1+\exp(x))", legend=:topleft, lw=2, size=(400,300))
end)
``\texttt{Softplus}`` function takes any real-value as input and outputs a positive number, which satisfies our requirement. Note that one can also simply use ``\exp(\cdot)`` instead, which is also a ``\mathbb{R} \rightarrow \mathbb{R}^+`` transformation.
"""
# ╔═╡ 85ae4791-6262-4f07-9796-749982e00fec
md"""
The full Bayesian model is specified below:
```math
\begin{align}
\text{Priors: }\beta_0 &\sim \mathcal{N}(m_0^{\beta_0}, v_0^{\beta_0})\\
\gamma_0 &\sim \mathcal{N}(m_0^{\gamma_0}, v_0^{\gamma_0}) \\
\beta_1 &\sim \mathcal{N}(m_0^{\beta_1}, v_0^{\beta_1})\\
\gamma_1 &\sim \mathcal{N}(m_0^{\gamma_1}, v_0^{\gamma_1}) \\
\text{Likelihood: for } n &= 1,2,\ldots, N:\\
\mu_n &=\beta_0 + \beta_1 x_n \\
\rho_n &= \gamma_0 + \gamma_1 x_n\\
\sigma_n &= \log(1+\exp(\rho_n))\\
y_n &\sim \mathcal{N}(\mu_n, \sigma_n^2),
\end{align}
```
which can be translated to `Turing` as follows. For simplicity, we have assumed all priors are with mean zero and variance ``10^2``.
"""
# ╔═╡ 08c52bf6-0920-43e7-92a9-275ed298c9ac
@model function hetero_var_model(X, y; v₀=100)
β₀ ~ Normal(0, sqrt(v₀))
β₁ ~ Normal(0, sqrt(v₀))
γ ~ MvNormal(zeros(2), sqrt(v₀))
μs = β₀ .+ β₁ .* X
ρs = γ[1] .+ γ[2] .* X
σs = log.(1 .+ exp.(ρs))
for i in eachindex(y)
y[i] ~ Normal(μs[i], σs[i])
end
return (;μs, ρs, σs)
end
# ╔═╡ 11021fb7-b072-46ac-8c23-f92825182c8c
begin
Random.seed!(100)
model2 = hetero_var_model(Advertising.TV, Advertising.sales; v₀=10)
chain2 = sample(model2, NUTS(), MCMCThreads(), 2000, 4)
end;
# ╔═╡ cece17c2-bacb-4b4a-897c-4116688812c6
describe(chain2)
# ╔═╡ 29175b82-f208-4fcf-9704-4a1996cc6e3c
begin
chain2_ = replacenames(chain2, Dict("β₀" => L"\beta_0", "β₁" => L"\beta_1"))
plot(chain2_)
end
# ╔═╡ 05b90ddb-8479-4fbf-a469-d5b0bf4a91c8
md"""
#### Use `generated_quantities()`
**Analyse internel hidden values.** ``\sigma, \mu`` are of importance in evaluating the model. They represent the observation noise scale and regression-line's expectation respectively. Recall that they depend on the independent variable *TV* and the unknown model parameters ``\gamma_0, \gamma_1, \beta_0, \beta_1`` via deterministic functions
```math
\sigma(\texttt{TV}) = \ln(1+ \exp(\gamma_0 +\gamma_1 \texttt{TV})).
```
and
```math
\mu(\texttt{TV}) = \beta_0 +\beta_1 \texttt{TV}.
```
To gain deeper insights into the inference results, we can analyse the posterior distributions of ``\mu`` and ``\sigma`` directly: i.e.
```math
p(\sigma|\texttt{TV}, \mathcal{D}),\;\; p(\mu|\texttt{TV}, \mathcal{D}).
```
Based on the Monte Carlo principle, their posterior distributions can be empirically approximated based on their posterior samples, which can be subsequently computed based on the MCMC samples ``\{\gamma_0^{(r)}, \gamma_1^{(r)}\}_{r=1}^R``, and ``\{\beta_0^{(r)},\beta_0^{(r)}\}_{r=1}^R`` respectively.
Take ``\sigma`` for example, its posterior can be approximated based on ``\{\gamma_0^{(r)}, \gamma_1^{(r)}\}_{r=1}^R`` by an empirical distribution:
```math
p(\sigma |\texttt{TV}, \mathcal{D}) \approx \frac{1}{R} \sum_{r=1}^R \delta_{\sigma^{(r)}_{\texttt{TV}}},
```
where
$$\sigma^{(r)}_{\texttt{TV}} =\ln(1+ \exp( \gamma_0^{(r)} +\gamma_1^{(r)} \texttt{TV})).$$ and ``\delta_{x}`` is a Dirac measured at ``x``. The posterior mean can then be approximated by the Monte Carlo average:
```math
\mathbb{E}[\sigma|\mathcal{D}, \texttt{TV}] \approx \frac{1}{R} \sum_{r=1}^R \sigma^{(r)}_{\texttt{TV}}.
```
To make the idea concrete, we manually implement and plot the posterior approximation below.
"""
# ╔═╡ 2d7f773e-9fa1-475d-9f74-c13908209aeb
begin
σ(x; γ) = log(1.0 + exp(γ[1] + γ[2] * x))
# obtain the posterior mean of the γ
γ_mean = describe(chain2)[1][:,:mean][3:4]
# obtain γ samples from the chain object
γ_samples = Array(group(chain2, :γ))
tv_input = 0:300
# calculate each σⁱ
σ_samples = map(γⁱ -> σ.(tv_input; γ = γⁱ), eachrow(γ_samples))
# E[σ|𝒟]: the Monte Carlo average
σ_mean = mean(σ_samples)
end;
# ╔═╡ 8a0692db-6b85-42f6-8893-4219d58b1032
let
# Plotting
plt = plot(tv_input, σ_mean, lw=3, xlabel="TV", ylabel=L"\sigma", label=L"\mathbb{E}[\sigma|\mathcal{D}]", legend=:topleft)
plot!(tv_input, σ_samples[1], lw=0.2, lc=:gray, xlabel="TV", ylabel=L"\sigma", label=L"\sigma^{(r)}\sim p(\sigma|\mathcal{D})", legend=:topleft)
for i in 2:25
plot!(tv_input, σ_samples[i], lw=0.2, lc=:gray, label="")
end
plt
end
# ╔═╡ 3c3f6378-ed20-4f40-a780-f9329ace34bc
md"""
It is not practical to implement the approximation for every single internal variable manually. Fortunately, `Turing` provides us with an easy-to-use method:
```
generated_quantities(model, chain)
``` The method takes a `Turing` model and a chain object as input arguments and it returns all internal variables' samples, which are returned at the end of the `Turing` model. For example, to obtain the posterior samples of the hidden variables ``\mu`` and ``\sigma``, simply issue commands:
"""