-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathml-reading-notes.tex
3669 lines (2511 loc) · 138 KB
/
ml-reading-notes.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[a4paper, 12pt]{article}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{bbold}
\usepackage{bm}
\usepackage{cite}
\usepackage{fullpage}
\usepackage{graphicx}
\usepackage[pagebackref=true,breaklinks=true,letterpaper=true,colorlinks,bookmarks=false]{hyperref}
\usepackage{inputenc}
\usepackage{mathtools}
\usepackage{natbib}
\usepackage{titling}
\usepackage{siunitx}
\usepackage{txfonts}
\usepackage{url}
\DeclareMathOperator{\sign}{sign}
\DeclareMathOperator{\softmax}{softmax}
\newcommand{\expect}{\operatorname{E}\expectarg}
\DeclarePairedDelimiterX{\expectarg}[1]{[}{]}{%
\ifnum\currentgrouptype=16 \else\begingroup\fi
\activatebar#1
\ifnum\currentgrouptype=16 \else\endgroup\fi
}
\newcommand{\innermid}{\nonscript\;\delimsize\vert\nonscript\;}
\newcommand{\activatebar}{%
\begingroup\lccode`\~=`\|
\lowercase{\endgroup\let~}\innermid
\mathcode`|=\string"8000
}
\newcommand\phantomlabel[1]{\phantomsection\label{#1}}
\DeclarePairedDelimiter\abs{\lvert}{\rvert}%
\DeclarePairedDelimiter\norm{\lVert}{\rVert}%
\DeclarePairedDelimiterX{\infdivx}[2]{(}{)}{%
#1\;\delimsize\|\;#2%
}
\newcommand{\infdiv}{D_{KL}\infdivx}
\date{\today}
\title{Machine Learning Reading Notes}
\author{Brendan Duke}
\begin{document}
\maketitle
\part{Definitions}
% TODO(brendan):
\phantomlabel{backprop}
\textbf{Back propagation}
\phantomlabel{batchnorm}
\textbf{Batch Normalization}
\textbf{Deep Neural Networks (DNNs)} are engineered systems inspired by the
biological brain\citet{Goodfellow-et-al-2016-Book}.
\phantomlabel{fisher-info}
The \textbf{Fisher information matrix} is the expected value of the observed
information matrix, which is the gradient of the negative score function, or
equivalently the Hessian of the negative log-likelihood. I.e.,
\begin{align*}
F(\hat{\theta} | \theta^*) &= \expect{J(\hat{\theta} | D)} \\
&= -\nabla s(\hat{\theta}) \\
&= -\nabla^2 \log p(D \mid \hat{\theta})
\end{align*}
where $s(\hat{\theta})$ is the score function.
\phantomlabel{kl}
\textbf{KL divergence} is given in Equation~\ref{kleqn}\citet[Chapter~3]{Goodfellow-et-al-2016-Book}.
\begin{equation}
\infdiv{P}{Q} = \varmathbb{E}_{x \sim P}\left[\log P(x) - \log Q(x)\right]
\label{kleqn}
\end{equation}
\textbf{Cross-entropy} is related to \hyperref[kl]{KL divergence} by
$H(P, Q) = H(P) + \infdiv{P}{Q}$, where $H(P)$, the \textbf{Shannon entropy}, is
$H(P) = -\varmathbb{E}_{x \sim P} \left[\log P(x)\right]$.
\phantomlabel{LSTM}
% TODO(brendan): Diagrams and equations for LSTM. Definition for RNNs. Should
% be a distillation of Chapter 10 of Goodfellow book.
\textbf{LSTM} (Long Short Term
Memory) neural networks are a type of recurrent neural network whose
characteristic feature is the presence of a gated self-loop that allows
retention of its ``cell state'', which are the pre-non-linearity activations of
the previous time step\citet[Chapter~10]{Goodfellow-et-al-2016-Book}.
Cell state is updated at each time step according to
Equation~\ref{lstm_cell_state_update}.
\begin{equation}
s_i^{(t)} = f_i^{(t)} s_i^{(t - 1)} + g_i^{(t)}
\sigma \left( b_i + \sum_j U_{i, j} x_j^{(t)} + \sum_j W_{i, j} h_j^{(t - 1)}\right)
\label{lstm_cell_state_update}
\end{equation}
The vectors $\boldsymbol{f^{(t)}}$ and $\boldsymbol{g^{(t)}}$ in
Equation~\ref{lstm_cell_state_update} also take inputs from $\boldsymbol{x^{(t)}}$
and $\boldsymbol{h^{(t - 1)}}$, with their own weight tensors and bias vectors
$\boldsymbol{U}^f$, $\boldsymbol{W}^f$ and $\boldsymbol{b}^f$, $\boldsymbol{U}^g$,
$\boldsymbol{W}^g$ and $\boldsymbol{b}^f$, respectively.
Similar gate functions exist to gate the inputs and outputs to the LSTM, as
well.
\textbf{Mahalanobis Distance}
\phantomlabel{multilayer_perceptron}
\textbf{Multi-Layer Perceptrons (MLPs)} are mathematical functions mapping some
set of input values to some set of output
values\citet{Goodfellow-et-al-2016-Book}.
\textbf{Neighbourhood Components Analysis (NCA)} is a method of learning a
Mahalanobis distance metric, and can also be used in linear dimensionality
reduction\citet{NIPS2004_2566}.
The \textbf{PCKh} metric, used by the MPII Human Pose Dataset, defines a joint
estimate as matching the ground truth if the estimate lies within 50\% of the
head segment length\citet{andriluka-2d-2014-853}. The head segment length is
defined as the diagonal across the annotated head rectangle in the MPII data,
multiplied by a factor of 0.6. Details can be found by examining the MATLAB
\href{http://human-pose.mpi-inf.mpg.de/results/mpii_human_pose/evalMPII.zip}{evaluation script}
provided with the MPII dataset.
\phantomlabel{nonmax_supression}
\textbf{Non-maximum suppression} in object detection, in general, is a set of
methods used to prune an initial set of object bounding boxes that may be
uncorrelated with the actual object detections in an image, down to a subset
that are\citet{DBLP:conf/accv/RotheGG14}. In edge detection, non-maximum
suppression is used to suppress any pixels (i.e.\ not include them in the set of
detected edges) that are not the maximum response in their neighbourhood.
The \textbf{softmax function} is a continuous differentiable version of the
argmax function, where the result is represented as a one-hot
vector\citet[Chapter~6]{Goodfellow-et-al-2016-Book}. Softmax is a way of
representing probability distributions over a discrete variable that can take
on $n$ possible values.
Formally, softmax is given by Equation~\ref{softmax_eqn}.
\begin{equation}
\textrm{softmax}{(\boldsymbol{z})}_i = \frac{e^{z_i}}{\sum_je^{z_j}}
\label{softmax_eqn}
\end{equation}
\phantomlabel{rectified_linear_units}
\textbf{Rectified linear units (ReLUs)}\citet{icml2010_NairH10}
\textbf{Leaky rectified linear units (LReLUs)}\citet{maas_rectified_nonlinearities}
\begin{equation}
h^{(i)} = \max\left(w^{(i)T}x, 0.01w^{(i)T}x\right) =
\begin{cases}
w^{(i)T}x & \textrm{if } w^{(i)T}x > 0 \\
0.01w^{(i)T}x & \textrm{otherwise}
\end{cases}
\end{equation}
Leaky ReLUs have non-zero gradient over their domain, and were therefore
motivated in reducing the vanishing gradient problem.
\phantomlabel{dilated_convolutions}
\textbf{Dilated Convolutions}~\citet{DBLP:journals/corr/YuK15} can be defined as
follows.
Let $F: \mathbb{Z}^2 \rightarrow \mathbb{R}$ be a discrete
function, let $\Omega_r = {[-r, r]}^2 \cap \mathbb{Z}^2$ and let
$k: \Omega_r \rightarrow \mathbb{R}$ be a discrete filter of size
${(2r + 1)}^2$. Then the convolution operator is defined by
Equation~\ref{conv-op}.
\begin{equation}
(F * k)(\mathbf{p}) = \sum_{\mathbf{s} + \mathbf{t} = \mathbf{p}} F(\mathbf{s}) + k(\mathbf{t})
\label{conv-op}
\end{equation}
Furthermore, the dilated convolution operator, denoted by $*_l$, is defined by
Equation~\ref{dilated-conv-op}.
\begin{equation}
(F *_l k)(\mathbf{p}) = \sum_{\mathbf{s} + l\mathbf{t} = \mathbf{p}} F(\mathbf{s}) + k(\mathbf{t})
\label{dilated-conv-op}
\end{equation}
\part{Book Summaries}
\section{Machine Learning: A Probabilistic Perspective}
% TODO(brendan): citation
\subsection{Probability}
Bayes rule is:
\begin{equation}
p(Y = y | X = x) = \frac{p(Y = y, X = x)}{p(X = x)}
= \frac{p(X = x) p(Y = y | X = x)}
{\sum_{x'} p(X = x') p(Y = y | X = x')}
\end{equation}
\subsection{Generative Models for Discrete Data}
The \textbf{maximum a posteriori (MAP)} estimate is defined as
$\hat{y} = \textrm{argmax}_{c} p(y = c | \mathbf{x}, D)$.
A \textbf{prior} is a probability distribution assigned to each hypothesis $h
\in \mathcal{H}$ in the hypothesis space, based only on knowledge external to
the training data. E.g.\ in the space of related numbers under 100, a strong
prior may be assigned to ``odd numbers'' or ``even numbers'', while a weak
prior would be assigned to unintuitive concepts such as ``all powers of 2
except 32''.
The \textbf{likelihood} of a hypothesis given a set of training data, is the
probability of randomly sampling exactly that set of training data, given the
hypothesis, i.e. $p{(\mathcal{D} | h)} = {(1 / |h|)}^N$.
The \textbf{posterior} is:
\begin{equation}
p(h | \mathcal{D})
= \frac{p(h) p(\mathcal{D} | h)}
{\sum_{h' \in \mathcal{H}} p(\mathcal{D} | h')}
\end{equation}
where $p(\mathcal{D} | h)$ is ${(1 / |h|)}^N$ if the training data match the
hypothesis $h$, otherwise zero.
The MAP is
$\textrm{argmax}_h p(h) p(\mathcal{D} | h) = \textrm{argmax}_h \left[ \log{p(h)} + \log{p(\mathcal{D} | h)} \right]$.
In the limit of infinite training data, the term exponential in $N$ in
$\log{p(\mathcal{D} | h)}$ will dominate. Therefore the
\textbf{maximum likelihood estimator (MLE)}
$\textrm{argmax}_h \log{p(\mathcal{D} | h)}$ is converged to by the MAP in the
limit of infinite data, justifying the MLE's use as an objective function.
\section{Deep Learning~\citet{Goodfellow-et-al-2016-Book}}
\subsection{Machine Learning Basics}
Maximum Likelihood Estimation (MLE) is maximization of the log-likelihood
$\log p_{model}(y | \mathbf{x}; \mathbf{\theta})$, where $y$ is a ground truth
example, $\mathbf{x}$ is an input feature vector, and $\mathbf{\theta}$ are
model parameters.
Principal Component Analysis (PCA) involves projecting input feature vectors
$\mathbf{x}$ into a reduced-dimensionality space via multiplication by
$D \in \mathbb{R}^{n \times l}$. The PCA projection minimizes the $L_2$
reconstruction error $||\mathbf{x} - r(\mathbf{x})||_2$.
Stocastic Gradient Descent (SGD) is gradient descent over minibatches. For an
objective function $L = f(y; x, \theta)$, at each step $t$ we have
$\theta_t = \frac{1}{m'} \sum_{i = 0}^{m'} \theta_{t - 1} - \epsilon \nabla f$.
\part{Datasets}
\section{CIFAR-10~\citet{cifar10-website}}
\label{cifar10}
CIFAR-10~\citet{cifar10-website} consists of \num{60000} colour images of $32
\times 32$ resolution, has 10 classes and \num{6000} images per class.
\section{HMDB-51~\citet{Kuehne11}}
\label{hmdb51}
HMDB-51 contains 7000 clips extracted from movies and YouTube then manually
annotated with 51 class labels.
\section{Kinetics~\citet{kay2017kinetics}}
\label{kinetics}
The Kinetics Human Action Video Dataset\citet{kay2017kinetics}, which has 400
human action classes each with more than 400 examples. The classes are focused
on human actions, such as pouring or kissing, as opposed to activities (such as
tennis or baseball). The clips are 10s long. There are roughly \num{300000}
labelled clips in total in the dataset.
The Kinetics test set is 100 clips for each class.
\section{KITTI 2012~\citet{geiger-kitti-2012}}
\label{kitti}
The KITTI 2012 odometry benchmark consists of 22 stereo sequences, saved in
lossless PNG format. The data include colour frames, LIDAR data, camera
calibration and ground truth poses (except sequences 11 to 21, which are for
evaluation).
\section{THUMOS~2014~\citet{DBLP:journals/corr/IdreesZJGLSS16}}
\label{thumos}
THUMOS~2014~\citet{DBLP:journals/corr/IdreesZJGLSS16} consists of 101 action
classes. THUMOS action classes are from UCF-101~\ref{ucf101}. All videos are
from YouTube, and are labelled with action class and temporal span of the
action. THUMOS~2014 contains 20 action classes, 2755 trimmed training videos
and 1010 untrimmed validation videos containing 3007 action instances. 213 test
videos, with 3358 action instances, exist that are not entirely background.
The THUMOS~2015 dataset extends THUMOS~2014 with a total of 5613 positive and
background untrimmed videos.
\section{UCF-101\citet{DBLP:journals/corr/abs-1212-0402}}
\label{ucf101}
101 classes, 13k clips and 27 hours of video data in total.
\section{WMT~2014~\citet{wmt14-translation-website}}
\label{wmt2014}
WMT~2014 is a machine translation task with five language pairs:
French-English, Hindi-English, German-English, Czech-English and
Russian-English. A number of parallel and monolingual corpora are included as
training data.
The majority of the training data is taken
from~\href{http://www.statmt.org/europarl/}{Europarl}~v7, from which about
50~million words per language are used in WMT~2014. Additional training data is
taken from the~\href{http://www.casmacat.eu/corpus/news-commentary.html}{News
Commentary Parallel Corpus}, at about three million words per language.
Test data from previous years is provided, with on the order of~\num{3000}
sentences per test set.
\part{Paper Summaries}
\section{Embedded Software Stack}
\subsection{Learning to Optimize Tensor Programs~\cite{chen2018learning}}
This paper expands on the ``tensor program'' low level optimization component
of TVM, called AutoTVM. They trade off exploration/exploitation in an online
model-based operator search by using submodular optimization (local
optimization) on the subset~$S_e$ found by simulated annealing (global
optimization) of the total set of considered operator implementations,
represented with a polyhedral model. This ``exploration module'' builds a
dataset~$\mathcal{D}$ of schedules with ground truth measurements run in the
hardware environment.
The submodular optimization objective function used is,
\begin{equation}
L(S) = - \sum_{s \in S} \hat{f}(g(e, s)) + \alpha \sum_{j = 1}^m \abs{\cup_{s \in S} \{s_j\}}
\end{equation}
They trained the cost model (GBT, TreeRNN, or random search) alternating
with the exploration module, which generates the cost model's dataset. They
found that a rank objective improved over a regression objective for training
the cost model (could this perhaps be fixed with normalization? Regression
makes sense here).
\subsection{TVM: An Automated End-to-End Optimizing Compiler for Deep
Learning~\cite{chen2018tvm}}
They present TVM, which is a ~50k lines of core C++ code optimizing compiler
for deep learning systems. TVM performs both high-level graph optimizations, by
e.g., fusing conv-BN-ReLU into backend-supported fused operators, as well as
low-level kernel optimizations. The kernel optimizations use GBTs (also tried
TreeRNNs) produce operators that outperform even heavily handtuned cuDNN
kernels on e.g., Titan X (as of cuBLAS v8, cuDNN v7).
\subsubsection{Future Work/Comments/Questions}
\begin{itemize}
\item Improve the ML cost model, since GBTs were used for their speed,
but it's possible that there could be advantages from more
powerful models.
\item TVM has a runtime for high-level graph operator fusion, is this
runtime also required for kernel optimizations? Presumably
kernels can be generated offline and linked into an ML
framework.
\end{itemize}
\section{Human Pose}
\subsection{DeepPose: Human Pose Estimation via Deep Neural
Networks\citet{DBLP:journals/corr/ToshevS13}}
This paper uses DNNs as a method for human pose estimation, based on the
success of~\citet{NIPS2013_5207} and~\citet{DBLP:journals/corr/GirshickDDM13} for
object detection using DNNs.
This is in contrast to the existing work in human pose estimation at the time,
which focused on explicitly designed pose models. Papers about these methods
can be found in the ``Related Work'' section of
\citet{DBLP:journals/corr/ToshevS13}.
The input to the 7-layered convolutional DNN (based on
AlexNet\citet{NIPS2012_4824}) is the full image.
\subsection{End-to-end people detection in crowded
scenes\citet{DBLP:journals/corr/StewartA15}}
This paper is focused on jointly creating a set of bounding-box predictions for
people in crowded scenes using GoogLeNet and a
\hyperref[LSTM]{recurrent LSTM layer} as a controller. Since bounding-box
predictions are generated jointly, common post-processing steps such as
\hyperref[nonmax_supression]{non-maximum suppression} are unnecessary. All
components of the system are trained end-to-end using back propagation.
\subsubsection{Motivation}
The end-to-end people detection method is contrasted with the object detection
methods of R-CNN in~\citet{DBLP:journals/corr/GirshickDDM13} and OverFeat in
\citet{DBLP:journals/corr/SermanetEZMFL13}.
\citet{DBLP:journals/corr/GirshickDDM13} and
\citet{DBLP:journals/corr/SermanetEZMFL13} rely on non-maximum suppression,
which does not use access to image information to infer bounding box positions
since non-maximum suppression acts only on bounding boxes. Also, in end-to-end
people detection, the decoding stage is learned using LSTMs, instead of using
specialized methods as in~\citet{VisualPhrases} and~\citet{TaAnSc_14:occluded}.
Early related work can be found in~\citet{Felzenszwalb:2010:ODD:1850486.1850574}
and~\citet{Leibe:2005:PDC:1068507.1069006}. Best performing object detectors at
the time were~\citet{DBLP:journals/corr/GirshickDDM13},
\citet{DBLP:journals/corr/SermanetEZMFL13}, \citet{Uijlings13},
\citet{DBLP:journals/corr/ZhangBS15} and~\citet{DBLP:journals/corr/SzegedyREA14}.
Sequence modeling is done using LSTMs as in
\citet{DBLP:journals/corr/SutskeverVL14} (used for machine translation) and
\citet{DBLP:journals/corr/KarpathyF14} (used for image captioning). The loss
function is similar to the loss function proposed in
\citet{Graves06connectionisttemporal} in that the loss function encourages the
model to make predictions in descending order of confidence.
\subsubsection{Data}
A new training set collected from public webcams, called ``Brainwash'', is
produced. Brainwash consists of 11917 images with 91146 labelled people. 1000
images are allocated for testing and validation, hence training, test and
validation sets contain 82906, 4922 and 3318 labels, respectively.
\subsubsection{Model}
A pre-trained GoogLeNet\citet{going-deeper-szegedy43022} is used to produce
encoded features as input to the LSTM\@. The GoogLeNet features are further
fine-tuned by the training process. Using GoogLeNet, a feature vector of
length 1024 is produced for each region over a $(15, 20)$ grid of regions that
covers the entire $(480, 640)$ input image. Each cell in the grid has a
receptive field of $(139, 139)$, and is trained to produce a set (with fixed
cardinality five) of distinct bounding boxes in the center $(64, 64)$ region.
$L_2$ regularization of weights in the network was removed entirely.
GoogLeNet activations are scaled down by a factor of 100 before being input to
the decoder, since decoder weights are initialized according to a uniform
distribution in $[-0.1, 0.1]$, while GoogLeNet activations are in $[-80, 80]$.
Regression predictions from GoogLeNet are scaled up by 100 before comparing
with ground truth locations (which are in $[-64, 64]$).
At each step, the LSTM for each grid cell, of which there are 300 in total,
produces a new bounding box and corresponding confidence that the bounding box
contains a person $\boldsymbol{b} = \{\boldsymbol{b}_{pos}, b_c\}$, where
$\boldsymbol{b}_{pos} = (b_x, b_y, b_w, b_h) \in \varmathbb{R}^4$ and
$b_c \in [0, 1]$. The prediction algorithm stops when the confidence drops
below a set threshold. The LSTM units have 250 memory states, no bias units,
and no output non-linearities. Each LSTM unit adds its output to the image
representation, and feeds the result into the next LSTM unit. Comparable
results are found by only presenting the image representation as input to the
first LSTM unit.
Dropout with probability 0.15 is used on the output of each LSTM\@.
\subsubsection{Inference}
The system is trained with learning rate 0.2, decreased by a factor of 0.8
every 100 000 iterations (with convergence occurring after 500 000 iterations),
and momentum 0.5. Gradient clipping is done at 2-norm of 0.1.
Images are jittered by up to 32 pixels in horizontal and vertical directions,
and scaled by a factor between 0.9 and 1.1.
At test time, per-region predictions are merged by adding a new region at each
iteration, and destroying any new bounding boxes that overlap previously
accepted bounding boxes, under the constraint that any given bounding box can
destroy at most one other bounding box. An ordering function
$\Delta': A \times C \rightarrow \varmathbb{N} \times \varmathbb{R}$ given by
$\Delta'(\boldsymbol{b}_i, \tilde{\boldsymbol{b}}_j) = (m_{ij}, d_{ij})$ where
$m_{ij}$ denotes intersection of boxes and $d_{ij}$ is $L_1$ displacement, is
minimized using the Hungarian algorithm in order to find a bipartite matching.
At each step, any new candidate that is not intersecting in the matching is
added to the set of accepted candidates.
\subsubsection{Criticism}
A new loss function that operates on sets of bounding-box predictions is
introduced. Denoting bounding boxes generated by the model as
$C = \{\tilde{\boldsymbol{b}}_i\}$, and ground truth bounding boxes by
$G = \{\boldsymbol{b}_i\}$, the loss function is given by
Equation~\ref{loss-eqn}.
\begin{equation}
L(G, C, f) = \alpha\sum_i^{|G|}
l_{pos}\left(\tilde{\boldsymbol{b}}_{pos}^i, \boldsymbol{b}_{pos}^{f(i)}\right) +
\sum_j^{|C|} l_c\left(\tilde{b}_c^j, y_j\right)
\label{loss-eqn}
\end{equation}
In Equation~\ref{loss-eqn}, $f(i)$ is an injective function $G \rightarrow C$
that assigns one ground truth to each index $i$ up to the number of ground
truths, $l_{pos}$ is the $L_1$ displacement between bounding boxes, and $l_c$
is a cross-entropy loss on a candidate's confidence that a bounding box exists,
where $y_j = \mathbb{1}\{f^{-1}(j) \neq \varnothing\}$. $\alpha$ is set to 0.03
from cross-validation.
In creating $f(i)$ in Equation~\ref{loss-eqn} to assign candidate predictions
to ground truths, the
$G \times C \rightarrow \varmathbb{R} \times \varmathbb{N} \times \varmathbb{N}$
function
$\Delta\left(\boldsymbol{b}^i, \tilde{\boldsymbol{b}}^j\right) = (o_{ij}, r_i, d_{ij})$
is used to lexicographically order pairs first by $o$, then $r$, then $d$,
where $o$ is one if there is sufficient overlap between candidate and ground
truth and zero otherwise, $r$ is the prediction's confidence, and $d$ is the
$L_1$ displacement between candidate and ground truth bounding boxes.
\subsubsection{Experiments}
With an AP (average precision) of 0.78 and EER (equal error rate) of 0.81, the
$f(i)$ produced by minimizing $\Delta$, using the
\href{https://en.wikipedia.org/wiki/Hungarian_algorithm}{Hungarian algorithm},
is found to improve on AP and EER compared with a fixed assignment of $f(i)$,
or selecting the first $k$ highest ranked ($L_\textrm{firstk}$). COUNT
(Absolute difference between number of predicted and ground truth detections)
for $f(i)$ with Hungarian was 0.76 compared with 0.74 for $L_\textrm{firstk}$.
As a baseline, Overfeat-GoogLeNet (bounding-box regression on each cell,
followed by non-maximum suppression, as in
\citet{DBLP:journals/corr/SermanetEZMFL13}) achieved 0.67, 0.71 and 1.05 AP, EER
and COUNT, respectively.
Training without finetuning GoogLeNet reduces AP by 0.29.
Removal of dropout from the output of each LSTM decreases AP by 0.011.
When using the original $2^{-4}$ $L_2$ weights regularization multiplier on
GoogLeNet only, the network was unable to train. An $L_2$ regularization
multiplier on GoogLeNet of $10^{-6}$ reduced AP by 0.03.
It is found that AP (on the validation set) increases from 0.82 to 0.85 when
using separate weights connecting each of the LSTM outputs to predicted
candidates.
\section{Regularization}
\subsection{Dropout: A Simple Way to Prevent Neural Networks from
Overfitting\citet{Srivastava:2014:DSW:2627435.2670313}}
\label{dropout}
\textbf{Dropout} is a technique used to overcome the problem of overfitting in
deep neural nets with large numbers of parameters. The idea is to train using
many ``thinned'' networks, chosen by randomly removing subsets of units and
their connections. The predictions from the thinned networks are approximately
averaged at test time by using a single, unthinned, network with reduced
weights.
\begin{itemize}
\item Existing regularization methods: stopping training as soon as
validation error stops improving, L1 and L2 regularization, and
weight sharing\citet{Nowlan:1992:SNN:148167.148169}.
\end{itemize}
\section{CNN Architecture}
\subsection{Deep Residual Learning for Image
Recognition\citet{DBLP:journals/corr/HeZRS15}}
A technique, training of residual functions, is presented for deep neural
network architecture design, which allows training of deeper networks with
improved accuracy compared to not training residual functions.
\citet{going-deeper-szegedy43022} and~\citet{DBLP:journals/corr/SimonyanZ14a} are
referred to as motivating ``very deep'' models.
% TODO(brendan): Read [1], [9] for vanishing gradient, R-CNN series for
% localization.
\subsection{MobileNetV2: Inverted Residuals and Linear
Bottlenecks~\cite{Sandler_2018_CVPR}}
They change residual blocks (with depthwise separable convolutions)
to~\emph{expand} up to a higher number of channels in the ``bottleneck'', which
performs 3x3 channel-wise convolutions. The expansion (and contraction) is done
using 1x1 convolutions with~\emph{no nonlinearity} (hence linear bottlenecks).
\subsection{Spatial Transformer Networks~\citet{jaderberg-spatial-2015}}
Spatial Transformer Networks introduce a CNN submodule that sub-differentiably
parametrizes spatial transformations of a feature map.
The submodule takes an input~$U \in \mathbb{R}^{H \times W \times C}$ and
outputs~$\mathcal{T}_\theta(G)$, which is a warp of the output grid~$G$
parametrized by~$\theta$, and outputs~$V$. E.g., $\mathcal{T}_\theta$ could be
an affine transformation.
The technique used in the paper to make the operation sub-differentiable is to
use a kernel function of the source coordinates, i.e.,
\begin{equation*}
V_i^{nm} = \sum_n^H \sum_m^W U^c_{nm} k(x_i^s - m\;;\; \Phi_x)
k(y_i^s - m\;;\; \Phi_y)
\end{equation*}
for all channels~$c$ and pixels~$i$. The kernel function could be, for example,
a Dirac delta function, a bilinear
function~$\max(0, 1 - |x_i^s - m|)\max(0, 1 - |y_i^s - n|)$, or a Gaussian
kernel, etc.
One disadvantage of this method is that it seems that either the true discrete
characteristic of the grid warping operation is given up (in the case of
Gaussian or bilinear kernels), or a biased estimator of the gradient
(straight-through estimator, in the case of the Dirac delta function) is used.
It would be more interesting to evaluate the source pixel-selection operation
using an expectation over all selections in general, then to evaluate different
gradient estimators for this selection.
\section{Colour Constancy}
\subsection{On Finding Gray Pixels~\cite{qian2019cvpr}}
They determine the spatial illumination map~$L_i(x, y)$ by finding the
top~$N\%$ of gray pixels (assumed based on neutral interface reflection to
have~$j \in \{s,b\}$ surface and body
reflection~$R_{j,R} = R_{j,G} = R_{j,B} = \bar{R_j}$) using a grayness
index~$GI(x, y) = \norm{[C\{\log I_R - \log\abs{I}\}, C\{\log I_B -
\log\abs{I}\}]}$ in regions of varying intensity of light, i.e.,
where~$C\{I_i\} > \epsilon, \forall i\in \{R,G,B\}$.
\section{Correspondence}
\subsection{Learning Correspondence from the Cycle-Consistency of
Time~\cite{wang2019learning}}
They present a self-supervised method for learning features trained for finding
correspondences, by following an image patch forwards and backwards in time
with a simple tracker~$\mathcal{T}$ and minimizing localization error, i.e.,
enforcing cycle-consistency.
They encode features of a patch in the initial image, then use~$\mathcal{T}$ to
follow the patch back-and-forth over cycles of different lengths.
The tracker takes normalized cross-correlation (attention) of the patch
features over the entire past/future frame as input.
They use three losses: the tracking cycle-consistency loss, a skip loss
(cycle-consistency over skipped frames), and a feature similarity loss.
\subsubsection{Comments/Questions/Future Work}
\begin{itemize}
\item Choosing what to track, e.g., not tracking background parts of
the image.
\item Improving robustness to occlusion and partial observability,
e.g., with a better strategy for finding cycles (at training
time?).
\item How should one track a patch with two objects that eventually
diverge?
\item More context for tracking (does this mean correspondences between
a sequence of frames?)
\item The feature similarity loss seems to be dominant (other two
losses are weighted by~\num{0.1}).
Replace this with mutual information maximization?
\end{itemize}
\section{Face}
\subsection{Face Alignment in Full Pose Range: A 3D Total
Solution~\cite{zhu2017face}}
\subsubsection{Content / Contributions}
\begin{itemize}
\item This paper provides a solution for ``face alignment'', which seems have
been used to refer to registration of either 2D~\cite{kazemi2014one} or
3D models.
The paper addresses face alignment in large pose ranges (\ang{90}).
\item Large pose range face alignment poses challenges:
\begin{itemize}
\item Existing pose algorithms assume that landmarks are visible.
\item Large appearance change for large poses.
\item It is difficult to annotate occluded points.
\end{itemize}
\item To solve the landmark visibility (self-occlusion) issue, this paper
proposes solving face alignment as a 3D problem, rather than 2D, by
fitting translation, scale, rotation, and expression and shape
parameters of a 3DMM\@.
\item The paper introduces Projected Normalized Coordinate Code (PNCC) and Pose
Adaptive Features (PAF) to deal with appearance changes.
The paper also uses OWPDC to prioritize 3DMM parameters.
\item The paper constructs a database of 2D face images and 3D face models, and
combines these to synthesize more than~\num{60000} profile-view images
to create artificial profile-view data with ground-truth occluded
landmark annotations.
\end{itemize}
\subsubsection{Motivation}
\begin{itemize}
\item At the time of writing, most face alignment models worked only for small
or medium angle rotations (yaw angle less than~\ang{45}).
\end{itemize}
\subsubsection{Method}
\begin{itemize}
\item Quaternion rotation representation.
\item Optimized Weighted Parameter Distance Cost (OWPDC).
The paper asserts that different 3DMM parameters should be given
different priorities during model fitting to minimize alignment error.
\end{itemize}
\subsubsection{Results}
\subsubsection{Background}
\begin{itemize}
\item \href{http://www.csc.kth.se/~vahidk/face_ert.html}{One
Millisecond Face Alignment with an Ensemble of Regression Trees}~\cite{kazemi2014one}
for a classical (gradient boosting) solution.
\item Active appearance models~\cite{Cootes98activeappearance}.
\item 3DMM, analysis-by-synthesis 3DMM fitting~\cite{egger20203dmm}.
\item Regression-based 3DMM fitting~\cite{jourabloo2017poseinvariantface}.
\item Cascaded regression~\cite{dollar2010cascadedpose}.
\end{itemize}
\subsubsection{Open Challenges}
\subsubsection{Prerequisites}
\begin{itemize}
\item Euler angles.
\item Quaternions.
\item Quadratic programming.
\end{itemize}
\subsubsection{Comments / Questions}
\begin{itemize}
\item Can a 3D model be unambiguously registered only with correspondences to
2D keypoints?
It seems single-view 3d reconstruction is ill-posed.
Why?
\end{itemize}
\subsection{Generating 3D faces using Convolutional Mesh
Autoencoders~\cite{ranjan2018generating}}
\subsubsection{Content / Contributions}
\begin{itemize}
\item Novel mesh sampling operations that preserve mesh topology at different
scales.
\item Convolve facial mesh with Chebyshev filters.
\item Compact model (CoMA).
\item Dataset of~\num{20466} meshes of~\num{12} subjects in~\num{12} facial
expressions.
\end{itemize}
\subsubsection{Motivation}
\begin{itemize}
\item Existing linear 3D face models don't capture nonlinear deformations.
\item Extending the success of hierarchical 2D CNN structure to 3D meshes.
3D convolution is a poor substitute due to inefficiency, while 3D
mesh convolutions can process 3D efficiently at high-res.
\item
\end{itemize}
\subsubsection{Method}
\begin{itemize}
\item Can sample faces from a Gaussian distribution.
\end{itemize}
\subsubsection{Results}
\begin{itemize}
\item 50\% better performance than linear 3D face model, while CoMA has 75\%
fewer parameters.
\item Replacing FLAME's expression space with CoMA improves FLAME's
reconstruction accuracy.
\end{itemize}
\subsubsection{Background}
\subsubsection{Open Challenges}
\begin{itemize}
\item Scarcity of 3D face training data.
\end{itemize}
\subsubsection{Comments}
\section{GANs}
\subsection{Generative Adversarial Networks\citet{NIPS2014_5423}}
\label{gan}
A method of generating probability distributions is presented that can be
trained end-to-end when used with \hyperref[multilayer_perceptron]{MLPs}. In
the given method, a discriminator network $D$ is optimized to distinguish from
ground truth the samples from the generated distribution $G$. $G$ is optimized
to cause $D$ to become $\frac{1}{2}$ everywhere.
Motivation points to~\citet{deepSpeechReviewSPM2012} and~\citet{NIPS2012_4824} as
successful uses of deep discriminative networks for classification based on
\hyperref[backprop]{back propagation}, \hyperref[dropout]{dropout} and
piecewise linear units (such as \hyperref[rectified_linear_units]{ReLUs}).
\subsection{A Style-Based Generator Architecture for Generative Adversarial
Networks~\cite{karras2019astylebased}}
\subsubsection{Content / Contributions}
\begin{itemize}
\item New StyleGAN generator architecture
\item FFHQ dataset of high-res faces
\item Perceptual path length and linear separability metrics for
evaluating GAN latent space interpolation.
\end{itemize}
\subsubsection{Motivation}
\begin{itemize}
\item Unsupervised separation of style (e.g., pose and identity) via
latent code from stochastic variation (e.g., freckles and hair)
via injected noise
\item Allow disentangling of latent space by projecting latent code
into an intermediate space.
The generator should find it easier to generate from
disentangled latent codes, therefore the.
\end{itemize}
\subsubsection{Method}
\begin{itemize}
\item Embeds latent code into intermediate latent space using an
8-layer MLP.
An affine transformation projects the intermediate latent
vector to a set of per-feature map styles and biases.
Styles and biases scale the normalized output feature map from
each layer (i.e., control the mean and std of each independent
feature map).
\item Each style controls one convolution, since the input to the next
convolution layer depends only on the statistics determined by
the style.
The style is then overridden by the next layer's AdaIN layer.
\item Each generator layer's output sums with uncorrelated Gaussian
noise, scaled by a learned factor.
\item Mixing regularization: use different latent codes for different
layers.
Improves FID when using multiple latent codes at test time.
Desirable to use latent codes of a source to manipulate a
target (paper Figure 3 and Table 2).
\item Used truncation trick, performed in the intermediate instead of
input latent space.
\end{itemize}
\subsubsection{Results}
\begin{itemize}
\item
\end{itemize}
\subsubsection{Background}
\begin{itemize}
\item Arbitrary style transfer in real-time with adaptive instance normalization
\item Progressive growing of GANs for improved quality, stability, and
variation (same first author)
\end{itemize}
\subsubsection{Open Challenges}
\subsubsection{Prerequisites}
\begin{itemize}
\item Slerp spherical interpolation operator ``Sampling generative
networks: Notes on a few effective techniques''
\end{itemize}
\subsubsection{Comments / Questions}
\begin{itemize}
\item Is the input latent vector sampled from a uniform distribution?
\item The footnote about disentanglement studies with designed datasets
hiding the issue where each combination of input latent vectors
has to match its density in the training data.
\item Hand-wavy motivation for disentangled intermediate latent space
(easier for generator to generate from disentangled latent
variables).
\end{itemize}
\subsection{Few-Shot Adversarial Learning Of Realistic Neural Talking Head
Models~\cite{zakharov2019fewshot}}