ml-reading-notes.tex

\documentclass[a4paper, 12pt]{article}

\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{bbold}
\usepackage{bm}
\usepackage{cite}
\usepackage{fullpage}
\usepackage{graphicx}
\usepackage[pagebackref=true,breaklinks=true,letterpaper=true,colorlinks,bookmarks=false]{hyperref}
\usepackage{inputenc}
\usepackage{mathtools}
\usepackage{natbib}
\usepackage{titling}
\usepackage{siunitx}
\usepackage{txfonts}
\usepackage{url}

\DeclareMathOperator{\sign}{sign}
\DeclareMathOperator{\softmax}{softmax}

\newcommand{\expect}{\operatorname{E}\expectarg}
\DeclarePairedDelimiterX{\expectarg}[1]{[}{]}{%
  \ifnum\currentgrouptype=16 \else\begingroup\fi
  \activatebar#1
  \ifnum\currentgrouptype=16 \else\endgroup\fi
}

\newcommand{\innermid}{\nonscript\;\delimsize\vert\nonscript\;}
\newcommand{\activatebar}{%
  \begingroup\lccode`\~=`\|
  \lowercase{\endgroup\let~}\innermid
  \mathcode`|=\string"8000
}

\newcommand\phantomlabel[1]{\phantomsection\label{#1}}

\DeclarePairedDelimiter\abs{\lvert}{\rvert}%
\DeclarePairedDelimiter\norm{\lVert}{\rVert}%

\DeclarePairedDelimiterX{\infdivx}[2]{(}{)}{%
        #1\;\delimsize\|\;#2%
}
\newcommand{\infdiv}{D_{KL}\infdivx}

\date{\today}
\title{Machine Learning Reading Notes}

\author{Brendan Duke}

\begin{document}

\maketitle


\part{Definitions}


% TODO(brendan):
\phantomlabel{backprop}
\textbf{Back propagation}


\phantomlabel{batchnorm}
\textbf{Batch Normalization}


\textbf{Deep Neural Networks (DNNs)} are engineered systems inspired by the
biological brain\citet{Goodfellow-et-al-2016-Book}.

\phantomlabel{fisher-info}
The \textbf{Fisher information matrix} is the expected value of the observed
information matrix, which is the gradient of the negative score function, or
equivalently the Hessian of the negative log-likelihood. I.e.,

\begin{align*}
        F(\hat{\theta} | \theta^*) &= \expect{J(\hat{\theta} | D)} \\
                                   &= -\nabla s(\hat{\theta}) \\
                                   &= -\nabla^2 \log p(D \mid \hat{\theta})
\end{align*}

where $s(\hat{\theta})$ is the score function.

\phantomlabel{kl}
\textbf{KL divergence} is given in Equation~\ref{kleqn}\citet[Chapter~3]{Goodfellow-et-al-2016-Book}.

\begin{equation}
        \infdiv{P}{Q} = \varmathbb{E}_{x \sim P}\left[\log P(x) - \log Q(x)\right]
        \label{kleqn}
\end{equation}

\textbf{Cross-entropy} is related to \hyperref[kl]{KL divergence} by
$H(P, Q) = H(P) + \infdiv{P}{Q}$, where $H(P)$, the \textbf{Shannon entropy}, is
$H(P) = -\varmathbb{E}_{x \sim P} \left[\log P(x)\right]$.


\phantomlabel{LSTM}
% TODO(brendan): Diagrams and equations for LSTM. Definition for RNNs. Should
% be a distillation of Chapter 10 of Goodfellow book.
\textbf{LSTM} (Long Short Term
Memory) neural networks are a type of recurrent neural network whose
characteristic feature is the presence of a gated self-loop that allows
retention of its ``cell state'', which are the pre-non-linearity activations of
the previous time step\citet[Chapter~10]{Goodfellow-et-al-2016-Book}.

Cell state is updated at each time step according to
Equation~\ref{lstm_cell_state_update}.

\begin{equation}
        s_i^{(t)} = f_i^{(t)} s_i^{(t - 1)} + g_i^{(t)}
                \sigma \left( b_i + \sum_j U_{i, j} x_j^{(t)} + \sum_j W_{i, j} h_j^{(t - 1)}\right)
        \label{lstm_cell_state_update}
\end{equation}

The vectors $\boldsymbol{f^{(t)}}$ and $\boldsymbol{g^{(t)}}$ in
Equation~\ref{lstm_cell_state_update} also take inputs from $\boldsymbol{x^{(t)}}$
and $\boldsymbol{h^{(t - 1)}}$, with their own weight tensors and bias vectors
$\boldsymbol{U}^f$, $\boldsymbol{W}^f$ and $\boldsymbol{b}^f$, $\boldsymbol{U}^g$,
$\boldsymbol{W}^g$ and $\boldsymbol{b}^f$, respectively.

Similar gate functions exist to gate the inputs and outputs to the LSTM, as
well.


\textbf{Mahalanobis Distance}


\phantomlabel{multilayer_perceptron}
\textbf{Multi-Layer Perceptrons (MLPs)} are mathematical functions mapping some
set of input values to some set of output
values\citet{Goodfellow-et-al-2016-Book}.


\textbf{Neighbourhood Components Analysis (NCA)} is a method of learning a
Mahalanobis distance metric, and can also be used in linear dimensionality
reduction\citet{NIPS2004_2566}.

The \textbf{PCKh} metric, used by the MPII Human Pose Dataset, defines a joint
estimate as matching the ground truth if the estimate lies within 50\% of the
head segment length\citet{andriluka-2d-2014-853}. The head segment length is
defined as the diagonal across the annotated head rectangle in the MPII data,
multiplied by a factor of 0.6. Details can be found by examining the MATLAB
\href{http://human-pose.mpi-inf.mpg.de/results/mpii_human_pose/evalMPII.zip}{evaluation script}
provided with the MPII dataset.


\phantomlabel{nonmax_supression}
\textbf{Non-maximum suppression} in object detection, in general, is a set of
methods used to prune an initial set of object bounding boxes that may be
uncorrelated with the actual object detections in an image, down to a subset
that are\citet{DBLP:conf/accv/RotheGG14}. In edge detection, non-maximum
suppression is used to suppress any pixels (i.e.\ not include them in the set of
detected edges) that are not the maximum response in their neighbourhood.

The \textbf{softmax function} is a continuous differentiable version of the
argmax function, where the result is represented as a one-hot
vector\citet[Chapter~6]{Goodfellow-et-al-2016-Book}. Softmax is a way of
representing probability distributions over a discrete variable that can take
on $n$ possible values.

Formally, softmax is given by Equation~\ref{softmax_eqn}.

\begin{equation}
        \textrm{softmax}{(\boldsymbol{z})}_i = \frac{e^{z_i}}{\sum_je^{z_j}}
        \label{softmax_eqn}
\end{equation}


\phantomlabel{rectified_linear_units}
\textbf{Rectified linear units (ReLUs)}\citet{icml2010_NairH10}


\textbf{Leaky rectified linear units (LReLUs)}\citet{maas_rectified_nonlinearities}

\begin{equation}
        h^{(i)} = \max\left(w^{(i)T}x, 0.01w^{(i)T}x\right) =
        \begin{cases}
                w^{(i)T}x & \textrm{if } w^{(i)T}x > 0 \\
                0.01w^{(i)T}x & \textrm{otherwise}
        \end{cases}
\end{equation}

Leaky ReLUs have non-zero gradient over their domain, and were therefore
motivated in reducing the vanishing gradient problem.


\phantomlabel{dilated_convolutions}
\textbf{Dilated Convolutions}~\citet{DBLP:journals/corr/YuK15} can be defined as
follows.

Let $F: \mathbb{Z}^2 \rightarrow \mathbb{R}$ be a discrete
function, let $\Omega_r = {[-r, r]}^2 \cap \mathbb{Z}^2$ and let
$k: \Omega_r \rightarrow \mathbb{R}$ be a discrete filter of size
${(2r + 1)}^2$. Then the convolution operator is defined by
Equation~\ref{conv-op}.

\begin{equation}
        (F * k)(\mathbf{p}) = \sum_{\mathbf{s} + \mathbf{t} = \mathbf{p}} F(\mathbf{s}) + k(\mathbf{t})
        \label{conv-op}
\end{equation}

Furthermore, the dilated convolution operator, denoted by $*_l$, is defined by
Equation~\ref{dilated-conv-op}.

\begin{equation}
        (F *_l k)(\mathbf{p}) = \sum_{\mathbf{s} + l\mathbf{t} = \mathbf{p}} F(\mathbf{s}) + k(\mathbf{t})
        \label{dilated-conv-op}
\end{equation}


\part{Book Summaries}


\section{Machine Learning: A Probabilistic Perspective}
% TODO(brendan): citation


\subsection{Probability}

Bayes rule is:

\begin{equation}
        p(Y = y | X = x) = \frac{p(Y = y, X = x)}{p(X = x)}
                = \frac{p(X = x) p(Y = y | X = x)}
                       {\sum_{x'} p(X = x') p(Y = y | X = x')}
\end{equation}


\subsection{Generative Models for Discrete Data}

The \textbf{maximum a posteriori (MAP)} estimate is defined as
$\hat{y} = \textrm{argmax}_{c} p(y = c | \mathbf{x}, D)$.

A \textbf{prior} is a probability distribution assigned to each hypothesis $h
\in \mathcal{H}$ in the hypothesis space, based only on knowledge external to
the training data. E.g.\ in the space of related numbers under 100, a strong
prior may be assigned to ``odd numbers'' or ``even numbers'', while a weak
prior would be assigned to unintuitive concepts such as ``all powers of 2
except 32''.

The \textbf{likelihood} of a hypothesis given a set of training data, is the
probability of randomly sampling exactly that set of training data, given the
hypothesis, i.e. $p{(\mathcal{D} | h)} = {(1 / |h|)}^N$.

The \textbf{posterior} is:

\begin{equation}
        p(h | \mathcal{D})
                = \frac{p(h) p(\mathcal{D} | h)}
                       {\sum_{h' \in \mathcal{H}} p(\mathcal{D} | h')}
\end{equation}

where $p(\mathcal{D} | h)$ is ${(1 / |h|)}^N$ if the training data match the
hypothesis $h$, otherwise zero.

The MAP is
$\textrm{argmax}_h p(h) p(\mathcal{D} | h) = \textrm{argmax}_h \left[ \log{p(h)} + \log{p(\mathcal{D} | h)} \right]$.
In the limit of infinite training data, the term exponential in $N$ in
$\log{p(\mathcal{D} | h)}$ will dominate. Therefore the
\textbf{maximum likelihood estimator (MLE)}
$\textrm{argmax}_h \log{p(\mathcal{D} | h)}$ is converged to by the MAP in the
limit of infinite data, justifying the MLE's use as an objective function.


\section{Deep Learning~\citet{Goodfellow-et-al-2016-Book}}


\subsection{Machine Learning Basics}

Maximum Likelihood Estimation (MLE) is maximization of the log-likelihood
$\log p_{model}(y | \mathbf{x}; \mathbf{\theta})$, where $y$ is a ground truth
example, $\mathbf{x}$ is an input feature vector, and $\mathbf{\theta}$ are
model parameters.

Principal Component Analysis (PCA) involves projecting input feature vectors
$\mathbf{x}$ into a reduced-dimensionality space via multiplication by
$D \in \mathbb{R}^{n \times l}$. The PCA projection minimizes the $L_2$
reconstruction error $||\mathbf{x} - r(\mathbf{x})||_2$.

Stocastic Gradient Descent (SGD) is gradient descent over minibatches. For an
objective function $L = f(y; x, \theta)$, at each step $t$ we have
$\theta_t = \frac{1}{m'} \sum_{i = 0}^{m'} \theta_{t - 1} - \epsilon \nabla f$.


\part{Datasets}


\section{CIFAR-10~\citet{cifar10-website}}
\label{cifar10}

CIFAR-10~\citet{cifar10-website} consists of \num{60000} colour images of $32
\times 32$ resolution, has 10 classes and \num{6000} images per class.


\section{HMDB-51~\citet{Kuehne11}}
\label{hmdb51}

HMDB-51 contains 7000 clips extracted from movies and YouTube then manually
annotated with 51 class labels.


\section{Kinetics~\citet{kay2017kinetics}}
\label{kinetics}

The Kinetics Human Action Video Dataset\citet{kay2017kinetics}, which has 400
human action classes each with more than 400 examples. The classes are focused
on human actions, such as pouring or kissing, as opposed to activities (such as
tennis or baseball). The clips are 10s long. There are roughly \num{300000}
labelled clips in total in the dataset.

The Kinetics test set is 100 clips for each class.


\section{KITTI 2012~\citet{geiger-kitti-2012}}
\label{kitti}

The KITTI 2012 odometry benchmark consists of 22 stereo sequences, saved in
lossless PNG format. The data include colour frames, LIDAR data, camera
calibration and ground truth poses (except sequences 11 to 21, which are for
evaluation).


\section{THUMOS~2014~\citet{DBLP:journals/corr/IdreesZJGLSS16}}
\label{thumos}

THUMOS~2014~\citet{DBLP:journals/corr/IdreesZJGLSS16} consists of 101 action
classes. THUMOS action classes are from UCF-101~\ref{ucf101}. All videos are
from YouTube, and are labelled with action class and temporal span of the
action. THUMOS~2014 contains 20 action classes, 2755 trimmed training videos
and 1010 untrimmed validation videos containing 3007 action instances. 213 test
videos, with 3358 action instances, exist that are not entirely background.

The THUMOS~2015 dataset extends THUMOS~2014 with a total of 5613 positive and
background untrimmed videos.


\section{UCF-101\citet{DBLP:journals/corr/abs-1212-0402}}
\label{ucf101}

101 classes, 13k clips and 27 hours of video data in total.


\section{WMT~2014~\citet{wmt14-translation-website}}
\label{wmt2014}

WMT~2014 is a machine translation task with five language pairs:
French-English, Hindi-English, German-English, Czech-English and
Russian-English. A number of parallel and monolingual corpora are included as
training data.

The majority of the training data is taken
from~\href{http://www.statmt.org/europarl/}{Europarl}~v7, from which about
50~million words per language are used in WMT~2014. Additional training data is
taken from the~\href{http://www.casmacat.eu/corpus/news-commentary.html}{News
Commentary Parallel Corpus}, at about three million words per language.

Test data from previous years is provided, with on the order of~\num{3000}
sentences per test set.


\part{Paper Summaries}


\section{Embedded Software Stack}

\subsection{Learning to Optimize Tensor Programs~\cite{chen2018learning}}

This paper expands on the ``tensor program'' low level optimization component
of TVM, called AutoTVM. They trade off exploration/exploitation in an online
model-based operator search by using submodular optimization (local
optimization) on the subset~$S_e$ found by simulated annealing (global
optimization) of the total set of considered operator implementations,
represented with a polyhedral model. This ``exploration module'' builds a
dataset~$\mathcal{D}$ of schedules with ground truth measurements run in the
hardware environment.

The submodular optimization objective function used is,

\begin{equation}
        L(S) = - \sum_{s \in S} \hat{f}(g(e, s)) + \alpha \sum_{j = 1}^m \abs{\cup_{s \in S} \{s_j\}}
\end{equation}

They trained the cost model (GBT, TreeRNN, or random search) alternating
with the exploration module, which generates the cost model's dataset. They
found that a rank objective improved over a regression objective for training
the cost model (could this perhaps be fixed with normalization? Regression
makes sense here).


\subsection{TVM: An Automated End-to-End Optimizing Compiler for Deep
            Learning~\cite{chen2018tvm}}

They present TVM, which is a ~50k lines of core C++ code optimizing compiler
for deep learning systems. TVM performs both high-level graph optimizations, by
e.g., fusing conv-BN-ReLU into backend-supported fused operators, as well as
low-level kernel optimizations. The kernel optimizations use GBTs (also tried
TreeRNNs) produce operators that outperform even heavily handtuned cuDNN
kernels on e.g., Titan X (as of cuBLAS v8, cuDNN v7).


\subsubsection{Future Work/Comments/Questions}

\begin{itemize}
        \item Improve the ML cost model, since GBTs were used for their speed,
                but it's possible that there could be advantages from more
                powerful models.

        \item TVM has a runtime for high-level graph operator fusion, is this
                runtime also required for kernel optimizations? Presumably
                kernels can be generated offline and linked into an ML
                framework.
\end{itemize}


\section{Human Pose}


\subsection{DeepPose: Human Pose Estimation via Deep Neural
            Networks\citet{DBLP:journals/corr/ToshevS13}}

This paper uses DNNs as a method for human pose estimation, based on the
success of~\citet{NIPS2013_5207} and~\citet{DBLP:journals/corr/GirshickDDM13} for
object detection using DNNs.

This is in contrast to the existing work in human pose estimation at the time,
which focused on explicitly designed pose models. Papers about these methods
can be found in the ``Related Work'' section of
\citet{DBLP:journals/corr/ToshevS13}.

The input to the 7-layered convolutional DNN (based on
AlexNet\citet{NIPS2012_4824}) is the full image.


\subsection{End-to-end people detection in crowded
            scenes\citet{DBLP:journals/corr/StewartA15}}

This paper is focused on jointly creating a set of bounding-box predictions for
people in crowded scenes using GoogLeNet and a
\hyperref[LSTM]{recurrent LSTM layer} as a controller. Since bounding-box
predictions are generated jointly, common post-processing steps such as
\hyperref[nonmax_supression]{non-maximum suppression} are unnecessary.  All
components of the system are trained end-to-end using back propagation.

\subsubsection{Motivation}

The end-to-end people detection method is contrasted with the object detection
methods of R-CNN in~\citet{DBLP:journals/corr/GirshickDDM13} and OverFeat in
\citet{DBLP:journals/corr/SermanetEZMFL13}.
\citet{DBLP:journals/corr/GirshickDDM13} and
\citet{DBLP:journals/corr/SermanetEZMFL13} rely on non-maximum suppression,
which does not use access to image information to infer bounding box positions
since non-maximum suppression acts only on bounding boxes. Also, in end-to-end
people detection, the decoding stage is learned using LSTMs, instead of using
specialized methods as in~\citet{VisualPhrases} and~\citet{TaAnSc_14:occluded}.

Early related work can be found in~\citet{Felzenszwalb:2010:ODD:1850486.1850574}
and~\citet{Leibe:2005:PDC:1068507.1069006}. Best performing object detectors at
the time were~\citet{DBLP:journals/corr/GirshickDDM13},
\citet{DBLP:journals/corr/SermanetEZMFL13}, \citet{Uijlings13},
\citet{DBLP:journals/corr/ZhangBS15} and~\citet{DBLP:journals/corr/SzegedyREA14}.

Sequence modeling is done using LSTMs as in
\citet{DBLP:journals/corr/SutskeverVL14} (used for machine translation) and
\citet{DBLP:journals/corr/KarpathyF14} (used for image captioning). The loss
function is similar to the loss function proposed in
\citet{Graves06connectionisttemporal} in that the loss function encourages the
model to make predictions in descending order of confidence.

\subsubsection{Data}

A new training set collected from public webcams, called ``Brainwash'', is
produced. Brainwash consists of 11917 images with 91146 labelled people. 1000
images are allocated for testing and validation, hence training, test and
validation sets contain 82906, 4922 and 3318 labels, respectively.

\subsubsection{Model}

A pre-trained GoogLeNet\citet{going-deeper-szegedy43022} is used to produce
encoded features as input to the LSTM\@. The GoogLeNet features are further
fine-tuned by the training process.  Using GoogLeNet, a feature vector of
length 1024 is produced for each region over a $(15, 20)$ grid of regions that
covers the entire $(480, 640)$ input image. Each cell in the grid has a
receptive field of $(139, 139)$, and is trained to produce a set (with fixed
cardinality five) of distinct bounding boxes in the center $(64, 64)$ region.

$L_2$ regularization of weights in the network was removed entirely.

GoogLeNet activations are scaled down by a factor of 100 before being input to
the decoder, since decoder weights are initialized according to a uniform
distribution in $[-0.1, 0.1]$, while GoogLeNet activations are in $[-80, 80]$.
Regression predictions from GoogLeNet are scaled up by 100 before comparing
with ground truth locations (which are in $[-64, 64]$).

At each step, the LSTM for each grid cell, of which there are 300 in total,
produces a new bounding box and corresponding confidence that the bounding box
contains a person $\boldsymbol{b} = \{\boldsymbol{b}_{pos}, b_c\}$, where
$\boldsymbol{b}_{pos} = (b_x, b_y, b_w, b_h) \in \varmathbb{R}^4$ and
$b_c \in [0, 1]$. The prediction algorithm stops when the confidence drops
below a set threshold. The LSTM units have 250 memory states, no bias units,
and no output non-linearities. Each LSTM unit adds its output to the image
representation, and feeds the result into the next LSTM unit. Comparable
results are found by only presenting the image representation as input to the
first LSTM unit.

Dropout with probability 0.15 is used on the output of each LSTM\@.

\subsubsection{Inference}

The system is trained with learning rate 0.2, decreased by a factor of 0.8
every 100 000 iterations (with convergence occurring after 500 000 iterations),
and momentum 0.5. Gradient clipping is done at 2-norm of 0.1.

Images are jittered by up to 32 pixels in horizontal and vertical directions,
and scaled by a factor between 0.9 and 1.1.

At test time, per-region predictions are merged by adding a new region at each
iteration, and destroying any new bounding boxes that overlap previously
accepted bounding boxes, under the constraint that any given bounding box can
destroy at most one other bounding box. An ordering function
$\Delta': A \times C \rightarrow \varmathbb{N} \times \varmathbb{R}$ given by
$\Delta'(\boldsymbol{b}_i, \tilde{\boldsymbol{b}}_j) = (m_{ij}, d_{ij})$ where
$m_{ij}$ denotes intersection of boxes and $d_{ij}$ is $L_1$ displacement, is
minimized using the Hungarian algorithm in order to find a bipartite matching.
At each step, any new candidate that is not intersecting in the matching is
added to the set of accepted candidates.

\subsubsection{Criticism}

A new loss function that operates on sets of bounding-box predictions is
introduced. Denoting bounding boxes generated by the model as
$C = \{\tilde{\boldsymbol{b}}_i\}$, and ground truth bounding boxes by
$G = \{\boldsymbol{b}_i\}$, the loss function is given by
Equation~\ref{loss-eqn}.

\begin{equation}
        L(G, C, f) = \alpha\sum_i^{|G|}
                             l_{pos}\left(\tilde{\boldsymbol{b}}_{pos}^i, \boldsymbol{b}_{pos}^{f(i)}\right) +
                     \sum_j^{|C|} l_c\left(\tilde{b}_c^j, y_j\right)
        \label{loss-eqn}
\end{equation}

In Equation~\ref{loss-eqn}, $f(i)$ is an injective function $G \rightarrow C$
that assigns one ground truth to each index $i$ up to the number of ground
truths, $l_{pos}$ is the $L_1$ displacement between bounding boxes, and $l_c$
is a cross-entropy loss on a candidate's confidence that a bounding box exists,
where $y_j = \mathbb{1}\{f^{-1}(j) \neq \varnothing\}$. $\alpha$ is set to 0.03
from cross-validation.

In creating $f(i)$ in Equation~\ref{loss-eqn} to assign candidate predictions
to ground truths, the
$G \times C \rightarrow \varmathbb{R} \times \varmathbb{N} \times \varmathbb{N}$
function
$\Delta\left(\boldsymbol{b}^i, \tilde{\boldsymbol{b}}^j\right) = (o_{ij}, r_i, d_{ij})$
is used to lexicographically order pairs first by $o$, then $r$, then $d$,
where $o$ is one if there is sufficient overlap between candidate and ground
truth and zero otherwise, $r$ is the prediction's confidence, and $d$ is the
$L_1$ displacement between candidate and ground truth bounding boxes.

\subsubsection{Experiments}

With an AP (average precision) of 0.78 and EER (equal error rate) of 0.81, the
$f(i)$ produced by minimizing $\Delta$, using the
\href{https://en.wikipedia.org/wiki/Hungarian_algorithm}{Hungarian algorithm},
is found to improve on AP and EER compared with a fixed assignment of $f(i)$,
or selecting the first $k$ highest ranked ($L_\textrm{firstk}$). COUNT
(Absolute difference between number of predicted and ground truth detections)
for $f(i)$ with Hungarian was 0.76 compared with 0.74 for $L_\textrm{firstk}$.
As a baseline, Overfeat-GoogLeNet (bounding-box regression on each cell,
followed by non-maximum suppression, as in
\citet{DBLP:journals/corr/SermanetEZMFL13}) achieved 0.67, 0.71 and 1.05 AP, EER
and COUNT, respectively.

Training without finetuning GoogLeNet reduces AP by 0.29.

Removal of dropout from the output of each LSTM decreases AP by 0.011.

When using the original $2^{-4}$ $L_2$ weights regularization multiplier on
GoogLeNet only, the network was unable to train.  An $L_2$ regularization
multiplier on GoogLeNet of $10^{-6}$ reduced AP by 0.03.

It is found that AP (on the validation set) increases from 0.82 to 0.85 when
using separate weights connecting each of the LSTM outputs to predicted
candidates.


\section{Regularization}


\subsection{Dropout: A Simple Way to Prevent Neural Networks from
            Overfitting\citet{Srivastava:2014:DSW:2627435.2670313}}
\label{dropout}

\textbf{Dropout} is a technique used to overcome the problem of overfitting in
deep neural nets with large numbers of parameters. The idea is to train using
many ``thinned'' networks, chosen by randomly removing subsets of units and
their connections. The predictions from the thinned networks are approximately
averaged at test time by using a single, unthinned, network with reduced
weights.

\begin{itemize}
        \item Existing regularization methods: stopping training as soon as
                validation error stops improving, L1 and L2 regularization, and
                weight sharing\citet{Nowlan:1992:SNN:148167.148169}.
\end{itemize}


\section{CNN Architecture}


\subsection{Deep Residual Learning for Image
            Recognition\citet{DBLP:journals/corr/HeZRS15}}

A technique, training of residual functions, is presented for deep neural
network architecture design, which allows training of deeper networks with
improved accuracy compared to not training residual functions.

\citet{going-deeper-szegedy43022} and~\citet{DBLP:journals/corr/SimonyanZ14a} are
referred to as motivating ``very deep'' models.

% TODO(brendan): Read [1], [9] for vanishing gradient, R-CNN series for
% localization.


\subsection{MobileNetV2: Inverted Residuals and Linear
            Bottlenecks~\cite{Sandler_2018_CVPR}}

They change residual blocks (with depthwise separable convolutions)
to~\emph{expand} up to a higher number of channels in the ``bottleneck'', which
performs 3x3 channel-wise convolutions. The expansion (and contraction) is done
using 1x1 convolutions with~\emph{no nonlinearity} (hence linear bottlenecks).


\subsection{Spatial Transformer Networks~\citet{jaderberg-spatial-2015}}

Spatial Transformer Networks introduce a CNN submodule that sub-differentiably
parametrizes spatial transformations of a feature map.

The submodule takes an input~$U \in \mathbb{R}^{H \times W \times C}$ and
outputs~$\mathcal{T}_\theta(G)$, which is a warp of the output grid~$G$
parametrized by~$\theta$, and outputs~$V$. E.g., $\mathcal{T}_\theta$ could be
an affine transformation.

The technique used in the paper to make the operation sub-differentiable is to
use a kernel function of the source coordinates, i.e.,

\begin{equation*}
        V_i^{nm} = \sum_n^H \sum_m^W U^c_{nm} k(x_i^s - m\;;\; \Phi_x)
                k(y_i^s - m\;;\; \Phi_y)
\end{equation*}

for all channels~$c$ and pixels~$i$. The kernel function could be, for example,
a Dirac delta function, a bilinear
function~$\max(0, 1 - |x_i^s - m|)\max(0, 1 - |y_i^s - n|)$, or a Gaussian
kernel, etc.

One disadvantage of this method is that it seems that either the true discrete
characteristic of the grid warping operation is given up (in the case of
Gaussian or bilinear kernels), or a biased estimator of the gradient
(straight-through estimator, in the case of the Dirac delta function) is used.
It would be more interesting to evaluate the source pixel-selection operation
using an expectation over all selections in general, then to evaluate different
gradient estimators for this selection.


\section{Colour Constancy}

\subsection{On Finding Gray Pixels~\cite{qian2019cvpr}}

They determine the spatial illumination map~$L_i(x, y)$ by finding the
top~$N\%$ of gray pixels (assumed based on neutral interface reflection to
have~$j \in \{s,b\}$ surface and body
reflection~$R_{j,R} = R_{j,G} = R_{j,B} = \bar{R_j}$) using a grayness
index~$GI(x, y) = \norm{[C\{\log I_R - \log\abs{I}\}, C\{\log I_B -
\log\abs{I}\}]}$ in regions of varying intensity of light, i.e.,
where~$C\{I_i\} > \epsilon, \forall i\in \{R,G,B\}$.


\section{Correspondence}

\subsection{Learning Correspondence from the Cycle-Consistency of
            Time~\cite{wang2019learning}}

They present a self-supervised method for learning features trained for finding
correspondences, by following an image patch forwards and backwards in time
with a simple tracker~$\mathcal{T}$ and minimizing localization error, i.e.,
enforcing cycle-consistency.
They encode features of a patch in the initial image, then use~$\mathcal{T}$ to
follow the patch back-and-forth over cycles of different lengths.
The tracker takes normalized cross-correlation (attention) of the patch
features over the entire past/future frame as input.

They use three losses: the tracking cycle-consistency loss, a skip loss
(cycle-consistency over skipped frames), and a feature similarity loss.


\subsubsection{Comments/Questions/Future Work}

\begin{itemize}
        \item Choosing what to track, e.g., not tracking background parts of
                the image.

        \item Improving robustness to occlusion and partial observability,
                e.g., with a better strategy for finding cycles (at training
                time?).

        \item How should one track a patch with two objects that eventually
                diverge?

        \item More context for tracking (does this mean correspondences between
                a sequence of frames?)

        \item The feature similarity loss seems to be dominant (other two
                losses are weighted by~\num{0.1}).
                Replace this with mutual information maximization?
\end{itemize}


\section{Face}


\subsection{Face Alignment in Full Pose Range: A 3D Total
            Solution~\cite{zhu2017face}}

\subsubsection{Content / Contributions}

\begin{itemize}
\item This paper provides a solution for ``face alignment'', which seems have
        been used to refer to registration of either 2D~\cite{kazemi2014one} or
        3D models.
        The paper addresses face alignment in large pose ranges (\ang{90}).

\item Large pose range face alignment poses challenges:

        \begin{itemize}
        \item Existing pose algorithms assume that landmarks are visible.
        \item Large appearance change for large poses.
        \item It is difficult to annotate occluded points.
        \end{itemize}

\item To solve the landmark visibility (self-occlusion) issue, this paper
        proposes solving face alignment as a 3D problem, rather than 2D, by
        fitting translation, scale, rotation, and expression and shape
        parameters of a 3DMM\@.

\item The paper introduces Projected Normalized Coordinate Code (PNCC) and Pose
        Adaptive Features (PAF) to deal with appearance changes.
        The paper also uses OWPDC to prioritize 3DMM parameters.

\item The paper constructs a database of 2D face images and 3D face models, and
        combines these to synthesize more than~\num{60000} profile-view images
        to create artificial profile-view data with ground-truth occluded
        landmark annotations.
\end{itemize}


\subsubsection{Motivation}

\begin{itemize}
\item At the time of writing, most face alignment models worked only for small
        or medium angle rotations (yaw angle less than~\ang{45}).

\end{itemize}


\subsubsection{Method}

\begin{itemize}
\item Quaternion rotation representation.
\item Optimized Weighted Parameter Distance Cost (OWPDC).
        The paper asserts that different 3DMM parameters should be given
        different priorities during model fitting to minimize alignment error.
\end{itemize}

\subsubsection{Results}


\subsubsection{Background}

\begin{itemize}
\item \href{http://www.csc.kth.se/~vahidk/face_ert.html}{One
        Millisecond Face Alignment with an Ensemble of Regression Trees}~\cite{kazemi2014one}
        for a classical (gradient boosting) solution.

\item Active appearance models~\cite{Cootes98activeappearance}.

\item 3DMM, analysis-by-synthesis 3DMM fitting~\cite{egger20203dmm}.

\item Regression-based 3DMM fitting~\cite{jourabloo2017poseinvariantface}.

\item Cascaded regression~\cite{dollar2010cascadedpose}.
\end{itemize}


\subsubsection{Open Challenges}


\subsubsection{Prerequisites}

\begin{itemize}
\item Euler angles.
\item Quaternions.

\item Quadratic programming.
\end{itemize}


\subsubsection{Comments / Questions}

\begin{itemize}
\item Can a 3D model be unambiguously registered only with correspondences to
        2D keypoints?
        It seems single-view 3d reconstruction is ill-posed.
        Why?
\end{itemize}


\subsection{Generating 3D faces using Convolutional Mesh
            Autoencoders~\cite{ranjan2018generating}}

\subsubsection{Content / Contributions}

\begin{itemize}
\item Novel mesh sampling operations that preserve mesh topology at different
        scales.

\item Convolve facial mesh with Chebyshev filters.

\item Compact model (CoMA).

\item Dataset of~\num{20466} meshes of~\num{12} subjects in~\num{12} facial
        expressions.
\end{itemize}


\subsubsection{Motivation}

\begin{itemize}
\item Existing linear 3D face models don't capture nonlinear deformations.

\item Extending the success of hierarchical 2D CNN structure to 3D meshes.
        3D convolution is a poor substitute due to inefficiency, while 3D
        mesh convolutions can process 3D efficiently at high-res.

\item
\end{itemize}


\subsubsection{Method}

\begin{itemize}
\item Can sample faces from a Gaussian distribution.
\end{itemize}


\subsubsection{Results}

\begin{itemize}
\item 50\% better performance than linear 3D face model, while CoMA has 75\%
        fewer parameters.

\item Replacing FLAME's expression space with CoMA improves FLAME's
        reconstruction accuracy.
\end{itemize}


\subsubsection{Background}


\subsubsection{Open Challenges}

\begin{itemize}
\item Scarcity of 3D face training data.
\end{itemize}


\subsubsection{Comments}


\section{GANs}

\subsection{Generative Adversarial Networks\citet{NIPS2014_5423}}
\label{gan}

A method of generating probability distributions is presented that can be
trained end-to-end when used with \hyperref[multilayer_perceptron]{MLPs}. In
the given method, a discriminator network $D$ is optimized to distinguish from
ground truth the samples from the generated distribution $G$. $G$ is optimized
to cause $D$ to become $\frac{1}{2}$ everywhere.

Motivation points to~\citet{deepSpeechReviewSPM2012} and~\citet{NIPS2012_4824} as
successful uses of deep discriminative networks for classification based on
\hyperref[backprop]{back propagation}, \hyperref[dropout]{dropout} and
piecewise linear units (such as \hyperref[rectified_linear_units]{ReLUs}).


\subsection{A Style-Based Generator Architecture for Generative Adversarial
            Networks~\cite{karras2019astylebased}}

\subsubsection{Content / Contributions}

\begin{itemize}
        \item New StyleGAN generator architecture

        \item FFHQ dataset of high-res faces

        \item Perceptual path length and linear separability metrics for
                evaluating GAN latent space interpolation.
\end{itemize}


\subsubsection{Motivation}

\begin{itemize}
        \item Unsupervised separation of style (e.g., pose and identity) via
                latent code from stochastic variation (e.g., freckles and hair)
                via injected noise

        \item Allow disentangling of latent space by projecting latent code
                into an intermediate space.
                The generator should find it easier to generate from
                disentangled latent codes, therefore the.
\end{itemize}


\subsubsection{Method}

\begin{itemize}
        \item Embeds latent code into intermediate latent space using an
                8-layer MLP.
                An affine transformation projects the intermediate latent
                vector to a set of per-feature map styles and biases.
                Styles and biases scale the normalized output feature map from
                each layer (i.e., control the mean and std of each independent
                feature map).

        \item Each style controls one convolution, since the input to the next
                convolution layer depends only on the statistics determined by
                the style.
                The style is then overridden by the next layer's AdaIN layer.

        \item Each generator layer's output sums with uncorrelated Gaussian
                noise, scaled by a learned factor.

        \item Mixing regularization: use different latent codes for different
                layers.
                Improves FID when using multiple latent codes at test time.
                Desirable to use latent codes of a source to manipulate a
                target (paper Figure 3 and Table 2).

        \item Used truncation trick, performed in the intermediate instead of
                input latent space.
\end{itemize}


\subsubsection{Results}

\begin{itemize}
        \item
\end{itemize}


\subsubsection{Background}

\begin{itemize}
        \item Arbitrary style transfer in real-time with adaptive instance normalization

        \item Progressive growing of GANs for improved quality, stability, and
                variation (same first author)
\end{itemize}


\subsubsection{Open Challenges}


\subsubsection{Prerequisites}

\begin{itemize}
        \item Slerp spherical interpolation operator ``Sampling generative
                networks: Notes on a few effective techniques''
\end{itemize}


\subsubsection{Comments / Questions}

\begin{itemize}
        \item Is the input latent vector sampled from a uniform distribution?

        \item The footnote about disentanglement studies with designed datasets
                hiding the issue where each combination of input latent vectors
                has to match its density in the training data.

        \item Hand-wavy motivation for disentangled intermediate latent space
                (easier for generator to generate from disentangled latent
                variables).
\end{itemize}


\subsection{Few-Shot Adversarial Learning Of Realistic Neural Talking Head
            Models~\cite{zakharov2019fewshot}}

\subsubsection{Content / Contributions}

Problem: synthesizing photorealistic talking head videos from a set of face
landmarks.

Contribution: bring adversarial finetuning into the metalearning framework.


\subsubsection{Motivation}

Challenge: complexity of modeling face along with hair, garments, mouth
cavity, upper torso.
Uncanny valley.

Existing photorealistic head generations require large models to be trained
specially for one person.


\subsubsection{Method}

Simulate metalearning during training by providing a few example images and
landmarks, and pretraining on those.
Then there is an adversarial learning problem with a high-capacity
discriminator and generator pretrained by metalearning.

Rasterize connected landmark image into 3-channels.


\subsubsection{Results}


\subsubsection{Background}

\begin{itemize}
        \item Warping methods (intro)
        \item Multimodal Unsupervised Image-to-Image Translation
        \item MAML
        \item Text-to-speech [3, 19]
\end{itemize}


\subsubsection{Open Challenges}


\subsubsection{Prerequisites}

\begin{itemize}
        \item High-resolution image synthesis and semantic manipulation with
                conditional GANs (feature matching)
        \item cgans with projection discriminator ($W_i$ embeddings that correspond to individual videos i)
        \item Arbitrary style transfer in realtime with adaptive instance normalization
        \item Perceptual losses for real-time style transfer and super-resolution
        \item BigGAN
        \item Spectral norm
        \item Self-attention / SAGAN
        \item SPADE
\end{itemize}


\subsubsection{Comments}

Few-Shot vid2vid


\subsection{Image-to-Image Translation with Conditional Adversarial
            Networks\citet{DBLP:journals/corr/IsolaZZE16}}

A loss function is presented for image-to-image translation that can be applied
to different tasks, such as colourizing images and reconstructing photos from
label maps or edges.

\href{https://github.com/phillipi/pix2pix}{Code} is available.

\subsubsection{Motivation}

\citet{DBLP:journals/corr/LarsenSW15}, \citet{DBLP:journals/corr/PathakKDDE16}
and~\citet{DBLP:journals/corr/ZhangIE16} are noted as evidence that applying
Euclidean distance loss alone on generated images produces blurry results.

\citet{DBLP:journals/corr/PathakKDDE16} also noted that it was helpful to mix
the GAN objective with a pixel-wise loss such as an L2 loss.

\citet{NIPS2014_5423}, \citet{DBLP:journals/corr/DentonCSF15},
\citet{DBLP:journals/corr/RadfordMC15},
\citet{DBLP:journals/corr/SalimansGZCRC16}, and
\citet{DBLP:journals/corr/ZhaoML16} are mentioned as prior work in
\hyperref[gan]{GANs}, and specifically conditional GANs are explored, as
suggested by~\citet{NIPS2014_5423}.

As opposed to image modelling losses where pixels are conditionally independent
from each other (e.g.~\citet{DBLP:journals/corr/ShelhamerLD16},
\citet{DBLP:journals/corr/XieT15}, \citet{IizukaSIGGRAPH2016},
\citet{DBLP:journals/corr/LarssonMS16} and \citet{DBLP:journals/corr/ZhangIE16}),
a ``structured loss'' is used. Other examples of structured losses are in
\citet{DBLP:journals/corr/ChenPKMY14} (conditional random fields),
\citet{DBLP:journals/corr/DosovitskiyB16} (feature matching),
\citet{DBLP:journals/corr/LiW16} (non-parametric losses),
\citet{DBLP:journals/corr/XieHT15} (pseudo-priors) and
\citet{DBLP:journals/corr/JohnsonAL16} (losses based on matching covariance
statistics).

Previous work on conditional GANs exists in~\citet{DBLP:journals/corr/MirzaO14}
(generating MNIST digits from discrete labels) and
\citet{DBLP:journals/corr/ReedAYLSL16} (image generation from text). Work on
image generation with conditional GANs has focused on image
inpainting\citet{DBLP:journals/corr/PathakKDDE16}, image prediction from a
normal map\citet{DBLP:journals/corr/WangG16}, generating images from user
input\citet{DBLP:journals/corr/WangG16}, predicting future
frames\citet{DBLP:journals/corr/MathieuCL15}, predicting future states based on
time-lapses of objects\citet{DBLP:journals/corr/ZhouB16b}, generating photos of
clothing from input images of clothed people\citet{DBLP:journals/corr/YooKPPK16}
and style transfer\citet{DBLP:journals/corr/LiW16b}.

The generator and discriminator architectures are motivated by
\citet{DBLP:journals/corr/RadfordMC15}.

Instance normalization is introduced in~\citet{DBLP:journals/corr/UlyanovVL16}.

\subsubsection{Data}

The cityscapes\citet{Cordts_2016_CVPR} (semantic labels $\leftrightarrow$
photo), CMP Facades (architectural labels $\rightarrow$ photo), Google maps
(map $\leftrightarrow$ aerial photo),
ImageNet\citet{DBLP:journals/corr/RussakovskyDSKSMHKKBBF14} (BW $\rightarrow$
colour), \citet{zhu2016generative} and~\citet{fine-grained} (edges
$\rightarrow$ photo using the HED edge
detector\citet{DBLP:journals/corr/XieT15}),
\citet{Eitz:2012:HSO:2185520.2185540} (sketch $\rightarrow$ photo) and
\citet{Laffont14} (day $\rightarrow$ night).

\subsubsection{Model}

A U-Net\citet{DBLP:journals/corr/RonnebergerFB15} architecture is used for the
generator. A PatchGAN architecture\citet{DBLP:journals/corr/LiW16b} is used for
the discriminator.

Generator architecture: encoder: $C64-C128-C256-C512-C512-C512-C512-C512$,
decoder: $CD512-CD512-CD512-C512-C512-C256-C128-C64$, where $Ck$ denotes
Convolution-BatchNorm-ReLU and $CDk$ denotes
Convolution-BatchNorm-Dropout-ReLU with a dropout rate of $50\%$. After the
last layer, a convolution maps to the number of output channels, followed by a
$\tanh$ function.

In the case of the U-Net, there are skip connections between the $i$th level of
the encoder and the $(n - i)$th layer of the decoder.

The $(70, 70)$ discriminator architecture is $C64-C128-C256-C512$, $(1, 1)$ and
$(16, 16)$ are $C64-C128$, and $(256, 256)$ is $C64-C128-C256-C512-C512-C512$.

All ReLUs in the encoder are leaky, with slope 0.2.

\subsubsection{Inference}

Network weights were initialized from a Gaussian distribution with mean zero
and standard deviation 0.02.

Noise was supplied in the form of dropout, which is applied both at training
and test time.

\hyperref[batchnorm]{Batch normalization} is applied using the statistics of
the test batch, instead of the aggregated training data statistics. Doing so
with a batch size of one is instance
normalization\citet{DBLP:journals/corr/UlyanovVL16}. Batch sizes of one and four
were used.

Mini-batch SGD is used along with the Adam optimizer. One gradient descent step
on G is used, followed by a gradient descent step on D.

Random jitter was applied by resizing the $(256, 256)$ input to $(286, 286)$
and random cropping back down to $(256, 256)$.

Refer to Appendix 5.1.2 of~\citet{DBLP:journals/corr/IsolaZZE16} for specific
training details for each dataset.

\subsubsection{Criticism}

The objective function of a conditional GAN is shown in Equation~\ref{discrim_condition_eqn}.

\begin{equation}
        \mathcal{L}_{cGAN} = \varmathbb{E}_{x, y \sim p_{data}(x, y)}\left[\log D(x, y)\right] +
                             \varmathbb{E}_{x \sim p_{data}(x), y \sim p_z (z)}\left[\log \left(1 - D\left(x, G(x, z)\right)\right)\right]
        \label{discrim_condition_eqn}
\end{equation}

Equation~\ref{discrim_no_condition_eqn} is the objective function where the
discriminator has no prior information about $x$.

\begin{equation}
        \mathcal{L}_{cGAN}\left(G, D\right) =
                \varmathbb{E}_{y \sim p_{data}(y)}\left[\log D(y)\right] +
                \varmathbb{E}_{x \sim p_{data}(x), z \sim p_z (z)}\left[\log \left(1 - D\left(G(x, z)\right)\right)\right]
        \label{discrim_no_condition_eqn}
\end{equation}

An $L1$ loss is attached, as given in Equation~\ref{l1_loss}.

\begin{equation}
        \mathcal{L}_{L1}\left(G\right) = \varmathbb{E}_{x, y \sim p_{data}(x, y)}\left[\norm{y - G(x, z)}_1\right]
        \label{l1_loss}
\end{equation}

The final objective function is
$G^* = \arg\min_G \max_D \mathcal{L}_{cGAN}\left(G, D\right) + \lambda \mathcal{L}_{L1}\left(G\right)$.

\subsubsection{Experiments}

It was found in initial experiments that $G$ learned to ignore its input noise
$z$.

Only minor variation due to the dropout noise applied is observed in the
generated samples.

Engineering generative networks that produce stochastic dependence on noise
input is left as an open research problem.

Little difference was found between using batch sizes of one and four.

Inputs and outputs in all experiments are 1--3 channel images.

Amazon Mechanical Turk was used to show participants a ground truth or
generated image at $(256, 256)$ resolution for one second, after which the
participant had to respond whether the shown image was real or fake.

$6.1\% \pm 1.3\%$ and $18.9\% \pm 2.5\%$ of participants labelled photo
$\rightarrow$ map and map $\rightarrow$ photo generated images as real,
respectively, with L1 + cGAN loss improving over L1 alone.

On colourization, the cGAN achieved $22.5\% \pm 1.6\%$ of responses as
``real'', as compared with $27.8\% \pm 2.7\%$ in
\citet{DBLP:journals/corr/ZhangIE16}.

FCN-8 was trained for semantic classification on a real dataset, after which
its accuracy on the $(70, 70)$ PatchGAN on cityscapes hit 0.63 per-pixel, 0.21 per
class and 0.16 class IoU, higher than $(1, 1)$ and $(256, 256)$ PatchGAN
discriminators.

% NOTE(brendan): 1x1, 70x70 and 256x256 used differing numbers of layers in
% their architectures, so there is no control for number of layers vs.
% receptive field causing the improved performance.

L1 + cGAN loss is found to perform better overall than L1 + GAN or cGAN
objective functions, with the FCN-8 metric scores given above.

Colour distributions in lab colour space are compared for different objective
functions, with L1 loss producing a narrower colour distribution, and cGAN loss
producing a colour distribution closer to that of the ground truth.

The U-Net qualitatively achieves better generated results with both cGAN and L1
loss than an encoder-decoder without skips, with the latter collapsing to
nearly identical results for all label maps.

A generator is trained on $(256, 256)$ images, and evaluated on $(512, 512)$
map $\leftrightarrow$ aerial images.

A cGAN is trained on semantic segmentation labelling of
Cityscapes\citet{DBLP:journals/corr/CordtsORREBFRS16} and achieves 0.22 class
IoU with cGAN loss alone, compared with 0.35 class IoU with L1 loss and 0.80
class IoU using the wide ResNets of~\citet{DBLP:journals/corr/WuSH16e}.


\subsection{InstaGAN: Instance-aware Image-to-Image
            Translation~\cite{mo2019instagan}}

They made a variant of CycleGAN that performs better for image-to-image
translation where the two domains involve a major shape transfiguration, e.g.,
from giraffe to sheep, of many instances.
For the source domain, they extract features from the RGB image, as well as
separately from each ``instance attribute'' (instance segmentation mask).
Translated images are generated from the concatenation of image
features~$f_{GX}(x)$ and aggregated mask features~$\sum_{i = 1}^N f_{GA}(a_i)$,
while translated masks are generated from the same, except for the~$j$th mask
the individual mask features~$f_{GA}(a_j)$ are also concatenated.

The discriminator sees features~$f_{DX}(x')$ and~$\sum_{i = 1}^N f_{DA}(a_i')$.

Losses are the LSGAN loss, cyclic consistency and identity losses, and a
context loss to preserve the content outside the predicted mask.

\subsubsection{Comments}

\begin{itemize}
        \item What about predicting a matting and directly using the matting to
                blend foreground and background? This would be a max of
                InstaGAN and~\cite{mejjati2018unsupervised}.

        \item The metareview suggested extending this from one-to-one to
                many-to-many translation.
\end{itemize}


\subsection{Spectral Normalization for Generative Adversarial
            Networks~\cite{miyato2018spectral}}

\subsubsection{Content / Contributions}

\begin{itemize}
\item Propose weight normalization technique.
\end{itemize}


\subsubsection{Motivation}

\begin{itemize}
\item Stabilize GAN training, which is notoriously unstable.
\end{itemize}


\subsubsection{Method}


\subsubsection{Results}


\subsubsection{Background}


\subsubsection{Open Challenges}


\subsubsection{Comments}


\subsection{Unsupervised Attention-guided Image-to-Image
            Translation~\cite{mejjati2018unsupervised}}

They use attention to blend the generated image with the background, and
likewise the discriminator sees only the attended region (after~\num{30} epochs
of full-image pretraining).


\subsubsection{Ideas}

\begin{itemize}
        \item Train attention model with weak segmentation labels to get
                matting.
\end{itemize}


\subsection{Video-to-Video Synthesis~\cite{wang2018vid2vid}}

The authors used discriminators on image quality and temporal coherence to
generate RGB video conditioned on semantic segmentation mask (FlowNet 2),
detected edges (Canny), or pose input (?). They also used a bilinear
interpolation between the warped previous image (using the estimated optical
flow), and a ``hallucinated'' frame. They used the segmentation masks to split
up their generators into foreground and background, where only the background
generator takes the optical flow as input (foreground segmentations were masked
as occluded).

Based on human preference, the temporal coherence seems to come from the
``video discriminator'' $D_V$, moreso than the optical flow warping.


\subsubsection{Questions}

There were some comments about sampling with a Gaussian mixture model on
features in order to get multimodal video, which I didn't get without reading
the code.

Also I am not sure why in the Frechet Inception score, the trace of the
difference of terms with covariance matrices is taken. It seems to penalize any
ground truth distribution with high covariance between features.


\section{Graphics}


\subsection{Fun With Premultiplied Alpha (Jim Blinn)}


\section{Hand Pose}


\subsection{Cross-Modal Deep Variational Hand Pose
            Estimation~\cite{spurr2018crossmodal}}

They experiment with embedding RGB and 3D hand pose inputs into a latent space
and decoding those to both 3D hand pose and RGB in the VAE framework. Their VAE
framework improves on the purely supervised RGB to 3D results
from~\cite{zb2017hand}, but for some reason is worse than DeepPrior++ on depth
images (perhaps network architecture and hyperparameter tweaking).


\subsubsection{Comments/Questions}

\begin{itemize}
        \item One issue is that the regime is fairly data starved -- would the
                VAE objective still help if 10x as much data were available for
                supervised training?
\end{itemize}


\subsection{3D Hand Shape and Pose from Images in the
            Wild~\cite{boukhayma20193d}}

They predict 3D hand pose and shape from input RGB images, and optionally 2D
heatmaps.
They predict the parameters~$\theta$ (joint angles),~$\beta$ (shape),
and~$(R, t, s)$ (camera view) of a 3D hand model, MANO\@.
MANO is a linear blend skinning model learned from~\num{31} subjects
doing~\num{51} hand poses.
Shape and rotation parameters are dimensionality reduced using PCA\@.
They train the thing with orthographic projection loss onto ground-truth 2D
landmarks ($L_{2D} = \norm{\hat{x} - x}_1$), ground-truth 3D joint annotations
($L_{3D} = \norm{RJ(\beta, \theta) - x_{3D}}^2_2$, and hand mask
loss~$L_{mask} = 1 - \frac{1}{N}\sum_i H(\hat{y}_i)$).


\section{Human Activity Recognition}


\subsection{Quo Vadis, Action Recognition? A New Model and the Kinetics
            Dataset\citet{carreira2017quo}}

\subsubsection{Data}

\hyperref[kinetics]{Kinetics}.

A miniKinetics version of the dataset was used in the paper, and for
experimentation, with 213 classes and 120k clips split into three subsets: one
for training with 150 to 1000 clips per class, and one split each for
validation and test, containing 25 and 75 clips per class each, respectively.

\hyperref[ucf101]{UCF-101}.

\hyperref[hmdb51]{HMDB-51}.

Transfer learning is done on UCF-101 and HMDB-51, these two smaller datasets,
from the larger Kinetics dataset.

\subsubsection{Model}

Inception-V1 with batch normalization is used as a common ``backbone''
architecture for all models.

\textbf{Two-Stream Inflated 3D ConvNets (I3D)}. All kernels in the Inception-V1
model are inflated, i.e. $(N, N)$ kernels become $(N, N, N)$ kernels.

Weights of $2D$ filters are tiled $N$ times across the time dimension.

The first two max-pooling layers of Inception-V1 were altered to be
$(1, 3, 3)$, and have unity stride in time.

\textbf{2D ConvNets with LSTMs on top}. An LSTM layer with batch normalization,
and 512 hidden units, is placed after the last average pooling layer in
Inception-V1. A fully connected layer is added on top for classification.

\textbf{Two-stream networks with different stream fusion techniques}.
Predictions from an RGB frame are averaged with predictions from pre-computed
optical flow for ten frames surrounding the RGB frame.

The flow stream has twice as many input channels as flow frames: one for each
of up and down flow directions.

The two streams are fused after the last convolutional layer.

Inputs to the two-stream model are a sequence of five frames, sampled ten
frames apart from a 25 fps video, as well as the corresponding optical flow
features.

Features from the $(5, 7, 7)$ activations before the last average pooling layer
of Inception-V1 are passed through a $C_{512} \rightarrow P_3 \rightarrow FC_{?}$ 3D
convolutional network, where $P_3$ is a $(3, 3, 3)$ max-pooling layer. These
layers are initialized with Gaussians.

The averaging process is learnable.

\textbf{C3D}. A network with eight convolutional layers, five pooling layers
and two fully connected layers at the top is used. Inputs are 16-frame,
$(112, 112)$ clips, as in~\citet{DBLP:journals/corr/TranBFTP14}. Batch
normalization is used after each (convolutional and fully connected) layer. In
the first layer, a temporal stride of two (compared with one) is used, in order
to fit 15 videos per batch per K40 GPU\@.

\subsubsection{Inference}

Video streams are decoded at 25 fps.

Standard SGD with momentum set to 0.9 is used, with synchronous data
parallelization across 32 GPUs for all models except C3D, for which data
parallelization across 64 GPUs was used.

Models are trained on miniKinetics for 35k steps, and for 100k steps on
Kinetics, with a 10x reduction on learning rate when validation loss plateaued.

Models were trained for up to 5k steps on UCF-101 and HMDB-51, using 16 GPUs.

\textbf{Data augmentations} used are random cropping, by resizing the smaller
video side to 256 pixels then randomly cropping a $(224, 224)$ patch, and
temporal random cropping. Random left-right flipping was applied.

Shorter videos are looped to satisfy each model's input requirements.

Training with \textbf{Two-Stream Inflated 3D ConvNets (I3D)} is done with
64-frame snippets, and test is done using entire videos.

Test inference is done by taking $(224, 224)$ center crops.

Optical flow is computed using the TV-$L^1$ algorithm
from~\citet{Zach07aduality}, which runs at 30 fps on $(320, 240)$ videos (on
decade-old GPU technology).

\subsubsection{Criticism}

For \textbf{2D ConvNets with LSTMs on top}, a cross-entropy loss is placed on
the outputs at all time steps.

\subsubsection{Results}

78.7\% on miniKinetics with Two-Stream I3D, compared with 74.0\% for 3D-Fused,
72.9\% for Two-Stream, 60.0\% for C3D and 69.9\% for Conv2D with LSTM\@.

Using pre-training on Kinetics, 98.0\% and 80.7\% is achieved, compared with
state of the art of 94.6\% and 70.3\% respectively on UCF-101 and HMDB-51.

% TODO(brendan): A couple more interesting findings from the results, e.g. with
% regards to optical flow.

\subsubsection{Questions}

\begin{itemize}
        \item \textbf{Future work?} Action tubes, or attention mechanisms to
                focus in on human actors.

                The authors suggest that the optical flow streams produce
                significant improvements, compared to the RGB stream alone, due
                to their optical flow algorithm's recurrent nature
                (optimization is done iteratively).  This could be disproved by
                using purely feed-forward optical flow computations, e.g.
                FlowNet 2.0.

                The authors mention specifically that they will repeat all
                experiments with Kinetics instead of miniKinetics, and compare
                with and without ImageNet pre-training. They will also compare
                inflating of different 2D ConvNet architectures (besides
                Inception-V1).
\end{itemize}

\subsubsection{Similar Work}

Convolutional Two-Stream Network Fusion for Video Action
Recognition~\citet{DBLP:journals/corr/FeichtenhoferPZ16} builds on two-stream
networks by fusing the streams' respective features earlier.


\subsection{Learning Spatiotemporal Features with 3D Convolutional
            Networks\citet{DBLP:journals/corr/TranBFTP14}}

\subsubsection{Data}

UCF-101~\ref{ucf101}.

Sports-1M~\citet{KarpathyCVPR14} (1.1 million sports videos, with 487 classes).

I380K (internal dataset, finetuned from).

\subsubsection{Model}

A Conv3D model is used, with $(3, 3, 3)$ kernels and $(2, 2, 2)$ max-pooling
layers, with the exception of the first max-pooling layer, which is
$(1, 2, 2)$.

Full model structure (testing effect of kernel size):
$C_{64} \rightarrow P_1 \rightarrow C_{128} \rightarrow P_2 \rightarrow
{(C_{256} \rightarrow P_2)}^3 \rightarrow {(FC_{2048})}^2 \rightarrow
\textrm{softmax}$, where $P_1$ refers to the $(1, 2, 2)$ max-pooling layer, and
$P_2$ refers to the $(2, 2, 2)$ max-pooling layer.

Full model structure (best results):
$C_{64} \rightarrow P_1 \rightarrow C_{128} \rightarrow P_2 \rightarrow
{(C_{256})}^2 \rightarrow P_2 \rightarrow
{({(C_{512})}^2 \rightarrow P_2)}^2 \rightarrow {(FC_{4096})}^2 \rightarrow
\textrm{softmax}$.

The last fully connected layer above goes to an $L2$ normalization, followed by
a linear SVM\@.

\subsubsection{Inference}

Inputs were resized to $(171, 128)$ then randomly cropped to $(112, 112)$.

Non-overlapping 16-frame clips were used during training, therefore batches of
$(\verb|batch_size|, 16, 112, 112, 3)$ were input to the model.

Hyper-parameters: batch size of 30, learning rate of 0.003 divided by four
every four epochs, for sixteen in total.

Five two-second long clips were extracted from each video and trained on.

For predictions on UCF101, a trained C3D model was used to extract features
from 16-frame clips, with overlap of 8-frames between subsequences. A linear
SVM was then trained on these C3D features.

\subsubsection{Criticism}

Cross-entropy on activity label?

\subsubsection{Results}

90.4\% on UCF-101, compared with 89.1\% state-of-the-art at the time.

84.4\% vs. $> 90\%$ on Sports-1M, although the technique
from\citet{DBLP:journals/corr/NgHVVMT15} could be applied on top of C3D
features.

On UCF-101, homogeneous temporal depth of kernels in the model were able to
beat increasing or decreasing temporal depth.

\subsubsection{Questions}

\begin{itemize}
\item \textbf{Any comparison with frame-frame ImageNet vectors?} None. The only
        comparison was with ImageNet features averaged over all frames.

\item \textbf{Which features complement C3D?} iDT\@? Optical flow?
\end{itemize}

\subsubsection{Similar Work}

Improved Dense Trajectories\citet{Wang2013}.

Learning Hierarchical Invariant Spatio-temporal Features for Action Recognition
with Independent Subspace Analysis\citet{Le:2011:LHI:2191740.2192108}.
Unsupervised video feature learning using stacking.

Two-stream networks\citet{DBLP:journals/corr/SimonyanZ14}.

Beyond Short Snippets: Deep Networks for Video
Classification\citet{DBLP:journals/corr/NgHVVMT15}. Connects Conv2D to LSTM, and
explores other temporal feature pooling models that work on Conv2D-extracted
features.


\subsection{Revisiting the Effectiveness of Off-the-shelf Temporal Modeling
            approaches for Large-scale Video
            Classification\citet{2017arXiv170803805B}}

\subsubsection{Data}

\hyperref[kinetics]{Kinetics}.

\subsubsection{Model}

RGB, optical flow and audio streams are used. RGB and optical flow streams both
use Inception-ResNet-v2, where the RGB stream is pre-trained on ImageNet then
fine-tuned on Kinetics. The optical flow stream is initialized with the
fine-tuned weights from the RGB stream.

A VGG-16 ConvNet is used to model the audio data, which is preprocessed
following~\citet{DBLP:journals/corr/HersheyCEGJMPPS16}. Audio data is first
split into non-overlapping 960ms frames. Each frame is split into 25ms windows,
offset at every 10ms, for 96 windows in total. A Fourier transform is run on
each window, and frequencies are binned using Mel-scale into 64 bins. A
log-transform is applied to each bin. Therefore, the VGG-16 ConvNet receives
$96 \times 64$ inputs.

\subsubsection{Inference}

At test time, three segments are sampled from each trimmed video,
following~\citet{DBLP:journals/corr/WangXW0LTG16}.

\subsubsection{Criticism}

\subsubsection{Results}

\subsubsection{Questions}

\subsubsection{Similar Work}


\subsection{Temporal Shift Modules for Efficient Video
            Understanding~\cite{lin2018temporal}}

For an NCTHW video tensor, they shift a fraction of the frames backwards (in
the temporal dimension), and another fraction forwards, at each residual unit
of a 2-D ResNet, padding with zeros.

\begin{verbatim}
# shape of x: [N, T, C, H, W]
out = torch.zeros_like(x)
fold = c // fold_div
out[:, :-1, :fold] = x[:, 1:, :fold]  # shift left
out[:, 1:, fold: 2 * fold] = x[:, :-1, fold: 2 * fold]  # shift right
out[:, :, 2 * fold:] = x[:, :, 2 * fold:]  # not shift
return out
\end{verbatim}


\subsubsection{Questions/Comments}

\begin{itemize}
        \item Multi-stride shifts (e.g., shift
                by~$\dots, 1/4, 1/2, 1, 2, 4, \dots$).

        \item Temporal shift for dense prediction?
\end{itemize}


\section{Learning to Rank}


\subsection{Optimizing Search Engines Using Clickthrough
            Data~\cite{joachims-optimizing-2002}}

This paper presents a machine learning framework for optimizing search engine
rankings for given queries based on a dataset of $(q, r, c)$ tuples.

The authors note that the information from clickthrough data does not indicate
absolute link relevance, due to the clicks' dependence on the ranking.
Alternatively, the knowledge used is that lower ranked, clicked links are more
relevant than higher ranked, unclicked links.

The authors present an algorithm (RankSVM) that optimizes Kendall's tau, which
is a measure of similarity between two rankings given by (P - Q)/(P + Q), where
P is the number of concordant pairs (pair orderings both rankings agree on),
and Q is the number of discordant pairs (pair orderings where the rankings
differ).

RankSVM works by minimizing $\norm{\mathbf{w}}_2 + C\sum \xi_{ijk}$ subject to
$\mathbf{w}^\intercal (\Phi(q_k, d_i) - \Phi(q_k, d_j)) \geq 1 - \xi_{ijk}$,
where $\Phi(q, d)$ is a similarity measure between query~$q$ and document~$d$,
and $\xi$ are slack variables. RankSVM provides an approximate minimization of
Kendall's tau, by casting the ordering problem as a two-class classification
problem on pairs of features for the same query.


\subsubsection{Data}

They collected data using a proxy server that keeps a log of query IDs, query
words, ranking, and clicked link URL.


\section{Temporal Action Localization}


\subsection{CDC\@: Convolutional-De-Convolutional Networks for Precise Temporal
            Action Localization in Untrimmed
            Videos~\citet{DBLP:journals/corr/ShouCZMC17}}

\subsubsection{Data}

THUMOS~2014~\ref{thumos} was trained and evaluated on.

ActivityNet challenge 2016~\citet{Heilbron_2015_CVPR} has 203 activity
categories, with 137 untrimmed videos per class on average. ActivityNet 2016
has an average of 1.41 activity instances per video, and 849 hours of video in
total. Activitynet 2016 encompasses untrimmed video classification, trimmed
video classification and activity detection.

\subsubsection{Model}

The C3D architecture
from~\citet{DBLP:journals/corr/TranBFTP15, DBLP:journals/corr/TranBFTP14} up to
and including layer \verb|pool5| is built upon. \verb|pool5| is changed to only
pool in spatial dimensions.

\verb|pool5| is then followed by \verb|conv6|, which is a layer with filter
size $[4, 4, 4]$, with strides $[2, 1, 1]$, and the temporal part of the filter
doing an up-convolution while the spatial part does down-convolution (?). No
spatial padding, and ``same'' temporal padding is used, such that the input
$[L/8, 4, 4]$ feature map becomes a $[L/4, 1, 1]$ feature map.

This layer is called \verb|CDC6|, after ``conv-deconv'' layers, and is followed
by two temporal up-convolutional layers (both of filter size $[4, 1, 1]$ and
stride $[2, 1, 1]$) for a final output of shape $[L, 1, 1]$.  The pre-logits
layers \verb|CDC6| and \verb|CDC7| have 4096 channel outputs.

Dropout is used for all \verb|CDC| layers, with 0.5 dropout ratio.

\subsubsection{Inference}

Non-overlapping segments of 32 frames are extracted and fed to the CDC network.
During training, only segments with at least one frame with a non-background
label are used.

C3D is pre-trained on Sports-1M~\citet{KarpathyCVPR14}, and after this the CDC
network converges within four epochs.

Both CDC and C3D layers are trained jointly, with a learning rate of
$10^{-5}$ used for all layers except \verb|CDC8|, for which a learning rate
of $10^{-4}$ is used.

At test time, segment proposals from~\citet{DBLP:journals/corr/ShouWC16} are
fed to the CDC network, with the proposals expanded by $1/8$ of the original
proposal length on either end.

Using the per-frame predictions on the segment proposal, Gaussian kernel
density estimation is used to obtain $\mu$ and $\sigma$, and a refined segment
from $[\mu - \sigma, \mu + \sigma]$ is produced. The predicted class scores are
the average class scores over this refined segment.

To make segment proposals for the temporal localization task, non-maximum
suppression is used as
in~\citet{DBLP:journals/corr/YeungRJAML15, DBLP:journals/corr/ShouWC16}.

\subsubsection{Criticism}

A cross-entropy per-frame loss is used on the class predictions.

As an evaluation metric, mean average precision is computed over the
predictions for each frame.

For evaluation on the temporal localization task, mAP is computed over
different IoU thresholds.

\subsubsection{Results}

Results on THUMOS~2014 are given below.

Per-frame: 44.4 mAP compared with 41.3 mAP from
MultiLSTM~\citet{DBLP:journals/corr/YeungRJAML15}, and 41.7 mAP using
conv-deconv instead of CDC\@.

Temporal localization at 0.5 IoU\@: 23.3 mAP, as compared with 19.0 mAP
in~\citet{DBLP:journals/corr/ShouWC16}.

On ActivityNet (at 0.75 IoU): 26.0 when based on the segment proposals
of Wang and Tao in their ActivityNet 2016 challenge submission, as compared to
4.1 mAP without CDC\@. At 0.5 IoU, CDC only improves the method of Wang and Tao
from 45.1 to 45.3 mAP\@.

\subsubsection{Questions}

Tried conv-deconv vs. CDC, but what about deconv-conv? Or a series of 3d
convolutions with spatial stride fixed, until the last layer? I.e.\ test
whether the CDC filter is special when compared to simply adding more
parameters to the temporal up-convolutional filters.


\subsection{Temporal Action Localization in Untrimmed Videos via Multi-stage
            CNNs~\citet{DBLP:journals/corr/ShouWC16}}

\subsubsection{Data}

THUMOS~2014~\ref{thumos} was trained and evaluated on.

MEXaction2~\citet{MEXaction2} is a dataset with two action classes: horseback
riding, and ``bull charging cape''. The dataset is made up of YouTube, UCF101
and INA videos, of which only INA videos are untrimmed. The untrimmed videos
are 77 hours in total. The training set is 1336 instances, validation set is
310 instances and test set is 329 instances.

\subsubsection{Model}

The model consists of three separate 3D convolutional networks based
on C3D~\citet{DBLP:journals/corr/TranBFTP14}: proposal, classification and
localization networks. C3D is pre-trained on Sports-1M.

The proposal network has a binary ``action or not?'' output, the classification
network outputs the class scores (plus background), and the localization
network outputs class scores as well, except is trained with a different loss
function.

\subsubsection{Inference}

Chunks of 16, 32, 64, 128, 256 and 512 frames are taken from the video at
intervals overlapping by 75\% of the chunk size. 16 frames are then sampled
uniformly from each of these chunks, and the resulting 16 frames are input to
the proposal network.

The classification network is trained first, then used to initialize the
localization network.

During evaluation, segments with proposal scores $\geq 0.7$ are kept.
Post-processing is done to remove segments predicted as background, as well as
to scale up confidence scores by the class distribution in the training set.
Non-maximum suppression is done to remove redundant detections, with overlap
threshold $\theta - 0.1$, where $\theta$ is the IoU overlap threshold.

\subsubsection{Criticism}

Ground truth labels for the proposal network are assigned as follows. Segments
from trimmed videos are all assigned as ``action''. Segments with greater than
0.7 IoU with a class are labelled as positive, whereas if segments have less
than 0.3 IoU with any given class those segments are labelled as background.
Finally, for each ground truth instance, if the instance has no segments
assigned to it, a segment with greater than 0.5 IoU with that ground truth
instance is chosen as a positively labelled segment. A number of background
segments equal to the total number of positively labelled segments are randomly
sampled.

Ground truth labels for the classification network are gathered in a similar
way as for the proposal network, except labels are action classes, and
background class labels are reduced to match the average number of labels per
class.

An overlap loss is used to artificially increase the scores of predictions with
higher overlap with the ground truth instance.

\begin{equation}
        \mathcal{L}_{\textrm{overlap}} = \frac{1}{N} \sum_n
                \left(\frac{1}{2} \cdot
                        \left(
                                \frac{{\left( P_n^{k_n} \right)}^2}{v_n^\alpha}
                                - 1
                        \right)
                        \cdot \left[ k_n > 0 \right]
                \right)
\end{equation}

Where $\left[ k_n > 0 \right]$ is the indicator function for the true class
label $k_n$ being positive, $v_n$ is the fraction of overlap of the segment
with the ground truth instance, and $\alpha = 0.25$.

\subsubsection{Results}

At IoU threshold 0.5:

7.4 mAP on MEXaction2, compared with 1.7 mAP baseline provided with the
dataset.

19.0 mAP on THUMOS~2014, compared with 15.0 mAP from the THUMOS~2014 challenge
submission of~\citet{LearSubmissionThumos2014}.

\subsubsection{Questions}

Does uniform sampling mean random uniform sampling, or sampling at fixed,
uniform intervals?


\subsection{Temporal Convolutional Networks for Action Segmentation and
            Detection\citet{DBLP:journals/corr/LeaFVRH16}}

\subsubsection{Data}

50 salads~\citet{Stein:2013:CEA:2493432.2493482} is a dataset of 50 sequences of
food preparation video, with about 30 action annotations, such as ``cutting a
tomato'', per sequence. RGB and depth maps at $640 \times 480$ and 30Hz, 3-axis
accelerometer data from a device attached to various cooking utensils, as well
as synchronization parameters between the accelerometer and RGB-D data are
provided.

Annotations in 50 salads consist of ``pre-'', ``core'', and ``post-'' phases of
each action in a recipe.

25 different people in total are in the 50 salads videos.

The MERL shopping dataset~\citet{merl-shopping-singh} is a human action dataset
consisting of 96 two-minute surveillance-style videos, with a single shopper
per video.  The shoppers perform one of five actions, plus a background class:
``reach to shelf'', ``retract hand from shelf'', ``hand in shelf'', ``inspect
product'', and ``inspect shelf'', each action normally being a few seconds
long.

Georgia Tech Egocentric Activities (GTEA)~\citet{Fathi:2011:LRO:2191740.2191834}
contains 28 videos of household activities, such as making coffee, taken from a
head-mounted camera.  Videos are on average a minute long, and on average
contain 19 action instances per video.

\subsubsection{Model}

Two models are investigated, one called an ``Encoder-Decoder Temporal
Convolutional Network (ED-TCN)'', and the other called a ``Dilated TCN''.

In the TCN encoder, layers consist of a convolution plus bias, a non-linear
unit, followed by a max-pool. TCN decoder layers are a nearest-neighbour
upsample, followed by a convolution plus bias then a non-linear unit.

The dilated TCN consists of a series of blocks, where each block consists of a
sequence of layers. In each subsequent dilated convolutional layer within the
same block, each unit at time $t$ takes weighted input from the two previous
units at time $t$ and $t - s$, where $s$ is the same as the $l$-dilation
parameter as defined in~\citet{DBLP:journals/corr/YuK15}.

The set of outputs from all of the blocks are summed, followed by a ReLU, then
a fully-connected layer and another ReLU\@, which is input to a softmax
function, the output of which forms the frame-wise predictions.

For ED-TCN, each layer $l$ has $96 + 32l$ filters. For dilated TCN, each layer
has 128 filters.

Causal and decausal models are evaluated. For ED-TCN, models convolve from $t -
d$ to $t$ in the causal case, and from $t - d/2$ to $t + d/2$ in the acausal
case. For dilated TCN, units at time $t$ take the additional input from $t + s$
in the acausal case.

\subsubsection{Inference}

Adam optimizer~\citet{DBLP:journals/corr/KingmaB14} is used.

\subsubsection{Criticism}

``Action segmentation'' metrics use frame-wise mAP, while ``action detection''
metrics use segment-wise mAP for given IoU overlap fractions.

It is pointed out that the way confidence evaluated heavily affects the score
for a given type of metric. E.g.\ on MERL shopping, the mAP scores
from~\citet{merl-shopping-singh} increase from 50.9 to 69.8 by using maximum
instead of average prediction score over a given interval.

A new evaluation metric is proposed, which computes true positives, false
positives and false negatives over segments at a given IoU threshold, and defines
$F1 = 2\frac{prec * recall}{prec + recall}$.

\subsubsection{Results}

On GTEA, ED-TCN achieves an average F1 score of 64.0 over IoU thresholds of
0.1, 0.25 and 0.5 and dilated TCN achieves an average F1 score of 60.6. A
baseline method, with better descriptors, achieves 64.6.

On 50 salads, different activation units are compared, with normalized ReLUs
out-performing all with a F1@25 score of 58.4, compared with 40.4 using
standard ReLUs. Activation function choice is not found to affect dilated TCN
performance.

ED-TCN performed best with two layers and filter size of 15 (44 frame receptive
field). Dilated TCN performed best with four blocks and five layers per block
(128 frame receptive field), and similar performance with 96 frame receptive
field.

\subsubsection{Questions}

Dilated convolutions with larger filter sizes? Dilated convolutions over 3D
data?


\subsection{Deep Temporal Linear Encoding
            Networks\citet{DBLP:journals/corr/DibaSG16}}

A method called Temporal Linear Encoding (TLE) is proposed to encode entire
videos into vectors in a single feature space.

\subsubsection{Data}

\hyperref[ucf101]{UCF-101} and \hyperref[hmdb51]{HMDB-51}.

\subsubsection{Model}

The deep temporal linear encoding layer takes in a set $\{S_i\}$ of feature
maps extracted from clips in the video, and does,

\begin{enumerate}
        \item Aggregation of the features $\{S_i\}$.
        \item Encoding of the aggregated features.
\end{enumerate}

Aggregation operators include element-wise average, maximum and multiplication
of segments, with element-wise multiplication yielding the best results.

Encoding methods include,

\begin{enumerate}
        \item Bilinear combination given by $y = W[X \otimes X']$, where $X \in
                \mathbb{R}^{(hw) \times c}$ and in this case $X = X'$.
                Presumably, $W$ is element-wise multiplication of the of the
                resulting $[X \otimes X'] \in \mathbb{R}^{(cc')}$. The square
                brackets mean concatenation into a vector.

        \item Fully connected pooling
\end{enumerate}

AlexNet, VGG-16 and BN-Inception are used as two-stream ConvNet base models,
pre-trained on ImageNet.

C3D models are also used as a base feature extractor.

\subsubsection{Inference}

The Tensor Sketch algorithm of~\citet{Pham:2013:FSP:2487575.2487591} is used to
approximate the outer product for the bilinear encoding method by projecting
the two input tensors directly to a lower-dimensional space, without explicitly
computing the outer product.

Fully-connected layers are dropped from the pre-trained ConvNet models and the
feature maps from the last convolutional layers are fed into the bilinear
model. E.g. BN-Inception produces features maps of dimension $14 \times 14
\times 1024$, leading to bilinear outputs of dimension $1024 \times 1024$, and
corresponding compact bilinear representation of size $8196$.

The ConvNet models are then fine-tuned, first fine-tuning the last layer, then
the entire ConvNet.

Features output from the bilinear model have a signed square root and
L2-normalization applied before being input to a softmax layer.

A similar procedure to above is applied for the fully-connected pooling models.

They first split each video into three equal segments. Then, for two-stream,
one RGB frame and corresponding stack of optical flow frames is extracted for
each segment, and the frames extracted from all three segments are input to the
TLE model. Similarly, for C3D, one clip of 16 frames is extracted from each
segment and the three sets of extracted features are input to the TLE model.

For the two-stream model, the above three-segment extraction-prediction
sequence is completed five times and the resulting predictions are averaged.
For the C3D model, the above is repeated three times.

\subsubsection{Criticism}

Presumably cross-entropy loss on the output of the softmax function.

\subsubsection{Results}

Element-wise multiplication (94.8/70.4) beat average (92.6/68.1) and maximum
(91.3/67.4) aggregation functions on UCF101/HMDB51.

With two-stream ConvNets, TLE\@: bilinear achieved 95.6/71.1 on UCF101/HMDB51,
compared with $94.0/68.5$ in~\citet{DBLP:journals/corr/WangXW0LTG16}. With C3D,
TLE\@: bilinear scores $86.3/60.3$.

An experiment was run to include a fourth segment extracted from a VGG-16
network pre-trained on Places365, a so-called ``context'' feature map, in
addition to a three-segment spatial ConvNet TLE\@. This setup scored
$83.8/63.6$, as compared to $81.5/60.9$ using only spatial ConvNets\@.

\subsubsection{Questions}

Would knowledge transfrom from Places365 still work for the two-stream
ConvNets?


\section{Reinforcement Learning}


\subsection{Designing Neural Network Architectures using Reinforcement
            Learning~\citet{DBLP:journals/corr/BakerGNR16}}

Q-Learning~\citet{Watkins1992} is used to build neural network architectures for
image classification, and tested on MNIST, SVHN and CIFAR-10. A single such
``MetaQNN'' model (the one that did best on the validation set) achieves 0.44\%
error on the MNIST test set. The best model for CIFAR-10 achieves 6.92\% error
on the test set.

$\epsilon$-greedy exploration is used with a training schedule that starts with
\num{1500} updates at $\epsilon = 1.0$ to allow for sufficient exploration,
followed by either \num{100} or \num{150} updates at each step
$\epsilon = 0.9, 0.8, \dots$.

Only convolutional, fully connected, pooling and softmax layers are allowed.
Various constraints to reduce the state space are used, such as only allowing
fully-connected layers to transition to fully-connected or termination states.

The following steps are taken in defining the learning algorithm:

\begin{enumerate}
        \item Reduce CNN layer definitions to state tuples.

        \item Define a set of actions that the agent may taken from a given
                state (i.e.\ layer definition).

        \item The size of the action space is balanced with the amount of
                exploration needed by the agent for the algorithm to converge.
\end{enumerate}

The action space is forced to terminate after a finite number of layers (12 for
MNIST and 18 for CIFAR-10).

\subsubsection{Extension Ideas}

\begin{itemize}
        \item In general, using Q-function approximation to allow increasing of
                the state space.

        \item Attaching a regularization term to the agent's cost function,
                e.g.\ $\log{N}$ in the number of parameters.
\end{itemize}

\subsubsection{Questions}

\begin{itemize}
        \item Is convergence to the global minimum really guaranteed by
                Q-learning?
\end{itemize}


\subsection{Neural Optimizer Search with Reinforcement
            Learning~\citet{neural-optimizer-search-46114}}

\subsubsection{Data}

An optimizer is learned on CIFAR-10~\ref{cifar10}.

The optimizer is transferred to WMT~2014~\ref{wmt2014}.

The \href{https://en.wikipedia.org/wiki/Rosenbrock_function}{Rosenbrock
function} is trained on using the learned update rule.

\subsubsection{Model}

% TODO(brendan): TikZ diagram of model controller RNN and computation graph.

The controller is an RNN, which samples strings of length $5n$ from the DSL
during training. Values of $n$ equal to 1, 2 and 3 are used in the experiments.

The controller is a single-layer LSTM with a \num{150}-unit hidden layer and
weights uniformly initialized in $[-0.08, 0.08]$. An entropy penalty on the
weights is used, set to $0.0015$.

A domain specific language (DSL) is cooked up to write a binary expression
tree, which represents the learned update function. The DSL uses postfix
notation.

Refer to section 4.1 of the paper for a list of operands, unary and binary
functions used in the DSL\@. Adam and RMSProp are among the operand primitives.

Weight updates are computed as
$\Delta w = \lambda * b(u_1({op}_1), u_2({op}_2))$.

A two-layer $3 \times 3$ convolutional network, with ReLU activations and batch
normalization after each layer, is used as the model to optimize over for the
CIFAR-10 dataset. The learned update rule is then transferred to train a Wide
ResNet model.

The Google Neural Machine Translation (GNMT) model from~\citet{gnmt-45610} is
used for the WMT~2014 experiments.

\subsubsection{Inference}

Samples generated by the controller RNN are added to a queue, from which a
worker from a distributed set of workers dequeues, trains, and returns the
accuracy $R(\Delta)$ to the controller.

After workers receive an optimizer from the controller, the worker does a
hyperparameter sweep for learning rates given by $10^i$, with $i \in [-5, 1]$.
The best learning rate is then trained for five epochs before evaluating.

Child networks have a batch size of 100.

Update rule hyperparameters are: $\beta_1 = 0.9$, $\beta_2 = \beta_3 = 0.999$,
and $\epsilon = 10^{-8}$.

Trust Region Policy Optimization~\citet{DBLP:journals/corr/SchulmanLMJA15} is
found to improve over REINFORCE~\citet{Williams1992} for training the controller
RNN\@. The baseline function used for TRPO is an exponential moving average of
previous rewards.

Adam optimizer is used to train the RNN controller, with a learning rate of
$10^{-5}$ and a minibatch size of five.

\subsubsection{Criticism}

The training objective for the RNN controller is
$\mathbb{E}_{\Delta \sim p_\theta(.)}[R(\Delta)]$, where $R(\Delta)$ is the
accuracy of the model on a held-out validation set after training the
model-to-optimize using the sampled optimizer.

\subsubsection{Results}

Of a number of discovered optimizers, \verb|Optimizer_1|, given by $e^{sign(g)
* sign(m)} * g$, performs nearly as well as SGD with momentum on the Rosenbrock
function, and better than all other compared optimizers (Adam, RMSProp, and
SGD).

The best discovered optimizers outperform SGD with momentum on CIFAR-10 using a
Wide ResNet, scoring 93.2 compared with 92.3 for momentum.

On WMT~2014, the best optimizer achieves an improved BLEU score of 25.2, as
compared with 24.5 with Adam.

\subsubsection{Questions}

Can a learned optimizer improve upon existing standard optimizers such as Adam
and SGD with momentum?

Will optimizers learned on one task transfer well to other tasks?

What is the computational cost of doing a neural optimizer search, and how
could said cost be alleviated?

The neural optimizer search was distributed over \num{100} CPU servers over a
period of less than a day.

How can the neural architecture search be efficiently extended to check for
optimizers that perform well over the entire training cycle, but perform
relatively poorly during the first $N$ epochs?

\subsubsection{Similar Work}

Trust Region Policy Optimization~\citet{DBLP:journals/corr/SchulmanLMJA15}
(background).


\subsection{Asynchronous Methods for Deep Reinforcement
            Learning~\citet{DBLP:journals/corr/MnihBMGLHSK16}}

Four different asynchronous methods are explored for reinforcement learning
using neural networks. The methods are tested on Atari 2600 games, a car
simulator, Mujoco and a new 3-D maze game. In most games, asynchronous
advantage actor-critic is found to converge much faster (in terms of wall-clock
time) and to a higher score than the other asynchronous methods. In general,
n-step return methods converge faster than one-step methods.

Asynchronous advantage actor-critic (Algorithm~S3
of~\citet{DBLP:journals/corr/MnihBMGLHSK16}) involves using the value function
as a learned baseline in policy gradient. Hence the value-function baseline is
the ``critic'', while the policy is the ``actor''.

Entropy is added to the policy $\pi$ \citet{williams1991function} to avoid
convergence to sub-optimal deterministic policies (for both A3C and value
methods).

For value methods, a different exploration policy is used per thread, by
sampling $\epsilon$ from different distributions used in $\epsilon$-greedy
exploration policies.

RMSProp with statistics shared across thread is shown to be more robust
compared to using thread-specific statistics.

One-step Q-learning and SARSA both achieve super-linear performance
improvements in the number of threads, hypothesized to be due to bias
reduction.

For Mujoco, low-dimensional inputs are mapped to a 200-dimensional vector,
which is then used as input to an LSTM\@.

For continuous action domains, the neural network outputs are the mean and
scalar variance of a multi-dimensional Gaussian with spherical covariance
(i.e.\ the covariance matrix $\Sigma = \lambda I$).

\subsubsection{Future work}

Experience replay, e.g.\ to improve sample efficiency for slow environments
such as TORCS\@.

Eligibility traces.

Generalized advantage estimation.

Reducing the over-estimation bias of Q-values.

True online temporal difference models with non-linear function
approximation~\citet{DBLP:journals/corr/SeijenMPMS15}.

\subsubsection{Questions}

Why is Algorithm~S2 unusual? What are eligibility traces (page 4)?

A small network (a handful of convolutional layers followed by fully connected
layers) is trained. What could be done to improve the ability to train a larger
network?


\subsection{Scalable trust-region method for deep reinforcement learning using
            Kronecker-factored
            approximation~\citet{DBLP:journals/corr/abs-1708-05144}}

A more sample-efficient algorithm for gradient descent is presented by using a
Kronecker-factored approximation to the natural gradient, which is applied to
OpenAI Gym discrete and continuous control tasks such as MuJoCo. The algorithm
improves over TRPO in sample-efficiency and computation time since TRPO is an
iterative method using conjugate gradient.

The Kronecker-factor approximation part of Actor-Critic using
Kronecker-Factored Trust Region (ACKTR) involves approximating the natural
gradient by approximating the
\hyperref[fisher-info]{Fisher information matrix} $F$. The method of
approximating $F$ is the same as in~\citet{DBLP:journals/corr/MartensG15}.

The trust region part of ACKTR refers to transforming the fixed $\eta$ in,

\begin{equation*}
        \theta \leftarrow \theta - \eta F^{-1} \nabla_\theta L
\end{equation*}

into,

\begin{equation*}
        \eta \leftarrow \min\left(
                \eta_{\max}, \sqrt{
                        \frac{2\delta}{\Delta\theta^{\top}
                                       \hat{F}
                                       \Delta\theta}}\right)
\end{equation*}

\subsubsection{Questions}

Supposedly $F$ is a local quadratic approximation to the KL divergence, which
measures the dissimilarity of probability distributions $P$ and $Q$. If this is
the case, what are $P$ and $Q$ when optimizing the policy of an RL agent?


\section{Model Search}

\subsection{DARTS: Differentiable Architecture Search~\cite{liu2018darts}}

DARTS uses a continuous relaxation of the NN architecture search space in order
to use SGD to tune the architecture as a hyperparameter.


\subsubsection{Comments}

Since an argmax is taken after running the algorithm, the algorithm can't
handle, e.g., bimodal distributions over optimal policies.


\section{Natural Language Processing}


\subsection{Skip-Thought Vectors~\citet{DBLP:journals/corr/KirosZSZTUF15}}

Skip-thought vectors is a method of encoding a feature vector from a sentence,
in such a way that the feature vector can be re-used as a generic, continuous
sentence representation in a variety of tasks.

The skip-thought vector model is an encoder-decoder system. Sentences are input
to the encoder-decoder as a sequence of word representations from word2vec. In
the paper, RNNs with a reset (resetting the hidden state in the proposed state
computation) and update gate (updating the hidden state with the proposed state
if one, otherwise carrying over the previous hidden state) were used as both
encoder and decoder.

To generate the encoded representation for a given sentence, the encoder first
generates a hidden state for each word in the sentence, with the final hidden
state representing the entire sentence. At each timestep, the encoder takes as
input the word2vec representation of the current word, as well as its own
previous hidden state.

The decoder then takes that encoded state as an additional input (in addition
to its own hidden state and the word2vec representation of the previous
timestep's word) as it decodes, word-by-word, the next sentence.

The decoder is trained to predict both the previous and next sentence from a
given sentence representation, with a log-loss on the correct word at each
time-step.

In this way, the encoder is trained to produce a feature representation of the
preceding sentence that will be useful for the decoder to correctly predict the
subsequent word in the current sentence.


\subsection{Effective Approaches to Attention-based Neural Machine
            Translation~\citet{DBLP:journals/corr/LuongPM15}}

Different attention-based models for neural machine translation are used on the
WMT'15 English-German translation task in both directions, achieving 25.9 BLEU
points in the English to German direction, which is an improvement of 1.0 BLEU
points compared to the best system at the time of publication.

Two different attention mechanisms are considered: global attention, where a
context vector is formed from the weighted average of an attention mechanism
over the entire source sentence $\mathbf{s}$, and local attention, where first
a start position $p_t$ is predicted and then attention is only used in a
neighbourhood around $p_t$.

Different scoring mechanisms are used to compute attention, including the dot
product of the current target hidden state $\mathbf{h}_t$ with each source
hidden state $\overline{\mathbf{h}}_s$, i.e.\
$\mathbf{h}_t^\intercal \overline{\mathbf{h}}_s$. The generalized dot product
$\mathbf{h}_t^\intercal \mathbf{W}_\mathbf{a} \overline{\mathbf{h}}_s$ is also
used.

The align weights are then computed as the softmax of the score for each source
word $s$, and the context vector is computed as a weighted sum over the source
states $\overline{\mathbf{h}}_s$.

Locally-predicted attention with general scoring is shown to slightly
outperform global attention with dot product scoring in terms of BLEU score.

\subsubsection{Questions}

How is the loss function computed? Does it use cross-entropy with the closest
monogram or tri-gram in the reference translations?

The alignment weights output by the attention model (see Figure~7 alignment
visualizations) seem to be nonsensical, i.e.\ not mapping between source and
target word meanings. Is this normal for attention in machine translation
models?


\subsection{Reinforcement Learning for Bandit Neural Machine Translation with
            Simulated Human Feedback~\citet{DBLP:journals/corr/NguyenDB17}}

Advantage actor-critic is used to improve a pre-trained neural machine
translation system using only ``bandit rewards'', i.e.\ scalar rewards
indicating the quality of a given translation.

The \emph{granularity}, \emph{variance} and \emph{skew} of true human rating
distributions are simulated by fitting the distributions of these values using
kernel smoothing, then applying perturbations according to kernel-smoothed fits
to the output score from a BLEU system.

A neural encoder-decoder with global attention is used as both the ``actor''
and ``critic''.

The data used are parallel-translated TED talks from IWSLT 2014 and
2015~\citet{cettolo2015iwslt}.

An improvement by the NED-A2C method is shown in both per-sentence BLEU score
(apparently on the training data) and corpus BLEU score on a held-out test set.
Supervised training on the ``bandit set'' is shown to improve over A2C training
on the bandit set, and the authors attribute this to maximizing the
log-likelihood being a better approximation to corpus level BLEU than is
per-sentence level BLEU\@.

The method is shown to be more sensitive to variance in the bandit feedback
than to skew or granularity, with variance greater than 20\% causing results to
degrade.


\subsubsection{Questions}

What is the difference between corpus-level BLEU and sentence-level BLEU\@?

The authors mention that directly learning the Q-function with the critic is
unstable compared to advantage actor-critic. It would make sense as a sanity
check to also compare against other baselines, such as averaged reward over the
mini-batch (i.e.\ just using REINFORCE with a computed baseline).


\section{Object Detection}


\subsection{SSD\@: Single Shot MultiBox Detector~\cite{liu2016ssd}}

SSD is an object detection scheme that works by regressing location offsets,
and ``confidence'' scores of a (large) set of~\emph{fixed} default boxes.
Objects at multiple scales are detected by scoring boxes at different levels of
feature maps (with different receptive fields): for~$m$ successive feature
maps, each scale~$s_k$
is~$s_k = s_{\textrm{min}} + \frac{s_{\textrm{max}} - s_{\textrm{min}}}{m - 1}(k - 1)$
for~$k \in \{0, \dots, m - 1\}$.

Each default box has six aspect ratios ranging from~$\sqrt{3}:1/\sqrt{3}$
(tall) to the same (short) in five intervals, and also one extra box that is
between the scale of this feature map and that of the successive one.

Localization width and height are regressed as~$\log(w/w_d)$ where~$w$ is the
ground truth width, and~$w_d$ is the width of the default box. Localization
offsets are centered at the default box center, and normalized by their
width/height.

NMS is performed after discarding all boxes with confidence less
than~\num{0.01}.


\subsubsection{Comments}

Would an additional CoordConv channel help improve the location offset
prediction, since the predictor is convolutional?

The confidence branch should perhaps regress Jaccard index, rather than
(logistic) regress the~$\geq 0.5$ overlap of an object (in~$\{0, 1\}$).


\section{Optical Flow}


\subsection{Fast Optical Flow using Dense Inverse
            Search~\cite{kroeger2016fast}}

The authors suggest a tradeoff between accuracy and speed in optical flow, by
using inverse search to find patchwise correspondences at multiple scales, then
variational inference to do pixelwise refinement.

At each scale in multiple image scales at the same patch size:

\begin{enumerate}
        \item Create patch grid.

        \item Initialize patches (zero for coarsest scale, then initialize
                finer scales with coarse flow).

        \item Inverse matching of patches.

        \item Densify by assigning to a pixel a weighted average of the flow of
                patches that include it.

        \item Variational refinement of dense flow predictions.
\end{enumerate}

Variational refinement is done by minimizing an energy function:

\begin{equation*}
E(\mathbf{U}) = \int_\Omega \sigma\Psi(E_I) + \gamma\Psi(E_G) + \alpha\Psi(E_S) d\mathbf{x}
\end{equation*}


\subsubsection{Interesting/Important References}

\begin{itemize}
        \item~\cite{weinzaepfel2013deepflow} for optical flow with deep matching.

        \item~\cite{brox2011large} classic,~\cite{wulff2015efficient} learned
                prior.

        \item~\cite{zimmer2011optic} for derivation of the variational
                refinement used, which mentions~\cite{bertero1988illposed} for
                the aperture problem.
\end{itemize}


\subsubsection{Questions/Comments}

\begin{itemize}
        \item I guess like always the question is why people are still using
                SIFT feature descriptors, although this is besides the point of
                the paper.
\end{itemize}


\section{Network Compression/Resource Constrained}

\subsection{CLIP-Q: Deep Network Compression Learning by In-Parallel
            Pruning-Quantization~\cite{Tung_2018_CVPR}}

They use a three-step process to prune/sparsify neural network weights:

\begin{enumerate}
        \item Adjust positive and negative clips~$c^-$ and $c^+$ such that
                proportion~$p$ each of positive and negative weights are
                below~$c^+$ and~$c^-$, respectively.

        \item Partition weights linear-uniformly into~$n$ partitions.

        \item Set weights within each quantile to the average of that quantile.
\end{enumerate}

The steps above are performed for pruning proportion~$p$, and bits per
layer~$b$, which are optimized over using an outer Bayesian optimization loop
(for 20-30 iterations). The Bayes opt objective is to
minimize~$\epsilon(\theta) - \lambda c_i(\theta)$,
where~$c_i(\theta) = \frac{m_i - s_i(\theta)}{\sum_i m_i}$ minimizes the error,
and maximizes the difference in uncompressed ($m_i$) and compressed
($s_i(\theta)$) bits for the layer, normalized by the total number of allocated
bits.

An inner loop trains the network weights by forward propagating with the
quantized weights, and backpropagating to update the full precision weights
(which are discarded after training).


\subsubsection{Future Work}

Practical speed-ups with existing BLAS libraries, e.g., by structured pruning.
``Generalized convolution is often implemented in modern DL frameworks by
reshaping filter and patch matrices, and performing large matrix
multiplications using highly optimized BLAS libraries''. How are convolutions
actually implemented? I believe there is a Song Han paper about speeding up
Winograd by pruning.

Energy efficiency by considering memory access patterns.


\subsection{Compressing Neural Networks using the Variational Information
            Bottleneck~\cite{dai2018vib}}

Their motivation is to maximize mutual information (MI) between a given layer's
activations with the correct labels~$I(h_i ; y)$ while minimizing MI with the
next layer's output~$I(h_i ; h_{i + 1})$. Based on the definition of MI this
gives the objective,

\begin{equation}
        \gamma_i \expect{\infdiv{p(h_{i + 1} \mid h_i)}{q(h_{i + 1})}} - \expect{\log q(y \mid h_L)}
\end{equation}

for each pair of layers.

They assume that each layer's activations~$h_i$ is the product of a learned
Gaussian with a function of the previous layer's activations,
i.e.,~$h_i = (\mu_i + \epsilon \odot \sigma_i) \odot f_i(h_{i - 1})$, and
that~$q(h_i) = N(h_i ; 0, diag[\xi_i])$. From these assumptions, they produce
the regularization
term~$\sum_{j = 1}^{r_i} \log\left(1 + \frac{\mu_{ij}^2}{\sigma_{ij}^2}\right)$
for layer~$i$.


\subsubsection{Interesting/Important References}

\begin{itemize}
        \item~\cite{tipping2001sparse} for Bayesian sparsity framework for
                kernel machines.

        \item~\cite{tishby1999theinformation} for the information bottleneck.
\end{itemize}


\subsubsection{Questions/Comments}

\begin{itemize}
        \item How is mutual information computed tractably?
\end{itemize}


\subsection{Constraint-Aware Deep Neural Network
            Compression~\cite{chen2018constraint}}

They use an outer (inequality constrained) Bayesian optimization loop to tune
the pruning percentage (and potentially other) hyperparameters of a network
compression algorithm, while gradually ``cooling'' a set of constraints. They
get impressive results ($\approx 53\%$ vs. $\approx 59\%$) when compared to
using an unconstrained baseline to compress the network and meet the
constraints.


\subsubsection{Interesting/Important References}

\begin{itemize}
        \item It's worth checking out~\cite{luo2018thinet}, which has code
                at~\url{https://github.com/Roll920/ThiNet_Code}, for pruning
                CNN filters for practical FLOPs reduction on existing deep
                learning frameworks.

        \item~\cite{gelbart2014bayesian} and~\cite{gardner2014bayesian} on
                Bayes opt with inequality constraints.

        \item Gaussian processes for ML (Rasmussen and Williams, 2016)
                textbook.
\end{itemize}


\subsubsection{Questions/Comments}

\begin{itemize}
        \item They made an interesting point that the progressive sequence of
                networks~$F_t$ found during the cooling process is important in
                helping the compressed network's accuracy, since the one-shot
                compression performed much better when hyperparameters
                \emph{and} sparsity structure were transferred from the final
                cooling step, as opposed to just transferring the final
                hyperparameters.

        \item The above point raises the question whether the Bayes opt
                hyperparameter search is necessary at all, since the cooling
                process alone could have led to similar results.

        \item (Figure 3) the constraint-aware methods all show alarmingly high
                variance (in contrast to unconstrained) as the latency
                constraint becomes more strict.
\end{itemize}


\subsection{Learning Efficient Convolutional Networks Through Network
            Slimming~\cite{liu2017learning}}

They added a LASSO regularization on the scaling parameter~$\gamma$ of
batchnorm, training with the LASSO, pruning the lowest channels, then
finetuning (optionally repeating the process).


\subsubsection{Questions/Comments}

\begin{itemize}
        \item They claim there is an advantage to reusing the batchnorm scale
                parameter as opposed to inserting arbitrarily a scale layer,
                however it seems that the filter weights of the subsequent
                convolutional layer could increase in magnitude to compensate
                for bathnorm's reduced~$\gamma$. Perhaps L2-regularization on
                the filter weights is a key part.

        \item Another question is, how well does this work for transfer
                learning? There are no experiments for transfer learning, in
                particular when the learning rate for the transferred weights
                is lower than any ``new'' randomly initialized layers.

        \item Idea: use gradients from a heldout set to finetune the per-layer
                regularization scale parameter~$\lambda$. Differentiate through
                pruning process to get these.
\end{itemize}


\subsubsection{Interesting/Important References}

\begin{itemize}
        \item~\cite{changpinyo2017thepower},~\cite{liu2015sparse},~\cite{scardapane2016group},~\cite{wen2016learning},
                and~\cite{zhou2016less} are sparse convolution papers.
\end{itemize}


\subsection{Rethinking the Value of Network Pruning~\cite{liu2018rethinking}}

They select five different pruning algorithms: three with predefined amount of
channel pruning per layer, and two where amount of pruning per layer is
optimized for, and retrain from scratch. Their main finding is that for channel
pruning, models retrained from scratch with similar computation budget perform
just as well as models finetuned from their unpruned counterpart. They don't
find the same result for unstructured pruning.


\subsubsection{Interesting/Important References}

\begin{itemize}
        \item~\cite{huang2018datadriven},~\cite{ye2018rethinking}, which force
                a scaling parameter to zero during training to learn structured
                sparsity. Do they use gradients from the validation set to
                learn the scaling parameters?
\end{itemize}


\subsection{Rethinking the Smaller Norm Less Informative Assumption in Channel
            Pruning of Convolution Layers~\cite{ye2018rethinking}}

They adopt ISTA instead of~$L_1$ norm as a regularizer for the batchnorm scale
parameter~$\gamma$ in channel pruning for CNNs.

The solution of ISTA
is~$\gamma_{t + 1} = \max\{\abs{x} - \eta,\, 0\} \cdot \sign{x}$,
where~$x = \gamma_t - \mu_t \nabla_\gamma l_t$, and~$\eta = \lambda \mu_t$.
So~$\gamma$ gets updated with gradients until it goes below some threshold,
where it is clipped to zero.

A hyperparameter~$\alpha$ is used to increase the speed of optimization
of~$\gamma$ relative to the weights in pretrained models.

\subsubsection{Questions/Comments}

\begin{itemize}
        \item There is some unconvincing discussion about the use of~$\alpha$
                to increase/decrease sparsification.
\end{itemize}

\subsubsection{Interesting/Important References}

\begin{itemize}
        \item~\cite{beck2009afast} for ISTA.
\end{itemize}


\subsection{SBNet: Sparse Blocks Network for Fast
            Inference~\cite{ren2018sbnet}}

In this paper the authors introduce a sparse block module for fast convolution
computation given an input mask. They produce a set of fixed-size blocks tiling
the input mask, then stack these blocks and feed them to a dense convolutional
kernel (e.g., a ResNet module), using a pair of scatter/gather operations to
get the blocks from the input, then restore the computed features to the
output.

\subsubsection{Questions/Comments}

\begin{itemize}
        \item How does this work for multiple stages? Is the mask downsized at
                each stage to match the stride of the convolution? Does the
                mask account for the receptive field of the convolutions? In
                the output, are ignored regions just set to zero, or is the
                output stored in a sparse format?
\end{itemize}


\subsection{Scalable Methods for 8-bit Training of Neural
            Networks~\cite{banner2018scalable}}

They train using 8-bit gradients (mostly), activations, and weights. Their
novel contribution is ``range batchnorm'', which divides by the input's range
multiplied by~$C(n) = 1/\sqrt{2\ln n}$, as well as quantizing all weight
gradients~$g_{W_l}$ to eight-bit, but not layer gradients~$g_l, g_{l - 1},
\dots$ (they keep these as 16-bit). They have some proof as to the validity of
the quanitization, which is based on the gradient distribution being Gaussian,
which they show does not hold for~$g_l$.

\subsubsection{Interesting/Important References}

\begin{itemize}
        \item~\cite{gupta2015deep} on stochastic rounding.

        \item~\cite{hubara2017jmlr} apparently show a 16x speedup and 15x power
                efficiency improvement for int8. Is this on existing CPUs?
\end{itemize}

\subsubsection{Questions/Comments}

\begin{itemize}
        \item The gradient distribution plots are interesting. Why does~$g_l$
                have such a sharp peak at zero, while~$g_{W_l}$ is more
                Gaussian?
\end{itemize}


\subsection{Structured Bayesian Pruning via Log-Normal Multiplicative
            Noise~\cite{neklyudov2017structured}}

They do general structured pruning by putting a sparsifying prior on
noise~$\theta_i$, which is multiplied with channels (convolutions) or input
features (FC layers), while suggesting that their technique would also
generalize to connection pruning (related to DropConnect).
They put a sparsifying prior on the prunable structures, namely a truncated
log-uniform prior, approximated with a truncated log-normal prior (see
equations 9--11).
At test time they replace the sampled noise with its expected value.


\subsubsection{Interesting/Important References}

\begin{itemize}
        \item They mention that it is important that the multiplicative noise
                is positive, although they did not provide any experiments
                verifying this.
\end{itemize}


\subsection{TETRIS: TilE-matching the TRemendous Irregular
            Sparsity~\cite{yu2018tetris}}

They prune at 1x1 weight resolution, then re-order the weights by swapping rows
and columns to minimize the~$L_1$ norm of the weights pruned by blocksparse
pruning~\cite{wen2016learning} for a given sparsity level. They use blocksparse
(Scott Gray sparse matmul GPU kernels) to get an on-device speedup.

\subsubsection{Questions/Comments}

\begin{itemize}
        \item Which norm is used for the convolutional layers? Presumably it is
                some norm of the kernel.
\end{itemize}


\subsection{The State of Sparsity in Deep Neural
            Networks~\cite{gale2019thestate}}

They did an empirical cross-comparison of magnitude pruning with variational
pruning~\cite{molchanov2017variational} and~$L_0$ regularization
pruning~\cite{louizos2018learning}, using ResNet-50 on ImageNet and transformer
on German-English WMT2014 as testbeds.
They found that~$L_0$ regularization performs equivalently well to magnitude
pruning and variational dropout performs worse (nearly as bad as random
pruning) on WMT2014.
On ImageNet, they found that variational dropout outperforms magnitude pruning,
and they couldn't get their implementation of~$L_0$ regularization to work.
They also refuted the ``training from scratch'' hypothesis for both the
``lottery ticket'' case (re-initialize with pre-pruning initialization) and
random initialization case.


\section{RNNs}

\subsection{Recurrent Batch
            Normalization~\citet{DBLP:journals/corr/CooijmansBLC16}}

Batch normalization (BN) is extended to the hidden-to-hidden connections in
RNNs (in particular, to LSTMs). Previously suggested vanishing gradient issues
when using BN in hidden-to-hidden connections are overcome by initializing the
BN $\gamma$ parameter to 0.1. The reason given for the vanishing gradients is
tanh saturation for unit variance features.

In the equation
$(\mathbf{f}_t, \mathbf{g}_t, \mathbf{i}_t, \mathbf{o}_t) =
\mathbf{W}_h \mathbf{h}_{t - 1} + \mathbf{W}_x \mathbf{x}_t + \mathbf{b}$,
BN layers are added after the matrix products
$\mathbf{W}_h \mathbf{h}_{t - 1}$ and $\mathbf{W}_x \mathbf{x}_t$ separately.
$\beta_h$ and $\beta_x$ are set to zero as they are redundant with
$\mathbf{b}$. Another BN layer is added in the hidden state output equation
$\mathbf{h}_t = \sigma{(\mathbf{o}_t)} \odot \tanh{(\mathbf{c}_t)}$,
right before the tanh squashing function. No BN layer is added to the
$\mathbf{c}_t$ update equation, supposedly so that the cell state can freely
propagate forward.

Separate BN statistics are kept for each timestep, supposedly to overcome
transience in the statistics of the initial sequences. BN statistics averaged
over entire sequences is found to degrade performance. At test time, the BN
statistics of the last timestep $T_{\max}$ of the longest sequence seen at
training time are extended to all timesteps $t > T_{\max}$.

As a trick for dealing with training sequences with long runs of constant
variance (namely black pixels in MNIST digits), hidden units are initialized
with Gaussian noise.


\section{Semantic Segmentation}

\subsection{3D Graph Neural Networks for RGBD Semantic
            Segmentation~\cite{qi20173d}}

The idea in this paper is to use graph neural networks as a model of RGBD
(i.e., 2.5D) data, in order to take depth into account in the prediction
algorithm (2D CNNs have the drawback that their spatial prior does not take
into account depth).

The graph is defined as having edges between each pixel and its closest~$N$
nodes in world coordinate space. The propagation algorithm runs iteratively for
a fixed number of steps, where at each step~$t$ messages are propagated:

\begin{align*}
        m_v^t = \frac{1}{|\Omega_v|} \sum_{u \in \Omega_v} g(h_u^t) \\
        h_v^{t + 1} = q([h_v^t, m_v^t])
\end{align*}

where $\Omega_v$ is the set of nodes adjacent to node~$v$, and~$g$ and~$q$ are
NNs.

The algorithm gets a small boost ($\approx 2\%$ mean AP and IoU) on the
SUN-RGBD dataset.


\subsubsection{Questions}

Wouldn't the obvious approach to 2.5D data be to use a sparse tensor of the
pixel data, and do a sparse 3D convolution on the sparse tensor features?


\subsection{3D Hand Shape and Pose Estimation from a Single {RGB}
            Image~\cite{ge20193dhand}}

They present a system for full 3d hand mesh estimation from a single RGB image.
They pre-train on a synthetic dataset with a mesh loss, 3d-pose loss, and
heatmap loss.
The mesh loss is composed of vertex loss, edge loss, normals loss, and
Laplacian (smoothness) loss.
They then finetune on a real RGB dataset, using a heatmap loss,
weakly-supervised depth loss (rendering their estimated meshes to depth with a
differentiable renderer), and ``pseudo''-mesh loss, which uses predictions from
a model pretrained on the synthetic dataset as ground truth, and applies only
the edge and Laplacian losses for those.
They created the synthetic dataset with Maya (for the 3d hand model) and the
Arnold renderer for rendering with global illumination.


\subsection{Annotating Object Instances with a
            Polygon-RNN~\cite{castrejon2017annotating}}

The authors used a convLSTM~\cite{shi2015convolutional} to predict polygon
segmentation annotations, by minimizing the negative log-likelihood of the
groundtruth annotated polygon vertices. The convLSTM takes as input features
from a pretrained convnet, along with the previous two vertices, and outputs
a~$D \times D + 1$ grid where the maximum cell is the output vertex (the extra
cell is a stop token).

The first vertex has to be treated specially, so two extra layers
with~$D \times D$ output are added to the CNN backbone: one predicts object
boundaries, and the other predicts vertices, taking the object boundary as
input. Both are treated as binary classification tasks.

Their model gets~\num{61.4} mIoU on Cityscapes.


\subsubsection{Questions/Comments}

The Conv-LSTM is being trained to maximize the likelihood of vertices that an
annotator would place, which doesn't reflect the entire distribution of
possible valid polygons. Similar to NLP word language model problem.

Is there a nonsequential model that could achieve the same result of producing
a segmentation polygon?

Are transformers applicable (attention is all you need)?


\subsection{CCNet: Criss-Cross Attention for Semantic
            Segmentation~\cite{huang2018ccnet}}

They compute an ``attention map'' by doing an outer product on~$CHW$ feature
tensors where feature vectors only interact with other feature vectors sharing
the same row and column.

A so-called ``affinity operation'' generates attention
maps~$A\in \mathbb{R}^{(H + W - 1)\times H\times W}$ from tensors~$\{Q, K\} \in
\mathbb{R}^{C'\times H\times W}$.
The affinity operation is~$d_{k,u} = {Q_u}^\intercal K_{\pi(k)}$
where~$u = (i, j)$,~$k$ ranges from~$1..W + H - 1$ and~$\pi$ maps~$k \leq W$
to~$(k, j)$ and~$k > W$ to~$(i, k - W)$.

An ``aggregation operation'' follows the affinity operation by
weighted-averaging over spatial positions of a third
tensor~$V\in \mathbb{R}^{C\times H\times W}$,
i.e.,~$F'_u = \sum_{i = 1..W + H -1 }A_{i, u}V_{\pi(i)}$.


\subsection{Deep Extreme Cut: From Extreme Points to Object
            Segmentation~\cite{maninis2018deep}}

They made an interactive segmentation model that takes an image along with
extreme points (concatenated as gaussians in an additional channel) and
produces a foreground/background object segmentation.


\subsubsection{Questions/Comments}

\begin{itemize}
        \item They supposedly compare interactive performance by adding a fifth
                ``hard-example mined'' point, but they also train directly on
                these examples.
                So their interactive component could be improved.

        \item Relatively poor performance on improving video object
                segmentation performance on DAVIS when trained on labels
                produced by DEXTR as opposed to ground-truth.
\end{itemize}


\subsection{FEELVOS: Fast End-to-End Embedding Learning for Video Object
            Segmentation~\cite{feelvos2019}}

They propose an end-to-end VOS system that doesn't use finetuning.
For each pixel of each object of each frame, they do global matching against
the first frame, and local matching against the previous frame's predictions to
produce two distance maps.
A dynamic segmentation head takes the concatenation of the local and global
distance maps with global shared features and probability masks for each class
to predict an object segmentation mask.


\subsubsection{Questions/Comments}

\begin{itemize}
        \item One possible future direction: closing the performance gap with
                PReMVOS\@.

        \item Another possibility would be using FEELVOS for interactive VOS\@.
\end{itemize}


\subsection{ICNet for Real-Time Semantic Segmentation on High-Resolution
            Images~\cite{zhao2017icnet}}

\subsubsection{Interesting/Important References}

\begin{itemize}
        \item~\cite{zhu17dff} use optical flow to pass spatial features from a
                keyframe to subsequent frames.
\end{itemize}


\subsection{Loss Max-Pooling for Semantic Image Segmentation~\cite{bulo2017loss}}

Bulo and campadres tackle a loss-imbalance problem amongst pixels in the same
image in semantic segmentation. Specifically, for a set of pixel
losses~$l_{\hat{y} y}$, they propose to use a generalization of loss
max-pooling: a weighting~$w \cdot l_{\hat{y} y}$ for~$w \in \mathcal{W}$,
where~$\mathcal{W}$ is convex.

They use convex optimization to derive an algorithm for computing the optimal
set of weights~$w$, which maximizes~$\sum l_{\hat{y} y}$ subject to the minimum
number of pixels~$1 \leq m \leq n$, where~$n$ is the total number of pixels,
$m = {(\gamma / \tau)}^p$, $\norm{w}_p \leq \gamma$, and
$\norm{w}_\infty \leq \tau$, and~$\gamma = n^{-(p - 1)/p}$.


\subsubsection{Comments/Questions}

\begin{itemize}
        \item Has anyone used this loss max-pooling over minibatches, in order
                to work on class imbalance, or underrepresented examples?
\end{itemize}


\subsubsection{Interesting/Important References}

\begin{itemize}
        \item~\cite{bunkhumpornpat2009safe} for reducing overfitting when
                oversampling underrepresented classes.
\end{itemize}


\subsection{Interactive Image Segmentation via Backpropagating Reﬁnement
            Scheme~\cite{jang2019interactive}}

They train a network to take an RGB image + guidance maps as input, and to
output a foreground/background segmentation. They accept user interaction by
placing points on the image, and enforce the network to correctly classify the
user-annotated points by doing L-BFGS to optimize the guidance maps until the
pretrained (frozen weight) network gets all user-annotated points correct.


\subsection{Normalized Cut Loss for Weakly-supervised CNN
            Segmentation~\cite{tang2018normalized}}

The authors used a normalized cut loss to train CNNs to perform semantic
segmentation where the only labels are scribbles. The scribbles are used as a
seed for the normalized cut algorithm.

They get a relatively large boost ($\approx 3\%$ mIoU) from using the
normalized cut loss on top of scribble supervision. They didn't have any
experiment demonstrating whether the normalized cut loss still gave an
improvement in the full supervision case.

Their normalized cut loss is:

\begin{equation}
        E_{NC}(S) = \sum_k \frac{{S^k}^T W (\mathbf{1} - S^k)}{d' S^k}
\end{equation}

where $k$ is a class label, $W \in \mathbb{R}^{|\Omega| \times |\Omega|}$
($\Omega$ is the set of all pixels), $S^k \in \mathbb{R}^{|\Omega| \times 1}$,
and~$d = W\mathbf{1}$.


\subsection{On the Importance of Label Quality for Semantic
            Segmentation~\cite{Zlateski_2018_CVPR}}

The authors created a large synthetic Cityscapes-esque segmentation dataset
using CityEngine to generate a city, and Mitsuba to render it. They created
coarse labels by generating inner polygons with between 4px and 32px error from
the border, and found that performance of models trained on these coarse labels
is correlated to the amount of ``labeler time'' spent creating them. They found
that in general labeler time was used most efficiently in creating coarse
labels, and the rule of thumb for trading off labeling time and perfomance was
to split time evenly between labeling coarsely and finely (i.e., pixel
accurately). Specifically, a 48k/3k coarse/fine split of labels performed
better than 48k fine labels.

An important point is that their coarse labels contained no incorrectly labeled
pixels, just unlabeled regions.


\subsection{One-Shot Video Object Segmentation~\cite{caelles2017oneshot}}

They do video object segmentation on a video clip by fine-tuning a network
(pre-trained on ImageNet then DAVIS) on a fully annotated first-frame.
They use the fast bilateral solver (FBS) as well as a two-stream foreground and
contour branches FCN architecture to snap boundaries from the foreground branch
with FBS to the learned contours.

\subsubsection{Comments/Questions}

\begin{itemize}
        \item Not sure about the contour snapping/FBS part.

        \item Could you make more accurate/efficient use of interaction?

        \item Self-supervision instead of pre-training on DAVIS\@.

        \item Using frame-diffs as an input feature?

        \item Explicitly training for the accuracy of the roll-outs (i.e.,
                meta-learning it).
\end{itemize}


\subsection{PReMVOS: Proposal-generation, Refinement and Merging for Video
            Object Segmentation~\cite{luiten2018premvos}}

They use a system consisting of the following stages:

1. Segmentation mask proposals with Mask-RCNN,

2. Mask refinement with DeepLab-v3+,

3. A ReID step using a network trained with triplet loss, and,

4. A merging and tracking algorithm.


They merge proposals according to an affine combination of the following five
scores:

- Inverse mask proposal score (warp first mask proposal?),

- Inverse ReID score,

- Mask propagation score,

- ReID score, and,

- Objectness score (from Mask R-CNN).


\subsection{Context Encoding for Semantic Segmentation~\cite{Zhang_2018_CVPR}}

The authors used a ``context encoding'' module to learn global context about
whether or not each class is present in the image. The representations from
this context encoder are then used to modulate the semantic segmentation
predictions, by outputting an feature vector that has length equal to the
number of classes, doing a sigmoid on each of those features, then multiplying
channelwise with the ``main branch'' features used for semantic segmentation
predictions.

The interesting choice here is to modulate the~\num{512}-channel feature
vector, rather than the predictions themselves. The trick is in an SE loss
applied to the features used to modulate the main branch, where the SE loss is
minimizing negative loglikelihood of each class's presence.

Although the paper claims to modulate with~$y \* x$, where~$y$ is the
modulating post-sigmoid outputs, their code uses~$x + y \* x$.

Another interesting aspect of this paper is their implementation of a
synchronized batchnorm, to overcome the issue of small minibatches due to high
memory usage of dilated convolutions used in semantic segmentation. How would
group normalization fare here?


\subsection{Fast User-Guided Video Object Segmentation by
            Interaction-and-Propagation Networks~\cite{oh2019fast}}

They made a system of dual networks for interactive video segmentation, which
they run in a series of rounds. The first (interaction) network is conditioned
on input scribbles and the previous round's mask, while the second
(propagation) network is conditioned on the previous timestep's and the
previous round's mask. There is a ``feature aggregation'' recurrent module
connecting the dual networks.

They train on fully-supervised labels, tested on DAVIS-2018, and fine-tuned on
GyGo and YouTube-VOS\@.


\section{Semi-Supervised Learning (SSL)}

\subsection{Why does unsupervised pre-training help deep
            learning?~\citet{Erhan:2010:WUP:1756006.1756025}}

This journal article investigates the reason why unsupervised pre-training is
effective at both improving generalization, and reducing training error in deep
networks. Unsupervised pre-training as regularization is compared, as a possible
explanation, with optimization and pre-conditioning.

The authors suggest that unsupervised pre-training acts as a prior on the model
parameters, from which some of the basins of attraction of the training error
descent procedures cannot be reached.

It is noted that unsupervised pre-training improves generalization even when
training data is in large abundance.

The initialization, and initial steps, of the training procedure are proposed
as being important. Even in the presence of abundant data, purely supervised
learning may result in overfitting to the initial training data observations.


\subsubsection{Data}

The data used are MNIST, and ``Infinite MNIST'', which is a version of MNIST on
which random projective transformations of the examples are done in order to,
in some sense, create infinite unique examples.


\subsubsection{Model}

The models used are Stacked Denoising Autoencoders (SDAEs) and Deep Belief
Networks (DBNs) with one and three hidden layers.


\subsubsection{Inference}

Contrastive divergence, for DBNs. SGD for SDAEs?


\subsubsection{Criticism}

Cross-entropy on the label?

A log-log plot of test negative log-likelihood (NLL) versus train NLL was used
to visualize generalization.


\subsubsection{Experiments}

A number of experiments were run, and the important results support the
hypothesis that unsupervised pre-training limits the set of local optima that
SGD can converge to, due to its initialization. On MNIST, this effect amounted
to regularization, where the models that underwent unsupervised pretraining had
worse training accuracy and better test accuracy (i.e., better generalization).
On Infinite MNIST, where the generalization error is directly optimized, the
``limiting the basins of attraction'' effect led to significantly improved
optimization (i.e., lower training error by an order of magnitude).

Another interesting finding was that the effect of varying the order in which
data is presented had much higher variance with respect to the final accuracy,
for both unsupervised pretrained and purely supervised models. Although the
paper claims that that unsupervised pre-training acts as a variance reduction
technique, their own empirical results are not convincing compared to the other
empirical results in the paper (e.g., optimization versus regularization
hypotheses).


\subsubsection{Future Work}

Look into the regularization effect of joint unsupervised and supervised
objectives, as in semi-supervised learning.

Comparison with training of deep autoencoders in~\citet{HinSal06}, where it is
shown to be difficult to get a good reconstruction using deep autoencoders,
without pre-training.


\section{Statistical Theory}

\subsection{Stochastic Gradient Descent as Approximate Bayesian
            Inference~\citet{mandt-sgd-2017}}

SGD is viewed as a stochastic (Ornstein-Uhlenbeck) process, and thereby related
to Bayesian inference.

Shows how proper tuning of its learning rate, minibatch size, and
pre-conditioning matrix parameters can make SGD an approximate Bayesian
inference algorithm.

Demonstrates that constant SGD can be used as a variational EM algorithm to
optimize the hyperparameters of complex probabilistic models.

Shows how SGD with momentum can be used for approximate sampling.

Analyzes stochastic gradient MCMC algorithms: Stochastic Gradient Langevin
Dynamics (SGLD), and Stochastic Gradient Fisher Scoring (SGFS).

Uses the stochastic process perspective to show that Polyak averaging is
optimal.


\section{Visual Question-Answering (VQA)}

\subsection{Tips and Tricks for Visual Question-Answering: Learnings from the
            2017 Challenge}

A~\num{3000} GPU-hour hyperparameter search over different VQA method
configurations.


\subsubsection{Data}

VQA~2.0~\citet{goyal2017making}. Visual Genome provides small
($\approx 0.8\%$?) improvement.


\subsubsection{Model}

Gated tanh gives a large performance increase ($\approx 5\%$ over tanh,
$\approx 3\%$ over ReLU) over other activations.

Faster-RCNN features trained on Visual Genome give a large performance boost
($\approx 4\%$).  See bottom-up attention paper~\citet{anderson2017bottom}.

Uni-directional GRU used, pre-trained with GloVe embeddings (small performance
boost over training embeddings from scratch).

Individual sigmoids provide a significant boost compared with using softmax.

Ensembles provide a large performance boost.


\subsubsection{Inference}

ResNet features subsampled to~$7 \times 7$ provides a~$\approx 2\%$ boost
over~$14 \times 14$ features.


\subsubsection{Criticism}

Criticizing using soft scores (part of VQA~2.0 annotations) gives a small
performance boost over using hard binary scores.

Individual sigmoids trained with cross-entropy on multiple annotations (i.e.,
treat VQA as a multi-label problem).


\subsection{Bilinear Attention Networks}

Propose using a pairwise bilinear attention over question embeddings and visual
image ``channels'' (e.g., RoIs of an object detection).

\begin{equation}
\mathbf{f}_k = {(X^T U)}^T_k \mathcal{A} {(Y^T V)}_k
\end{equation}

where $U \in \mathbb{R}^{N \times K}$, $V \in \mathbb{R}^{M \times K}$,
$X \in \mathbb{R}^N$, $Y \in \mathbb{R}^M$. $\mathcal{A}$'s parameters give the
pairwise weighting of interactions between ``channels'' of~$X$ and~$Y$. $K$
here must be the length of the pre-classifier vector.

The attention map~$\mathcal{A}$ itself is determined via:

\begin{equation}
        \mathcal{A} := \softmax\left(((\mathbf{1} \cdot \mathbf{p}^T) \odot X^T U) V^T Y\right)
\end{equation}

where~$\mathbf{1}$ is a vector of ones.

Otherwise, datasets used were VQA 2.0 and Flickr 30k, and other details were
similar to MLB\@.


\section{Visual Geometry}

\subsection{DeepVO\@: Towards End-to-End Visual Odometry with Deep Recurrent
            Convolutional Neural Networks~\citet{wang-deep-vo-2017}}

This paper proposes a simple end-to-end supervised approach to visual odometry
using neural networks. The inputs used are pairs of RGB frames, and the outputs
are the current pose of the camera.


\subsubsection{Data}

The data used are from the KITTI 2012~\ref{kitti} Odometry task.


\subsubsection{Model}

The model is an RNN on top of frames produced by a CNN\@. The CNN takes pairs
of frames as input, while the RNN takes a sequence of outputs from the CNN as
input.


\subsubsection{Inference}

Inference is performed by taking a sequence of pairs of frames as input and
outputting a sequence of poses.


\subsubsection{Criticism}

An~$L_2$ loss on the ground truth pose is used.


\subsubsection{Comments}

It seems that the neural network model suffers from path divergence over time.
Is there any way to incorporate prior knowledge in order to allow the model to
re-localize itself from divergent paths? This is the dead reckoning problem.

Regressing the pose seems to be a weak signal, similar to the weak labelling of
supervised classifiers. It would be interesting to take advantage of what we
know about the geometry of the scene to provide an unsupervised signal to train
good representations for the network. Reprojection
loss~\url{https://alexgkendall.com/computer_vision/Reprojection_losses_geometry_computer_vision/}?
An interesting avenue is to learn representations that do not have the same
issues as the photometric reprojection loss.

Another interesting extension would be to incorporate transfer learning from
large datasets (e.g., ApolloScape~\citet{huang-apolloscape-2018}).


\subsection{Dynamic Graph CNN for Learning on Point
            Clouds~\cite{dgcnn2018wang}}


\subsubsection{Interesting/Important References}

\begin{itemize}
        \item~\cite{qi2016pointnet} pioneered deep learning directly on points,
                by processing each point independently then pooling via a
                symmetric function.

        \item~\cite{bruna2013dynamic}
\end{itemize}


\subsection{Unsupervised Monocular Depth Estimation from Left-Right
            Consistency~\citet{monodepth17}}

This paper presents a system for monocular depth estimation, which is trained
in a self-supervised scheme via left-right photometric consistency of rectified
views from a stereo camera. The major contribution of this paper, compared with
other papers that train monocular depth from stereo camera images, is the use
of two extra regularization terms, namely smoothness and left-right consistency
terms, in addition to the reconstruction term in the loss function.

They seem to use a number of clever tricks, such as limiting the range of their
disparity predictions based on a known strong prior about the maximum possible
disparity in the dataset, and predicting disparity from a sigmoid whose range
is this known limited range of disparities. Another interesting trick is a
post-processing step, where they merge disparities predicted on a flipped image
with the image itself in order to reduce some errors due to disocclusion.


\subsubsection{Data}

They test on KITTI depth, Make3D etc.\ and use
CityScapes~\cite{Cordts_2016_CVPR} as pre-training data.


\subsubsection{Model}

They use both a VGG and a ResNet-50 as backbones to their depth predictor.


\subsubsection{Inference}

Only the left image is passed through one CNN backbone. The features from the
CNN are used to separately predict left and right disparities. The left and
right disparities are trained by doing a spatial transform of the left image
using the left disparities, and similarly for the right image, then applying a
reconstruction loss.

They use Adam as the optimizer with a learning rate of~$10^{-4}$ and train
for~\num{50} epochs, halving the learning rate for each~\num{10} epochs
after~\num{30}.


\subsubsection{Criticism}

There are three loss terms: a reconstruction loss~$C_{ap}$, a smoothness
term~$C_{ds}$, and a left-right consistency term~$C_{lr}$.

The reconstruction loss interpolates between~$L_1$ loss and
SSIM~\citet{wang-image-ssim-2004}.

The smoothness term is:

\begin{equation*}
        C_{ds}^l = \frac{1}{N} \sum_{i, j}
                |\partial_x d_{ij}^l|e^{-\norm{\partial_x I_{ij}^l}} +
                |\partial_y d_{ij}^l|e^{-\norm{\partial_y I_{ij}^l}}
\end{equation*}

Where~$I$ is the image and~$d_{ij}$ is the disparity prediction. The
exponential term takes into account the fact that disparity is expected to
change rapidly at object boundaries.

Finally, the left-right consistency term from which the article gets its name,
just puts an~$L_1$ distance loss on the negative left disparity warped onto the
right disparity,
i.e.,~$C_{lr}^l = \sum_{i, j} |d_{ij}^l - d_{ij + d_{ij}^l}^r|$.


\subsubsection{Comments}

Future work: dealing with specular surfaces, explicitly dealing with
occlusions. Temporal consistency in videos. Sparse input as an alternative
training signal? Estimating full occupancy of a scene? (See the conclusion).

A flaw in the paper seems to be that by predicting disparity directly, their
system has baked in the camera calibration to the learned weights (and hence
they report poor numerical performance when deploying their model trained on
KITTI on other datasets, despite qualitatively detailed looking depth
predictions). Why not just predict depth or inverse depth?


\subsection{Undeep VO\@: Monocular Visual Odometry through Unsupervised Deep
            Learning~\citet{li-undeep-vo-2017}}

This paper focuses on visual odometry without supervision: a variety of
reprojection losses are used to jointly train a depth and pose estimation
model.


\subsubsection{Data}

KITTI 2012 Odometry~\ref{kitti}.


\subsubsection{Model}

The pose estimator model is based on VGG-16, while the depth estimation CNN is
an encoder-decoder (based on FlowNet?).


\subsubsection{Inference}

Two frames each are used as input from the left and right views, and depth and
pose are predicted.

The Adam optimizer was used to train for 20 to 30 epochs.

Trajectories with relatively high rotation were supersampled.


\subsubsection{Criticism}

A variety of losses are used, belonging to two categories: spatial reprojection
losses and temporal reprojection losses.

There are three spatial reprojection losses: disparity (inverse depth)
consistency, pose consistency, and photometric consistency losses.

There are two temporal image losses: photometric consistency loss, and 3D
geometric registration loss.


\subsubsection{Comments}

Considering the inter-related nature of pose and depth, it is a bit odd that
the pose estimator is a completely separate network from the depth estimator,
instead of sharing features between the two.

Overall the system seems to have many knobs to tune.

The unsupervised objective definitely makes the system a good candidate for
transfer learning from larger video datasets.

Three open problems for deep learning in geometry (from Alex Kendall's blog):

\begin{enumerate}
        \item \emph{The aperture problem}: different directions of motion can
                appear identical when viewed through an aperture.

        \item Relying on the \emph{photometric consistency assumption} for
                reprojection losses. New representations should be learned or
                designed that are invariant to changes in colour that would
                result from different views (spatial or temporal) of the same
                object.

        \item An attention problem: we do not need to reconstruct everything,
                but the photometric consistency loss functions are acting like
                an autoencoder and making all parts of the image equally
                important.
\end{enumerate}


\bibliographystyle{apalike}
\bibliography{ml-reading-notes}

\end{document}