ML_Report_DRP2024.tex

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%% ICML 2009 EXAMPLE LATEX SUBMISSION FILE %%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% Deep Network Guided Proof Search

% Use the following line _only_ if you're still using LaTeX 2.09.
%\documentstyle[icml2009,epsf,mlapa]{article}
% If you rely on Latex2e packages, like most modern people, use this:
\documentclass{article}

% Employ the following version of the ``usepackage'' statement for
% submitting the draft version of the paper for review.  This will set
% the note in the first column to ``Under review.  Do not distribute.''
\usepackage{iclr2017_arXiv,times}
\usepackage{url}

%{\"a} {\^e} {\`i} {\.I} {\o} {\'u} {\aa} {\c c} {\u g} {\l} {\~n} {\H o} {\v r} {\ss}
% & % $ # _ { } ~ ^ \
% \& \% \$ \# \_ \{ \}
% \textasciitilde
% \textasciicircum
% \textbackslash

\usepackage{tabularx}
\usepackage{multirow}

\usepackage{graphicx} % Allows including images

\usepackage{bm}
\usepackage{amsmath}
\DeclareMathOperator*{\argmin}{arg\,min} % the * allows subscripts below the text (same behaviour as \lim_{n \to \infty})
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator{\net}{net}

\usepackage{xcolor}

\usepackage{hyperref}
\hypersetup{
    colorlinks,
    citecolor=green,
    filecolor=black,
    linkcolor=red,
    urlcolor=blue
}

\usepackage{multirow}

\usepackage{amssymb} % for checkmark
\usepackage{rotating} % for vertical text

\usepackage{lipsum}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%% Alex's Commands and packages %%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage{amsthm}
\usepackage{thmtools}
\usepackage[linesnumbered,ruled,vlined]{algorithm2e} %For pseudocode

\usepackage{natbib}
\setcitestyle{square}
%\setcitestyle{authoryear,open={[},close={]}} %Citation-related commands
%For plots
\usepackage{pgfplots}
\pgfplotsset{compat = newest}

% Define the theorem environment
\declaretheorem[style=mytheoremstyle,name=Theorem,numberwithin=section]{theorem}

% Define the corollary environment linked to the theorem
\declaretheorem[style=corlstyle,name=Corollary,numberlike=theorem,parent=theorem]{corollary}

\declaretheorem[style=corlstyle,name=Remark,numberlike=theorem]{remark}

\declaretheorem[style=corlstyle,name=Example,numberlike=theorem]{example}

% Define the lemma environment linked to the theorem
\declaretheorem[style=mytheoremstyle,name=Lemma,numberlike=theorem]{lemma}

% Define the proposition environment linked to the theorem
\declaretheorem[style=mytheoremstyle,name=Proposition,numberlike=theorem]{proposition}

% Define the definition environment linked to the theorem
\declaretheorem[style=mytheoremstyle,name=Definition,numberlike=theorem]{definition}

\usepackage{graphicx} % allow embedded images
  \setkeys{Gin}{width=\linewidth,totalheight=\textheight,keepaspectratio}
  \graphicspath{{graphics/}} % set of paths to search for images

\usepackage{listings}
\usepackage{xcolor}
\usepackage{wrapfig}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%       End Alex's Commands and packages        %%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\pagestyle{headings}
\title{An Introduction to Supervised Machine Learning}

% The \author macro works with any number of authors. There are two
% commands used to separate the names and addresses of multiple
% authors: \And and \AND.
%
% Using \And between authors leaves it to LaTeX to determine where to
% break the lines. Using \AND forces a line break at that point. So,
% if LaTeX puts 3 of 4 authors names on the first line, and the last
% on the second line, try using \AND instead of \And before the third
% author name.

\author{
  Alexandre St-Aubin \\
  \hspace{0.5mm} McGill University \\
  \hspace{0.5mm} alexandre.st-aubin2@mail.mcgill.ca\\
  %\texttt{alexandre.st-aubin2@mail.mcgill.ca} \\
  }


\begin{document} 
\maketitle

\begin{abstract} 

\noindent
  This paper serves as an introductory guide to supervised learning within the field of machine learning (ML), aimed at readers with a foundational understanding of mathematics, primarily calculus and statistics. The focus is on neural networks (NN), with an in-depth exploration of its key components and learning methods. We begin with an overview of NNs, detailing the architecture and function of single-layer perceptrons, neurons, and feed-forward neural networks. The discussion extends to the structural elements of layers and the pivotal role of activation functions. We then explore the learning mechanisms of NNs through backpropagation and gradient descent optimization techniques. Different loss functions, critical for evaluating model performance, are also explored. Finally, we address the challenges of overfitting and underfitting and present strategies to achieve an optimal fit. 

\vspace{5mm}


\textbf{Keywords:} Machine Learning, Supervised Learning, Neural Networks, Multiple Layer Perceptron, Activation Function, Backpropagation, Loss function, Gradient Descent, Overfitting, Underfitting.


\textbf{}

\end{abstract} 


\vfill

\clearpage

\newpage

\tableofcontents

\newpage

\section{Introduction}
Machine learning (ML) has surged in popularity recently due to its ability to tackle complex problems by extracting knowledge from data and learning from it autonomously. In 2022, the field of Generative Artificial Intelligence, a branch of ML, experienced a major breakthrough when OpenAI launched ChatGPT, a large language model that quickly became widely adopted. Shortly after, OpenAI introduced Dalle-3, a model designed to generate images from text captions. More recently, there has been a trend in AI towards releasing task-specific models, such as math equation solvers, paraphrasers, schedulers, and various others.

Unlike traditional approaches, ML focuses not only on understanding data but also on inferring from it, allowing models to generalize knowledge and make predictions on unseen data. This emphasis on inference enables machine learning to go beyond merely understanding information, allowing ML to foster innovation and advancement in many different fields. Essentially, the power to infer enhances the practical applications and impact of ML, leading to significant developments across diverse areas.

There are several types of ML algorithms. The main categories are divided into \textit{Supervised learning, Unsupervised learning, Semi-supervised learning and Reinforcement learning.} \autoref{fig:ML} depicts the main classes of ML along with some popular models for each. It is important to note that since ML is a constantly evolving field, its organization into subfields can be somewhat chaotic. As a result, some areas, such as Imitation Learning, are not included in the figure, and even the latest surveys on machine learning methods often lack thoroughness. Additionally, new techniques are introduced every year, reshaping how we approach ML. For example, the introduction of transformers with the now famous paper "Attention is All You Need" \citep{vaswani2023attention} revolutionized the field, and more recently, models like Mamba \citep{gu2023mamba} are being developed that may eventually replace transformers. While these topics are beyond the scope of this paper, they may be of interest to the reader.

\vspace{4mm}
  \noindent \textsc{Supervised learning} allows the machine to learn through examples. The machine learning algorithm is tasked with developing the strategy for achieving the specified outputs given some input.  To do so, a known dataset is supplied  that contains a set of inputs and associated target outputs in the form $(x, t)$, where $x$ is the input to the model, and $t$ the target output, i.e. the \textit{label}. The algorithm finds patterns in the data and adjusts its parameters to predict more accurately future inputs. This kind of machine learning algorithm needs a developer to accurately label the training data. 

  Broadly speaking, the two main problems addressed by Supervised learning are \textbf{Classification} and \textbf{Regression}. The former consists of classifying objects into their appropriate classes. A well known example operates on the MNIST data set, which is a set of images representing hand drawn numbers from 0 to 9. A classification ML algorithm will analyse each point $(x,t)$ of the data set during the \textit{training phase}, where $x$ is the image, and $t$ is the number it represents. Then, given an unseen image as input, the algorithm should predict which number it represents.

  \textit{Regression} aims to predict a continuous output variable, rather than a discrete one (as in \textit{Classification}). An example would be predicting the cost of a house given a set of training data consisting of house characteristics and their corresponding prices. We note, however, that supervised learning can also encompass more complex tasks, such as image segmentation and object detection (e.g., \href{https://github.com/ultralytics/ultralytics}{YOLO} with bounding boxes), where models learn more than just labels from the data.

\vspace{4mm}
\noindent\textsc{Unsupervised learning} algorithms discern patterns within data without predefined labels. The algorithm identifies correlations and structures within datasets. Key steps in this process include clustering, which groups similar data sets, and dimension reduction, which simplifies variables to extract essential information. 

\vspace{4mm}
\noindent\textsc{Semi-Supervised learning} acts on a data set that contains both labelled and unlabelled data. Normally, labelled data is rare, while unlabelled data makes up most of the set. The goal of semi-supervised learning algorithms would be to classify the unlabelled by inferring from the labelled data.

\vspace{4mm}
\noindent\textsc{Reinforcement learning} involves a structured approach where algorithms are provided with actions, parameters, and goals. These algorithms explore various options, learn from trial and error, and adapt their strategies based on past experiences to optimize outcomes \citep{GuideML}.

In this paper, we aim to provide a concise, yet comprehensive introduction to \textit{Supervised learning}, focussing on the \textit{Neural Network} approach. We intend for this report to serve as a reference to key concepts, offering quick access to definitions and reliable references as needed. Most of the basic concepts we discuss, such as Neurons, Backpropagation, and Overfitting, are relevant across all areas of machine learning.

We note that throughout, some terms will be introduced before they are defined, or may only be clarified by illustrative examples. We chose this approach to communicate ideas because of the broad nature of machine learning. Precise definitions can be made in some cases for terms, while other significant terms are more flexibly used like in common language. The figures and examples used were taken from trusted sources in the field and citations are included. 

Before starting, we want to highlight that ML, while closely related to statistics, is distinct from it. Both fields share common methods and terminologies, but their goals differ. Statistics primarily focuses on analyzing and interpreting existing data, whereas ML aims to make predictions and infer from new data. Terms like \textbf{Bias} and \textbf{Variance} can have different implications in ML. In ML, bias refers to the error introduced by approximating a real-world problem with a simplified model, while variance measures how much the model's predictions change with different training data. Understanding and managing the trade-off between bias and variance is crucial for developing robust ML models, as we'll learn in \autoref{sec:fitting}.
\begin{figure}
  \includegraphics{MachineLearning}
  \caption{Classification of Machine Learning Algorithms from \cite{SurveyML}}
  \label{fig:ML}
\end{figure}
\section{Neural Networks}
Neural networks, a subset of machine learning, form the core of supervised learning and are inspired by the human brain. They are made of input, hidden, and output layers, and consist of interconnected \textit{neurons} with associated \textit{weights and biases}. Activated nodes transmit data to the next layer through outgoing edges.

\subsection{Single Layer Perceptron} \label{sub:Single Layer Perceptron}
The simplest neural network is referred to as the \textbf{Single Layer Perceptron} (see \autoref{fig:SLP}). This neural network contains
a single input layer and an output node \citep{inbook:Aggarwal-1.2}.

Consider a situation where each \textit{training instance} is of the form $(\overline{X}, y)$, where each $\overline{X} = [x_1,..., x_d]$ contains $d$ feature variables, and $y \in [-1, +1]$ contains the observed value of the binary class variable. By “observed value” we refer to the fact that it is given to us as a part of the training data, and our goal is to predict the class variable for cases in which it is not observed.

\begin{wrapfigure}{R}{0.4\textwidth} %this figure will be at the right
    \centering
    \includegraphics[width=0.4\textwidth]{SingleLayerP}
    \caption{A single layer perceptron, from \cite{inbook:Aggarwal-1.2}.}
    \label{fig:SLP}
\end{wrapfigure}

The input layer of the {perceptron} contains $d$ nodes that transmit the $d$ features $\overline{X} = [x_1, . . . x_d]$ with \textbf{edges} of \textbf{weight} $\overline{W} = [w_1, . . . w_d]$  to an output node. The input \textbf{layer} does not perform any computation. The linear function $\overline{W} \cdot \overline{X} =\sum^{d}_{i=1}  w_ix_i$ is computed at the output node. Subsequently, the sign of this real value is used in order to predict the dependent variable of $\overline{X}$. Therefore, the prediction $\hat y$ is computed as follows:
    $$\hat y = \text{sign}[\overline{X} \cdot \overline{W}] =\text{sign}\left[\sum^{d}_{i=1}  w_ix_i\right] $$

Here, the sign function serves the role of an activation function (\autoref{sub:ActivationFunction}). The goal of the \textbf{perceptron} algorithm with respect to all training instances in a data set $\mathcal{D} = \{(\overline{X_1}, y_1 ), \cdots,(\overline{X_n}, y_n )\}$ is to minimize the loss function (\autoref{sub:Loss Function}). We shall learn more about loss functions in a following section, but an example would be to minimize the least-squares function: 
 \begin{equation*}
  \begin{split}
    \sum^{n}_{i=1} (\hat y_i - y_i)^2 =\sum^{n}_{i=1} \left(\text{sign}[\overline{X_i} \cdot \overline{W}] -y_i \right)
  \end{split}
 \end{equation*}

 To optimize the loss function, the perceptron will use a \textit{gradient descent} method, which we will see later.
 \subsection{Neurons}%
  \label{sub:Neurons}
The neuron is the smallest computational unit of a neural network. The single layer perceptron described in the previous section is exactly a neuron, see \autoref{fig:Neuron} for a picture.  
\begin{definition}[Neurons]
  Neurons are made up of 4 elements: 
  \begin{enumerate}
    \item The inputs, seen as \textit{nodes}, consist of either the input layer (\autoref{sub:Layers}), or the outputs of the previous layer of the neural network. 
    \item The weights (\autoref{def:weight}) and biases (\autoref{def:bias}), seen as \textit{edges} will act on the outputs of previous neurons before they reach the input function of the neuron. 
    \item The input function takes a sum of the weighed inputs. 
    \item The activation function (\autoref{sub:ActivationFunction}) acts on the input function and outputs the result to the following layers. 
  \end{enumerate}
   A simple mathematical model for a neuron's output activation is \citep{book:AIModernApp}
  \begin{figure}
    \centering
    \includegraphics[width=0.75\textwidth]{Neuron}
    \caption{A neuron and its inputs and outputs.}
    \label{fig:Neuron}
  \end{figure}
  $$o_j = f \left( \sum^{n}_{i=0}w_{i,j} o_i  \right)$$
  where $f$ is the activation function, $o_i$ is the output activation of neuron $i$ and $w_{i,j}$ is the weight on the directed edge from neuron $i$ to the neuron $j$. 
\end{definition}
\begin{definition}[Weights]\label{def:weight}
  A weight $w_{i,j} \in \mathbb{R}$ will multiply the output of neuron $i$ before giving it as input to neuron $j$. Weights are the moving parts of the neural network, as they get changed during each iteration of the learning step in order to optimize the output. They can be initialized to any random value, and after running the network on a pair $(x,t)$ from the \textit{training data}, the weight update rule (\autoref{sub:Weight Update Rule}) will alter them to minimize the error between the output $\hat y$ and the target $t.$ 
\end{definition}
\begin{definition}[Bias]\label{def:bias}
  Biases are constant, they are an additional input into the next layer that will always have the value of 1. Bias units are not influenced by the previous layer, they guarantee that even when all the inputs are zeros there will still be an input in the neuron \citep{WeightsBias}. 
\end{definition}

When talking about neural networks and ML models in general, the term "\textbf{parameters}" often comes up. For example, OpenAI loves to brag about how many parameters their latest language models have (\textbf{GPT-4} has $1.76$ trillion). Parameters are the elements of a neural networks that the model learns from the training data. These parameters are exactly the weights and biases defined above. See \autoref{fig:parameters} for a comparison of popular Large Language Models in their number of parameters.
 

\subsection{Feed-Forward Neural Networks}%
  \label{sub:Feed-Forward Neural Networks}
  Having defined neurons, the building blocks of neural networks, we may now give a rigorous definition of the latter. A \textbf{feed-forward neural network} (FFN) has connections only in one direction—that is, it forms a directed acyclic graph. Every node receives input from “upstream” nodes and delivers output to “downstream” nodes; there are no loops \citep{book:AIModernApp}.

The overall network is a combination of function composition and matrix multiplication:

  $$g(x) = f^L (W^Lf^{L-1}(W^{L-1} \hdots f^1(W^1x)\hdots)), $$
  where $L$ is the number of layers (\autoref{sub:Layers}), $W$ is the weight matrix, and $f$ is the activation function (\autoref{sub:ActivationFunction}).
 \begin{definition}[Feed-forward Neural Network]
  A neural network is a \textit{computational directed acyclic graph,} in which a unit of computation is the neuron. Each directed edge in the graph represents a function passing the weighed output of a node in a layer to a node in the next layer.
 \end{definition} 

   Artificial neural networks are often referred to as \textbf{multi-layer perceptrons} (MLPs) for one simple reason: you can think of a neural network as the composition of multiple perceptrons. In this case, the perceptron would be a synonym for neuron, so the smallest computational unit of a neural network, which performs, for example, a linear combination of its inputs followed by the application of an activation function, which can be the sigmoid, tanh, ReLU, identity, or any other function that is differentiable, if you plan to train the neural network with gradient descent.
 \begin{figure}
    \centering
    \includegraphics[width=1\textwidth]{parameters}
    \caption{A graph comparing the number of parameters in popular Large Language Models (LLMs), data from \href{https://en.wikipedia.org/wiki/Large_language_model}{Wikipedia}.}
    \label{fig:parameters}
  \end{figure}
\subsection{Layers}
\label{sub:Layers}
FFNs are usually arranged in layers, such that each unit receives input
only from units in the immediately preceding layer. The layers are divided in 3 groups: the \textit{input, hidden,} and \textit{output} layers. Layers are a general term given to a grouping of neurons that act together in the neural network, as seen in \autoref{fig:NN}. 
  \begin{wrapfigure}{R}{0.4\textwidth} %this figure will be at the right
    \centering
    \includegraphics[width=0.4\textwidth]{NeuralNetwork}
    \caption{A graph representation of a Neural Network, from \cite{inbook:Aggarwal-1.2}.}
  \label{fig:NN}
\end{wrapfigure}

The first layer in a neural network is the \textbf{Input Layer}, it receives the initial data for the network from the outside world. The "Entry point" of the neural network. Then, come the \textbf{hidden layer(s)}, which are where the magic happens in neural networks. Each layer is trying to learn different aspects about the data by minimizing the \textbf{cost function}. The most intuitive way to understand these layers is in the context of image recognition such as a face. The first layer may learn edge detection, the second may detect eyes, third a nose, etc. \citep{layers}. The final layer in the neural network is the \textbf{output layer}. This layer is responsible for holding the final result or output of the problem. Input, such as raw images, is fed into the input layer, and the output layer produces the corresponding result.


\subsection{Activation Function}%
  \label{sub:ActivationFunction}
An activation function in neural networks is a smooth function applied to the output of each neuron in a layer. It introduces non-linearity to the network, allowing it to learn complex patterns and relationships in the data.

The activation function determines whether a neuron should be activated or not based on the weighted sum of its inputs. In other words, it defines the output of a neuron given a set of inputs. Without activation functions, neural networks would be limited to linear transformations, and they wouldn't be able to capture the non-linearities present in many real-world datasets. 

\vspace{5mm}
\noindent \textbf{Desired Characteristics of Activation Functions} \citep{jagtap2022important}

\noindent There is no universal rule for choosing the best activation function, but there are some characteristics to look for, namely
\begin{enumerate}
  \item \textbf{Nonlinearity} is one of the most essential characteristics
of an activation function. The non-linearity significantly improves the network's ability to learn and model complex, non-linear relationships in the data. If only linear activation functions were used, the entire network would be equivalent to a single linear transformation of the input, regardless of the number of layers. This would severely limit the network's ability to solve complex problems.

  \item The activation function must
  be \textbf{computationally cheap} in order to reduce training costs.
  \item It must be \textbf{bounded}, as gradient-based training approaches are more stable when the range of the activation function is finite. 
\item The most desirable quality for using
gradient-based optimization approaches is \textbf{continuously
  differentiable} activation functions. This ensures that the
back-propagation algorithm works properly. 
\end{enumerate}
\begin{remark}[The vanishing and exploding gradient problems] The inputs and outputs of certain activation functions, like the logistic function (Sigmoid), can vary greatly. These functions compress a large input space into a smaller output range between [0,1]. This may cause the back-propagation algorithm to have almost no gradients to propagate backward through the network, and any remaining gradients diminish as they move down through the layers. This leaves the initial hidden layers with little to no gradient information. Using non-saturating activation functions, such as ReLU, is one way to solve this problem. 
\end{remark}
We present some commonly used activation functions discussed in \cite{jagtap2022important}.
\begin{enumerate}
  \item \textit{Sigmoid Function.} Its range is $[0,1]$, and is defined as,
$$\sigma(x) = \frac{1}{1 + e^{-x}}$$
Advantage: boundedness. \\ Disadvantages:
the vanishing gradient problem, the output not being zerocentered, and the saturation for large input values. 
  \item \textit{Hyperbolic Tangent Function.} It is mostly used for
regression problems, has a range of $[-1,1]$, and is defined as,
$$\tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$$
Advantage: zerocentered structure. \\
Disadvantage: the vanishing gradient problem, i.e. once saturated, it is really challenging
for the learning algorithm to adapt the parameters and learn
faster.
  \item \textit{ReLU Function.} ReLU was primarily used to overcome the vanishing gradient problem. ReLU is the most common activation function used for \textit{classification problems}. Its range is $[0, \infty)$, and is defined as
  $$\text{ReLU}(x) = \max(0, x)$$
Advantages: Apart from overcoming the vanishing gradient problem, the
implementation of ReLU is very easy and thus cheaper, unlike
tanh and sigmoid, where an exponential function is needed.
\\
Disadvantages: It has a saturation region, which can prevent the
learning of the networks. In particular, ReLU always discards
the negative values, which makes the neurons stop responding to
the gradient-based optimizer. This problem is known as \textit{dead
or dying ReLU problem}, meaning the neurons stop
outputting other than zero. 
  \item \textit{Softplus Function.} It approximates the ReLU activation function in a smooth way, with a range of $(0, \infty)$, and it is defined as
$$\text{Softplus}(x) = \ln (1+ e^x) $$
  \item \textit{Softmax.} It is a generalization of logistic
function in high dimensions. It normalizes the output and
divides it by its sum, which forms a probability distribution. The standard softmax function Softmax$: \mathbb{R}^k \to (0,1)^k$ is defined as 
$$\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum^{k}_{j=1} e^{x_j}} \quad \text{for } i= 1,...,k   $$
In other words, it applies
the standard exponential function to each element $x_i$
of the input vector $x$ and normalizes these values by
dividing them by the sum of all these exponentials, which ensures that the sum of the components of the output vector is 1.
\end{enumerate}
\autoref{fig:actfun} summarizes the information described above.
\begin{figure}[]
  \includegraphics{ActFunTable}
  \caption{Common activation functions, their derivatives, range, and order of continuity, from \cite{jagtap2022important}.}
\label{fig:actfun}
\end{figure}

\section{Learning}
  \label{sec:Learning}

  At the core of machine learning lies the art of learning itself. But how can a computer program "learn"? The answer lies in viewing learning from data as an optimization problem. Essentially, the model's goal is to minimize the error between the predicted output and the actual expected output.

 In practice, given many data points, the model learns by adjusting its parameters iteratively. Each iteration involves tweaking the model's parameters to reduce the error between its predictions and the true values. This process continues until the model achieves the desired level of accuracy.

In this section, we delve into the essential components that drive this process forward: gradient descent, backpropagation, and loss functions.

\textbf{Gradient descent} is the main concept that enables learning. Its goal is to navigate down the gradient of the \textbf{loss function}, which quantifies the disparity between predicted and actual outputs, guiding the learning trajectory towards convergence. This optimization technique iteratively adjusts the model’s parameters to minimize the error.

To achieve this, gradient descent relies on the computation of gradients, which are crucial for understanding the direction and rate at which the model parameters should be updated. This is where \textbf{backpropagation} comes into play. Backpropagation serves as the engine of learning in neural networks, propagating the error from the output layer back through the network to the first hidden layer. This process enables the calculation of the gradient of the loss function with respect to the weights, allowing the model to update its parameters in a way that reduces the overall error. By continually applying these updates, the model gradually improves its performance.
\subsection{BackPropagation}%
  \label{sub:BackPropagation}
\textbf{Backpropagation} is an algorithm for automatic differentiation that calculates the gradients of parameters in neural networks. The gradient estimate is used by optimization algorithms such as \textit{Stochastic Gradient Descent} (\autoref{subsub:Stochastic Gradient Descent}) to compute the network weight updates. So, when we say a neural network is \textit{learning}, it means that backprop is computing a gradient descent that minimizes the loss function, and updates the weights using a \textbf{weight update rule} (\autoref{sub:Weight Update Rule}). Backpropagation is a way of computing the partial derivatives of a loss function with respect to the weights of a network; we use these derivatives in gradient descent. 
\begin{wrapfigure}{R}{0.4\textwidth} %this figure will be at the right
    \centering
    \includegraphics[width=0.4\textwidth]{simpleNet}
    \caption{A simple network with 2 inputs, one hidden layer, and two outputs.}
    \label{fig:simpleNet}
\end{wrapfigure}
  Let us first begin by recalling that a neural network evaluates compositions of functions computed at individual nodes. Think of a neural network as a function $h_w(x)$ of the input, parametrized by the weights. 
\begin{example}
   Consider the network in \autoref{fig:simpleNet}, let $\{x_1,x_2\}$ be the input vector, $f$ the activation function and $a_i$ denote the activated output at node $i$. Then, the output at node 3 is given by
  $$a_3 = f(w_{0, 3} + f(x_1)w_{1,3} + f(x_2)w_{2,3}), $$
  where $w_{0,3}$ is the bias weight (\autoref{def:bias}) at node 3. Similarly, the output at node 5 is
  \begin{equation*}
    \begin{split}
      a_5 &= f( w_{0,5} + f(a_3) w_{3,5} + f(a_4) w_{4,5}  )\\ 
      &= f[w_{0,5} + f(w_{0, 3} + f(x_1)w_{1,3} + f(x_2)w_{2,3})w_{3,5} +f(w_{0, 4} + f(x_1)w_{1,4} + f(x_2)w_{2,4})w_{4,5} ]
    \end{split}
  \end{equation*}
  And even with such a small network, we can already see how awkward it would be to compute the derivative of $a_5$ with respect to $w$. 
\end{example}
 An even bigger problem than in the above example arises when we think of how we would compute the loss function in hidden layers. Whereas the error $y -h_w$ at the output layer is clear, the error at the hidden layers seems mysterious because the training data do not say what value the hidden nodes should have. Note that here, $y$ is the target output, and $h_w$ the value computed by the network.
\begin{wrapfigure}{r}{0.4\textwidth} %this figure will be at the right
    \centering
    \includegraphics[width=0.4\textwidth]{pre-post-neuron}
    \caption{Pre and Post activation values of a neuron.}
    \label{fig:ppNeuron}
\end{wrapfigure}

  Therefore, we need some kind of iterative approach to compute the derivatives, and a way to \textit{back-propagate} the error from the output layer to the hidden layers. The resulting iterative approach uses \textit{dynamic programming}, and the \textbf{weight update rule} is the \textit{chain rule} of differential calculus \citep{inbook:Aggarwal-3.2}.

\begin{theorem}[Multivariate Chain Rule] \label{chain}
    Let $z = f(y_1, y_2, \ldots, y_m)$ be a differentiable function of $m$ variables, where each $y_j = g_j(x_1, x_2, \ldots, x_n)$ is a differentiable function of $n$ variables. Then, for each $i = 1, 2, \ldots, n$, the partial derivative of $z$ with respect to $x_i$ is given by
$$
\frac{\partial z}{\partial x_i} = \sum_{j=1}^{m} \frac{\partial z}{\partial y_j} \frac{\partial y_j}{\partial x_i}.$$
  \end{theorem}
  We note that backpropagation can be implemented by using either the \textit{pre-activation}, or \textit{post-activation} values at each neuron. Here, we'll focus on the method that acts on the pre-activation values, illustrated in \autoref{fig:ppNeuron}. For the sake of simplicity, we view the neural network as a \textit{Directed Acyclic Graph} $G$, where 
  \begin{enumerate}
    \item Each node represents a neuron, and is denoted by a number $j$.
    \item The weight on the edge from neuron $i$ to $j$ is denoted $w_{i,j}$.
  \end{enumerate}
  The algorithm can be divided in two phases: \textit{forward} and \textit{backward}.

    \vspace{4mm}
\noindent\textsc{Forward Phase.} The term "forward phase" refers to this process of computing values of each hidden layer depending on the current weight values using a specific input vector. These computations naturally cascade forward across the layers. The aim of the forward phase is to compute every intermediate hidden and output variable for a given input. The Backward phase will call for these values. The loss function $L$ with respect to this output, as well as the value of the output $o$, are calculated at the completion of the computation. 

 \vspace{4mm}
\noindent\textsc{Backward Phase.} In this phase, the gradient of the loss function with respect to different weights is calculated. First, the derivative $\frac{\partial {o}}{\partial {w}} {}$ is computed. This establishes the gradient computation's initialization. The multivariate chain rule is then used to propagate the derivatives \textit{backwards} in the neural network. Since we are focussing on the pre-activation approach, the gradients are computed with respect to the pre-activation values of the hidden variables, which are then propagated backwards.
  
  More precisely, the process can be described as follows \citep{Dreyfus1990ArtificialNN}. Let $(x,t)$ be the input to our algorithm, where $t$ is the target output of the input vector $x.$ Define the value
  $$L = \operatorname{Loss}(y,t),$$  
  where Loss could be any \textit{loss function} (see \autoref{sub:Loss Functions}), such as the MSE. At each neuron $j$, let its post-activation output be 
  $$o_j = \Phi (\net_j) = \Phi \left( \sum^{n}_{i=0}w_{i,j} o_i  \right) , $$
  where $\Phi$ is any \textit{activation function}, $w_{i,j}$ is the weight on the edge between neurons $i$ and $j$, and $o_i$ is the output from neuron $i.$ Then, the pre-activation value is easily seen to be $\net_j$.

  \begin{enumerate}
    \item The \textbf{forward pass} propagates $x$ through the neural network, computing the values of all hidden neurons to reach the output $\hat y$ of the neural network, which corresponds to the predicted output. Then, $L = \operatorname{Loss}(\hat y, t)$  is computed.
    \item The derivative $\frac{\partial {L}}{\partial {\hat y}}$ at the output can be directly computed. Then, to compute the derivative of $L$ with respect to $o_j$ for an arbitrary neuron $j$, 

    \item Consider $L$ as a function of all neurons receiving input from $j$, and denote this set by $I_j =\{ i_1^{(j)}, ..., i_n^{(j)}\}$. Then, 
    $$\frac{\partial {L(o_j)}}{\partial {o_j}} = \frac{\partial {L(\net_{i_1^{(j)}}, ..., \net_{i_n^{(j)}})}}{\partial {o_j}} {} $$
    to obtain the following recurrence relation for the derivative of $L$ with respect to $o_j$, by the chain rule, 
    \begin{equation}\label{deriv_o_j}
      \begin{split}
        \frac{\partial {L}}{\partial {o_j}} = \mathlarger{\mathlarger{\sum}}^{}_{i \in I_j} \left( \frac{\partial {L}}{\partial {\net_i}} \frac{\partial {\net_i}}{\partial {o_j}}  \right)  =&  \mathlarger{\mathlarger{\sum}}^{}_{i \in I_j} \left( \frac{\partial {L}}{\partial {o_i}} \frac{\partial {o_i}}{\partial {\net_i}} \frac{\partial {\net_i }}{\partial {o_j}} \right) \\ =&\mathlarger{\mathlarger{\sum}}^{}_{i \in I_j} \left( \frac{\partial {L}}{\partial {o_i}} \frac{\partial {o_i}}{\partial {\net_i}} w_{j,i}\right) 
      \end{split}
    \end{equation}
    We can see that the derivative with respect to $o_j$ can be computed if those of the neurons on the next layer are already known, per the \textit{recurrent} behaviour of this algorithm. 
    \item We now have all the necessary tools to compute the partial derivative of $L$ with respect to the weight $w_{i,j}$. Again, by applying the chain rule, we get
    \begin{equation}\label{deriv_w}
  \begin{split}
    {\frac {\partial L}{\partial w_{i,j}}}={\frac {\partial L}{\partial o_{j}}}{\frac {\partial o_{j}}{\partial w_{i,j}}}=&{\frac {\partial L}{\partial o_{j}}}{\frac {\partial o_{j}}{\partial {\text{net}}_{j}}}{\frac {\partial {\text{net}}_{j}}{\partial w_{i,j}}}\\ 
    =& {\frac {\partial L}{\partial o_{j}}}{\frac {\partial o_{j}}{\partial {\text{net}}_{j}}} \left(\frac{\partial {}}{\partial {w_{ij}}} {\sum^{n}_{k=1} w_{k,j}o_k}\right) \\ 
    =&{\frac {\partial L}{\partial o_{j}}}{\frac {\partial o_{j}}{\partial {\text{net}}_{j}}} \left(\frac{\partial {w_{i,j} o_i}}{\partial {w_{i,j}}} \right)={\frac {\partial L}{\partial o_{j}}}{\frac {\partial o_{j}}{\partial {\text{net}}_{j}}} o_i
  \end{split}
\end{equation}
   and so, we may consider the following recursively defined function to simplify notation,

\begin{equation*}
  \begin{split}
    \delta_j = {\frac {\partial L}{\partial o_{j}}}{\frac {\partial o_{j}}{\partial {\text{net}}_{j}}}= 
    \begin{cases}
      \frac{\partial {L(t, o_j)}}{\partial {o_j}} \frac{d {\varphi (\net_j)}}{d {\net_j}}, &\text{ if } j \text{ is an output neuron, }\\
      \left( \sum^{}_{i\in I_j} w_{j,i} \delta_i\right) \frac{d {\varphi (\net_j)}}{d {\net_j}} {}, &\text{ otherwise.} 
    \end{cases}
  \end{split}
\end{equation*}
to conclude that the partial derivative of $L$ with respect to the weight $w_{i,j}$ is given by 
\begin{equation*}
  \begin{split}
    \frac{\partial {L}}{\partial {w_{i,j}}} = o_i \delta_j
  \end{split}
\end{equation*}
      \end{enumerate}
\subsection{Weight Update Rule}%
  \label{sub:Weight Update Rule}
  After having found the derivatives of the loss function with respect to the weights, one needs a way to update the weights. The goal is to modify the weights of the neural network so that given the input $(x,t)$, where $x$ is the input vector and $t$ the target output, it points more toward $t$. Then, in the future, it will have a better chance of classifying $x$ correctly \citep{Hagan_Martin}. The naive way would be to set the weights so that they point directly to $t$. Unfortunately, this leads to overfitting, which we'll address in the next chapter.
  
  To update the weight $w_{i,j}$ using backpropagation, we first choose a \textbf{learning rate} $\mu> 0$. We may then choose to update $w_{i,j}$ by adding $\Delta w_{i,j}$ to it, where
  \begin{equation*}
    \begin{split}
      \Delta w_{i,j} = -\mu \frac{\partial {L}}{\partial {w_{i,j}}}  = - \mu  o_i \delta_j
    \end{split}
  \end{equation*}
 Then, the weight update rule would be defined as, 
 \begin{equation*}
  \begin{split}
    w_{i,j}^{(\text{updated})} =  w_{i,j}^{(\text{old})} - \Delta w_{i,j} 
  \end{split}
 \end{equation*}
 We claim that this update rule decreases the loss $L$, and refer the reader to section $4.15$ of \cite{Hagan_Martin} for a proof, which is quite lengthy.  
\subsection{Gradient Descent}%
  \label{sub:Gradient Descent}
  
Gradient descent is a method for minimizing an \textbf{objective function} \( J(\theta) \) parameterized by a model's parameters \( \theta \in \mathbb{R}^d \), namely, the weights and biases. This is achieved by updating the parameters in the opposite direction of the gradient of the objective function \( \nabla_{\theta} J(\theta) \) with respect to the parameters. The learning rate \( \eta \) (see \autoref{sub:Weight Update Rule}) determines the size of the steps taken towards a (potentially local) minimum. Essentially, we move in the direction of the slope of the surface defined by the objective function, descending until we reach a valley \citep{ruder2017overview}. 

   In the context of neural networks, the objective function would be the \textit{loss function}, and in order to find the gradient, one would use \textit{backpropagation}. It is common for developers to say they trained their model using back-propagation, but technically, this is incorrect. Back-propagation is not an optimization algorithm and cannot be used to train a model by itself. The term back-propagation is often misunderstood as the entire learning algorithm for multi-layer neural networks. In reality, back-propagation only refers to the gradient computation method, while another algorithm, like SGD, performs the learning using these gradients \citep{Goodfellow-et-al-2016}. Before exploring the different gradient descent algorithms, we define some terminology from \cite{batchvsEpoch}. 

\begin{definition}[Sample]
  A sample is a single row of data containing inputs for the algorithm and an output for error calculation. A training dataset consists of many samples, also known as instances, observations, input vectors, or feature vectors.
\end{definition}

The term \textit{feature vectors} may be more appropriate for describing a row of data, as \textit{features} refer to specific, measurable attributes or characteristics of a given phenomenon. Selecting informative features is crucial for the success of algorithms in tasks such as pattern recognition, classification, and regression. Although features are typically numerical, they can also be structural, such as strings and graphs. The concept of a feature is analogous to that of an explanatory variable in statistical methods like linear regression. For simplicity, we will use the term \textit{sample} for the remainder of the paper.

\begin{definition}[Batch]
  The batch size is a hyperparameter defining the number of samples processed before updating the model parameters. A batch iterates over samples, makes predictions, calculates errors, and updates the model. 
\end{definition}

Common mini-batch sizes are 32, 64, and 128. If the dataset size is not divisible by the batch size, the final batch will have fewer samples.
\begin{definition}[Epoch]
  An epoch is a hyperparameter defining the number of times the entire training dataset is processed. Each epoch ensures all samples have an opportunity to update the model parameters. An epoch consists of one or more batches.
\end{definition}
   The number of epochs is typically large, often in the hundreds or thousands, to minimize the model error sufficiently. Running too many epochs may lead to overfitting, as we'll see in \autoref{sub:Overfitting}.

   \subsubsection{Vanilla Gradient Descent}
  \label{subsub:Vanilla Gradient Descent}
Vanilla gradient descent computes the gradient of the loss function with respect to the parameters (i.e. the weights and biases) for the whole training set. In other words, the \textit{batch size} is the size of the entire training set. So for each parameter $\theta$, the update would be, 
   $$ \theta =\theta - \eta \cdot \nabla_{\theta}J(\theta) $$
Since calculating the gradients for the entire dataset to perform a single update can be very slow, vanilla gradient descent is impractical for large data sets. The code for vanilla gradient descent might look like this \citep{ruder2017overview}:
\begin{verbatim}
for i in range(nb_epochs):
  params_grad = evaluate_gradient(loss_function, data, params)
  params = params - learning_rate * params_grad
\end{verbatim}
For a pre-defined number of \textit{epochs}, we first compute the gradient vector \texttt{params\_grad} of the loss function for the entire dataset with respect to our parameter vector \texttt{params} using, say, \textit{backpropagation}. 

   We then update our parameters in the direction of the gradients, with the learning rate determining the size of each update. Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces \citep{ruder2017overview}.
\subsubsection{Stochastic Gradient Descent}%
  \label{subsub:Stochastic Gradient Descent}

The volume of data has grown so much in recent years that the processing power available is insufficient to train neural networks by performing gradient descent on the entire training set at each epoch. As a result, neural networks employ \textbf{stochastic gradient descent} (SGD), which reduces machine computation time. 

   SGD is a method that optimizes gradient descent, making it less costly. SGD alters the batch gradient descent algorithm by computing the gradient for just one training sample in each iteration before updating the parameters. Equivalently, the batch size for SGD is equal to 1. We define an example as a pair $(x^{(i)}, y^{(i)})$ where $x^{(i)}$ is the training input and $y^{(i)}$ the label. The update is then \citep{ruder2017overview},
   $$\theta = \theta - \eta \cdot \nabla_\theta J(\theta :(x^{(i)}, y^{(i)}) ) $$

   The steps for performing SGD are as follows:
\begin{verbatim}
for i in range(nb_epochs): 
  np.random.shuffle(data) 
  for example in data:
    params_grad = evaluate_gradient(loss_function, example, params) 
    params = params - learning_rate * params_grad
\end{verbatim}
\begin{wrapfigure}{R}{0.4\textwidth} %this figure will be at the right
    \centering
    \includegraphics[width=0.4\textwidth]{fluctuation-SGD}
    \caption{SGD fluctuation, from \citep{ruder2017overview}}
    \label{fig:fluctuation-SGD}
\end{wrapfigure}
Notice that the data is shuffled before each iteration. This is done to prevent bias that would occur if the data were trained in a specific order. A downside of SGD is that it doesn't converge as surely as vanilla gradient descent to the minimum of the "basin" where the parameters were initialized. Indeed, since SGD updates the parameters after every gradient computation on a single sample, it fluctuates a lot, as seen in \autoref{fig:fluctuation-SGD}. This may cause it to jump to a new local minima. Research indicates that by gradually decreasing the learning rate with a \textbf{learning rate schedule}, SGD exhibits similar convergence behavior to vanilla gradient descent, nearly always converging to a local minimum for non-convex functions and to the global minimum for convex functions \citep{ruder2017overview}.

\subsubsection{Mini-Batch Gradient Descent}%
  \label{sub:Mini-Batch Gradient Descent}
  Finally, Mini-batch is the middle ground between SGD and Vanilla gradient descent. Instead of computing the gradient on one sample, or on the entire data set, we compute it on a fixed number $n$ of samples, 
  $$\theta = \theta - \eta \cdot \nabla_\theta J(\theta: (x^{(i:i+n)} ,y^{(i:i+n)}))$$
This approach reduces the variance of parameter updates, leading to more stable convergence, and leverages highly optimized matrix operations in modern deep learning libraries, making gradient computation with mini-batches very efficient. Common mini-batch sizes range from 50 to 256, though they can vary depending on the application. Mini-batch gradient descent is typically the preferred algorithm for training neural networks, and the term SGD is often used even when mini-batches are employed \citep{ruder2017overview}. The code for mini-batch gradient descent with a batch size of 50 would be, 
\begin{verbatim}
for i in range(nb_epochs): 
  np.random.shuffle(data)
  for batch in get_batches(data, batch_size=50):
    params_grad = evaluate_gradient(loss_function, batch, params) 
    params = params - learning_rate * params_grad
\end{verbatim}

Now, there exist many more gradient descent algorithms, such as \textit{Nesterov Accelerated Gradient, Adagrad, Adadelta, Adam,} and many more. For more information on the subject, we refer the reader to \cite{ruder2017overview}.
\subsection{Loss Functions}%
  \label{sub:Loss Functions}
  The attentive reader will have noticed that supervised machine learning revolves around optimizing the outputs of a loss function. This also applies to a wide range of other machine learning strategies. Thus, defining a good loss function is essential. We offer an overview of the most widely utilized loss functions for a variety of uses, starting with \textbf{Regression Losses}.
  \subsubsection{Regression Loss Functions}%
  \label{sub:Loss Function}
    Regression losses are loss functions used to solve the regression problems of \textit{supervised ML}. Recall that regressions model predict the output of a continuous output variable. The regression losses are all based on residuals, namely the difference between the predicted and expected outputs. In the following, let $x_i$ be the $i^{th}$ element of input $x$, $f(x_i)$ the $i^{th}$ element of the predicted output, and $y_i$ the $i^{th}$ element of the expected output. The following represent commonly used regression losses,
  \begin{enumerate}
    \item \textit{Mean Bias Error Loss [Continuous, Differentiable]}. The Mean Bias Error loss (MBE) is the most basic loss function, it is given by,
    \begin{equation}
      \begin{split}
        \mathcal{L}_{MBE} (y_i, f(x_i))= \frac{1}{n} \sum^{n}_{i=1} [y_i-f(x_i)]
      \end{split}
    \end{equation}
 It measures the average bias in the prediction, but because positive errors have the ability to cancel out negative ones and create an incorrect parameter estimate, it is rarely used as the loss function to train regression models. However, it serves as the foundation for the ensuing loss functions and is frequently employed to assess the models' performances \citep{ciampiconi2023survey}.
    \item \textit{Mean Absolute Error Loss (L1) [Lipschitz-Continuous, Convex].} The Mean Absolute Error loss is one of the most fundamental loss functions for regression; it measures the average of the absolute bias in the prediction. The absolute value overcomes the problem of the MBE by ensuring that positive errors do not cancel the negative ones. It is defined as, 
    \begin{equation}
      \begin{split}
        \mathcal{L}_{MAE}(y_i, f(x_i)) = \frac{1}{n} \sum^{n}_{i=1} |y_i-f(x_i)|
      \end{split}
    \end{equation}

    Notice that the contribution of the errors follows a linear behaviour, implying that many small errors have as much impact as a big one. This implies that the gradient magnitude is not dependent on the error size, thus leading to convergence problems when the error is small. A model trained to minimize the MAE performs well when the target data conditioned on the input is symmetric \citep{ciampiconi2023survey}.
    \item \textit{Mean Squared Error Loss (L2) [Continuous, Differentiable, Convex].} The Mean Squared Error loss is a well-known and simple loss function for regression. It is given by 
    \begin{equation}
      \begin{split}
        \mathcal{L}_{MSE}(y_i, f(x_i)) = \frac{1}{n} \sum^{n}_{i=1} (y_i-f(x_i))^2
      \end{split}
    \end{equation}
The squared term makes all the biases positive and magnifies the contribution made by outliers, making it more suitable for problems where noise in the observations follows a normal distribution. The sensitivity to the outliers is the primary disadvantage \citep{ciampiconi2023survey}.
    \item \textit{Root Mean Squared Error Loss [Continuous, Differentiable, Convex].} The Root Mean Squared Error loss is, apart from the square root term, identical to MSE. It main benefit is that the loss has the same units and scale as the relevant variable. The minimization procedure converges to the same optimal value as MSE. However, the RMSE may take different gradient steps depending on the optimization method employed \citep{ciampiconi2023survey}. It is defined as, 
    \begin{equation}
      \begin{split}
        \mathcal{L}_{RMSE}(y_i, f(x_i)) = \sqrt{\frac{1}{n} \sum^{n}_{i=1} (y_i-f(x_i))^2}
      \end{split}
    \end{equation}
    \item \textit{Huber Loss [Lipschitz-Continuous, Differentiable, Strictly Convex].} Huber loss is a mix of MAE and MSE. When the residuals are sufficiently small, it goes from MAE to MSE. It is parameterized by $\delta$, which indicates the point at which MAE and MSE transition. This enables it to combine the benefits of the MAE and the MSE. When there is a significant discrepancy between the model's output and prediction, the Huber loss becomes less susceptible to outliers since the errors are linear. On the other hand, if the error is tiny, it follows the MSE, which accelerates convergence and makes it differentiable at $0$. A crucial decision is which $\delta$ to use, which may be modified often throughout the training process depending on what constitutes an outlier \citep{ciampiconi2023survey}. The Huber loss is given by,
\begin{equation}
      \begin{split}
        \mathcal{L}_{Huber}(y_i, f(x_i)) = \sum^{n}_{i=1} \alpha_i(y_i, f(x_i)),
      \end{split}
    \end{equation}
    where, 
    \begin{equation*}
      \begin{split}
        \alpha_i(y_i, f(x_i)) =\begin{cases}
          \frac{1}{2} (y_i-f(x_i))^2, &\text{ if } |y_i -f(x_i) | \leq \delta \\  
          \delta\left( |y_i -f(x_i)| -\frac{1}{2}\delta \right), &\text{ otherwise.} 
        \end{cases}
      \end{split}
    \end{equation*}
    \item \textit{Log-cosh Loss [Continuous, Differentiable].} The Log-cosh loss is given by the logarithm of the \textit{hyperbolic cosine} of the residuals between the actual value $y$ and the predicted value $f(y_i)$. It is more computationally demanding than \textit{Huber loss}, but it provides all the same benefits without the need to establish a hyperparameter. The log-cosh loss has the advantage of being twice differentiable, making it appropriate for algorithms needing to compute the hessian. It is also considered as a \textit{Robust estimator}, meaning that is is tolerant to outliers in the data set \citep{ciampiconi2023survey}. It is defined as, 
  \begin{equation}
      \begin{split}
        \mathcal{L}_{Logcosh}(y_i, f(x_i)) = \frac{1}{n} \sum^{n}_{i=1} \log[\cosh(y_i-f(x_i))]
      \end{split}
    \end{equation}
    \item \textit{Root Mean Squared Logarithmic Error Loss [Continuous, Differentiable, Convex].} The Root Mean Squared Logarithmic Error (RMSLE) loss is given by,
\begin{equation}
      \begin{split}
        \mathcal{L}_{RMSLE}(y_i, f(x_i)) = \sqrt{\frac{1}{n} \sum^{n}_{i=1} [\log (y_i + 1) - \log(f(x_i)+1)]^2 }
      \end{split}
    \end{equation}

     When it comes to RMSE, the only distinction is that the logarithm is used on both the observed and the anticipated values. The logarithm's plus one term permits zero values for $f(x_i)$. The RMSLE is more resilient to outliers because of the logarithm, as well as the relative inaccuracy between the expected and actual values. In particular, the RMLSE's size does not increase in proportion to the error's magnitude. Rather, when both the anticipated and actual values are high, data points with large residuals are not penalized as much. Because of this, the RMSLE is a viable option for problems where the targets have an exponential relationship or when penalizing underestimates more heavily than overestimates is desirable. This loss, however, is inappropriate for problems where negative values are permitted \citep{ciampiconi2023survey}.
  \end{enumerate}
\subsubsection{Classification Loss Functions}%
  \label{sub:Classification Loss Functions}
  The second subset of supervised ML, \textit{Classification}, also has its own set of loss functions. The following consists of the most commonly used \textit{Margin-based} Classification losses. 
  \begin{enumerate}
    \item \textit{Zero-one Loss.} The Zero-one loss is the most basic classification loss function. It is defined as, 
    \begin{equation}
      \begin{split}
        \mathcal{L}_{Zero-One}(y, f(x)) = 
        \begin{cases}
          1, &\text{ if } f(x) \cdot y < 0 \\ 
          0, &\text{ otherwise.} 
        \end{cases}
      \end{split}
    \end{equation}
    In practice, the Zero-one loss can't be used since it is not convex, nor differentiable.
    \item \textit{Hinge Loss [Lipschitz-continuous, Convex].} The Hinge loss is among the most famous loss functions for classification. It is given by, 
 \begin{equation}
      \begin{split}
        \mathcal{L}_{Hinge}(y, f(x)) = \max(0, 1-f(x)\cdot y)
      \end{split}
    \end{equation}
    The two main drawbacks of Hinge loss are that it is sensible to outliers, and that its derivative is discontinuous at $f(x)\cdot y = 1$. The latter makes it harder to optimise.
    \item \textit{Quadratically Smoothed Hinge Loss [Lipschitz-continuous, Convex, Differentiable].} The Quadratically Smoothed Hinge loss is a smoothed version of the Hinge loss, making it easier to optimize.
\begin{equation}
      \begin{split}
        \mathcal{L}_{QSmoothedHinge}(y, f(x)) = 
        \begin{cases}
          \frac{1}{2\gamma} \max(0, -(f(x)\cdot y))^2 , &\text{ if } f(x)\cdot y \geq 1 -\gamma\\
          1 -\frac{\gamma}{2} -f(x)\cdot y, &\text{ otherwise.}  
        \end{cases}
      \end{split}
    \end{equation}
    The hyperparameter $\gamma$ gives the degree of smoothing. As $\gamma \to 0,$  $\mathcal{L}_{QSmoothedHinge}(y, f(x)) \to \mathcal{L}_{Hinge}(y, f(x)) $.
    \item \textit{Modified Huber Loss [Lipschitz-continuous, Differentiable, Strictly convex].} The Modified Huber loss is a version for classification. It is a special case of the Quadratically Smoothed Hinge Loss with $\gamma =2$. We define it as,
\begin{equation}
      \begin{split}
        \mathcal{L}_{ModHuber}(y, f(x)) = 
        \begin{cases}
          \frac{1}{4} \max (0, -(f(x) \cdot y))^2 , &\text{ if } f(x)\cdot y \geq -1 \\
         -(f(x) \cdot y), &\text{ otherwise.}  
        \end{cases}
      \end{split}
    \end{equation}
    \item \textit{Ramp Loss [Continuous, Convex].} The Ramp loss is more robust to outliers than the Hinge loss. It is given by,
    \begin{equation}
      \begin{split}
        \mathcal{L}_{Ramp}(y, f(x)) = 
        \begin{cases}
         \mathcal{L}_{Hinge}(y, f(x)), &\text{ if } f(x)\cdot y \leq 1 \\
         1, &\text{ otherwise.}  
        \end{cases}
      \end{split}
    \end{equation}
  \end{enumerate}
Other loss functions based on \textit{Information Theory}, such as the \textit{Cross-Entropy loss}, and the \textit{Kullback-Leiber divergence} are widely used in classification problems, and we refer the reader to \citep{ciampiconi2023survey} for more information.
 \section{Fitting the model}\label{sec:fitting}

Amidst the pursuit of learning, lurk the pitfalls of overfitting and underfitting. Overfitting occurs when a model becomes overly tailored to the training data, while underfitting results in oversimplified representations. Finding the right balance between generality and complexity is essential, achieved through techniques like regularization and cross-validation.

 The main difficulty with machine learning is that we have to be able to learn from new, unknown inputs as well as those that our model was trained on. The capacity of a model to perform well on unseen data is called \textbf{Generalization}. When training a model, we want the error between the expected and predicted outputs to be as low as possible on the training data. This is referred to as the \textit{training error}. We could stop here, and we'd have an \textit{optimization} problem. But this is where the difference with machine learning is, we require that our model also predict new, unseen data with a low error. We call this error the \textbf{test error}. 
When determining the quality of a ML algorithm, we look mainly at the two following attributes,
\begin{enumerate}
  \item How low is the training error? 
  \item How small is the gap between the test and training errors?
\end{enumerate}
These correspond, respectively, to the two main challenges in Machine Learning: \textbf{Underfitting} and \textbf{Overfitting} \citep{Goodfellow-et-al-2016}.
\begin{figure}
\includegraphics{fitting}
  \caption{Diagrams representing Underfitting, Appropriate, and Overfitting, from \citep{Goodfellow-et-al-2016}}
\end{figure}
  \subsection{Overfitting}%
  \label{sub:Overfitting}
\textbf{Overfitting} refers to the process of producing an analysis that matches a given set of data too closely, which can lead to problems when trying to fit new data or make reliable predictions about future events. That is, a mathematical model starts overfitting when it begins to memorize the data, rather than learn from it. 

In the topic of machine learning, if we choose too large of a network, it will behave like a lookup table on the data it was trained with, but won't generalize well to new data. Put another way, when a neural network has an excessive number of parameters, it can overfit, just like any other statistical model \citep{book:AIModernApp}.

As was hinted above, the degree of overfitting is influenced by the amount of data supplied, as well as the model's \textit{complexity}. The complexity of the model is determined by the number of underlying parameters that a neural network has. Additional degrees of freedom are the consequence of having more parameters, which can be utilized to explain certain training data points without making a strong generalization to new ones \citep{inbook:Aggarwal-4.1}.
\begin{wrapfigure}{R}{0.4\textwidth} %this figure will be at the right
    \centering
    \includegraphics[width=0.4\textwidth]{Overfitting}
    \caption{A diagram showing overfitting (green line) of data from \citep{overfitting-img}}
    \label{fig:Overfitting}
\end{wrapfigure}
\begin{example}
  Suppose we have 5 training pairs $(x, t)$ available. Then, it can be shown that there exists a degree 4 polynomial that fits the 5 training points exactly with zero error. This does not mean, however, that the polynomial will approximate unseen data with zero error. 
\end{example}
As seen in \autoref{fig:Overfitting}, an overfit model will have low \textit{bias}, but high \textit{variance} on the training data. However, as explained in \cite{burnham2002model}, an overfit model will
\begin{quote}"have estimated (and actual) sampling variances that are needlessly large (the precision of the estimators is poor, relative to what could have been accomplished with a more parsimonious model) [...]. A best approximating model is achieved by properly balancing the errors of underfitting and overfitting." \end{quote}
\begin{remark} 
  In machine learning models, overfitting is more likely when learning is performed for too long, or when the training data set is small.
\end{remark}
\subsection{Underfitting}%
  \label{sub:Underfitting}
  Underfitting is essentially the inverse of overfitting. It occurs when a model is too basic, and lacks the complexity, i.e. the number of parameters, to explain the patterns in the data. An underfit model will have high bias and low variance on the training data, the exact opposite of an overfit model. In other words, underfitting is failing to learn enough from the training set. For instance, when fitting a linear model to non-linear data, under-fitting would happen. Such a model's predicting ability would be mediocre both for the training and testing data.

  Underfitting is not as widely discussed as overfitting, since it can be detected easily by evaluating how well the model is performing on the training data. 

 \subsection{Mastering the Fit}%
  \label{sub:Mastering the Fit}
  Now that we've shown how bad a Machine Learning algorithm can be, we'll show how to fix it. By changing a model's capacity, we may alter how likely it is to overfit, or underfit. 
  \begin{definition}[Capacity]
   We define the capacity of a model as its ability to fit a wide variety of
functions. Models with low capacity may struggle to fit the training set, while models
with high capacity can overfit by memorizing properties of the training set that do
not serve them well on the test set. The capacity of a model is a way to measure its complexity.
  \end{definition}

  Before altering the capacity of a model, one must evaluate whether a model is overfitting or underfitting. This can be done with the help of cross-validation, which we'll discuss in the next chapter. 

The \textbf{bias-variance tradeoff} is a fundamental concept in machine learning that deals with the tradeoff between a model's bias and its variance. \textit{Bias} refers to the error introduced by approximating a real-world problem with a simplified model, often resulting in \textit{underfitting}. \textit{Variance}, on the other hand, reflects the model's sensitivity to small fluctuations in the training data, potentially leading to \textit{overfitting}. Balancing bias and variance is crucial for achieving optimal predictive performance: reducing bias typically increases variance and vice versa. Finding the right balance involves selecting an appropriate model complexity and regularization techniques to minimize both sources of error. \autoref{fig:capacity} demonstrates the optimal point of capacity. 

  As long as the training error is low, one way to increase regularization without playing with capacity is to increase the amount of training data. If this is not possible, there are a number of components we can adjust in order to control the capacity of a model, such as \citep{capacity-Brownlee}, 
\begin{wrapfigure}{R}{0.4\textwidth} %this figure will be at the right
    \centering
    \includegraphics[width=0.4\textwidth]{capacity}
    \caption{A diagram showing the optimal capacity of a model and its relation with the \textit{bias-variance trade-off}, from \citep{capacity-Kowalik}.}
    \label{fig:capacity}
\end{wrapfigure}
  \begin{enumerate}
    \item The number of nodes per layer (\textbf{width}) 
    \item The number of layers (\textbf{depth})
  \end{enumerate}
  It makes sense that augmenting these would increase the capacity, as they clearly increase the complexity of the model, which we know is closely related its capacity. It should be noted, however, that increasing the number of nodes and layers in a model can also increase its running time and memory usage. 

  Other ways to tune the model optimally are part of a subset of ML called \textit{Regularization}, that we'll cover in the following sections. 

  \subsection{Cross-Validation}%
  \label{sub:Cross-Validation}
Cross-validation is a technique that aims to test the ML model's testing capabilities on unseen data. This helps identify issues such as \textit{overfitting}, discussed above, or \textit{selection bias}, which is the result of selecting training data in a non-random way that doesn't properly represent the population. Cross-validation provides insight into how well the model will generalize to a different dataset.

The most common technique used is \textbf{k-fold cross-validation}, which allows one to repeatedly train and test the model $k$ times on various randomly chosen subsets of the training data. The average test error across $k$ trials can then be used to estimate the test error. In trial $i$, the $i^{th}$ subset of the data is used as the test set, and the remaining data is utilized as the training set \citep{Goodfellow-et-al-2016}. See \autoref{algo1} for the pseudo-code for the k-fold cross validation algorithm.


\begin{algorithm}[h!]
  \SetKwProg{Fn}{Function }{\string:}{}
  \SetKwFunction{KFoldXV}{KFoldXV}
  \caption{$k$-fold cross-validation}\label{k-CrossVal} 
  \label{algo1}
  \Fn{\KFoldXV{$\mathbb{D}, A, L, k$}}{
  \textbf {Require: } $\mathbb{D}$ \text {, the given dataset, with elements } $ \boldsymbol{z}^{(i)}$\\
  \textbf {Require: } $A$, the learning algorithm, seen as a function that takes a dataset as input and outputs a learned function \\
  \textbf {Require:} $L$, the loss function, seen as a function from a learned function $f$ and an example $\boldsymbol{z}^{(i)} \in \mathbb{D}$ to a scalar $\in \mathbb{R}$\\
  \textbf {Require: } $k$ \text {, the number of folds }\\
  \BlankLine
  Split $\mathbb{D}$ into $k$ mutually exclusive subsets $\mathbb{D}_i$, whose union is $\mathbb{D}$\\
  \For{$i$ from 1 to $k$ }{
    $f_i = A(\mathbb{D}\ \mathbb{D}_i)$;\\
    \For{$z^{(j)}$ in $\mathbb{D}_i$ }{
      $e_j = L(f_i,z^{(j)} )$;
    }
  }
  \Return $e$;
  }
\end{algorithm}

\subsection{Regularization Techniques}%
  \label{sub:Regularization Techniques}
Regularization refers to any change we make to a learning algorithm with the goal of lowering its generalization error but not its training error. It is a key technique for addressing \textit{overfitting} (\autoref{sub:Overfitting}) and is one of the main concerns in the field of ML, rivaled in its importance only by optimization \citep{Goodfellow-et-al-2016}. Finding effective regularization methods is a hot research area in the field.

A significant portion of machine learning involves creating various models and algorithms tailored to suit them. Techniques like cross validation help us empirically determine the most suitable approach for our specific problem. Yet, there isn't one "best" model universally applicable, this concept is often referred to as the "no free lunch theorem". The rationale behind this theorem lies in the fact that assumptions effective in one domain may not perform well in another \citep{MurphyML}.

In light of this theorem, it becomes imperative to develop diverse models to accommodate the wide array of real-world data. Additionally, for each model, there exists a multitude of algorithms offering different tradeoffs between speed, accuracy, and complexity \citep{MurphyML}.

\subsubsection{L2 Parameter Norm Penalty (Weight Decay)}    \label{sub:L1 and L2 Regularization}
Real-world data possesses intricate attributes, requiring equally intricate models to address them. While reducing parameters represents one approach to mitigate model complexity, it remains a rather restrictive tactic, since the more parameters our model has, the more interconnections within our neural network, thereby greater non-linearities, crucial for representing complex data.

Yet, we must exercise caution to prevent these interconnections from spiraling out of control. Thus, the notion of penalizing complexity emerges as a solution. By employing weight decay, we retain our many parameters while imposing constraints to prevent excessive model complexity.

A method to impose a penalty on complexity involves adding all parameters (weights) into our loss function. However, a straightforward addition isn't viable due to the mix of positive and negative parameters. Thus, we resort to adding the squares of these parameters to the loss function. Although effective, this approach might inflate the loss to an extent where the optimal solution would entail setting all parameters to zero. To avert this scenario, we scale down the sum of squares by a smaller factor, termed \textit{weight decay} \citep{Vasani_2019}. Weight decay stands out as a widely adopted technique in training numerous cutting-edge deep networks, including prominent models like \textbf{GPT-3} \citep{andriushchenko2023need}.

We provide the definition from \cite{andriushchenko2023need}. Let $(x_i,y_i)_{i = 1}^{n}$ be the training inputs and labels where $x_i \in \mathcal{D},~y_i \in \mathbb{R}^{c}$, and $c$ the number of classes. Let $h: \mathbb{R}^{p} \times \mathcal{D} \to \mathbb{R}^{c}$ be the hypothesis class of neural network and for any parameter $w \in \mathbb{R}^{p}$ where the function $h(w, \cdot): \mathcal{D} \to \mathbb{R}^{c}$ represents the network predictions. The training loss $\mathcal{L}$ and the $\ell_2$-regularized training loss $\mathcal{L}_{\lambda}$ are given as: 
\begin{align}
\label{eq:general_loss}
    \mathcal{L}(w):= \frac{1}{N} \sum_{i=1}^{N} \ell \left(y_i, h({w}, x_i)\right), \quad \mathcal{L}_{\lambda}(w) := \mathcal{L}(w) + \frac{\lambda}{2} {w}^2 \, ,
\end{align}
where $\ell(\cdot,\cdot):\mathbb{R}^c \times \mathbb{R}^c \to \mathbb{R}^p$ denotes the chosen loss function, such as cross-entropy loss. Then, simply use gradient descent with respect to $\mathcal{L}_{\lambda}.$

We note that choosing an appropriate $\lambda$ is a difficult problem. No exact method exists as of today. Reasonable values, however, lie between $0$ and $0.1$ \citep{Kuhn_13}. 
\subsubsection{Data Augmentation}%
  \label{sub:Data Augmentation}
  While weight decay was a regularization method that focussed on altering the model, we now see a technique that alters the data instead, to achieve similar results. Optimizing a machine learning model's ability to generalize is often best achieved by training it on a larger dataset. However, practical constraints limit the amount of available data. To address this limitation, one strategy involves augmenting the training set with "fake" data. This approach is straightforward for classification tasks, by generating new $(x, y)$ pairs through transformations of existing input data in the training set, we can effectively expand the dataset.

Dataset augmentation has demonstrated particular effectiveness in certain classification tasks, notably object recognition. Images, being high-dimensional, encompass a wide array of variation factors, many of which can be simulated easily. Operations such as translating, rotating, or scaling images have proven beneficial, even when the model is designed with partial translation invariance through convolution and pooling techniques.

It's important to be careful when applying transformations to data. For instance, in optical character recognition tasks, correctly identifying distinctions like 'b' versus 'd' and '6' versus '9' is essential. Therefore, transformations like horizontal flips are inappropriate for augmenting datasets in these cases, clearly \citep{Goodfellow-et-al-2016}. 
\subsubsection{Early Stopping}%
  \label{sub:Early Stopping}
When training large models with great complexity, it's common to observe a phenomenon where \textit{training error} steadily decreases over time, while \textit{validation set} error starts to rise again, as illustrated in \autoref{fig:augmentData}. Consequently, it's possible to achieve a model with improved validation set error (and ideally, better test set error) by reverting to the parameter setting at the point when validation set error was lowest. To implement this, we save a copy of the model parameters each time validation set error improves. Upon termination of the training algorithm, we return these parameters instead of the latest ones. The algorithm concludes when no parameters show improvement over the best recorded validation error for a specified number of iterations \citep{Goodfellow-et-al-2016}. 
\begin{figure}
\includegraphics{augmentData}
  \caption{Graph from \cite{Goodfellow-et-al-2016}. The curves depict the evolution of the negative log-likelihood loss over the number of training iterations or epochs. Initially, the training objective steadily decreases over time as the model learns from the data, yet the validation set loss eventually starts to rise again after reaching a minimum. This pattern indicates that despite improvements in the training loss, the model's performance on unseen data begins to degrade over time, highlighting the phenomenon of overfitting.}
  \label{fig:augmentData}
\end{figure}

\subsubsection{Dropout}%
  \label{sub:Dropout}
  Dropout is yet another technique to address the regularization problem. The core concept is to randomly deactivate units, along with their connections, within the neural network during training. This prevents units from becoming overly reliant on each other. Throughout training, dropout samples from an array of "thinned" networks. At the testing phase, approximating the ensemble effect of these thinned networks can be achieved by utilizing a single unthinned network with reduced weights. This approach significantly mitigates overfitting and surpasses other regularization techniques. Dropout has demonstrated its efficacy in enhancing the performance of neural networks across various supervised learning tasks, including vision, speech recognition, document classification, and computational biology. Notably, it has achieved state-of-the-art results on numerous benchmark data sets \citep{JMLR:v15:srivastava14a}.
    \section{Conclusion}%
    \label{sec:Conclusion}
As we've journeyed through the amazing realm of supervised machine learning, we've seen how powerful it can be. Through simple mathematics and precise engineering of neural networks, supervised learning algorithms can extract information from labelled data to make inferences about unseen data, behaving like an intelligent being. This is where machine learning diverges from statistics, which is often focused on making inference from observed data, while ML is purposefully structured for making inference on unseen data. Throughout, we covered the foundational concepts of neural networks and their construction, progressing to the more advanced techniques of backpropagation and gradient descent. We provided brief surveys of activation and loss functions for the reader's reference. We also addressed key challenges in machine learning, such as overfitting and underfitting, along with techniques to mitigate them.


Supervised learning has a significant impact on today's world. It drives many different applications, including as natural language processing, autonomous cars, and medical diagnostics. The ability to learn from labelled data has enabled the creation of systems that can recognize speech, translate languages, detect fraud, and recommend products with remarkable accuracy.

It becomes evident, however, that the reliance of supervised learning on carefully labelled datasets poses a significant limitation. The need for abundant, high-quality data, paired with corresponding inputs and desired outputs, can be restrictive and often impractical in real-world scenarios. This limitation becomes particularly apparent when faced with complex tasks where defining the desired outputs may be challenging or even impossible. 

This is the basis of the power of reinforcement learning (RL), an approach that learns from interacting directly with the environment, surpassing the limitations of supervised learning. In RL, agents move through a set of options, learning the best methods by making mistakes and getting feedback in the form of rewards. Due to its self-learning mechanism, RL can now address issues for which explicit supervision is impractical, providing access to previously uncharted territory for conventional machine learning techniques. The field of RL could be of great interest to the reader, and we recommend the textbook by \cite{Sutton1998} for further readings. 

In this paper, we focused on the scientific view of ML, but a great deal can also be said about its philosophical impacts. The concept of supervised learning represents a change in how we see intelligence and knowledge. Some once believed that intelligence was a quality exclusive to humans, one that was demonstrated by one's capacity for understanding, reasoning, and experience-based learning. Supervised learning could be said to challenge this notion by demonstrating that machines can also exhibit these capabilities. This highlights a new paradigm where knowledge emerges from the interaction between algorithms and data. However, it could also be argued that social networks, economies, and evolutionary systems, all behave in a manner which could be deemed intelligent. This raises the question of who defines \textit{intelligence}. Many systems, like AI, are equilibria of optimization problems, and some argue that everything, even life itself, can be viewed as an optimization problem. Given the complexity of life, we claim that it is actually an \textit{NP-hard} optimization problem and leave the proof to the reader. 


\section*{Acknowledgments}
I want to express my gratitude to the \textit{reader} who made it this far, possibly being the only one to do so. I also thank the \href{https://www.math.mcgill.ca/gsams/drp/}{McGill Directed Reading Program} for the opportunity, support, and resources. Most importantly, I thank Yanees Dobberstein for his criticism and many insights.

 {\footnotesize
  \bibliography{ML_Report_DRP2024}
}
%\bibliographystyle{apa} 
\bibliographystyle{plainnat}


%\bibliography{DeepRL}
%\bibliographystyle{iclr2017_conference}
%\bibliographystyle{apa}

\end{document} 

%ArXiv e-prints