Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor lecture updates #44

Merged
merged 1 commit into from
Dec 4, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions docs/lectures/feature_extraction.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ In a one-hot encoded vector, each word in the vocabulary $V$ is assigned a uniqu

The dimension of a feature vector $x$ is equal to the size of the vocabulary $|V|$:

$$ dim(x) = |V| $$.
$$ dim(x) = |V| $$

Here is a coding example:

Expand Down Expand Up @@ -63,7 +63,7 @@ array([2, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0])

!!! info

In Bow, the values in the vector can be integers representing word counts or real numbers representing [TF-IDF weights](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).
In BoW, the values in the vector can be integers representing word counts or real numbers representing [TF-IDF weights](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

!!! warning "Dimensionality"

Expand Down Expand Up @@ -138,7 +138,9 @@ We can observe that some words take clear sides, like `happy` and `sad`, while o
- Numpy array
- ...

Based on this table, we can build the feature vector for a document $i$ as follows:
Based on this table, we can build the feature vector for a document $i$, by summing up the positive and negative frequencies of each word in the document.

Considering the **bias unit** as the first feature, the feature vector $x_i$ looks like this:

<!-- prettier-ignore-start -->
$$ x_i = [1, \sum_{j=1}^{m} n_{pos}(w_j), \sum_{j=1}^{m} n\_{neg}(w_j)] $$
Expand Down
40 changes: 23 additions & 17 deletions docs/lectures/logistic_regression.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,13 @@ In **classification**, the goal is to predict a discrete class label, such as "s

!!! info

In the assignment, we will conduct a **sentiment analysis on tweets using logistic regression**, and try to predict whether a tweet has an overall positive or negative meaning.
In the assignment, we will conduct a **sentiment analysis on tweets using logistic regression**, and try to predict whether a tweet has an overall **positive or negative meaning**.

## Supervised Learning

In **supervised learning**, the goal is to learn a function that maps an input to an output based on example input-output pairs.
In **supervised learning**, the goal is to **learn a function**, i.e. the parameters of a function, that **maps an input to an output** based on example input-output pairs.

When working with text data, we need to make sure to apply the required preprocessing steps in order to extract the required features from the text data.
When working with **text data**, we need to make sure to apply the required **preprocessing** steps in order to extract the required features from the text data.

![Supervised learning overview](../img/supervised-learning-overview.drawio.svg)

Expand Down Expand Up @@ -56,17 +56,19 @@ On the other hand, as $z$ approaches $\infty$, the denominator of the sigmoid fu

## Training

In logistic regression, the goal is typically binary classification, where the algorithm learns to classify input data into one of two classes.
During the training phase, the algorithm adjusts its parameters (weights and bias) based on the input data in order to minimize the difference between its predictions and the actual labels in the training dataset.
In logistic regression, the goal is typically **binary classification**, where the algorithm learns to classify input data into one of two classes.
During the **training phase**, the algorithm **adjusts its parameters** (weights and bias) based on the input data in order to **minimize the difference between its predictions and the actual labels** in the training dataset.

The process involves using an optimization algorithm (usually gradient descent) to find the optimal values for the parameters that minimize a cost function.
The cost function measures the difference between the predicted outputs and the true labels.
The process involves using an **optimization algorithm** (usually gradient descent) to find the optimal values for the parameters that **minimize a cost function**.
The cost function measures the **difference** between the predicted outputs and the true labels.
The training process continues iteratively until the algorithm converges to a set of parameters that yield satisfactory predictions on the training data.

### Gradient Descent

Training in logistic regression is based on the **gradient descent** algorithm. Here is a brief overview of the steps involved:

![Logistic regresseion gradient descent steps](../img/logistic-regression-gradient-descent.drawio.svg)

1. **Initialize Parameters:** Set the initial values for the weights $\theta$, considering the **bias term**.
This is the starting point for the optimization process.

Expand All @@ -90,9 +92,8 @@ Training in logistic regression is based on the **gradient descent** algorithm.
6. **Repeat:** Iterate steps 2-5 until the convergence criteria are met, such as reaching a maximum number of iterations or achieving a sufficiently small change in the cost function.

These steps represent the core of the gradient descent algorithm in the context of logistic regression.
The goal is to iteratively update the parameters in the direction that minimizes the cost function, eventually reaching a set of parameters that optimally fit the training data.

![Logistic regresseion gradient descent steps](../img/logistic-regression-gradient-descent.drawio.svg)
The goal is to **iteratively update the parameters** in the direction that minimizes the cost function, eventually reaching a set of **parameters that optimally fit the training data**.

!!! info "Bias Term"

Expand All @@ -101,7 +102,7 @@ The goal is to iteratively update the parameters in the direction that minimizes

!!! info "Learning Rate"

The **learning rate $\alpha$** is a hyperparameter that controls the step size at each iteration while moving toward a minimum of the cost function.
The **learning rate $\alpha$** is a hyperparameter that controls the **step size at each iteration** while moving toward a minimum of the cost function.
It is a crucial parameter in optimization algorithms that are used to train machine learning models.

If the learning rate is too small, the algorithm may take a long time to converge or may get stuck in a local minimum.
Expand All @@ -113,7 +114,7 @@ The goal is to iteratively update the parameters in the direction that minimizes

### Cost Function

The cost function measures the difference between the predicted labels and the actual class labels (aka cost).
The cost function measures the **difference** between the **predicted labels** and the **actual class labels** (aka cost).

The cost function $J(\theta)$ in logistic regression is given as

Expand All @@ -133,6 +134,11 @@ When performing gradient descent, the cost should **decrease with every iteratio

![Gradient descent cost vs. number of iterations](../img/gradient-descent-cost-vs-iteration.png)

!!! info

For our purposes, we do not bother the derivation of the cost function.
However, if you are interested, you can read more about it [here](https://ml-explained.com/blog/logistic-regression-explained).

### Compute Gradient and Update Weights

Computing the gradient and updating the weights can happen in one step.
Expand All @@ -155,7 +161,7 @@ where
- $\mathbf{h}$ is the vector of outputs of the sigmoid function for all samples.
- $\mathbf{y}$ is the vector of training labels.

So the term $\mathbf{h - y}$ is essentially the vector of errors, representing the difference between the predicted values and the actual values.
So the term $\mathbf{h - y}$ is essentially the **vector of errors**, representing the difference between the predicted values and the actual values.
For the equation to work, we need to transpose the matrix of input features $\mathbf{X}$.

## Testing
Expand Down Expand Up @@ -199,8 +205,8 @@ The figure in the [supervised learning](#supervised-learning) section also indic

## Key Takeaways

- Logistic regression is a supervised learning algorithm that can be used for classification.
- The prediction function in logistic regression is a sigmoid function that outputs a probability value between 0 and 1.
- The cost function measures the difference between the predicted labels and the actual class labels.
- The goal is to minimize the cost function.
- Training in logistic regression is based on the gradient descent algorithm.
- Logistic regression is a **supervised learning** algorithm that can be used for classification.
- The **prediction function** in logistic regression is a **sigmoid function** that outputs a probability value between 0 and 1.
- The **cost function** measures the difference between the predicted labels and the actual class labels.
- The goal is to **minimize** the cost function.
- Training in logistic regression is based on the **gradient descent** algorithm.
12 changes: 6 additions & 6 deletions docs/lectures/preprocessing.md
Original file line number Diff line number Diff line change
Expand Up @@ -363,9 +363,9 @@ The choice between these techniques depends on the specific NLP task and the des

## Key Takeaways

- The NLP pipeline is a systematic approach to solving NLP problems by breaking them down into distinct steps.
- Many times, the success of an NLP project is determined already before the actual modeling step. Preprocessing and data acquisition play an important role, and in practice, much effort is spent on these steps.
- Text cleaning and normalization are essential steps in the NLP pipeline that help standardize the text and make it ready for further analysis.
- Text cleaning involves removing any elements from the text that are considered irrelevant, noisy, or potentially problematic for downstream NLP tasks.
- Text normalization involves transforming the text to a standard or canonical form, making it consistent and easier to work with.
- It depends on the specific NLP task which text cleaning and normalization techniques are appropriate.
- The NLP **pipeline** is a systematic approach to solving NLP problems by breaking them down into **distinct steps**.
- Many times, the success of an NLP project is determined already before the actual modeling step. **Preprocessing** and data acquisition play an important role, and **in practice, much effort** is spent on these steps.
- Text cleaning and normalization are essential steps in the NLP pipeline that help **standardize the text** and make it ready for further analysis.
- **Text cleaning** involves removing any elements from the text that are considered irrelevant, noisy, or potentially problematic for downstream NLP tasks.
- **Text normalization** involves transforming the text to a **standard** or canonical **form**, making it consistent and easier to work with.
- It **depends** on the specific NLP task which text cleaning and normalization techniques are appropriate. We need clear **requirements** to decide which techniques to apply.