diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
index 7e8ef31a..70726084 100644
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -24,5 +24,5 @@ jobs:
github_token: ${{ secrets.GITHUB_TOKEN }}
enable_jekyll: true
allow_empty_commit: true
- publish_dir: .
+ publish_dir: './webified/'
exclude_assets: '.github'
diff --git a/notebooks/20_markov_decision_processes_part_2/images/MDP_vs_RL2.jpg b/notebooks/20_markov_decision_processes_part_2/images/MDP_vs_RL2.jpg
new file mode 100644
index 00000000..b6683545
Binary files /dev/null and b/notebooks/20_markov_decision_processes_part_2/images/MDP_vs_RL2.jpg differ
diff --git a/notebooks/20_markov_decision_processes_part_2/images/fixed_policy2.png b/notebooks/20_markov_decision_processes_part_2/images/fixed_policy2.png
new file mode 100644
index 00000000..9ef924c7
Binary files /dev/null and b/notebooks/20_markov_decision_processes_part_2/images/fixed_policy2.png differ
diff --git a/notebooks/20_markov_decision_processes_part_2/images/policy_evaluation.png b/notebooks/20_markov_decision_processes_part_2/images/policy_evaluation.png
new file mode 100644
index 00000000..0ce54a0b
Binary files /dev/null and b/notebooks/20_markov_decision_processes_part_2/images/policy_evaluation.png differ
diff --git a/notebooks/20_markov_decision_processes_part_2/images/policy_extraction.png b/notebooks/20_markov_decision_processes_part_2/images/policy_extraction.png
new file mode 100644
index 00000000..44a8d5ea
Binary files /dev/null and b/notebooks/20_markov_decision_processes_part_2/images/policy_extraction.png differ
diff --git a/notebooks/20_markov_decision_processes_part_2/images/policy_iteration.png b/notebooks/20_markov_decision_processes_part_2/images/policy_iteration.png
new file mode 100644
index 00000000..b26f35b5
Binary files /dev/null and b/notebooks/20_markov_decision_processes_part_2/images/policy_iteration.png differ
diff --git a/notebooks/20_markov_decision_processes_part_2/images/value_iteration.png b/notebooks/20_markov_decision_processes_part_2/images/value_iteration.png
new file mode 100644
index 00000000..c3064e8c
Binary files /dev/null and b/notebooks/20_markov_decision_processes_part_2/images/value_iteration.png differ
diff --git a/notebooks/20_markov_decision_processes_part_2/images/value_iteration_numbers.png b/notebooks/20_markov_decision_processes_part_2/images/value_iteration_numbers.png
new file mode 100644
index 00000000..78a6fddf
Binary files /dev/null and b/notebooks/20_markov_decision_processes_part_2/images/value_iteration_numbers.png differ
diff --git a/notebooks/20_markov_decision_processes_part_2/index.md b/notebooks/20_markov_decision_processes_part_2/index.md
new file mode 100644
index 00000000..5ce1c986
--- /dev/null
+++ b/notebooks/20_markov_decision_processes_part_2/index.md
@@ -0,0 +1,670 @@
+
+
+ Markov Decision (from value iteration)
+
+
+
+
+
+
+
+ Table of Contents
+---
+
+- [Introduction](#introduction)
+- [Value Iteration](#value_iteration)
+ - [Convergence of Value Iteration](#vi_convergence)
+ - [Contraction](#vi_cont)
+ - [Error Bound](#vi_error)
+ - [Value Iteration Pseudocode](#vi_code)
+ - [Time Complexity](#time_vi)
+- [Policy Iteration](#pi)
+ - [The Idea of Policy Iteration](#pi_id)
+ - [Policy Evaluation](#pe)
+ - [Fixed Policy](#fp)
+ - [Policy Extraction (Improvement)](#pex)
+ - [Computing Actions from Values](#pi_cafv)
+ - [Computing Actions from Q-Values](#pi_cafq)
+ - [Policy Iteration Summary](#pis)
+ - [Policy Iteration Pseudocode](#pi_code)
+- [Conclusion](#conclusion)
+- [References](#references)
+
+
+
+ Introduction
+---
+
+
+
+In the previous lecture, we became familiar with what MDP is. As we know, MDP can be remembered as a mathematical framework used for modeling decision-making problems. Usually, an agent can do some actions, but the outcomes are stochastic and not entirely controllable. The goal is to decide the best action to select based on its current state. It is better to know MDP before Reinforcement Learning.
+
+In this lecture, we will talk about two different methods for solving MDPs. These two methods are value iteration and policy iteration. Stay with us.
+
+
+
+
+ Value Iteration
+---
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+The Bellman equation is the basis of the **value iteration** algorithm for solving MDPs. We would like to solve these simultaneous equations to find the utilities. One thing to try is an iterative approach. We start with arbitrary initial values for the utilities, calculate the right-hand side of the equation, and plug it into the left-hand side. Thereby updating the utility of each state from the utilities of its neighbors. We repeat this until we reach an equilibrium.
+
+Let $U_{i}$ be a utility function that $U_{i}(s)$ gives the utility value for state s at the $i$th iteration. The iteration step, called a **Bellman update**, looks like this:
+
+$$
+U_{i+1}(s) = \max_{a \in A(s)} \sum_{s^\prime}^{} P(s^{\prime}|s,a)[R(s,a,s^{\prime})+\gamma U_{i}(s^{\prime})]
+$$
+where the update is assumed to be applied simultaneously to all the states at each iteration. If we apply the Bellman update infinitely often, we are guaranteed to reach an equilibrium, In fact, they are also the unique solutions, and the corresponding policy is optimal.
+
+
+
+ Convergence of Value Iteration
+
+
+
+
+Suppose we view the Bellman update as an operator $B:\mathbb{R}^{\lvert S\lvert}\rightarrow \mathbb{R}^{\lvert S\lvert}$ that maps functions to functions. Then as a result of Bellman equation $U^{\star} = B U^{\star}$ that $U^{\star}$ is the optimal utility function and the Bellman update equation can be written as $U_{i+1} = B U_{i}$.
+
+
+
+> Contraction
+
+The basic concept used in showing that value iteration converges is the notion of a **contraction**. An important fact about the Bellman operator is that it's contraction. An operator $f:\mathbb{R}^{n}\rightarrow \mathbb{R}^{n}$ is said to be a $\alpha$-contraction if $\alpha \in (0,1)$ and
+
+$$
+\forall x,y \in \mathbb{R}^{n}, \quad \Vert f(x)-f(y)\Vert \leq \alpha \Vert x-y \Vert
+$$
+The **Banach fixed-point** theorem guarantees that if $f$ is a contraction, then $f$ has a unique
+fixed point($x$ is a fixed point of the $f$ if $f(x) = x$) $x^\star \in \mathbb{R}^{n}$ satisfying
+
+$$
+x^\star = f(x^\star) = lim_{k \to \infty} f^{k}(x) \quad \forall x \in \mathbb{R}^{n}
+$$
+where $f^{k+1}(x) = f(f^{k}(x))$ for $k > 0$ and $f^{1}(x) = f(x)$.
+
+Hence, two important properties of contractions are
+- A contraction has only one fixed point; if there were two fixed points, they would not get closer together when the function was applied, so it would not be a contraction.
+- When the operator is applied to any argument, the output must get closer to the fixed point (because the fixed point does not move), so repeated application of a contraction always reaches the fixed point in the limit.
+
+We will use the **max norm** to measure distances between utility functions:
+
+$$
+\Vert U\Vert = \max_{s} \lvert U(s)\lvert
+$$
+Now, we want to prove that the Bellman operator is $\gamma -contraction$ in the max norm. To prove this, we use the
+following Lemma:
+
+$$
+\lvert \max_{x} f(x) - \max_{x} g(x) \lvert \leq \max_{x} \lvert f(x) - g(x) \lvert
+$$
+To see this, suppose $\max_{x} f(x) > \max_{x} g(x)$ (the other case is symmetric) and let $b = \underset{x}{\operatorname{argmax}} f(x)$. Then
+
+$$
+\begin{aligned}
+\lvert \max_{x} f(x) - \max_{x} g(x) \lvert & = f(b) - \max_{x} g(x) \\\\
+& \leq f(b) - g(b) \\\\
+& \leq \max_x (f(x) - g(x)) = \max_{x} \lvert f(x) - g(x) \lvert\\\\
+\Rightarrow \lvert \max_{x} f(x) - \max_{x} g(x) \lvert & \leq \max_{x} \lvert f(x) - g(x) \lvert
+\end{aligned}
+$$
+
+Let $U_i$ and $U^{\prime}_{i}$ be any two utility functions. Then we have
+
+$$
+\begin{aligned}
+\Vert B U_i - B U_{i}^{\prime}\Vert & = \max_{s} \lvert B U_{i}(s) - B U_{i}^{\prime}(s) \lvert \\\\
+& = \max_{s} \Big \lvert \max_{a \in A(s)} \sum_{s^\prime}^{} P(s^{\prime}|s,a)[R(s,a,s^{\prime})+\gamma U_{i}(s^{\prime})] - \max_{a \in A(s)} \sum_{s^\prime}^{} P(s^{\prime}|s,a)[R(s,a,s^{\prime})+\gamma U_{i}^{\prime}(s^{\prime})] \Big \lvert \\\\
+& \leq \max_{s} \max_{a \in A(s)} \Big \lvert \sum_{s^\prime}^{} P(s^{\prime}|s,a)[R(s,a,s^{\prime})+\gamma U_{i}(s^{\prime})] - \sum_{s^\prime}^{} P(s^{\prime}|s,a)[R(s,a,s^{\prime})+\gamma U_{i}^{\prime}(s^{\prime})] \Big \lvert & (Lemma)\\\\
+& = \max_{s} \max_{a \in A(s)} \Big \lvert \gamma \sum_{s^\prime}^{} P(s^{\prime}|s,a)[U_{i}(s^{\prime}) - U_{i}^{\prime}(s^{\prime})] \Big \lvert \\\\
+& \leq \max_{s} \max_{a \in A(s)} \Big \lvert \gamma \Big (\sum_{s^\prime}^{} (P(s^{\prime}|s,a)\Big ) \max_{s^\prime} \Big ( U_{i}(s^{\prime}) - U_{i}^{\prime}(s^{\prime}) \Big )\Big \lvert \\\\
+& = \Big \lvert \gamma \max_{s^\prime} \Big ( U_{i}(s^{\prime}) - U_{i}^{\prime}(s^{\prime}) \Big )\Big \lvert \\\\
+& \leq \gamma \max_{s^\prime} \Big \lvert U_{i}(s^{\prime}) - U_{i}^{\prime}(s^{\prime}) \Big \lvert \\\\
+\Rightarrow \Vert B U_i - B U_{i}^{\prime}\Vert & \leq \gamma \Big \Vert U_{i} - U_{i}^{\prime}\Big \Vert
+\end{aligned}
+$$
+
+That is, the Bellman operator is $\gamma -contraction$ on the space of utility functions. The fixed point of the Bellman operator is $U^{\star}$. Hence, from the properties of contractions in general, it follows that value iteration always converges to $U^{\star}$ whenever $\gamma < 1$.
+
+
+
+> Error Bound
+
+We can't have $\infty$ iterations to converges to $U^{\star}$. If we view $\Vert U_{i+1} - U^{\star} \Vert$ as the error, we want to relate a bound for error to a bound for $\Vert U_{i+1} - U_{i} \Vert$. If $\Vert U_{i+1} - U_{i} \Vert \leq \delta$ we have
+
+$$
+\begin{aligned}
+\Vert U_{i+1} - U^{\star} \Vert & = \max_{s} \lvert U_{i+1}(s) - U^{\star}(s) \lvert \\\\
+& = \max_{s} \Big \lvert (U_{i+1}(s) - U_{i+2}(s)) + (U_{i+2}(s) - U_{i+3}(s)) + ...\Big \lvert \\\\
+& \leq \max_{s} \Big ( \lvert U_{i+1}(s) - U_{i+2}(s)\lvert + \lvert U_{i+2}(s) - U_{i+3}(s)\lvert + ... \Big ) \\\\
+& \leq \max_{s} \lvert U_{i+1}(s) - U_{i+2}(s)\lvert + \max_{s} \lvert U_{i+2}(s) - U_{i+3}(s)\lvert + ... \\\\
+& = \Vert B U_{i} - B U_{i+1}\Vert + \Vert B U_{i+1} - B U_{i+2}\Vert + ... \\\\
+& \leq \gamma \Vert U_{i} - U_{i+1}\Vert + \gamma^{2} \Vert U_{i} - U_{i+1}\Vert + ... \\\\
+& = \frac{\gamma}{1 - \gamma} \Vert U_{i} - U_{i+1}\Vert & \text{for } \gamma < 1\\\\
+& \leq \frac{\gamma}{1 - \gamma} \delta
+\end{aligned}
+$$
+Thus, if $\delta \leq \frac{1 - \gamma}{\gamma} \epsilon$, then $\Vert U_{i+1} - U^* \Vert \leq \epsilon$.
+
+
+
+ Value Iteration Pseudocode
+
+
+
+**function** Value-Iteration(MDP,$\epsilon$) returns a utility function
+ **inputs**: MDP, an MDP with states $S$ , actions $A(s)$, transition model $P(s^{\prime}|s,a)$,
+ rewards $R(s,a,s^{\prime})$, discount $\gamma$
+ $\epsilon$, the maximum error allowed in the utility of any state
+ **local variables**: $U$, $U^{\prime}$, utility functions for states in S, initially zero for all states
+ $\delta$, the maximum relative change in the utility of any state
+ **repeat**
+ $U \leftarrow U^{\prime}$;$\delta \leftarrow 0$
+ **for each** state s **in** S **do**
+ $U^{\prime}(s) \leftarrow \max_{a \in A(s)} \sum_{s^\prime}^{} P(s^{\prime}|s,a)[R(s,a,s^{\prime})+\gamma U(s^{\prime})]$
+ **if** $\lvert U^{\prime}(s) - U(s)\lvert > \delta$ **then** $\delta \leftarrow \lvert U^{\prime}(s) - U(s)\lvert$
+ **until** $\delta \leq \frac{1 - \gamma}{\gamma} \epsilon$
+ **return** $U$
+
+
+
+> Time Complexity
+
+We want to calculate the number of iterations required to reach an error of at most $\epsilon$. Suppose that the rewards are bounded by $\pm R_{max}$. if $\gamma < 1$ we have
+
+$$
+\begin{aligned}
+U(s0,a0,s1,...) & = \sum_{t=0}^{\infty} \gamma^{t} R(s_{t},a_{t},s_{t+1}) \\\\
+& \leq \sum_{t=0}^{\infty} \gamma^{t} R_{max} \\\\
+U(s) & \leq \frac{R_{max}}{1-\gamma}
+\end{aligned}
+$$
+
+If we run for $N$ iterations, we have
+$$
+\begin{aligned}
+\Vert U_{N} - U^* \Vert & = \Vert B U_{N-1} - B U^{\star} \Vert \\\\
+& \leq \gamma \Vert U_{N-1} - U^{\star} \Vert \\\\
+& = \gamma \Vert B U_{N-2} - B U^{\star} \Vert \\\\
+& \leq \gamma^{2} \Vert U_{N-2} - U^{\star} \Vert \\\\
+& \leq \gamma^{N} \Vert U_{0} - U^{\star} \Vert \\\\
+& = \gamma^{N} \Vert U^{\star} \Vert & \forall s \text{ } U_{0}(s) = 0 \\\\
+& \leq \gamma^{N} \frac{R_{max}}{1-\gamma}
+\end{aligned}
+$$
+
+if $\gamma^{N} \frac{R_{max}}{1-\gamma} \leq \epsilon$, then $\Vert U_{N} - U^{\star} \Vert \leq \epsilon$. Thus
+$$
+N = \Big \lceil \frac{\log{(\frac{R_{max}}{\epsilon (1-\gamma)})}}{\log{(\frac{1}{\gamma})}} \Big \rceil
+$$
+
+$N$ does not depend much on the ratio $\frac{\epsilon}{R_{max}}$ because value iteration converges exponentially fast but $N$ grows rapidly as $\gamma$ becomes close to 1. We can get fast convergence if we make $\gamma$ small.
+
+
+
+
+
+
+
+
+

+
+
+
+
+Figure shows how $N$ varies with $\gamma$, for different values of the ratio $c = \frac{2\epsilon}{R_{max}}$.
+
+Time complexity of each iteration of value iteration is $O(\lvert S\rvert^{2} \lvert A\rvert)$. Thus, time complexity of value iteration is $O \Big (\lvert S\rvert^{2} \lvert A\rvert \Big \lceil \frac{\log{(\frac{R_{max}}{\epsilon (1-\gamma)})}}{\log{(\frac{1}{\gamma})}} \Big \rceil \Big )$
+
+
+
+Policy Iteration
+---
+
+
+In the previous section, we observed that it is possible to get an optimal policy even when the utility function estimate is inaccurate.
+If one action is better than all others, then the exact magnitude of the utilities on the states involved does not need to be precise.
+This insight suggests an alternative way to find optimal policies. Policy iteration is a different approach to finding the optimal policy for given states and actions.
+
+
+The Idea of Policy Iteration
+
+
+Once a policy, $\pi_0$ (could be initialized random), has been improved using $U^{\pi_0}$ to yield a better policy, $\pi_1$, we can then compute $U^{\pi_1}$ and improve it again to yield an even better $\pi_2$. We can thus obtain a sequence of monotonically improving policies and value functions.
+
+Each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal). Because a finite MDP has only a finite number of policies, this process must converge to an optimal policy and optimal value function in a finite number of iterations.
+
+
+
+Policy Evaluation
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
+> Fixed Policy
+
+Back to the expectimax tree, there were different actions for node $s$, so we had different choices; but if the policy is fixed ($\pi(s)$), there is only one fixed action for the node $s$. This causes an important change in the Bellman equation: there will be no need to take maximum in the equation.
+$$
+U_{k}^{\pi}(s) \leftarrow \Sigma_{s'} {P(s^{\prime}|s,\pi(s))} \Big [R(s,\pi(s),s') + {\gamma}U^{\pi}_{k}(s') \Big]
+$$
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+**Idea:** Calculate values for the fixed policy:
+
+$U_0^\pi=0$
+
+$U_{k+1}^{\pi}(s) \leftarrow \Sigma_{s'} {P(s^{\prime}|s,\pi(s))} \Big [R(s,\pi(s),s') + {\gamma}U^{\pi}_{k}(s') \Big]$
+
+**Efficiency:** for each state we compute all the calculation which is the above equation. Thus order is $O(\lvert S\rvert^{2})$ per interation.
+
+
+
+Policy Extraction (Improvement)
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+
+
+> Computing Actions from Values
+
+Assume we have the optimal values $U^{\star}(s)$.
+
+We need to do a mini-expectimax.
+
+$$\pi^{\star}(s)=\underset{a}{\operatorname{argmax}} \Sigma_{s'} {P(s^{\prime}|s,a)} \Big [R(s,a,s') + {\gamma}U^{\star}(s') \Big]$$
+
+This is called policy extraction, since it gets the policy implied by the values
+
+
+
+
+> Computing Actions from Q-Values
+
+Assume we have the optimal Q-values.
+
+This is kind of one-step expectimax with assumption that terminal values $U^\pi_k(s)$ are less or equal than result value from bellman equation. so we can conclude that new policy is better than the previous one.
+
+The action is completely trivial.
+
+$$\pi^{\star}(s)=\underset{a}{\operatorname{argmax}} Q^{\star}(s,a)$$
+
+
+**Comparison:**
+actions are easier to select from q-values than values.
+
+**Efficiency:**
+for each state we compute all the calcaulations and take max on different actions. so order is $O(\lvert S\rvert^{2} \lvert A\rvert)$ in one-step.
+
+
+
+Policy Iteration Summary
+
+
+
+
+
+
+
+
+
+
+

+
+
+
+
+The policy iteration algorithm alternates the following two steps, beginning from some initial policy $\pi_0$
+
+- **Policy evaluation**
+
+ Calculate utilities for some fixed policy (not optimal
+ utilities!) until convergence.
+
+ For fixed current policy $\pi$, find values with policy evaluation:
+
+$$U_{k+1}^{\pi_{i}}(s) \leftarrow \Sigma_{s'} {P(s^{\prime}|s,\pi_{i}(s))} \Big [R(s,\pi_{i}(s),s') + {\gamma}U^{\pi_{i}}_{k}(s') \Big]$$
+
+
+- **Policy improvement**
+
+ Update policy using one-step look-ahead with resulting
+ converged (but not optimal!) utilities as future values.
+ It is an one-step algorithm.
+
+ For fixed values, get a better policy using policy extraction:
+
+$$\pi_{i+1}(s)= \underset{a}{\operatorname{argmax}} \Sigma_{s'} {P(s^{\prime}|s,a)} \Big [R(s,a,s') + {\gamma}U^{\pi_{i}}(s') \Big]$$
+
+- **Repeat steps until policy converges**
+
+ The algorithm terminates when the policy improvement step yields no change in the utilities.
+
+ At this point, we know that the utility function ${U_i}$ is a fixed point of the Bellman update, so it is a solution to the Bellman equations, and ${\pi_i}$ must be an optimal policy.
+
+ Because there are only finitely many policies for a finite state space, and each iteration can be shown to yield a better policy, policy iteration must terminate.
+
+
+**Terminate condition:**
+
+When algorithm stops changing utility, because we know that the utility function ${U_i}$ is a fixed point of the Bellman update, we conclude that it is a solution to the Bellman equations, and ${\pi_i}$ must be an optimal policy.
+
+
+
+
+ Policy Iteration Pseudocode
+
+
+
+**function** Q-Value(MDP, $s, a, U$) **returns** a utility value
+ $\Sigma_{s'} {P(s^{\prime}|s,a)} \Big [R(s,a,s') + {\gamma}U^{\star}(s') \Big]$
+
+
+
+**function** Policy-Evaluation($\pi, U$, MDP) **returns** a utility function
+ **inputs**: MDP, an MDP with states $S$ , actions $A(s)$, transition model $P(s^{\prime}|s,a)$,
+ rewards $R(s,a,s^{\prime})$, discount $\gamma$
+ $U$, $U^{\prime}$, utility functions for states in S with policy $\pi$
+ $\delta$, the maximum relative change in the utility of any state
+ **repeat**
+ $U \leftarrow U^{\prime}$;$\delta \leftarrow 0$
+ **for each** state s **in** S **do**
+ $U^{\prime}(s) \leftarrow \sum_{s^\prime}^{} P(s^{\prime}|s,a)[R(s,a,s^{\prime})+\gamma U(s^{\prime})]$
+ **if** $\lvert U^{\prime}(s) - U(s)\lvert > \delta$ **then** $\delta \leftarrow \lvert U^{\prime}(s) - U(s)\lvert$
+ **until** $\delta \leq \frac{1 - \gamma}{\gamma} \epsilon$
+ **return** $U$
+
+
+
+**function** Policy-Iteration(MDP) **returns** a utility function
+ **inputs**: MDP, an MDP with states $S$ , actions $A(s)$, transition model $P(s^{\prime}|s,a)$,
+ $U$ , a vector of utilities for states in $S$, initially zero
+ $\pi$, a policy vector indexed by state, initially random
+ **repeat**
+ $U \leftarrow $ Policy-Evaluation($\pi$, $U$, MDP)
+ $unchanged? \leftarrow$ true
+ **for each** state s **in** S **do**
+ $a^{\star} \leftarrow \underset{a \in A(s)}{\operatorname{argmax}}$ Q-Value(MDP, $s, a, U$)
+ **if** $\max\limits_{a \in A(s)}$ Q-Value(MDP, $s, a^{\star}, U$) $>$ Q-Value(MDP, $s, \pi[s], U$) **then** **do**
+ $\pi[s] \leftarrow a^{\star}$
+ $unchanged? \leftarrow$ false
+ **until** $unchanged?$
+ **return** $\pi$
+
+
+**Optimality:**
+Policy iteration is optimal
+
+**Convergence:**
+Can converge mush faster under some condition
+
+**Efficiency:**
+$O(\lvert S\rvert^{2} \lvert A\rvert)=O(\lvert S\rvert^{2})+O(\lvert S\rvert^{2} \lvert A\rvert)$ which are orders of policy evaluation and policy improvement.
+
+
+
+ Conclusion
+---
+
+
+Policy iteration and value iteration methods are compared and summarized in this section. In general, both are kinds of dynamic programming algorithms and guarantee convergence. As we saw, in both, we must use Bellman equations.
+
+In policy iteration, we start with a random policy, and then in several steps, we update utilities with a fixed policy. We called this part policy evaluation. $O(|S|^ 2)$ time required per iteration. Next, we must go to the policy improvement phase. it takes $O(|S|^ 2|A|)$ time. We find a better policy using a one-step look-ahead. If the policy does not change, it means we reach the optimal answer.
+
+In value iteration, we start with a random value function. In each step, we improve the values and (implicitly) the policy together. We do not care much about policy, apparently, but taking maximum action will improve it as well.
+
+The value iteration algorithm is more straightforward to understand as we only have to do one thing in each step. On the other hand, in practical terms, often, policy iteration algorithm converges within fewer iterations and is much faster. Maximum selection in the value iteration is not here, and it has a significant impact on this fact. In theory, however, policy iteration must take the same number of iterations as value iteration in the worst case. They differ only in whether we plug in a fixed policy or max over action.
+
+Each of them has its cons and pros. We can choose each of them depending on the situation, but policy iteration is more commonly used.
+
+In the next part, we will face reinforcement learning. The most significant change in RL is that we are unaware of R(s,a,s') and P(s'|s,a), and we have to do some actions to find or estimate them.
+
+
+
+
+
+
+
+
+

+
+
+
+
+
+ References
+---
+
+
++ AI course teached by Dr. Rohban at Sharif University of Technology, Fall 2021
++ Russell, S. J., Norvig, P., & Davis, E. (4th. Ed). Artificial Intelligence: A modern approach. Pearson Educación.
++ [Geeks For Geeks](https://www.geeksforgeeks.org)
++ [Towards Data Science](https://towardsdatascience.com)
++ [ai.stanford.edu](https://ai.stanford.edu/~gwthomas/notes/mdps.pdf)
++ [ai.berkeley.edu](http://ai.berkeley.edu)
diff --git a/notebooks/20_markov_decision_processes_part_2/metadata.yml b/notebooks/20_markov_decision_processes_part_2/metadata.yml
new file mode 100644
index 00000000..89372b46
--- /dev/null
+++ b/notebooks/20_markov_decision_processes_part_2/metadata.yml
@@ -0,0 +1,34 @@
+title: Markov Decision Processes - Part 2
+
+header:
+ title: Markov Decision Processes - Part 2
+ description: An Introduction to Markov Decision Processes (From Value Iteration Till The End)
+
+authors:
+ label:
+ position: top
+ kind: people
+ content:
+ - name: Mohammad Mahdi Abootorabi
+ role: Author
+ contact:
+ - link: https://github.com/aboots
+ icon: fab fa-github
+ - link: mailto:mahdi.abootorabi2@gmail.com
+ icon: fas fa-envelope
+
+ - name: Yalda Shabanzadeh
+ role: Author
+ contact:
+ - link: https://github.com/yaldashbz
+ icon: fab fa-github
+ - link: mailto:yaldashabanzadeh@gmail.com
+ icon: fas fa-envelope
+
+ - name: Amirreza Soleimanbeigi
+ role: Author
+ contact:
+ - link: https://github.com/invisible0831
+ icon: fab fa-github
+ - link: mailto:amirsoli80@gmail.com
+ icon: fas fa-envelope
diff --git a/notebooks/index.yml b/notebooks/index.yml
index 1b2133c0..fc771195 100644
--- a/notebooks/index.yml
+++ b/notebooks/index.yml
@@ -52,6 +52,8 @@ notebooks:
kind: S2021, LN, Notebook
- notebook: notebooks/16_deep_neural_networks/
kind: S2021, LN, Notebook
+ - md: notebooks/20_markov_decision_processes_part_2
+ kind: F2021, LN
#- notebook: notebooks/17_markov_decision_processes/
- notebook: notebooks/18_reinforcement_learning/
kind: S2021, LN, Notebook
\ No newline at end of file