Reinforcement_Learning

Random Variable

a variable whose vales depend on outcomes of a random event
Uppercase letter X for random variable.
Lowercase letter x for an observed value.

Random Variable	Possible Values	Random Events	Probabitites
X	0	coin head	P(X = 0) = 0.5
X	1	coin tail	P(X = 1) = 0.5

Probability Density Function(PDF)

PDF provides a relative likelihood that the value of the random variable would equal that sample.
ex. Gaussian distribution

$$\Large p(x) = \frac{1}{\sqrt{2\pi\sigma^2}}exp(-\frac{(x-\mu)^2}{2\sigma^2})$$

Random variable X is in the domain χ
For continuous distribution, $$\Large \int_{\chi}^{}p(x)dx=1$$
For discrete distribution, $$\Large \sum {\scriptsize{x\in \chi}}^{} {\displaystyle p(x)}=1$$

Expectation

Random variable X is in the domain χ
For continuous distribution, the expectation of f(X) is: $$\Large E[f(x)] = \int_{\chi}^{}p(x)f(x)dx=1$$
For discrete distribution, the expectation of f(X) is: $$\Large E[f(x)] = \sum {\scriptsize{x\in \chi}}^{} {\displaystyle p(x)f(x)}=1$$

Random Sample

It is a subset of a larger population that is selected in a way that every member of the population has an equal chance of being chosen.

Terminology

State s
Action a
Policy π $π(a | s) = P(A = a | S = s)$
Reward r
state transition $p(s' | s , a) = P(S' = s' | S = s , A = a)$

Return

Definition: Return(aka cumulative future reward) $$\Large U_t = R_t + R_{t+1} + R_{t+2} + R_{t+3} + ...$$
Definition: Discounted return(aka cumulative discounted future reward) γ: discount rate(tuning hyper-parameter) $$\Large U_t = R_t + \gamma R_{t+1} + \gamma^2R_{t+2} + \gamma ^3R_{t+3} + ...$$

At time step t, the return Ut is random

Two sources of randomness:

Action can be random: $P[A = a | S = s] = π(a | s)$
New state can be random: $P[S' = s' | S = s, A = a] = p(s' | s, a)$

Value Function Q(s, a)

Action-value function

Definition: Action-value function for policy π. $$\Large Qπ(s_t, a_t) = E[U_t|S_t = s_t, A_t = a_t]$$

Optimal action-value function

Definition: Optimal action-value function $$\Large Q^*(s_t, a_t) = max\ Qπ(s_t, a_t)$$

State-value function

Definition: State-value function
Action are discrete $$\Large Vπ(s_t) = E_A[Qπ(s_t, A)] = Σ_aπ(a | s_t)‧Qπ(s_t, a)$$
Action are continuous $$\Large Vπ(s _t) = E_A[Qπ(s_t, A)] = ∫π(a | s_t)‧Qπ(s_t, a)$$

Play game using reinforcement learning

Observe state s(t), make action a(t), environment gives s(t+1) and reward r(t)
The agent can be controlled by either π(a | s) or Q^*^(s, a)

Value-Based Reinforcement Learning

Deep Q-Ntewor(DQN)

Use neural network Q(s, a; w) to approximate Q*(s, a)

Temporal Difference (TD) Learning

Make a prediction: q = Q(w)
Finish the trop and get target y
Loss L = $\Large \frac{1}{2}(q-y)^2$
Loss L = $\Large \frac{1}{2}(Q(w)-y)^2$
Gradient: $\Large \frac{\delta L}{\delta w} = \frac{\delta q}{\delta w}\cdot \frac{\delta L}{\delta q} = (q - y) \cdot \frac{\delta Q(w)}{\delta w}$
Grandient descent: $\Large W_{t+1} = W_t - \alpha \cdot \frac{\delta L}{\delta w}\vert_{w=w_t}$

TD learning for DQN

equation: $\Large T_{A\rightarrow C} \approx T_{A\rightarrow B} + T_{B\rightarrow C}$
In deep reinforcement learning: $\Large Q(s_t, a_t;w)\approx r_t + \gamma \cdot Q(s_{t+1}, a_{t+1};w)$

$\Large U_t = R_t + \gamma \cdot R_{t+1} + \gamma^2 \cdot R_{t+2} + \gamma^3 \cdot R_{t+3} + ...\ =R_t + \gamma (R_{t+1} + \gamma \cdot R_{t+2} + \gamma^2 \cdot R_{t+3} + ...)\ = R_t + \gamma \cdot U_{t+1}$$

DQN's output, $$\Large Q(s_t, a_t;w),\ is\ estimate\ of\ E[U_t]$$
DQN's output, $$\Large Q(s_{t+1}, a_{t+1};w),\ is\ estimate\ of\ E[U_{t+1}]$$
Thus, $$\Large Q(s_t, a_t;w) \approx E[R_t + \gamma \cdot Q(s_{t+1}, A_{t+1};w)]$

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
Basic_Usage		Basic_Usage
Book_L2		Book_L2
GYM		GYM
Stock		Stock
kerong_project		kerong_project
README.md		README.md
RL_report_2023_12_02.pptx		RL_report_2023_12_02.pptx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reinforcement_Learning

Random Variable

Probability Density Function(PDF)

Expectation

Random Sample

Terminology

Return

At time step t, the return Ut is random

Value Function Q(s, a)

Action-value function

Optimal action-value function

State-value function

Play game using reinforcement learning

Value-Based Reinforcement Learning

Deep Q-Ntewor(DQN)

Temporal Difference (TD) Learning

TD learning for DQN

MAZE

Taxi

About

Releases

Packages

Languages

kerong2002/Reinforcement_Learning

Folders and files

Latest commit

History

Repository files navigation

Reinforcement_Learning

Random Variable

Probability Density Function(PDF)

Expectation

Random Sample

Terminology

Return

At time step t, the return Ut is random

Value Function Q(s, a)

Action-value function

Optimal action-value function

State-value function

Play game using reinforcement learning

Value-Based Reinforcement Learning

Deep Q-Ntewor(DQN)

Temporal Difference (TD) Learning

TD learning for DQN

MAZE

Taxi

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages