Skip to content

Implemenation of DDPG with numpy only (without Tensorflow)

Notifications You must be signed in to change notification settings

kcg2015/DDPG_numpy_only

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

In this project, we implement Deep Deterministic Policy Gradient (DDPG) from scratch (using Numpy only), without using DL framework such as Tensorflow.

DDPG

Key steps in DDPG

In the following, the key steps and their corresponding code snippets are listed. The DDPG algorithm is implemented in ddpg_numpy.py

  1. Select action a_t according to current policy and exploration noise

Drawing

a_t = actor.predict(np.reshape(s_t,(1,3)), ACTION_BOUND, target=False)+1./(1.+i+j)
  1. Execute action a_t and observe reward r_t and observe new state s_{t+1}

Drawing

s_t_1, r_t, done, info = env.step(a_t[0])
  1. Create and sample from replay buffer

Drawing

  1. Set y_i according to the following equation:

Drawing

y=np.zeros((len(batch), action_dim))
a_tgt=actor.predict(states_t_1, ACTION_BOUND, target=True)
Q_tgt = critic.predict(states_t_1, a_tgt,target=True)
for i in range(len(batch)):
    if dones[i]:
        y[i] = rewards[i]
    else:
        y[i] = rewards[i] + GAMMA*Q_tgt[i] 
  1. Update the critic network by the loss function

Drawing

loss += critic.train(states_t, actions, y)
  1. Update the actor policy using the sampled policy gradient:

Drawing

which needs the input of

Drawing

dQ_da = critic.evaluate_action_gradient(states_t,a_for_dQ_da)

which in turn relies on $a=\mu(s_i)$:

a_for_dQ_da=actor.predict(states_t, ACTION_BOUND, target=False)

Finally, the following code implements the actor policy update:

actor.train(states_t, dQ_da, ACTION_BOUND)

  1. Update target networks

Drawing

actor.train_target(TAU)
critic.train_target(TAU)

Actor (policy) Network

The actor network is implemented in actor_net.py .

Critic (value) Network

The critic network is implemented in critic_net.py . We note that we follow the implementation as mention in DDPG paper. The following sketch shows the architecture of the critic network. Drawing

Results

Drawing

Acknowledgement

In the process of the coding, I am informed and inspired by the coding practice, style, technique in the following Github repository (https://github.com/yanpanlau/DDPG-Keras-Torcs, http://cs231n.github.io/assignments2017/assignment2/).