-
What training techniques were used to fit the DDPG model present in the repository ? Mainly, I didn't understand how the actor and critic networks were used and combined during the training and why two instances were created for each of them. Thanks in advance, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
The question is a bit too general. You should really consult some DDPG tutorial first. Anyway, you have two networks: the "actor", that implements a policy returning a probability distributions over actions parametrized by the current state, and the "critic" that evaluates the choice of the actor, in the case of DDPG via a Q-value function. The critic is trained with the usual Bellman equation (see the slides); the actor is trained to maximize the cumulative expected reward. The expected reward under the actor policy is just computed by composing the critic with the actor, so the training objective of the actor is to maximize the output of the composition of the two networks (freezing values of the critic). |
Beta Was this translation helpful? Give feedback.
The question is a bit too general. You should really consult some DDPG tutorial first. Anyway, you have two networks: the "actor", that implements a policy returning a probability distributions over actions parametrized by the current state, and the "critic" that evaluates the choice of the actor, in the case of DDPG via a Q-value function. The critic is trained with the usual Bellman equation (see the slides); the actor is trained to maximize the cumulative expected reward. The expected reward under the actor policy is just computed by composing the critic with the actor, so the training objective of the actor is to maximize the output of the composition of the two networks (freezing val…