Skip to content

Training techniques for DDPG model. #4

Answered by asperti
sfiorilla asked this question in Q&A
Discussion options

You must be logged in to vote

The question is a bit too general. You should really consult some DDPG tutorial first. Anyway, you have two networks: the "actor", that implements a policy returning a probability distributions over actions parametrized by the current state, and the "critic" that evaluates the choice of the actor, in the case of DDPG via a Q-value function. The critic is trained with the usual Bellman equation (see the slides); the actor is trained to maximize the cumulative expected reward. The expected reward under the actor policy is just computed by composing the critic with the actor, so the training objective of the actor is to maximize the output of the composition of the two networks (freezing val…

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by sfiorilla
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants