Skip to content

Commit

Permalink
[RLlib] A2C + A3C move to algorithms folder and re-name into A2C/A3…
Browse files Browse the repository at this point in the history
…C (from ...Trainer). (ray-project#25314)
  • Loading branch information
sven1977 authored Jun 1, 2022
1 parent 288a81b commit 18c03f8
Show file tree
Hide file tree
Showing 30 changed files with 181 additions and 112 deletions.
2 changes: 1 addition & 1 deletion doc/source/ray-core/examples/plot_example-a3c.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ View the `code for this example`_.

.. _`A3C`: https://arxiv.org/abs/1602.01783
.. _`Universe Starter Agent`: https://github.com/openai/universe-starter-agent
.. _`code for this example`: https://github.com/ray-project/ray/tree/master/rllib/agents/a3c
.. _`code for this example`: https://github.com/ray-project/ray/tree/master/rllib/algorithms/a3c


To run the application, first install **ray** and then some dependencies:
Expand Down
2 changes: 1 addition & 1 deletion doc/source/rllib/core-concepts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -292,7 +292,7 @@ Examples
# type: LocalIterator[ResultDict]
return StandardMetricsReporting(train_op, workers, config)
See also the `actual A3C implementation <https://github.com/ray-project/ray/blob/master/rllib/agents/a3c/a3c.py>`__.
See also the `actual A3C implementation <https://github.com/ray-project/ray/blob/master/rllib/algorithms/a3c/a3c.py>`__.

.. dropdown:: **Example: Replay**

Expand Down
49 changes: 37 additions & 12 deletions doc/source/rllib/rllib-algorithms.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@ Available Algorithms - Overview
============================== ========== ============================= ================== =========== ============================================================= ===============
Algorithm Frameworks Discrete Actions Continuous Actions Multi-Agent Model Support Multi-GPU
============================== ========== ============================= ================== =========== ============================================================= ===============
`A2C, A3C`_ tf + torch **Yes** `+parametric`_ **Yes** **Yes** `+RNN`_, `+LSTM auto-wrapping`_, `+Attention`_, `+autoreg`_ A2C: tf + torch
`A2C`_ tf + torch **Yes** `+parametric`_ **Yes** **Yes** `+RNN`_, `+LSTM auto-wrapping`_, `+Attention`_, `+autoreg`_ A2C: tf + torch
`A3C`_ tf + torch **Yes** `+parametric`_ **Yes** **Yes** `+RNN`_, `+LSTM auto-wrapping`_, `+Attention`_, `+autoreg`_ No
`ARS`_ tf + torch **Yes** **Yes** No No
`Bandits`_ (`TS`_ & `LinUCB`_) torch **Yes** `+parametric`_ No **Yes** No
`BC`_ tf + torch **Yes** `+parametric`_ **Yes** **Yes** `+RNN`_ torch
Expand Down Expand Up @@ -60,7 +61,6 @@ Algorithm Frameworks Discrete Actions Continuous A
`Curiosity`_ tf + torch **Yes** `+parametric`_ No **Yes** `+RNN`_
================================ ========== ======================= ================== =========== =====================

.. _`A2C, A3C`: rllib-algorithms.html#a3c
.. _`APEX-DQN`: rllib-algorithms.html#apex
.. _`APEX-DDPG`: rllib-algorithms.html#apex
.. _`+autoreg`: rllib-models.html#autoregressive-action-distributions
Expand Down Expand Up @@ -228,43 +228,68 @@ Tuned examples: `CartPole-v0 <https://github.com/ray-project/ray/blob/master/rll
Gradient-based
~~~~~~~~~~~~~~

.. _a3c:
.. _a2c:

Advantage Actor-Critic (A2C, A3C)
---------------------------------
Advantage Actor-Critic (A2C)
----------------------------
|pytorch| |tensorflow|
`[paper] <https://arxiv.org/abs/1602.01783>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/agents/a3c/a3c.py>`__
RLlib implements both A2C and A3C. These algorithms scale to 16-32+ worker processes depending on the environment.

A2C also supports microbatching (i.e., gradient accumulation), which can be enabled by setting the ``microbatch_size`` config. Microbatching allows for training with a ``train_batch_size`` much larger than GPU memory.
`[paper] <https://arxiv.org/abs/1602.01783>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/a2c/a2c.py>`__
A2C scales to 16-32+ worker processes depending on the environment and supports microbatching
(i.e., gradient accumulation), which can be enabled by setting the ``microbatch_size`` config.
Microbatching allows for training with a ``train_batch_size`` much larger than GPU memory.

.. figure:: images/a2c-arch.svg

A2C architecture

Tuned examples: `PongDeterministic-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/a3c/pong-a3c.yaml>`__, `{BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/a3c/atari-a2c.yaml>`__
Tuned examples: `Atari environments <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/a2c/atari-a2c.yaml>`__

.. tip::
Consider using `IMPALA <#importance-weighted-actor-learner-architecture-impala>`__ for faster training with similar timestep efficiency.

**Atari results @10M steps**: `more details <https://github.com/ray-project/rl-experiments>`__

============= ======================== ==============================
Atari env RLlib A2C 5-workers Mnih et al A3C 16-workers
Atari env RLlib A2C 5-workers Mnih et al A3C 16-workers
============= ======================== ==============================
BeamRider 1401 ~3000
Breakout 374 ~150
Qbert 3620 ~1000
SpaceInvaders 692 ~600
============= ======================== ==============================

**A2C-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):

.. literalinclude:: ../../../rllib/algorithms/a2c/a2c.py
:language: python
:start-after: __sphinx_doc_begin__
:end-before: __sphinx_doc_end__


.. _a3c:

Asynchronous Advantage Actor-Critic (A3C)
-----------------------------------------
|pytorch| |tensorflow|
`[paper] <https://arxiv.org/abs/1602.01783>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/a3c/a3c.py>`__
A3C is the asynchronous version of A2C, where gradients are computed on the workers directly after trajectory rollouts,
and only then shipped to a central learner to accumulate these gradients on the central model. After the central model update, parameters are broadcast back to
all workers.
Similar to A2C, A3C scales to 16-32+ worker processes depending on the environment.

Tuned examples: `PongDeterministic-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/a3c/pong-a3c.yaml>`__

.. tip::
Consider using `IMPALA <#importance-weighted-actor-learner-architecture-impala>`__ for faster training with similar timestep efficiency.

**A3C-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):

.. literalinclude:: ../../../rllib/agents/a3c/a3c.py
.. literalinclude:: ../../../rllib/algorithms/a3c/a3c.py
:language: python
:start-after: __sphinx_doc_begin__
:end-before: __sphinx_doc_end__


.. _ddpg:

Deep Deterministic Policy Gradients (DDPG, TD3)
Expand Down
6 changes: 3 additions & 3 deletions doc/source/rllib/rllib-concepts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -482,13 +482,13 @@ Defining a policy in PyTorch is quite similar to that for TensorFlow (and the pr
name="MyTorchPolicy",
loss_fn=policy_gradient_loss)
Now, building on the TF examples above, let's look at how the `A3C torch policy <https://github.com/ray-project/ray/blob/master/rllib/agents/a3c/a3c_torch_policy.py>`__ is defined:
Now, building on the TF examples above, let's look at how the `A3C torch policy <https://github.com/ray-project/ray/blob/master/rllib/algorithms/a3c/a3c_torch_policy.py>`__ is defined:

.. code-block:: python
A3CTorchPolicy = build_torch_policy(
name="A3CTorchPolicy",
get_default_config=lambda: ray.rllib.agents.a3c.a3c.DEFAULT_CONFIG,
get_default_config=lambda: ray.rllib.algorithms.a3c.a3c.DEFAULT_CONFIG,
loss_fn=actor_critic_loss,
stats_fn=loss_and_entropy_stats,
postprocess_fn=add_advantages,
Expand Down Expand Up @@ -551,7 +551,7 @@ Now, building on the TF examples above, let's look at how the `A3C torch policy
_, _, vf, _ = self.model({"obs": obs}, [])
return vf.detach().cpu().numpy().squeeze()
You can find the full policy definition in `a3c_torch_policy.py <https://github.com/ray-project/ray/blob/master/rllib/agents/a3c/a3c_torch_policy.py>`__.
You can find the full policy definition in `a3c_torch_policy.py <https://github.com/ray-project/ray/blob/master/rllib/algorithms/a3c/a3c_torch_policy.py>`__.

In summary, the main differences between the PyTorch and TensorFlow policy builder functions is that the TF loss and stats functions are built symbolically when the policy is initialized, whereas for PyTorch (or TensorFlow Eager) these functions are called imperatively each time they are used.

Expand Down
24 changes: 14 additions & 10 deletions rllib/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -72,15 +72,15 @@ load("//bazel:python.bzl", "py_test_module_list")
# inside rllib/tuned_examples/[algo-name] for actual learning success.
# --------------------------------------------------------------------

# A2C/A3C
# A2C
# py_test(
# name = "learning_tests_cartpole_a2c",
# main = "tests/run_regression_tests.py",
# tags = ["team:ml", "learning_tests", "learning_tests_cartpole", "learning_tests_discrete"],
# size = "large",
# srcs = ["tests/run_regression_tests.py"],
# data = ["tuned_examples/a3c/cartpole-a2c.yaml"],
# args = ["--yaml-dir=tuned_examples/a3c"]
# data = ["tuned_examples/a2c/cartpole-a2c.yaml"],
# args = ["--yaml-dir=tuned_examples/a2c"]
# )

py_test(
Expand All @@ -89,8 +89,8 @@ py_test(
tags = ["team:ml", "learning_tests", "learning_tests_cartpole", "learning_tests_discrete"],
size = "large",
srcs = ["tests/run_regression_tests.py"],
data = ["tuned_examples/a3c/cartpole-a2c-microbatch.yaml"],
args = ["--yaml-dir=tuned_examples/a3c"]
data = ["tuned_examples/a2c/cartpole-a2c-microbatch.yaml"],
args = ["--yaml-dir=tuned_examples/a2c"]
)

py_test(
Expand All @@ -99,10 +99,12 @@ py_test(
tags = ["team:ml", "learning_tests", "learning_tests_cartpole", "learning_tests_discrete", "fake_gpus"],
size = "large",
srcs = ["tests/run_regression_tests.py"],
data = ["tuned_examples/a3c/cartpole-a2c-fake-gpus.yaml"],
args = ["--yaml-dir=tuned_examples/a3c"]
data = ["tuned_examples/a2c/cartpole-a2c-fake-gpus.yaml"],
args = ["--yaml-dir=tuned_examples/a2c"]
)

# A3C

# py_test(
# name = "learning_tests_cartpole_a3c",
# main = "tests/run_regression_tests.py",
Expand Down Expand Up @@ -669,19 +671,21 @@ py_test(
)

# Specific Trainers (Algorithms)
# A2/3CTrainer

# A2C
py_test(
name = "test_a2c",
tags = ["team:ml", "trainers_dir"],
size = "large",
srcs = ["agents/a3c/tests/test_a2c.py"]
srcs = ["algorithms/a2c/tests/test_a2c.py"]
)

# A3C
py_test(
name = "test_a3c",
tags = ["team:ml", "trainers_dir"],
size = "large",
srcs = ["agents/a3c/tests/test_a3c.py"]
srcs = ["algorithms/a3c/tests/test_a3c.py"]
)

# AlphaStar
Expand Down
22 changes: 0 additions & 22 deletions rllib/agents/a3c/README.md

This file was deleted.

23 changes: 20 additions & 3 deletions rllib/agents/a3c/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,21 @@
from ray.rllib.agents.a3c.a3c import A3CConfig, A3CTrainer, DEFAULT_CONFIG
from ray.rllib.agents.a3c.a2c import A2CConfig, A2CTrainer
from ray.rllib.algorithms.a2c.a2c import (
A2CConfig,
A2C as A2CTrainer,
A2C_DEFAULT_CONFIG,
)
from ray.rllib.algorithms.a3c.a3c import A3CConfig, A3C as A3CTrainer, DEFAULT_CONFIG
from ray.rllib.utils.deprecation import deprecation_warning

__all__ = ["A2CConfig", "A2CTrainer", "A3CConfig", "A3CTrainer", "DEFAULT_CONFIG"]

__all__ = [
"A2CConfig",
"A2C_DEFAULT_CONFIG", # deprecated
"A2CTrainer",
"A3CConfig",
"A3CTrainer",
"DEFAULT_CONFIG", # A3C default config (deprecated)
]

deprecation_warning(
"ray.rllib.agents.a3c", "ray.rllib.algorithms.[a3c|a2c]", error=False
)
6 changes: 3 additions & 3 deletions rllib/agents/impala/impala.py
Original file line number Diff line number Diff line change
Expand Up @@ -464,7 +464,7 @@ def get_default_policy_class(

return VTraceTorchPolicy
else:
from ray.rllib.agents.a3c.a3c_torch_policy import A3CTorchPolicy
from ray.rllib.algorithms.a3c.a3c_torch_policy import A3CTorchPolicy

return A3CTorchPolicy
elif config["framework"] == "tf":
Expand All @@ -475,7 +475,7 @@ def get_default_policy_class(

return VTraceStaticGraphTFPolicy
else:
from ray.rllib.agents.a3c.a3c_tf_policy import A3CTFPolicy
from ray.rllib.algorithms.a3c.a3c_tf_policy import A3CTFPolicy

return A3CTFPolicy
else:
Expand All @@ -484,7 +484,7 @@ def get_default_policy_class(

return VTraceEagerTFPolicy
else:
from ray.rllib.agents.a3c.a3c_tf_policy import A3CTFPolicy
from ray.rllib.algorithms.a3c.a3c_tf_policy import A3CTFPolicy

return A3CTFPolicy

Expand Down
8 changes: 4 additions & 4 deletions rllib/agents/registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,15 @@


def _import_a2c():
from ray.rllib.agents import a3c
import ray.rllib.algorithms.a2c as a2c

return a3c.A2CTrainer, a3c.a2c.A2C_DEFAULT_CONFIG
return a2c.A2C, a2c.A2C_DEFAULT_CONFIG


def _import_a3c():
from ray.rllib.agents import a3c
import ray.rllib.algorithms.a3c as a3c

return a3c.A3CTrainer, a3c.DEFAULT_CONFIG
return a3c.A3C, a3c.DEFAULT_CONFIG


def _import_alpha_star():
Expand Down
2 changes: 1 addition & 1 deletion rllib/agents/tests/test_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
import unittest

import ray
import ray.rllib.agents.a3c as a3c
import ray.rllib.algorithms.a3c as a3c
import ray.rllib.algorithms.dqn as dqn
from ray.rllib.algorithms.marwil import BCConfig, BCTrainer
import ray.rllib.algorithms.pg as pg
Expand Down
19 changes: 19 additions & 0 deletions rllib/algorithms/a2c/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Advantage Actor-Critic (A2C)

## Overview

[Advantage Actor-Critic](https://arxiv.org/pdf/1602.01783.pdf) proposes two distributed model-free on-policy RL algorithms, A3C and A2C.
These algorithms are distributed versions of the vanilla Policy Gradient (PG) algorithm with different distributed execution patterns.
The paper suggests accelerating training via scaling data collection, i.e. introducing worker nodes,
which carry copies of the central node's policy network and collect data from the environment in parallel.
This data is used on each worker to compute gradients. The central node applies each of these gradients and then sends updated weights back to the workers.

In A2C, the worker nodes synchronously collect data. The collected data forms a giant batch of data,
from which the central node (the central policy) computes gradient updates.


## Documentation & Implementation of A2C:

**[Detailed Documentation](https://docs.ray.io/en/master/rllib-algorithms.html#a2c)**

**[Implementation](https://github.com/ray-project/ray/blob/master/rllib/algorithms/a2c/a2c.py)**
3 changes: 3 additions & 0 deletions rllib/algorithms/a2c/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from ray.rllib.algorithms.a2c.a2c import A2CConfig, A2C, A2C_DEFAULT_CONFIG

__all__ = ["A2CConfig", "A2C", "A2C_DEFAULT_CONFIG"]
Loading

0 comments on commit 18c03f8

Please sign in to comment.