[RLlib] A2C + A3C move to algorithms folder and re-name into A2C/A3…

…C (from ...Trainer). (ray-project#25314)
rkooo567 · Jun 1, 2022 · 18c03f8 · 18c03f8
1 parent 288a81b
commit 18c03f8
Show file tree

Hide file tree

Showing 30 changed files with 181 additions and 112 deletions.
diff --git a/doc/source/ray-core/examples/plot_example-a3c.rst b/doc/source/ray-core/examples/plot_example-a3c.rst
@@ -14,7 +14,7 @@ View the `code for this example`_.
 
 .. _`A3C`: https://arxiv.org/abs/1602.01783
 .. _`Universe Starter Agent`: https://github.com/openai/universe-starter-agent
-.. _`code for this example`: https://github.com/ray-project/ray/tree/master/rllib/agents/a3c
+.. _`code for this example`: https://github.com/ray-project/ray/tree/master/rllib/algorithms/a3c
 
 
 To run the application, first install **ray** and then some dependencies:

diff --git a/doc/source/rllib/core-concepts.rst b/doc/source/rllib/core-concepts.rst
@@ -292,7 +292,7 @@ Examples
             # type: LocalIterator[ResultDict]
             return StandardMetricsReporting(train_op, workers, config)
 
-    See also the `actual A3C implementation <https://github.com/ray-project/ray/blob/master/rllib/agents/a3c/a3c.py>`__.
+    See also the `actual A3C implementation <https://github.com/ray-project/ray/blob/master/rllib/algorithms/a3c/a3c.py>`__.
 
 .. dropdown:: **Example: Replay**
 

diff --git a/doc/source/rllib/rllib-algorithms.rst b/doc/source/rllib/rllib-algorithms.rst
@@ -15,7 +15,8 @@ Available Algorithms - Overview
 ============================== ========== ============================= ================== =========== ============================================================= ===============
 Algorithm                      Frameworks Discrete Actions              Continuous Actions Multi-Agent Model Support                                                 Multi-GPU
 ============================== ========== ============================= ================== =========== ============================================================= ===============
-`A2C, A3C`_                    tf + torch **Yes** `+parametric`_        **Yes**            **Yes**     `+RNN`_, `+LSTM auto-wrapping`_, `+Attention`_, `+autoreg`_   A2C: tf + torch
+`A2C`_                         tf + torch **Yes** `+parametric`_        **Yes**            **Yes**     `+RNN`_, `+LSTM auto-wrapping`_, `+Attention`_, `+autoreg`_   A2C: tf + torch
+`A3C`_                         tf + torch **Yes** `+parametric`_        **Yes**            **Yes**     `+RNN`_, `+LSTM auto-wrapping`_, `+Attention`_, `+autoreg`_   No
 `ARS`_                         tf + torch **Yes**                       **Yes**            No                                                                        No
 `Bandits`_ (`TS`_ & `LinUCB`_) torch      **Yes** `+parametric`_        No                 **Yes**                                                                   No
 `BC`_                          tf + torch **Yes** `+parametric`_        **Yes**            **Yes**     `+RNN`_                                                       torch
@@ -60,7 +61,6 @@ Algorithm                        Frameworks Discrete Actions        Continuous A
 `Curiosity`_                     tf + torch **Yes** `+parametric`_  No                 **Yes**     `+RNN`_
 ================================ ========== ======================= ================== =========== =====================
 
-.. _`A2C, A3C`: rllib-algorithms.html#a3c
 .. _`APEX-DQN`: rllib-algorithms.html#apex
 .. _`APEX-DDPG`: rllib-algorithms.html#apex
 .. _`+autoreg`: rllib-models.html#autoregressive-action-distributions
@@ -228,43 +228,68 @@ Tuned examples: `CartPole-v0 <https://github.com/ray-project/ray/blob/master/rll
 Gradient-based
 ~~~~~~~~~~~~~~
 
-.. _a3c:
+.. _a2c:
 
-Advantage Actor-Critic (A2C, A3C)
----------------------------------
+Advantage Actor-Critic (A2C)
+----------------------------
 |pytorch| |tensorflow|
-`[paper] <https://arxiv.org/abs/1602.01783>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/agents/a3c/a3c.py>`__
-RLlib implements both A2C and A3C. These algorithms scale to 16-32+ worker processes depending on the environment.
-
-A2C also supports microbatching (i.e., gradient accumulation), which can be enabled by setting the ``microbatch_size`` config. Microbatching allows for training with a ``train_batch_size`` much larger than GPU memory.
+`[paper] <https://arxiv.org/abs/1602.01783>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/a2c/a2c.py>`__
+A2C scales to 16-32+ worker processes depending on the environment and supports microbatching
+(i.e., gradient accumulation), which can be enabled by setting the ``microbatch_size`` config.
+Microbatching allows for training with a ``train_batch_size`` much larger than GPU memory.
 
 .. figure:: images/a2c-arch.svg
 
     A2C architecture
 
-Tuned examples: `PongDeterministic-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/a3c/pong-a3c.yaml>`__, `{BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/a3c/atari-a2c.yaml>`__
+Tuned examples: `Atari environments <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/a2c/atari-a2c.yaml>`__
 
 .. tip::
     Consider using `IMPALA <#importance-weighted-actor-learner-architecture-impala>`__ for faster training with similar timestep efficiency.
 
 **Atari results @10M steps**: `more details <https://github.com/ray-project/rl-experiments>`__
 
 =============  ========================  ==============================
- Atari env     RLlib A2C 5-workers       Mnih et al A3C 16-workers
+Atari env      RLlib A2C 5-workers       Mnih et al A3C 16-workers
 =============  ========================  ==============================
 BeamRider      1401                      ~3000
 Breakout       374                       ~150
 Qbert          3620                      ~1000
 SpaceInvaders  692                       ~600
 =============  ========================  ==============================
 
+**A2C-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
+
+.. literalinclude:: ../../../rllib/algorithms/a2c/a2c.py
+   :language: python
+   :start-after: __sphinx_doc_begin__
+   :end-before: __sphinx_doc_end__
+
+
+.. _a3c:
+
+Asynchronous Advantage Actor-Critic (A3C)
+-----------------------------------------
+|pytorch| |tensorflow|
+`[paper] <https://arxiv.org/abs/1602.01783>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/a3c/a3c.py>`__
+A3C is the asynchronous version of A2C, where gradients are computed on the workers directly after trajectory rollouts,
+and only then shipped to a central learner to accumulate these gradients on the central model. After the central model update, parameters are broadcast back to
+all workers.
+Similar to A2C, A3C scales to 16-32+ worker processes depending on the environment.
+
+Tuned examples: `PongDeterministic-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/a3c/pong-a3c.yaml>`__
+
+.. tip::
+    Consider using `IMPALA <#importance-weighted-actor-learner-architecture-impala>`__ for faster training with similar timestep efficiency.
+
 **A3C-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
 
-.. literalinclude:: ../../../rllib/agents/a3c/a3c.py
+.. literalinclude:: ../../../rllib/algorithms/a3c/a3c.py
    :language: python
    :start-after: __sphinx_doc_begin__
    :end-before: __sphinx_doc_end__
 
+
 .. _ddpg:
 
 Deep Deterministic Policy Gradients (DDPG, TD3)

diff --git a/doc/source/rllib/rllib-concepts.rst b/doc/source/rllib/rllib-concepts.rst
@@ -482,13 +482,13 @@ Defining a policy in PyTorch is quite similar to that for TensorFlow (and the pr
         name="MyTorchPolicy",
         loss_fn=policy_gradient_loss)
 
-Now, building on the TF examples above, let's look at how the `A3C torch policy <https://github.com/ray-project/ray/blob/master/rllib/agents/a3c/a3c_torch_policy.py>`__ is defined:
+Now, building on the TF examples above, let's look at how the `A3C torch policy <https://github.com/ray-project/ray/blob/master/rllib/algorithms/a3c/a3c_torch_policy.py>`__ is defined:
 
 .. code-block:: python
 
     A3CTorchPolicy = build_torch_policy(
         name="A3CTorchPolicy",
-        get_default_config=lambda: ray.rllib.agents.a3c.a3c.DEFAULT_CONFIG,
+        get_default_config=lambda: ray.rllib.algorithms.a3c.a3c.DEFAULT_CONFIG,
         loss_fn=actor_critic_loss,
         stats_fn=loss_and_entropy_stats,
         postprocess_fn=add_advantages,
@@ -551,7 +551,7 @@ Now, building on the TF examples above, let's look at how the `A3C torch policy
                 _, _, vf, _ = self.model({"obs": obs}, [])
                 return vf.detach().cpu().numpy().squeeze()
 
-You can find the full policy definition in `a3c_torch_policy.py <https://github.com/ray-project/ray/blob/master/rllib/agents/a3c/a3c_torch_policy.py>`__.
+You can find the full policy definition in `a3c_torch_policy.py <https://github.com/ray-project/ray/blob/master/rllib/algorithms/a3c/a3c_torch_policy.py>`__.
 
 In summary, the main differences between the PyTorch and TensorFlow policy builder functions is that the TF loss and stats functions are built symbolically when the policy is initialized, whereas for PyTorch (or TensorFlow Eager) these functions are called imperatively each time they are used.
 

diff --git a/rllib/BUILD b/rllib/BUILD
@@ -72,15 +72,15 @@ load("//bazel:python.bzl", "py_test_module_list")
 # inside rllib/tuned_examples/[algo-name] for actual learning success.
 # --------------------------------------------------------------------
 
-# A2C/A3C
+# A2C
 # py_test(
 #    name = "learning_tests_cartpole_a2c",
 #    main = "tests/run_regression_tests.py",
 #    tags = ["team:ml", "learning_tests", "learning_tests_cartpole", "learning_tests_discrete"],
 #    size = "large",
 #    srcs = ["tests/run_regression_tests.py"],
-#    data = ["tuned_examples/a3c/cartpole-a2c.yaml"],
-#    args = ["--yaml-dir=tuned_examples/a3c"]
+#    data = ["tuned_examples/a2c/cartpole-a2c.yaml"],
+#    args = ["--yaml-dir=tuned_examples/a2c"]
 # )
 
 py_test(
@@ -89,8 +89,8 @@ py_test(
     tags = ["team:ml", "learning_tests", "learning_tests_cartpole", "learning_tests_discrete"],
     size = "large",
     srcs = ["tests/run_regression_tests.py"],
-    data = ["tuned_examples/a3c/cartpole-a2c-microbatch.yaml"],
-    args = ["--yaml-dir=tuned_examples/a3c"]
+    data = ["tuned_examples/a2c/cartpole-a2c-microbatch.yaml"],
+    args = ["--yaml-dir=tuned_examples/a2c"]
 )
 
 py_test(
@@ -99,10 +99,12 @@ py_test(
     tags = ["team:ml", "learning_tests", "learning_tests_cartpole", "learning_tests_discrete", "fake_gpus"],
     size = "large",
     srcs = ["tests/run_regression_tests.py"],
-    data = ["tuned_examples/a3c/cartpole-a2c-fake-gpus.yaml"],
-    args = ["--yaml-dir=tuned_examples/a3c"]
+    data = ["tuned_examples/a2c/cartpole-a2c-fake-gpus.yaml"],
+    args = ["--yaml-dir=tuned_examples/a2c"]
 )
 
+# A3C
+
 # py_test(
 #    name = "learning_tests_cartpole_a3c",
 #    main = "tests/run_regression_tests.py",
@@ -669,19 +671,21 @@ py_test(
 )
 
 # Specific Trainers (Algorithms)
-# A2/3CTrainer
+
+# A2C
 py_test(
     name = "test_a2c",
     tags = ["team:ml", "trainers_dir"],
     size = "large",
-    srcs = ["agents/a3c/tests/test_a2c.py"]
+    srcs = ["algorithms/a2c/tests/test_a2c.py"]
 )
 
+# A3C
 py_test(
     name = "test_a3c",
     tags = ["team:ml", "trainers_dir"],
     size = "large",
-    srcs = ["agents/a3c/tests/test_a3c.py"]
+    srcs = ["algorithms/a3c/tests/test_a3c.py"]
 )
 
 # AlphaStar

diff --git a/rllib/agents/a3c/README.md b/rllib/agents/a3c/README.md
diff --git a/rllib/agents/a3c/__init__.py b/rllib/agents/a3c/__init__.py
@@ -1,4 +1,21 @@
-from ray.rllib.agents.a3c.a3c import A3CConfig, A3CTrainer, DEFAULT_CONFIG
-from ray.rllib.agents.a3c.a2c import A2CConfig, A2CTrainer
+from ray.rllib.algorithms.a2c.a2c import (
+    A2CConfig,
+    A2C as A2CTrainer,
+    A2C_DEFAULT_CONFIG,
+)
+from ray.rllib.algorithms.a3c.a3c import A3CConfig, A3C as A3CTrainer, DEFAULT_CONFIG
+from ray.rllib.utils.deprecation import deprecation_warning
 
-__all__ = ["A2CConfig", "A2CTrainer", "A3CConfig", "A3CTrainer", "DEFAULT_CONFIG"]
+
+__all__ = [
+    "A2CConfig",
+    "A2C_DEFAULT_CONFIG",  # deprecated
+    "A2CTrainer",
+    "A3CConfig",
+    "A3CTrainer",
+    "DEFAULT_CONFIG",  # A3C default config (deprecated)
+]
+
+deprecation_warning(
+    "ray.rllib.agents.a3c", "ray.rllib.algorithms.[a3c|a2c]", error=False
+)
diff --git a/rllib/agents/impala/impala.py b/rllib/agents/impala/impala.py
@@ -464,7 +464,7 @@ def get_default_policy_class(
 
                 return VTraceTorchPolicy
             else:
-                from ray.rllib.agents.a3c.a3c_torch_policy import A3CTorchPolicy
+                from ray.rllib.algorithms.a3c.a3c_torch_policy import A3CTorchPolicy
 
                 return A3CTorchPolicy
         elif config["framework"] == "tf":
@@ -475,7 +475,7 @@ def get_default_policy_class(
 
                 return VTraceStaticGraphTFPolicy
             else:
-                from ray.rllib.agents.a3c.a3c_tf_policy import A3CTFPolicy
+                from ray.rllib.algorithms.a3c.a3c_tf_policy import A3CTFPolicy
 
                 return A3CTFPolicy
         else:
@@ -484,7 +484,7 @@ def get_default_policy_class(
 
                 return VTraceEagerTFPolicy
             else:
-                from ray.rllib.agents.a3c.a3c_tf_policy import A3CTFPolicy
+                from ray.rllib.algorithms.a3c.a3c_tf_policy import A3CTFPolicy
 
                 return A3CTFPolicy
 

diff --git a/rllib/agents/registry.py b/rllib/agents/registry.py
@@ -6,15 +6,15 @@
 
 
 def _import_a2c():
-    from ray.rllib.agents import a3c
+    import ray.rllib.algorithms.a2c as a2c
 
-    return a3c.A2CTrainer, a3c.a2c.A2C_DEFAULT_CONFIG
+    return a2c.A2C, a2c.A2C_DEFAULT_CONFIG
 
 
 def _import_a3c():
-    from ray.rllib.agents import a3c
+    import ray.rllib.algorithms.a3c as a3c
 
-    return a3c.A3CTrainer, a3c.DEFAULT_CONFIG
+    return a3c.A3C, a3c.DEFAULT_CONFIG
 
 
 def _import_alpha_star():

diff --git a/rllib/agents/tests/test_trainer.py b/rllib/agents/tests/test_trainer.py
@@ -8,7 +8,7 @@
 import unittest
 
 import ray
-import ray.rllib.agents.a3c as a3c
+import ray.rllib.algorithms.a3c as a3c
 import ray.rllib.algorithms.dqn as dqn
 from ray.rllib.algorithms.marwil import BCConfig, BCTrainer
 import ray.rllib.algorithms.pg as pg

diff --git a/rllib/algorithms/a2c/README.md b/rllib/algorithms/a2c/README.md
@@ -0,0 +1,19 @@
+# Advantage Actor-Critic (A2C)
+
+## Overview 
+
+[Advantage Actor-Critic](https://arxiv.org/pdf/1602.01783.pdf) proposes two distributed model-free on-policy RL algorithms, A3C and A2C.
+These algorithms are distributed versions of the vanilla Policy Gradient (PG) algorithm with different distributed execution patterns.
+The paper suggests accelerating training via scaling data collection, i.e. introducing worker nodes,
+which carry copies of the central node's policy network and collect data from the environment in parallel.
+This data is used on each worker to compute gradients. The central node applies each of these gradients and then sends updated weights back to the workers.
+
+In A2C, the worker nodes synchronously collect data. The collected data forms a giant batch of data,
+from which the central node (the central policy) computes gradient updates.
+
+
+## Documentation & Implementation of A2C:
+
+**[Detailed Documentation](https://docs.ray.io/en/master/rllib-algorithms.html#a2c)**
+
+**[Implementation](https://github.com/ray-project/ray/blob/master/rllib/algorithms/a2c/a2c.py)**
diff --git a/rllib/algorithms/a2c/__init__.py b/rllib/algorithms/a2c/__init__.py
@@ -0,0 +1,3 @@
+from ray.rllib.algorithms.a2c.a2c import A2CConfig, A2C, A2C_DEFAULT_CONFIG
+
+__all__ = ["A2CConfig", "A2C", "A2C_DEFAULT_CONFIG"]
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		from ray.rllib.algorithms.a2c.a2c import A2CConfig, A2C, A2C_DEFAULT_CONFIG

		__all__ = ["A2CConfig", "A2C", "A2C_DEFAULT_CONFIG"]