ai4co · fedebotu · Apr 30, 2024 · Apr 23, 2024 · Apr 23, 2024 · Apr 23, 2024
diff --git a/configs/model/am-ppo.yaml b/configs/model/am-ppo.yaml
@@ -1,4 +1,4 @@
-_target_: rl4co.models.PPOModel
+_target_: rl4co.models.AMPPO
 
 metrics:
   train: ["loss", "reward", "surrogate_loss", "value_loss", "entropy_bonus"]

diff --git a/docs/_content/api/algos/a2c.md b/docs/_content/api/algos/a2c.md
@@ -0,0 +1,9 @@
+# A2C
+
+## A2C (Advantage Actor Critic)
+
+```{eval-rst}
+.. automodule:: rl4co.models.rl.a2c.a2c
+   :members:
+   :undoc-members:
+```
diff --git a/docs/_content/api/algos/base.md b/docs/_content/api/algos/base.md
@@ -6,4 +6,4 @@
 .. automodule:: rl4co.models.rl.common.base
    :members:
    :undoc-members:
-```
+```
diff --git a/docs/_content/api/algos/ppo.md b/docs/_content/api/algos/ppo.md
@@ -1,5 +1,9 @@
 # PPO
 
+
+## PPO (Proximal Policy Optimization)
+
+
 ```{eval-rst}
 .. automodule:: rl4co.models.rl.ppo.ppo
    :members:

diff --git a/docs/_content/api/algos/reinforce.md b/docs/_content/api/algos/reinforce.md
@@ -8,8 +8,6 @@
    :undoc-members:
 ```
 
----
-
 ## Baselines
 
 ```{eval-rst}

diff --git a/docs/_content/api/algos/search.md b/docs/_content/api/algos/search.md
diff --git a/docs/_content/api/decoding.md b/docs/_content/api/decoding.md
@@ -0,0 +1,7 @@
+# Decoding Strategies
+
+```{eval-rst}
+.. automodule:: rl4co.utils.decoding
+   :members:
+   :undoc-members:
+```
diff --git a/docs/_content/api/models/base.md b/docs/_content/api/models/base.md
diff --git a/docs/_content/api/models/common/__init__.md b/docs/_content/api/models/common/__init__.md
@@ -0,0 +1,53 @@
+# NCO Methods Overview
+
+
+We categorize NCO approaches (which are in fact not necessarily trained with RL!) into the following: 1) constructive, 2) improvement, 3) transductive.
+
+
+```{eval-rst}
+.. tip::
+   Note that in RL4CO we distinguish the RL algorithms and the actors via the following naming:
+
+   * **Model:** Refers to the reinforcement learning algorithm encapsulated within a `LightningModule`. This module is responsible for training the policy.
+   * **Policy:** Implemented as a `nn.Module`, this neural network (often referred to as the *actor*) takes an instance and outputs a sequence of actions, :math:`\pi = \pi_0, \pi_1, \dots, \pi_N`, which constitutes the solution.
+
+   Here, :math:`\pi_i` represents the action taken at step :math:`i`, forming a sequence that leads to the optimal or near-optimal solution for the given instance.
+```
+
+
+The following table contains the categorization that we follow in RL4CO:
+
+
+```{eval-rst}
+.. list-table:: Overview of RL Models and Policies
+   :widths: 5 5 5 5 25
+   :header-rows: 1
+   :stub-columns: 1
+
+   * - Category
+     - Model or Policy?
+     - Input
+     - Output
+     - Description
+   * - Constructive
+     - Policy
+     - Instance
+     - Solution
+     - Policies trained to generate solutions from scratch. Can be categorized into AutoRegressive (AR) and Non-Autoregressive (NAR).
+   * - Improvement
+     - Policy
+     - Instance, Current Solution
+     - Improved Solution
+     - Policies trained to improve existing solutions iteratively, akin to local search algorithms. They focus on refining *existing* solutions rather than generating them from scratch.
+   * - Transductive
+     - Model
+     - Instance, (Policy)
+     - Solution, (Updated Policy)
+     - Updates policy parameters during online testing to improve solutions of a specific instance.
+```
+
+
+
+
+
+
diff --git a/docs/_content/api/models/common/constructive.md b/docs/_content/api/models/common/constructive.md
@@ -0,0 +1,71 @@
+## Constructive Policies
+
+Constructive NCO policies pre-train a policy to amortize the inference. "Constructive" means that a solution is created from scratch by the model. We can also categorize constructive NCO in two sub-categories depending on the role of encoder and decoder:
+
+#### Autoregressive (AR)
+Autoregressive approaches **use a learned decoder** that outputs log probabilities for the current solution. These approaches generate a solution step by step, similar to e.g. LLMs. They have an encoder-decoder structure. Some models may not have an encoder at all and just re-encode at each step.
+
+#### NonAutoregressive (NAR)
+The difference between AR and NAR approaches is that NAR **only an encoder is learnable** (they just encode in one shot) and generate for example a heatmap, which can then be decoded simply by using it as a probability distribution or by using some search method on top.
+
+Here is a general structure of a general constructive policy with an encoder-decoder structure:
+
+<img class="full-img" alt="policy" src="https://user-images.githubusercontent.com/48984123/281976545-ca88f159-d0b3-459e-8fd9-89799be9d1b0.png">
+
+
+where _embeddings_ transfer information from feature space to embedding space.
+
+---
+
+
+
+### Constructive Policy Base Classes
+
+```{eval-rst}
+.. automodule:: rl4co.models.common.constructive.base
+   :members:
+   :undoc-members:
+```
+
+
+
+### Autoregressive Policies Base Classes
+
+```{eval-rst}
+.. automodule:: rl4co.models.common.constructive.autoregressive.encoder
+   :members:
+   :undoc-members:
+```
+
+```{eval-rst}
+.. automodule:: rl4co.models.common.constructive.autoregressive.decoder
+   :members:
+   :undoc-members:
+```
+
+```{eval-rst}
+.. automodule:: rl4co.models.common.constructive.autoregressive.policy
+   :members:
+   :undoc-members:
+```
+
+### Nonautoregressive Policies Base Classes
+
+
+```{eval-rst}
+.. automodule:: rl4co.models.common.constructive.nonautoregressive.encoder
+   :members:
+   :undoc-members:
+```
+
+```{eval-rst}
+.. automodule:: rl4co.models.common.constructive.nonautoregressive.decoder
+   :members:
+   :undoc-members:
+```
+
+```{eval-rst}
+.. automodule:: rl4co.models.common.constructive.nonautoregressive.policy
+   :members:
+   :undoc-members:
+```
diff --git a/docs/_content/api/models/common/improvement.md b/docs/_content/api/models/common/improvement.md
@@ -0,0 +1,3 @@
+## Improvement Policies
+
+These methods differ w.r.t. constructive NCO since they can obtain better solutions similarly to how local search algorithms work - they can improve the solutions over time. This is different from decoding strategies or similar in constructive methods since these policies are trained for performing improvement operations.
diff --git a/docs/_content/api/models/common/transductive.md b/docs/_content/api/models/common/transductive.md
@@ -0,0 +1,19 @@
+# Transductive Models
+
+
+Transductive models are learning algorithms that optimize on a specific instance. They improve solutions by updating policy parameters $\theta$_, which means that we are running optimization (backprop) **at test time**.  Transductive learning can be performed with different policies: for example EAS updates (a part of) AR policies parameters to obtain better solutions, but I guess there are ways (or papers out there I don't know of) that optimize at test time.
+
+
+```{eval-rst}
+.. tip::
+   You may refer to the definition of `inductive vs transductive RL <https://en.wikipedia.org/wiki/Transduction_(machine_learning)>`_. In inductive RL, we train to generalize to new instances. In transductive RL we train (or finetune) to solve only specific ones.
+```
+
+
+## Base Transductive Model
+
+```{eval-rst}
+.. automodule:: rl4co.models.common.transductive.base
+   :members:
+   :undoc-members:
+```
diff --git a/docs/_content/api/models/nn.md b/docs/_content/api/models/nn.md
@@ -1,4 +1,12 @@
-# `nn` Modules
+# Neural Network Modules
+
+## Critic Network
+
+```{eval-rst}
+.. automodule:: rl4co.models.rl.common.critic
+   :members:
+   :undoc-members:
+```
 
 ## Graph Neural Networks
 
@@ -35,14 +43,6 @@
 ```
 
 
-## rl4co.models.nn.flash_attention
-
-```{eval-rst}
-.. automodule:: rl4co.models.nn.flash_attention
-   :members:
-   :undoc-members:
-```
-
 ## rl4co.models.nn.mlp
 
 ```{eval-rst}
@@ -57,12 +57,4 @@
 .. automodule:: rl4co.models.nn.ops
    :members:
    :undoc-members:
-```
-
-## rl4co.models.nn.utils
-
-```{eval-rst}
-.. automodule:: rl4co.models.nn.utils
-   :members:
-   :undoc-members:
 ```
diff --git a/docs/_content/api/models/rl.md b/docs/_content/api/models/rl.md
-Original file line number
+Diff line change
@@ Expand Up / @@ -8,8 +8,6 @@ @@
        :undoc-members:
     ```
-    ---
     ## Baselines
     ```{eval-rst}
@@ Expand Down @@
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		## Improvement Policies

		These methods differ w.r.t. constructive NCO since they can obtain better solutions similarly to how local search algorithms work - they can improve the solutions over time. This is different from decoding strategies or similar in constructive methods since these policies are trained for performing improvement operations.