Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major modeling refactoring #165

Merged
merged 33 commits into from
Apr 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
81074f0
[Feat] add entropy calculation
fedebotu Apr 23, 2024
bb64ef1
[Feat] action logprob evaluation
fedebotu Apr 23, 2024
44c4901
[Minor] remove unused_kwarg for clarity
fedebotu Apr 23, 2024
fbd4941
[Rename] embedding_dim -> embed_dim (PyTorch naming convention)
fedebotu Apr 23, 2024
6e07985
[Move] move common one level up
fedebotu Apr 23, 2024
f30c32d
[Refactor] classify NCO as constructive (AR,NAR), improvement, search
fedebotu Apr 23, 2024
3a16b7c
[Refactor] follow major refactoring
fedebotu Apr 23, 2024
3ec285e
[Refactor] cleaner implementation; eval via policy itself
fedebotu Apr 23, 2024
796d54a
[Refactor] make env_name an optional kwarg
fedebotu Apr 23, 2024
faab06e
[Tests] adapt to refactoring
fedebotu Apr 23, 2024
5d04dfa
[Refactor] new structure; env_name as optional; embed_dim standardiza…
fedebotu Apr 23, 2024
4e6351c
[Tests] minor fix
fedebotu Apr 23, 2024
10cc4ee
Fixing best solution gathering for POMO
ahottung Apr 24, 2024
81a3bf9
Fixing bug introduced in last commit
ahottung Apr 25, 2024
7034172
Merge remote-tracking branch 'origin/main' into refactor-base
fedebotu Apr 27, 2024
3644acb
[BugFix] default POMO parameters
fedebotu Apr 27, 2024
cd62442
[Rename] Search -> Transductive
fedebotu Apr 27, 2024
4180997
[Feat] add NARGNN (as in DeepACO) as a separate policy and encoder
fedebotu Apr 27, 2024
e783679
[Refactor] abstract classes with abc.ABCMeta
fedebotu Apr 27, 2024
5a4740f
[Refactor] abstract classes with abc.ABCMeta
fedebotu Apr 27, 2024
3adbef4
[Feat] modular Critic network
fedebotu Apr 28, 2024
db06207
[Rename] PPOModel -> AMPPO
fedebotu Apr 28, 2024
9ef3254
[Refactor] separate A2C from classic REINFORCE #93
fedebotu Apr 28, 2024
ca44680
Merge remote-tracking branch 'origin/main' into refactor-base
fedebotu Apr 28, 2024
2c91457
[Minor] force env_name as str for clarity
fedebotu Apr 28, 2024
6da8691
[Tests] avoid testing render
fedebotu Apr 28, 2024
04ed94a
[Doc] add docstrings
fedebotu Apr 28, 2024
b7fe9b3
[BugFix] env_name not passed to base class
fedebotu Apr 28, 2024
3558d57
[Doc] update to latest version
fedebotu Apr 28, 2024
c3089fb
[Minor] woopsie, remove added exampels
fedebotu Apr 28, 2024
c1e19e8
[Minor] fix NAR; raise log error if any param is found in decoder
fedebotu Apr 28, 2024
90956af
[Doc] fix docstrings
fedebotu Apr 28, 2024
cfaf43d
[Doc] documentation update and improvements
fedebotu Apr 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion configs/model/am-ppo.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
_target_: rl4co.models.PPOModel
_target_: rl4co.models.AMPPO

metrics:
train: ["loss", "reward", "surrogate_loss", "value_loss", "entropy_bonus"]
Expand Down
9 changes: 9 additions & 0 deletions docs/_content/api/algos/a2c.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# A2C

## A2C (Advantage Actor Critic)

```{eval-rst}
.. automodule:: rl4co.models.rl.a2c.a2c
:members:
:undoc-members:
```
2 changes: 1 addition & 1 deletion docs/_content/api/algos/base.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@
.. automodule:: rl4co.models.rl.common.base
:members:
:undoc-members:
```
```
4 changes: 4 additions & 0 deletions docs/_content/api/algos/ppo.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# PPO


## PPO (Proximal Policy Optimization)


```{eval-rst}
.. automodule:: rl4co.models.rl.ppo.ppo
:members:
Expand Down
2 changes: 0 additions & 2 deletions docs/_content/api/algos/reinforce.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,6 @@
:undoc-members:
```

---

## Baselines

```{eval-rst}
Expand Down
7 changes: 0 additions & 7 deletions docs/_content/api/algos/search.md

This file was deleted.

7 changes: 7 additions & 0 deletions docs/_content/api/decoding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Decoding Strategies

```{eval-rst}
.. automodule:: rl4co.utils.decoding
:members:
:undoc-members:
```
64 changes: 0 additions & 64 deletions docs/_content/api/models/base.md

This file was deleted.

53 changes: 53 additions & 0 deletions docs/_content/api/models/common/__init__.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# NCO Methods Overview


We categorize NCO approaches (which are in fact not necessarily trained with RL!) into the following: 1) constructive, 2) improvement, 3) transductive.


```{eval-rst}
.. tip::
Note that in RL4CO we distinguish the RL algorithms and the actors via the following naming:

* **Model:** Refers to the reinforcement learning algorithm encapsulated within a `LightningModule`. This module is responsible for training the policy.
* **Policy:** Implemented as a `nn.Module`, this neural network (often referred to as the *actor*) takes an instance and outputs a sequence of actions, :math:`\pi = \pi_0, \pi_1, \dots, \pi_N`, which constitutes the solution.

Here, :math:`\pi_i` represents the action taken at step :math:`i`, forming a sequence that leads to the optimal or near-optimal solution for the given instance.
```
Comment on lines +7 to +15
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could mention here or somewhere else that abstract classes under rl4co/models/common are not expected to be directly initialized. For example, if you want to use an autoregressive policy, you may want to init an AM model instead of the AutoregressivePolicy(), same as NAR, improvement, and transductive classes.



The following table contains the categorization that we follow in RL4CO:


```{eval-rst}
.. list-table:: Overview of RL Models and Policies
:widths: 5 5 5 5 25
:header-rows: 1
:stub-columns: 1

* - Category
- Model or Policy?
- Input
- Output
- Description
* - Constructive
- Policy
- Instance
- Solution
- Policies trained to generate solutions from scratch. Can be categorized into AutoRegressive (AR) and Non-Autoregressive (NAR).
* - Improvement
- Policy
- Instance, Current Solution
- Improved Solution
- Policies trained to improve existing solutions iteratively, akin to local search algorithms. They focus on refining *existing* solutions rather than generating them from scratch.
* - Transductive
- Model
- Instance, (Policy)
- Solution, (Updated Policy)
- Updates policy parameters during online testing to improve solutions of a specific instance.
```






71 changes: 71 additions & 0 deletions docs/_content/api/models/common/constructive.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
## Constructive Policies

Constructive NCO policies pre-train a policy to amortize the inference. "Constructive" means that a solution is created from scratch by the model. We can also categorize constructive NCO in two sub-categories depending on the role of encoder and decoder:

#### Autoregressive (AR)
Autoregressive approaches **use a learned decoder** that outputs log probabilities for the current solution. These approaches generate a solution step by step, similar to e.g. LLMs. They have an encoder-decoder structure. Some models may not have an encoder at all and just re-encode at each step.

#### NonAutoregressive (NAR)
The difference between AR and NAR approaches is that NAR **only an encoder is learnable** (they just encode in one shot) and generate for example a heatmap, which can then be decoded simply by using it as a probability distribution or by using some search method on top.

Here is a general structure of a general constructive policy with an encoder-decoder structure:

<img class="full-img" alt="policy" src="https://user-images.githubusercontent.com/48984123/281976545-ca88f159-d0b3-459e-8fd9-89799be9d1b0.png">


where _embeddings_ transfer information from feature space to embedding space.

---



### Constructive Policy Base Classes

```{eval-rst}
.. automodule:: rl4co.models.common.constructive.base
:members:
:undoc-members:
```



### Autoregressive Policies Base Classes

```{eval-rst}
.. automodule:: rl4co.models.common.constructive.autoregressive.encoder
:members:
:undoc-members:
```

```{eval-rst}
.. automodule:: rl4co.models.common.constructive.autoregressive.decoder
:members:
:undoc-members:
```

```{eval-rst}
.. automodule:: rl4co.models.common.constructive.autoregressive.policy
:members:
:undoc-members:
```

### Nonautoregressive Policies Base Classes


```{eval-rst}
.. automodule:: rl4co.models.common.constructive.nonautoregressive.encoder
:members:
:undoc-members:
```

```{eval-rst}
.. automodule:: rl4co.models.common.constructive.nonautoregressive.decoder
:members:
:undoc-members:
```

```{eval-rst}
.. automodule:: rl4co.models.common.constructive.nonautoregressive.policy
:members:
:undoc-members:
```
3 changes: 3 additions & 0 deletions docs/_content/api/models/common/improvement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## Improvement Policies

These methods differ w.r.t. constructive NCO since they can obtain better solutions similarly to how local search algorithms work - they can improve the solutions over time. This is different from decoding strategies or similar in constructive methods since these policies are trained for performing improvement operations.
19 changes: 19 additions & 0 deletions docs/_content/api/models/common/transductive.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Transductive Models


Transductive models are learning algorithms that optimize on a specific instance. They improve solutions by updating policy parameters $\theta$_, which means that we are running optimization (backprop) **at test time**. Transductive learning can be performed with different policies: for example EAS updates (a part of) AR policies parameters to obtain better solutions, but I guess there are ways (or papers out there I don't know of) that optimize at test time.


```{eval-rst}
.. tip::
You may refer to the definition of `inductive vs transductive RL <https://en.wikipedia.org/wiki/Transduction_(machine_learning)>`_. In inductive RL, we train to generalize to new instances. In transductive RL we train (or finetune) to solve only specific ones.
```


## Base Transductive Model

```{eval-rst}
.. automodule:: rl4co.models.common.transductive.base
:members:
:undoc-members:
```
26 changes: 9 additions & 17 deletions docs/_content/api/models/nn.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,12 @@
# `nn` Modules
# Neural Network Modules

## Critic Network

```{eval-rst}
.. automodule:: rl4co.models.rl.common.critic
:members:
:undoc-members:
```

## Graph Neural Networks

Expand Down Expand Up @@ -35,14 +43,6 @@
```


## rl4co.models.nn.flash_attention

```{eval-rst}
.. automodule:: rl4co.models.nn.flash_attention
:members:
:undoc-members:
```

## rl4co.models.nn.mlp

```{eval-rst}
Expand All @@ -57,12 +57,4 @@
.. automodule:: rl4co.models.nn.ops
:members:
:undoc-members:
```

## rl4co.models.nn.utils

```{eval-rst}
.. automodule:: rl4co.models.nn.utils
:members:
:undoc-members:
```
37 changes: 0 additions & 37 deletions docs/_content/api/models/rl.md

This file was deleted.

Loading
Loading