Skip to content

Commit

Permalink
Merge pull request #18 from hakuhodo-technologies/d3rlpy
Browse files Browse the repository at this point in the history
Upgrade to scope-rl==0.2.1
  • Loading branch information
aiueola authored Jul 30, 2023
2 parents 138db9e + 31dc06b commit 0493d14
Show file tree
Hide file tree
Showing 52 changed files with 13,006 additions and 14,227 deletions.
4 changes: 2 additions & 2 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,15 @@ version: 2
build:
os: ubuntu-22.04
tools:
python: "3.9" # "3.10", "3.11" fails with d3rlpy==1.1.1
python: "3.11"
# You can also specify other tool versions:
# nodejs: "19"
# rust: "1.64"
# golang: "1.19"
jobs:
post_install:
- pip install Cython numpy
- pip install d3rlpy==1.1.1
- pip install d3rlpy>=2.0.2
- pip install scipy>=1.10.1
- pip install numpy>=1.22.4
- pip install pandas>=1.5.3
Expand Down
8 changes: 5 additions & 3 deletions FrequentlyAskedQuestions.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,11 @@ env = NewGymAPIWrapper(env)

Q. xxx environment does not work on d3rlpy, which is used for model training. How should we fix it? (d3rlpy and SCOPE-RL are compatible with different version of OpenAI Gym.)

A. While SCOPE-RL is compatible with the latest API of OpenAI Gym, d3rlpy is not. Therefore, please use `OldGymAPIWrapper` provided in `scope_rl/utils.py` to enable the use of d3rlpy.
A. A. Both `scope-rl>=0.2.1` and `d3rlpy>=2.0.2` supports compatibility with `gym>=0.26.0` and `gymnasium` environments. The source is available in the `main` branch.

If you want to use the older interface of `d3rlpy`, make sure to use `scope-rl==0.1.3` and `d3rlpy==1.1.1`. Then, please use `OldGymAPIWrapper` provided in `scope_rl/utils.py` to enable the use of d3rlpy. The source is available in the `depreciated` branch.
```Python
from scope_rl.utils import OldGymAPIWrapper
env = gym.make("xxx_v0") # compatible with gym>=0.26.2 and SCOPE-RL
env_ = OldGymAPIWrapper(env) # compatible with gym<0.26.2 and d3rlpy
env = gym.make("xxx_v0") # compatible with gym>=0.26.2 and scope-rl==0.1.3
env_ = OldGymAPIWrapper(env) # compatible with gym<0.26.2 and d3rlpy==1.1.1
```
47 changes: 25 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,15 @@

<div align="center"><img src="https://raw.githubusercontent.com/hakuhodo-technologies/scope-rl/main/images/logo.png" width="100%"/></div>

[![pypi](https://img.shields.io/pypi/v/scope-rl.svg)](https://pypi.python.org/pypi/scope-rl)
[![Python](https://img.shields.io/badge/python-3.9%20%7C%203.10%20%7C%203.11-blue)](https://www.python.org)
[![Downloads](https://pepy.tech/badge/scope-rl)](https://pepy.tech/project/scope-rl)
[![GitHub commit activity](https://img.shields.io/github/commit-activity/m/hakuhodo-technologies/scope-rl)](https://github.com/hakuhodo-technologies/scope-rl/graphs/contributors)
[![GitHub last commit](https://img.shields.io/github/last-commit/hakuhodo-technologies/scope-rl)](https://github.com/hakuhodo-technologies/scope-rl/graphs/commit-activity)
[![Documentation Status](https://readthedocs.org/projects/scope-rl/badge/?version=latest)](https://scope-rl.readthedocs.io/en/latest/)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![arXiv](https://img.shields.io/badge/arXiv-23xx.xxxxx-b31b1b.svg)](https://arxiv.org/abs/23xx.xxxxx)

<details>
<summary><strong>Table of Contents </strong>(click to expand)</summary>

Expand Down Expand Up @@ -152,30 +161,28 @@ Let's start by generating some synthetic logged data useful for performing offli
from scope_rl.dataset import SyntheticDataset
from scope_rl.policy import EpsilonGreedyHead
# import d3rlpy algorithms
from d3rlpy.algos import DoubleDQN
from d3rlpy.online.buffers import ReplayBuffer
from d3rlpy.online.explorers import ConstantEpsilonGreedy
from d3rlpy.algos import DoubleDQNConfig
from d3rlpy.dataset import create_fifo_replay_buffer
from d3rlpy.algos import ConstantEpsilonGreedy
# import rtbgym and gym
import rtbgym
import gym
import torch
# random state
random_state = 12345
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# (0) Setup environment
env = gym.make("RTBEnv-discrete-v0")

# for api compatibility to d3rlpy
from scope_rl.utils import OldGymAPIWrapper
env_ = OldGymAPIWrapper(env)

# (1) Learn a baseline policy in an online environment (using d3rlpy)
# initialize the algorithm
ddqn = DoubleDQN()
ddqn = DoubleDQNConfig().create(device=device)
# train an online policy
# this takes about 5min to compute
ddqn.fit_online(
env_,
buffer=ReplayBuffer(maxlen=10000, env=env_),
env,
buffer=create_fifo_replay_buffer(limit=10000, env=env),
explorer=ConstantEpsilonGreedy(epsilon=0.3),
n_steps=100000,
n_steps_per_epoch=1000,
Expand All @@ -194,15 +201,15 @@ behavior_policy = EpsilonGreedyHead(
# initialize the dataset class
dataset = SyntheticDataset(
env=env,
maximum_episode_steps=env.step_per_episode,
max_episode_steps=env.step_per_episode,
)
# the behavior policy collects some logged data
train_logged_dataset = dataset.obtain_trajectories(
train_logged_dataset = dataset.obtain_episodes(
behavior_policies=behavior_policy,
n_trajectories=10000,
random_state=random_state,
)
test_logged_dataset = dataset.obtain_trajectories(
test_logged_dataset = dataset.obtain_episodes(
behavior_policies=behavior_policy,
n_trajectories=10000,
random_state=random_state + 1,
Expand All @@ -217,7 +224,7 @@ We are now ready to learn a new policy (evaluation policy) from the logged data

# import d3rlpy algorithms
from d3rlpy.dataset import MDPDataset
from d3rlpy.algos import DiscreteCQL
from d3rlpy.algos import DiscreteCQLConfig

# (3) Learning a new policy from offline logged data (using d3rlpy)
# convert the logged dataset into d3rlpy's dataset format
Expand All @@ -226,16 +233,13 @@ offlinerl_dataset = MDPDataset(
actions=train_logged_dataset["action"],
rewards=train_logged_dataset["reward"],
terminals=train_logged_dataset["done"],
episode_terminals=train_logged_dataset["done"],
discrete_action=True,
)
# initialize the algorithm
cql = DiscreteCQL()
cql = DiscreteCQLConfig().create(device=device)
# train an offline policy
cql.fit(
offlinerl_dataset,
n_steps=10000,
scorers={},
)
```

Expand Down Expand Up @@ -281,7 +285,6 @@ evaluation_policies = [cql_, ddqn_, random_]
# create input for the OPE class
prep = CreateOPEInput(
env=env,
logged_dataset=test_logged_dataset,
)
input_dict = prep.obtain_whole_inputs(
logged_dataset=test_logged_dataset,
Expand Down Expand Up @@ -357,7 +360,7 @@ For more extensive examples, please refer to [quickstart/rtb/rtb_synthetic_discr

### Off-Policy Selection and Evaluation of OPE/OPS

We can also select the best-performing policy among a set of candidate policies based on the OPE results using the OPS class. It is also possible to evaluate the reliability of OPE/OPS using various metrics such as mean squaredberror, rank correlation, regret, and type I and type II error rates.
We can also select the best-performing policy among a set of candidate policies based on the OPE results using the OPS class. It is also possible to evaluate the reliability of OPE/OPS using various metrics such as mean squared error, rank correlation, regret, and type I and type II error rates.

```Python
# perform off-policy selection based on the OPE results
Expand All @@ -379,7 +382,7 @@ ranking_dict_ = ops.select_by_policy_value_via_cumulative_distribution_ope(input
ops.visualize_topk_policy_value_selected_by_standard_ope(
input_dict=input_dict,
compared_estimators=["dm", "tis", "pdis", "dr"],
safety_criteria=1.0,
relative_safety_criteria=1.0,
)
```
<div align="center"><img src="https://raw.githubusercontent.com/hakuhodo-technologies/scope-rl/main/images/ops_topk_lower_quartile.png" width="100%"/></div>
Expand All @@ -399,7 +402,7 @@ ranking_df, metric_df = ops.select_by_lower_quartile(
return_by_dataframe=True,
)
# visualize the OPS results with the ground-truth metrics
ops.visualize_cvar_for_validation(
ops.visualize_conditional_value_at_risk_for_validation(
input_dict,
alpha=0.3,
share_axes=True,
Expand Down
2 changes: 1 addition & 1 deletion docs/_templates/autosummary/module_head.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{{ fullname | escape | underline}}

.. automodule:: {{ fullname }}
:exclude-members: build_with_dataset,build_with_env,copy_policy_from,copy_q_function_from,fitter,generate_new_data,create_impl,get_action_type,get_params,load_model,save_model,from_json,save_params,save_policy,set_active_logger,set_grad_step,set_params,impl,grad_step,n_frames,action_size,batch_size,gamma,n_steps,reward_scaler,scaler,fit,fit_online,update,collect,action_logger,action_scalar,observation_space,predict,predict_value,fit_batch_online,sample_action
:exclude-members: build_with_dataset,build_with_env,copy_policy_from,copy_policy_optim_from,copy_q_function_from,copy_q_function_optim_from,fitter,update,inner_update,create_impl,inner_create_impl,get_action_type,load_model,save_model,from_json,save,save_policy,set_grad_step,reset_optimizer_states,impl,grad_step,action_size,batch_size,gamma,config,reward_scaler,observation_scaler,action_scaler,fit,fit_online,observation_shape,predict,predict_value,sample_action

{% block functions %}
{% if functions %}
Expand Down
6 changes: 3 additions & 3 deletions docs/documentation/examples/assessments.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Here, we show example codes for assessing OPE/OPS results.
For preparation, please also refer to the following pages:

* :doc:`What are Off-Policy Evaluation and Selection? </documentation/ope_ops>`
* :ref:`Supported Evaluation Protococols for OPE/OPS <implementation_eval_ope_ops>`
* :ref:`Supported Evaluation Protocols for OPE/OPS <implementation_eval_ope_ops>`
* :doc:`/documentation/sharpe_ratio`
* :doc:`Supported Implementations for data collection and Offline RL </documentation/learning_implementation>`
* :doc:`Example codes for basic OPE </documentation/examples/basic_ope>`
Expand All @@ -22,14 +22,14 @@ Here, we assume that an RL environment, a behavior policy, and evaluation polici

* ``behavior_policy``: an instance of :class:`BaseHead`
* ``evaluation_policies``: a list of instance(s) of :class:`BaseHead`
* ``env``: a gym environment (unecessary when using real-world datasets)
* ``env``: a gym environment (unnecessary when using real-world datasets)

Additionally, we assume that the logged datasets, inputs, and either ope or cd_ope instances are ready to use.
For initializing the ope and cd_ope instances, please refer to :doc:`this page </documentation/examples/basic_ope>`
and :doc:`this page </documentation/examples/cumulative_dist_ope>` as references, respectively.

* ``logged_dataset``: a dictionary containing the logged dataset
* ``input_dict``: a dictionaty containing inputs for OPE
* ``input_dict``: a dictionary containing inputs for OPE
* ``ope``: an instance of :class:`OffPolicyEvaluation`
* ``cd_ope``: an instance of :class:`CumulativeDistributionOPE`

Expand Down
23 changes: 11 additions & 12 deletions docs/documentation/examples/basic_ope.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Here, we assume that an RL environment, a behavior policy, and evaluation polici

* ``behavior_policy``: an instance of :class:`BaseHead`
* ``evaluation_policies``: a list of instance(s) of :class:`BaseHead`
* ``env``: a gym environment (unecessary when using real-world datasets)
* ``env``: a gym environment (unnecessary when using real-world datasets)

Then, we use the behavior policy to collect logged dataset as follows.

Expand All @@ -30,7 +30,7 @@ Then, we use the behavior policy to collect logged dataset as follows.
env=env,
max_episode_steps=env.step_per_episode,
)
# obtain logged dataset
# obtain a logged dataset
logged_dataset = dataset.obtain_episodes(
behavior_policies=behavior_policy,
n_trajectories=10000,
Expand All @@ -48,7 +48,7 @@ The next step is to create the inputs for OPE estimators. This procedure slightl
OPE with importance sampling-based estimators
----------
When using the importance sampling-based estimators including TIS, PDIS, SNTIS, and SNPDIS,
and hybrid estimators including DR and SNDR, make sure that "pscore" is recorded in the logged dataset.
and hybrid estimators including DR and SNDR, make sure that "pscore" (i.e., action choice probability of the behavior policy) is recorded in the logged dataset.

Then, when using only importance sampling-based estimators, the minimal sufficient codes are the following:

Expand All @@ -58,7 +58,7 @@ Then, when using only importance sampling-based estimators, the minimal sufficie
# initialize class to create inputs
prep = CreateOPEInput(
env=env, # unecessary when using real-world dataset
env=env, # unnecessary when using real-world dataset
)
# create inputs
input_dict = prep.obtain_whole_inputs(
Expand All @@ -82,7 +82,6 @@ When using the model-based estimator (DM) or hybrid methods, we need to addition
"encoder_factory": VectorEncoderFactory(hidden_units=[30, 30]),
"q_func_factory": MeanQFunctionFactory(),
"learning_rate": 1e-4,
"use_gpu": torch.cuda.is_available(),
},
},
)
Expand Down Expand Up @@ -148,11 +147,11 @@ We can also apply scaling to either state observation or (continuous) action as

.. code-block:: python
from scope_rl.utils import MinMaxScaler
from d3rlpy.preprocessing import MinMaxObservationScaler, MinMaxActionScaler
prep = CreateOPEInput(
env=env,
state_scaler=MinMaxScaler( #
state_scaler=MinMaxObservationScaler( #
minimum=logged_dataset["state"].min(axis=0),
maximum=logged_dataset["state"].max(axis=0),
),
Expand Down Expand Up @@ -220,7 +219,7 @@ Note that, the following provides the complete list of estimators that are curre
* :doc:`Supported OPE estimators </documentation/evaluation_implementation>` summarizes the key properties of each estimator.


We can easily conduct OPE and obtain and the results as follows.
We can easily conduct OPE and obtain the results as follows.

.. code-block:: python
Expand Down Expand Up @@ -290,7 +289,7 @@ Users can also specify the compared OPE estimators as follows.
random_state=random_state,
)
When ``legend`` is unecessary, just disable this option.
When ``legend`` is unnecessary, just disable this option.

.. code-block:: python
Expand All @@ -300,7 +299,7 @@ When ``legend`` is unecessary, just disable this option.
random_state=random_state,
)
To save figure, specify the directory to save it.
To save the figure, specify the directory to save it.

.. code-block:: python
Expand All @@ -313,7 +312,7 @@ To save figure, specify the directory to save it.
Choosing the "Spectrum" of OPE for marginal estimators
----------
The implemented OPE estimators can interpolates among naive importance sampling and
The implemented OPE estimators can interpolate among naive importance sampling and
marginal importance sampling by specifying the steps to use per-decision importance weight
(See :ref:`Supported OPE estimators (SOPE) <implementation_sope>` for the details).
This is done by specifying ``n_step_pdis`` when initializing the class.
Expand All @@ -329,7 +328,7 @@ This is done by specifying ``n_step_pdis`` when initializing the class.
Choosing a kernel for continuous-action OPE
----------
In continuous-action OPE, the choices of kernel and the bandwith hyperparameter can affect the bias-variance tradeoff and the estimation accuracy.
In continuous-action OPE, the choices of the kernel and the bandwidth hyperparameter can affect the bias-variance tradeoff and the estimation accuracy.
To control the hyperparameter, please use the following arguments.

.. code-block:: python
Expand Down
4 changes: 2 additions & 2 deletions docs/documentation/examples/cumulative_dist_ope.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Here, we assume that an RL environment, a behavior policy, and evaluation polici

* ``behavior_policy``: an instance of :class:`BaseHead`
* ``evaluation_policies``: a list of instance(s) of :class:`BaseHead`
* ``env``: a gym environment (unecessary when using real-world datasets)
* ``env``: a gym environment (unnecessary when using real-world datasets)

Then, we use the behavior policy to collect logged dataset as follows.

Expand Down Expand Up @@ -60,7 +60,7 @@ Then, when using only importance sampling-based estimators, the minimal sufficie
# initialize class to create inputs
prep = CreateOPEInput(
env=env, # unecessary when using real-world dataset
env=env, # unnecessary when using real-world dataset
)
# create inputs (e.g., calculating )
input_dict = prep.obtain_whole_inputs(
Expand Down
2 changes: 1 addition & 1 deletion docs/documentation/examples/custom_estimators.rst
Original file line number Diff line number Diff line change
Expand Up @@ -327,7 +327,7 @@ Note that, the available inputs are the same with basic OPE.

.. seealso::

Finally, contribution to SCOPE-RL with a new OPE estimator is more than welcome! Please read `the guidelines for contribution (CONTRIBUTING.md) <https://github.com/hakuhodo-technologies/scope-rl/blob/main/CONTRIBUTING.md>`_.
Finally, contributions to SCOPE-RL with a new OPE estimator are more than welcome! Please read `the guidelines for contribution (CONTRIBUTING.md) <https://github.com/hakuhodo-technologies/scope-rl/blob/main/CONTRIBUTING.md>`_.

.. raw:: html

Expand Down
14 changes: 7 additions & 7 deletions docs/documentation/examples/multiple.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Here, we assume that an RL environment, behavior policies, and evaluation polici

* ``behavior_policy``: an instance of :class:`BaseHead` or a list of instance(s) of :class:`BaseHead`
* ``evaluation_policies``: a list of instance(s) of :class:`BaseHead`
* ``env``: a gym environment (unecessary when using real-world datasets)
* ``env``: a gym environment (unnecessary when using real-world datasets)

Then, we can collect multiple logged datasets with a single behavior policy as follows.

Expand Down Expand Up @@ -109,7 +109,7 @@ We first show the case of creating whole logged datasets stored in ``multiple_lo
# initialize class to create inputs
prep = CreateOPEInput(
env=env, # unecessary when using real-world dataset
env=env, # unnecessary when using real-world dataset
)
# create inputs (e.g., calculating )
multiple_input_dict = prep.obtain_whole_inputs(
Expand All @@ -120,7 +120,7 @@ We first show the case of creating whole logged datasets stored in ``multiple_lo
)
The above code returns ``multiple_input_dict`` as an instance of :class:`MultipleInputDict`.
Each input dictionary is accessble via the following code.
Each input dictionary is accessible via the following code.

.. code-block:: python
Expand Down Expand Up @@ -159,7 +159,7 @@ by specifying the behavior policy and the dataset id as follows.
Off-Policy Evaluation
~~~~~~~~~~
SCOPE-RL enables OPE with multiple logged datasets and multiple input dicts without additional efforts.
SCOPE-RL enables OPE with multiple logged datasets and multiple input dicts without additional effort.
Specifically, we can estimate the policy value via basic OPE as follows.

.. code-block:: python
Expand Down Expand Up @@ -531,7 +531,7 @@ Similar codes also work for the following functions.

Validating True and Estimated Policy Performance
~~~~~~~~~~
Finally, we also provide funnctions to compare the true and estimated policy performance.
Finally, we also provide functions to compare the true and estimated policy performance.

.. code-block:: python
Expand All @@ -545,7 +545,7 @@ Finally, we also provide funnctions to compare the true and estimated policy per
:img-top: ../../_static/images/multiple_validation_policy_value.png
:text-align: center

When using a single behavior policy, specify behavipr policy name.
When using a single behavior policy, specify the behavior policy name.

.. code-block:: python
Expand All @@ -556,7 +556,7 @@ When using a single behavior policy, specify behavipr policy name.
share_axes=True,
)
When using a single logged dataset, specify both behavior policy name and dataset id.
When using a single logged dataset, specify both the behavior policy name and dataset id.

.. code-block:: python
Expand Down
Loading

0 comments on commit 0493d14

Please sign in to comment.