Merge pull request #18 from hakuhodo-technologies/d3rlpy

Upgrade to scope-rl==0.2.1
hakuhodo-technologies · Jul 30, 2023 · 0493d14 · 0493d14
2 parents 138db9e + 31dc06b
commit 0493d14
Show file tree

Hide file tree

Showing 52 changed files with 13,006 additions and 14,227 deletions.
diff --git a/.readthedocs.yaml b/.readthedocs.yaml
@@ -9,15 +9,15 @@ version: 2
 build:
   os: ubuntu-22.04
   tools:
-    python: "3.9"  # "3.10", "3.11" fails with d3rlpy==1.1.1
+    python: "3.11"
     # You can also specify other tool versions:
     # nodejs: "19"
     # rust: "1.64"
     # golang: "1.19"
   jobs:
     post_install:
       - pip install Cython numpy
-      - pip install d3rlpy==1.1.1
+      - pip install d3rlpy>=2.0.2
       - pip install scipy>=1.10.1
       - pip install numpy>=1.22.4
       - pip install pandas>=1.5.3

diff --git a/FrequentlyAskedQuestions.md b/FrequentlyAskedQuestions.md
@@ -26,9 +26,11 @@ env = NewGymAPIWrapper(env)
 
 Q. xxx environment does not work on d3rlpy, which is used for model training. How should we fix it? (d3rlpy and SCOPE-RL are compatible with different version of OpenAI Gym.)
 
-A. While SCOPE-RL is compatible with the latest API of OpenAI Gym, d3rlpy is not. Therefore, please use `OldGymAPIWrapper` provided in `scope_rl/utils.py` to enable the use of d3rlpy.
+A. A. Both `scope-rl>=0.2.1` and `d3rlpy>=2.0.2` supports compatibility with `gym>=0.26.0` and `gymnasium` environments. The source is available in the `main` branch.
+
+If you want to use the older interface of `d3rlpy`, make sure to use `scope-rl==0.1.3` and `d3rlpy==1.1.1`. Then, please use `OldGymAPIWrapper` provided in `scope_rl/utils.py` to enable the use of d3rlpy. The source is available in the `depreciated` branch.
 ```Python
 from scope_rl.utils import OldGymAPIWrapper
-env = gym.make("xxx_v0")  # compatible with gym>=0.26.2 and SCOPE-RL
-env_ = OldGymAPIWrapper(env)  # compatible with gym<0.26.2 and d3rlpy
+env = gym.make("xxx_v0")  # compatible with gym>=0.26.2 and scope-rl==0.1.3
+env_ = OldGymAPIWrapper(env)  # compatible with gym<0.26.2 and d3rlpy==1.1.1
 ```
diff --git a/README.md b/README.md
@@ -2,6 +2,15 @@
 
 <div align="center"><img src="https://raw.githubusercontent.com/hakuhodo-technologies/scope-rl/main/images/logo.png" width="100%"/></div>
 
+[![pypi](https://img.shields.io/pypi/v/scope-rl.svg)](https://pypi.python.org/pypi/scope-rl)
+[![Python](https://img.shields.io/badge/python-3.9%20%7C%203.10%20%7C%203.11-blue)](https://www.python.org)
+[![Downloads](https://pepy.tech/badge/scope-rl)](https://pepy.tech/project/scope-rl)
+[![GitHub commit activity](https://img.shields.io/github/commit-activity/m/hakuhodo-technologies/scope-rl)](https://github.com/hakuhodo-technologies/scope-rl/graphs/contributors)
+[![GitHub last commit](https://img.shields.io/github/last-commit/hakuhodo-technologies/scope-rl)](https://github.com/hakuhodo-technologies/scope-rl/graphs/commit-activity)
+[![Documentation Status](https://readthedocs.org/projects/scope-rl/badge/?version=latest)](https://scope-rl.readthedocs.io/en/latest/)
+[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
+[![arXiv](https://img.shields.io/badge/arXiv-23xx.xxxxx-b31b1b.svg)](https://arxiv.org/abs/23xx.xxxxx)
+
 <details>
 <summary><strong>Table of Contents </strong>(click to expand)</summary>
 
@@ -152,30 +161,28 @@ Let's start by generating some synthetic logged data useful for performing offli
 from scope_rl.dataset import SyntheticDataset
 from scope_rl.policy import EpsilonGreedyHead
 # import d3rlpy algorithms
-from d3rlpy.algos import DoubleDQN
-from d3rlpy.online.buffers import ReplayBuffer
-from d3rlpy.online.explorers import ConstantEpsilonGreedy
+from d3rlpy.algos import DoubleDQNConfig
+from d3rlpy.dataset import create_fifo_replay_buffer
+from d3rlpy.algos import ConstantEpsilonGreedy
 # import rtbgym and gym
 import rtbgym
 import gym
+import torch
 # random state
 random_state = 12345
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
 
 # (0) Setup environment
 env = gym.make("RTBEnv-discrete-v0")
 
-# for api compatibility to d3rlpy
-from scope_rl.utils import OldGymAPIWrapper
-env_ = OldGymAPIWrapper(env)
-
 # (1) Learn a baseline policy in an online environment (using d3rlpy)
 # initialize the algorithm
-ddqn = DoubleDQN()
+ddqn = DoubleDQNConfig().create(device=device)
 # train an online policy
 # this takes about 5min to compute
 ddqn.fit_online(
-    env_,
-    buffer=ReplayBuffer(maxlen=10000, env=env_),
+    env,
+    buffer=create_fifo_replay_buffer(limit=10000, env=env),
     explorer=ConstantEpsilonGreedy(epsilon=0.3),
     n_steps=100000,
     n_steps_per_epoch=1000,
@@ -194,15 +201,15 @@ behavior_policy = EpsilonGreedyHead(
 # initialize the dataset class
 dataset = SyntheticDataset(
     env=env,
-    maximum_episode_steps=env.step_per_episode,
+    max_episode_steps=env.step_per_episode,
 )
 # the behavior policy collects some logged data
-train_logged_dataset = dataset.obtain_trajectories(
+train_logged_dataset = dataset.obtain_episodes(
   behavior_policies=behavior_policy,
   n_trajectories=10000,
   random_state=random_state,
 )
-test_logged_dataset = dataset.obtain_trajectories(
+test_logged_dataset = dataset.obtain_episodes(
   behavior_policies=behavior_policy,
   n_trajectories=10000,
   random_state=random_state + 1,
@@ -217,7 +224,7 @@ We are now ready to learn a new policy (evaluation policy) from the logged data
 
 # import d3rlpy algorithms
 from d3rlpy.dataset import MDPDataset
-from d3rlpy.algos import DiscreteCQL
+from d3rlpy.algos import DiscreteCQLConfig
 
 # (3) Learning a new policy from offline logged data (using d3rlpy)
 # convert the logged dataset into d3rlpy's dataset format
@@ -226,16 +233,13 @@ offlinerl_dataset = MDPDataset(
     actions=train_logged_dataset["action"],
     rewards=train_logged_dataset["reward"],
     terminals=train_logged_dataset["done"],
-    episode_terminals=train_logged_dataset["done"],
-    discrete_action=True,
 )
 # initialize the algorithm
-cql = DiscreteCQL()
+cql = DiscreteCQLConfig().create(device=device)
 # train an offline policy
 cql.fit(
     offlinerl_dataset,
     n_steps=10000,
-    scorers={},
 )
 ```
 
@@ -281,7 +285,6 @@ evaluation_policies = [cql_, ddqn_, random_]
 # create input for the OPE class
 prep = CreateOPEInput(
     env=env,
-    logged_dataset=test_logged_dataset,
 )
 input_dict = prep.obtain_whole_inputs(
     logged_dataset=test_logged_dataset,
@@ -357,7 +360,7 @@ For more extensive examples, please refer to [quickstart/rtb/rtb_synthetic_discr
 
 ### Off-Policy Selection and Evaluation of OPE/OPS
 
-We can also select the best-performing policy among a set of candidate policies based on the OPE results using the OPS class. It is also possible to evaluate the reliability of OPE/OPS using various metrics such as mean squaredberror, rank correlation, regret, and type I and type II error rates.
+We can also select the best-performing policy among a set of candidate policies based on the OPE results using the OPS class. It is also possible to evaluate the reliability of OPE/OPS using various metrics such as mean squared error, rank correlation, regret, and type I and type II error rates.
 
 ```Python
 # perform off-policy selection based on the OPE results
@@ -379,7 +382,7 @@ ranking_dict_ = ops.select_by_policy_value_via_cumulative_distribution_ope(input
 ops.visualize_topk_policy_value_selected_by_standard_ope(
     input_dict=input_dict,
     compared_estimators=["dm", "tis", "pdis", "dr"],
-    safety_criteria=1.0,
+    relative_safety_criteria=1.0,
 )
 ```
 <div align="center"><img src="https://raw.githubusercontent.com/hakuhodo-technologies/scope-rl/main/images/ops_topk_lower_quartile.png" width="100%"/></div>
@@ -399,7 +402,7 @@ ranking_df, metric_df = ops.select_by_lower_quartile(
     return_by_dataframe=True,
 )
 # visualize the OPS results with the ground-truth metrics
-ops.visualize_cvar_for_validation(
+ops.visualize_conditional_value_at_risk_for_validation(
     input_dict,
     alpha=0.3,
     share_axes=True,

diff --git a/docs/_templates/autosummary/module_head.rst b/docs/_templates/autosummary/module_head.rst
@@ -1,7 +1,7 @@
 {{ fullname | escape | underline}}
 
 .. automodule:: {{ fullname }}
-   :exclude-members: build_with_dataset,build_with_env,copy_policy_from,copy_q_function_from,fitter,generate_new_data,create_impl,get_action_type,get_params,load_model,save_model,from_json,save_params,save_policy,set_active_logger,set_grad_step,set_params,impl,grad_step,n_frames,action_size,batch_size,gamma,n_steps,reward_scaler,scaler,fit,fit_online,update,collect,action_logger,action_scalar,observation_space,predict,predict_value,fit_batch_online,sample_action
+   :exclude-members: build_with_dataset,build_with_env,copy_policy_from,copy_policy_optim_from,copy_q_function_from,copy_q_function_optim_from,fitter,update,inner_update,create_impl,inner_create_impl,get_action_type,load_model,save_model,from_json,save,save_policy,set_grad_step,reset_optimizer_states,impl,grad_step,action_size,batch_size,gamma,config,reward_scaler,observation_scaler,action_scaler,fit,fit_online,observation_shape,predict,predict_value,sample_action
 
    {% block functions %}
    {% if functions %}

diff --git a/docs/documentation/examples/assessments.rst b/docs/documentation/examples/assessments.rst
@@ -8,7 +8,7 @@ Here, we show example codes for assessing OPE/OPS results.
     For preparation, please also refer to the following pages:
 
     * :doc:`What are Off-Policy Evaluation and Selection? </documentation/ope_ops>`
-    * :ref:`Supported Evaluation Protococols for OPE/OPS <implementation_eval_ope_ops>`
+    * :ref:`Supported Evaluation Protocols for OPE/OPS <implementation_eval_ope_ops>`
     * :doc:`/documentation/sharpe_ratio`
     * :doc:`Supported Implementations for data collection and Offline RL </documentation/learning_implementation>`
     * :doc:`Example codes for basic OPE </documentation/examples/basic_ope>`
@@ -22,14 +22,14 @@ Here, we assume that an RL environment, a behavior policy, and evaluation polici
 
 * ``behavior_policy``: an instance of :class:`BaseHead`
 * ``evaluation_policies``: a list of instance(s) of :class:`BaseHead`
-* ``env``: a gym environment (unecessary when using real-world datasets)
+* ``env``: a gym environment (unnecessary when using real-world datasets)
 
 Additionally, we assume that the logged datasets, inputs, and either ope or cd_ope instances are ready to use.
 For initializing the ope and cd_ope instances, please refer to :doc:`this page </documentation/examples/basic_ope>` 
 and :doc:`this page </documentation/examples/cumulative_dist_ope>` as references, respectively.
 
 * ``logged_dataset``: a dictionary containing the logged dataset
-* ``input_dict``: a dictionaty containing inputs for OPE
+* ``input_dict``: a dictionary containing inputs for OPE
 * ``ope``: an instance of :class:`OffPolicyEvaluation`
 * ``cd_ope``: an instance of :class:`CumulativeDistributionOPE`
 

diff --git a/docs/documentation/examples/basic_ope.rst b/docs/documentation/examples/basic_ope.rst
@@ -17,7 +17,7 @@ Here, we assume that an RL environment, a behavior policy, and evaluation polici
 
 * ``behavior_policy``: an instance of :class:`BaseHead`
 * ``evaluation_policies``: a list of instance(s) of :class:`BaseHead`
-* ``env``: a gym environment (unecessary when using real-world datasets)
+* ``env``: a gym environment (unnecessary when using real-world datasets)
 
 Then, we use the behavior policy to collect logged dataset as follows.
 
@@ -30,7 +30,7 @@ Then, we use the behavior policy to collect logged dataset as follows.
         env=env,
         max_episode_steps=env.step_per_episode,
     )
-    # obtain logged dataset
+    # obtain a logged dataset
     logged_dataset = dataset.obtain_episodes(
         behavior_policies=behavior_policy,
         n_trajectories=10000, 
@@ -48,7 +48,7 @@ The next step is to create the inputs for OPE estimators. This procedure slightl
 OPE with importance sampling-based estimators
 ----------
 When using the importance sampling-based estimators including TIS, PDIS, SNTIS, and SNPDIS, 
-and hybrid estimators including DR and SNDR, make sure that "pscore" is recorded in the logged dataset.
+and hybrid estimators including DR and SNDR, make sure that "pscore" (i.e., action choice probability of the behavior policy) is recorded in the logged dataset.
 
 Then, when using only importance sampling-based estimators, the minimal sufficient codes are the following:
 
@@ -58,7 +58,7 @@ Then, when using only importance sampling-based estimators, the minimal sufficie
 
     # initialize class to create inputs
     prep = CreateOPEInput(
-        env=env,  # unecessary when using real-world dataset
+        env=env,  # unnecessary when using real-world dataset
     )
     # create inputs
     input_dict = prep.obtain_whole_inputs(
@@ -82,7 +82,6 @@ When using the model-based estimator (DM) or hybrid methods, we need to addition
                 "encoder_factory": VectorEncoderFactory(hidden_units=[30, 30]),
                 "q_func_factory": MeanQFunctionFactory(),
                 "learning_rate": 1e-4,
-                "use_gpu": torch.cuda.is_available(),
             },
         },
     )
@@ -148,11 +147,11 @@ We can also apply scaling to either state observation or (continuous) action as
 
 .. code-block:: python
 
-    from scope_rl.utils import MinMaxScaler
+    from d3rlpy.preprocessing import MinMaxObservationScaler, MinMaxActionScaler
 
     prep = CreateOPEInput(
         env=env,
-        state_scaler=MinMaxScaler(  #
+        state_scaler=MinMaxObservationScaler(  #
             minimum=logged_dataset["state"].min(axis=0),
             maximum=logged_dataset["state"].max(axis=0),
         ),
@@ -220,7 +219,7 @@ Note that, the following provides the complete list of estimators that are curre
         * :doc:`Supported OPE estimators </documentation/evaluation_implementation>` summarizes the key properties of each estimator.
 
 
-We can easily conduct OPE and obtain and the results as follows.
+We can easily conduct OPE and obtain the results as follows.
 
 .. code-block:: python
 
@@ -290,7 +289,7 @@ Users can also specify the compared OPE estimators as follows.
         random_state=random_state, 
     )
 
-When ``legend`` is unecessary, just disable this option.
+When ``legend`` is unnecessary, just disable this option.
 
 .. code-block:: python
 
@@ -300,7 +299,7 @@ When ``legend`` is unecessary, just disable this option.
         random_state=random_state, 
     )
 
-To save figure, specify the directory to save it.
+To save the figure, specify the directory to save it.
 
 .. code-block:: python
 
@@ -313,7 +312,7 @@ To save figure, specify the directory to save it.
 
 Choosing the "Spectrum" of OPE for marginal estimators
 ----------
-The implemented OPE estimators can interpolates among naive importance sampling and
+The implemented OPE estimators can interpolate among naive importance sampling and
 marginal importance sampling by specifying the steps to use per-decision importance weight 
 (See :ref:`Supported OPE estimators (SOPE) <implementation_sope>` for the details). 
 This is done by specifying ``n_step_pdis`` when initializing the class.
@@ -329,7 +328,7 @@ This is done by specifying ``n_step_pdis`` when initializing the class.
 
 Choosing a kernel for continuous-action OPE
 ----------
-In continuous-action OPE, the choices of kernel and the bandwith hyperparameter can affect the bias-variance tradeoff and the estimation accuracy.
+In continuous-action OPE, the choices of the kernel and the bandwidth hyperparameter can affect the bias-variance tradeoff and the estimation accuracy.
 To control the hyperparameter, please use the following arguments.
 
 .. code-block:: python

diff --git a/docs/documentation/examples/cumulative_dist_ope.rst b/docs/documentation/examples/cumulative_dist_ope.rst
@@ -19,7 +19,7 @@ Here, we assume that an RL environment, a behavior policy, and evaluation polici
 
 * ``behavior_policy``: an instance of :class:`BaseHead`
 * ``evaluation_policies``: a list of instance(s) of :class:`BaseHead`
-* ``env``: a gym environment (unecessary when using real-world datasets)
+* ``env``: a gym environment (unnecessary when using real-world datasets)
 
 Then, we use the behavior policy to collect logged dataset as follows.
 
@@ -60,7 +60,7 @@ Then, when using only importance sampling-based estimators, the minimal sufficie
 
     # initialize class to create inputs
     prep = CreateOPEInput(
-        env=env,  # unecessary when using real-world dataset
+        env=env,  # unnecessary when using real-world dataset
     )
     # create inputs (e.g., calculating )
     input_dict = prep.obtain_whole_inputs(

diff --git a/docs/documentation/examples/custom_estimators.rst b/docs/documentation/examples/custom_estimators.rst
@@ -327,7 +327,7 @@ Note that, the available inputs are the same with basic OPE.
 
 .. seealso::
 
-    Finally, contribution to SCOPE-RL with a new OPE estimator is more than welcome! Please read `the guidelines for contribution (CONTRIBUTING.md) <https://github.com/hakuhodo-technologies/scope-rl/blob/main/CONTRIBUTING.md>`_.
+    Finally, contributions to SCOPE-RL with a new OPE estimator are more than welcome! Please read `the guidelines for contribution (CONTRIBUTING.md) <https://github.com/hakuhodo-technologies/scope-rl/blob/main/CONTRIBUTING.md>`_.
 
 .. raw:: html
 

diff --git a/docs/documentation/examples/multiple.rst b/docs/documentation/examples/multiple.rst
@@ -19,7 +19,7 @@ Here, we assume that an RL environment, behavior policies, and evaluation polici
 
 * ``behavior_policy``: an instance of :class:`BaseHead` or a list of instance(s) of :class:`BaseHead` 
 * ``evaluation_policies``: a list of instance(s) of :class:`BaseHead`
-* ``env``: a gym environment (unecessary when using real-world datasets)
+* ``env``: a gym environment (unnecessary when using real-world datasets)
 
 Then, we can collect multiple logged datasets with a single behavior policy as follows.
 
@@ -109,7 +109,7 @@ We first show the case of creating whole logged datasets stored in ``multiple_lo
 
     # initialize class to create inputs
     prep = CreateOPEInput(
-        env=env,  # unecessary when using real-world dataset
+        env=env,  # unnecessary when using real-world dataset
     )
     # create inputs (e.g., calculating )
     multiple_input_dict = prep.obtain_whole_inputs(
@@ -120,7 +120,7 @@ We first show the case of creating whole logged datasets stored in ``multiple_lo
     )
 
 The above code returns ``multiple_input_dict`` as an instance of :class:`MultipleInputDict`. 
-Each input dictionary is accessble via the following code.
+Each input dictionary is accessible via the following code.
 
 .. code-block:: python
 
@@ -159,7 +159,7 @@ by specifying the behavior policy and the dataset id as follows.
 
 Off-Policy Evaluation
 ~~~~~~~~~~
-SCOPE-RL enables OPE with multiple logged datasets and multiple input dicts without additional efforts.
+SCOPE-RL enables OPE with multiple logged datasets and multiple input dicts without additional effort.
 Specifically, we can estimate the policy value via basic OPE as follows.
 
 .. code-block:: python
@@ -531,7 +531,7 @@ Similar codes also work for the following functions.
 
 Validating True and Estimated Policy Performance
 ~~~~~~~~~~
-Finally, we also provide funnctions to compare the true and estimated policy performance.
+Finally, we also provide functions to compare the true and estimated policy performance.
 
 .. code-block:: python
 
@@ -545,7 +545,7 @@ Finally, we also provide funnctions to compare the true and estimated policy per
    :img-top: ../../_static/images/multiple_validation_policy_value.png
    :text-align: center
 
-When using a single behavior policy, specify behavipr policy name.
+When using a single behavior policy, specify the behavior policy name.
 
 .. code-block:: python
 
@@ -556,7 +556,7 @@ When using a single behavior policy, specify behavipr policy name.
         share_axes=True,
     )
 
-When using a single logged dataset, specify both behavior policy name and dataset id.
+When using a single logged dataset, specify both the behavior policy name and dataset id.
 
 .. code-block:: python