Note: This repository and its contents support the coursework of the INM363 module at City, University of London.
The code presented here is an updated version of the work found in the RL-BOED repository and corresponding paper Optimizing Sequential Experimental Design with Deep Reinforcement Learning. This version works with the latest versions of PyTorch, NumPy, and Gymnasium (note the difference from Gym). Since Akro and Garage do not support the latest versions of these libraries (at least as of June 2024), they are imported and edited in this repository as separate folders. Note that we use the 2021.03 release of Garage since it was used in RL-BOED.
- Python 3.9+ - we use Python 3.9.5
- PyTorch (with CUDA for GPU usage) - we use PyTorch 2.3.0
- All other requirements listed in requirements.txt - specific versions are listed
Optimal experimental design is the area dedicated to the optimal execution of experiments, with respect to some allocation of resources. Much work on optimal experimental design falls under the Bayesian setting, where we seek to reduce the uncertainty about our parameters of interest through experimentation. Conducting these experiments sequentially has recently brought about the use of reinforcement learning, where an agent is trained to navigate the design space to select the most informative designs for experimentation. However, there is still a lack of understanding about the benefits and drawbacks of using certain reinforcement learning algorithms to train these agents. In our work, we explore several reinforcement learning algorithms based on the state-of-the-art soft actor-critic method, and apply these to three Bayesian experimental design problems. We examine the amount of time needed to train these agents through each algorithm, and assess the generalisability of each agent to different but related experimental design setups. We draw insights on which algorithm generally performs best, and under what circumstances one may wish to use a particular algorithm.
Our trained agents, alongside their results at evaluation time, can be found here. Each folder beginning with 'boed_results' represents a set of 10 agents, each trained on a unique random seed. The agents are trained with a specific algorithm under a certain set of hyperparameters, and on a particular Bayesian experimental design problem. Folders with 'ces' are agents trained on the 'Constant Elasticity of Substitution' experiment, 'location' is the 'Location Finding' experiment, and 'docking' is the 'Biomolecular Docking' experiment.
We explore Randomised Ensembled Double Q-Learning (REDQ), Dropout Q-Functions for Doubly Efficient Reinforcement Learning (DroQ), Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier (SBR), and A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning (SUNRISE). These all extend the Soft Actor-Critic (SAC) algorithm. Proximal Policy Optimisation (PPO) and Trust Region Policy Optimisation (TRPO) are two other algorithms that can be used.
There are
The total intensity at point
For object
Parameter | Value |
---|---|
2 | |
2 | |
1 | |
0.1 | |
0.0001 | |
0.5 |
We have two baskets
The CES model \citep{arrowchen} defines the utility
We use the following priors for
The likelihood function is the preference of the human on a sliding 0-1 scale, which is based on
Molecular docking \citep{Meng_XuanYu2011} explores how two or more molecular structures interact with each other. When a compound and receptor bind, this is known as a `hit'. In this experiment, we need to select the most informative compounds to find the predicted binding affinity (docking score) ranges from which molecules would be picked for testing in an experiment.
The probability of outcome
We use the following priors for
The likelihood function is Bernoulli distributed and provides a binary outcome as to whether or not the docking score leads to a hit. 1 means that there is a hit, and 0 means there is no-hit. For a given design
See the arguments for each script at the end of the code, for example process_results.py
can be written in command line as (with relevant directories input):
python3 process_results.py --fpaths="Documents\Training Results\boed_results_sbr_430000\source\progress.csv, Documents\Training Results\boed_results_sbr_430000\source_1\progress.csv, Documents\Training Results\boed_results_sbr_430000\source_2\progress.csv, Documents\Training Results\boed_results_sbr_430000\source_3\progress.csv, Documents\Training Results\boed_results_sbr_430000\source_4\progress.csv" --dest="Documents\Training Results\sbr430000_results.npz"
Adaptive_{env}_{algo}.py
: File to initiate the training loop for the respective environment/experimental design problem {env} and algorithm {algo}; {env - Source}: Location Finding, {env - CES}: Constant Elasticity of Substitution, {env - Docking}: Biomolecular Docking.process_results.py
: Produces training datasets with the training performance from several random seeds, for a particular environment and algorithm.plot_results.py
: Allows training performance results to be plot, using the data files produced byprocess_results.py
.select_policy_env.py
: Evaluate/test a trained policy on a particular experimental design problem (the exact one it was trained on, or one with slightly different experimental parameters).
If you have access to a high-performance computer through SLURM, we recommend making use of the Bash scripts in useful-bash-scripts.
The experiments are quite expensive to run on a standard PC, and so we advise looking at using a high-performance computer. We use SLURM to run our experiments on NVIDIA A100 80GB PCIe and NVIDIA A100 40GB PCIe GPUs with 4 CPU cores and 40GB of RAM. An Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz was used.
We note that DroQ can be run through REDQ, by setting 'ens-size' and 'M' to be equal to each other (say 2), 'layer-norm = True', and 'dropout > 0' for the dropout probability (which should be larger than 0 for DroQ).
We can choose a reward function based on the sequential prior contrastive estimation (sPCE), which is a lower bound on the expected information gain. This is the standard due to it being bounded by \log (n-contr-samples + 1), unlike the upper bound sequential nested Monte Carlo (sNMC). sPCE can be selected by setting 'bound-type = lower', and sNMC can be selected using 'bound-type = upper'. Rewards are dense here, meaning that the incremental sPCE/sNMC is provided to the agent during training in each experiment
After deciding on an environment and algorithm to use, either run the relevant Python file on your IDE, or use Bash/command line to run the Python file with your chosen (environment and algorithm specific) values to parse (more values can be parsed, see the Python files):
python Adaptive_Source_REDQ.py --n-parallel=100 --n-contr-samples=100000 --n-rl-itr=20001 --log-dir="run_outputs/boed_results_discount_0.99/source" --bound-type=lower --id=1 --budget=30 --discount=0.99 --buffer-capacity=10000000 --tau=0.001 --pi-lr=0.0001 --qf-lr=0.0003 --M=2 --ens-size=2 --lstm-q-function=False --layer-norm=False --dropout=0
Your choice of random seeds to experiment with can be entered near the beginning of the Python file, replacing the 'seeds' variable with your chosen seeds to explore. The code understands which seed to use through the parsed 'id' value in Bash, if 'id = 1', then the first seed would be used for training, and so on. The Bash scripts we use loop over each of the 10 seeds in the list of the Python file, and run 10 jobs on SLURM, training an agent using a seed from the list.
The saved agent and its training results will be available in the directory named in 'log-dir'.
Once an agent has been trained, it can be evaluated on the same environment it was trained on. The environment specific parameters, such as the number of locations
python3 select_policy_env.py --src="boed_results_sbr_430000/source_9/itr_20000.pkl" --dest="boed_results_sbr_430000/source_9/evaluation_lower.log" --edit_type="w" --seq_length=30 --bound_type=lower --n_contrastive_samples=1000000 --n_parallel=250 --n_samples=2000 --seed=1 --env="source" --source_d=2 --source_k=5 --source_b=0.1 --source_m=0.0001 --source_obs_sd=0.5 --ces_d=6 --ces_obs_sd=0.005 --docking_d=1