[Project Website] [Paper] [OpenReview]
Grace Zhang*1, Ayush Jain*1, Injune Hwang2, Shao-Hua Sun3, Joseph J. Lim2
1University of Southern California 2KAIST 3National Taiwan University
This is the official PyTorch implementation of ICLR 2025 paper "QMP: Q-switch Mixture of Policies for Multi-Task Behavior Sharing".
Abstract: QMP is a multi-task reinforcement learning approach that shares behaviors between tasks using a mixture of policies for off-policy data collection. We show that using the Q-function as a switch for this mixture is guaranteed to improve sample efficiency. The Q-switch selects which policy among the mixture that maximizes the current task's Q-value for the current state. This works because other policies might have already learned overlapping behaviors that the current task's policy has not learned. QMP's behavior sharing shows complementary gains over common approaches like parameter sharing and data sharing.
run.py
take arguments and initializes experimentsgarage_experiments.py
defines experiments and starts traininglearning/
: contains all learning code, baseline implementations, and our methodenvironments/
: registers environments
- Ubuntu 18.04 or above
- Python 3.8
- Mujoco 2.1 [https://github.com/deepmind/mujoco/releases]
To install python dependencies.
pip install -r requirements.txt
Our implementation of QMP is based on top of the garage RL codebase https://github.com/rlworkgroup/garage. If you would
like to re-implement QMP in your own codebase, it is fairly simple, as we only replace the data collection policy for each
task, denoted
- We first initialize all the task policies and Q-function networks in the
setup
function inexperiment_utils.py
. We then initialize the mixture policy with all the task policies and Q-functions. - We define the mixture policy in
learning/policies/qswitch_mixture_policies_wrapper.py
. Critically, theget_action
function, given an input observation and task, samples all policies for candidate actions, uses the task Q-function to evaluate the policies, and outputs the best action. - We pass the mixture policy to the sampler to gather data and the individual policies and Q-functions to the RL algorithm to train.
To run our method in combination with other MTRL methods, follow the example commands below. Method X and Method X + QMP are always run with the same hyperparameters.
For data sharing, we tune unsupervised_quantile
per task, and for parameter sharing, we increase the network size and tune the learning rates, as reported in our paper.
Simply, replace the environment name --env=JacoReachMT5-v1
, --env=MazeLarge-10-v0
, --env=Walker2dMT4-v0
, --env=MetaWorldCDS-v1
,
--env=MetaWorldMT10-v2
,--env=KitchenMTEasy-v0
, or --env=MetaWorldMT50-v2
, and update the data and parameter sharing
hyperparameters (unsupervised_quantile
, lr
, hidden_sizes
) according to the paper.
- Separated + QMP (Our Method)
python run.py qmp_dnc --env=JacoReachMT5-v1
- Separated
python run.py dnc_sac --env=JacoReachMT5-v1
- Parameters + QMP (Our Method)
python run.py qmp_sac --env=JacoReachMT5-v1 --policy_architecture multihead --Q_architecture multihead --lr 0.001 --hidden_sizes 512 512
- Parameters
python run.py mtsac --env=JacoReachMT5-v1 --policy_architecture multihead --Q_architecture multihead --lr 0.001 --hidden_sizes 512 512
- Data + QMP (Our Method)
python run.py qmp_uds_dnc --env=JacoReachMT5-v1 --sharing_quantile 0
- Data
python run.py uds_dnc --env=JacoReachMT5-v1 --sharing_quantile 0
Please consider citing our work if you find it useful. Reach out to us for any questions!
@inproceedings{
zhang2025qmp,
title={{QMP}: Q-switch Mixture of Policies for Multi-Task Behavior Sharing},
author={Grace Zhang and Ayush Jain and Injune Hwang and Shao-Hua Sun and Joseph J Lim},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=aUZEeb2yvK}
}