Error in Example Run #5

lky-violet · 2025-02-15T12:24:48Z

Thank you very much for your amazing open-source work. When I tried to reproduce the code you provided on a 4-card A100, I encountered the following error:TypeError: JobSupervisor.init() takes 5 positional arguments but 7 were given. Additionally, the log shows:

Address already in use
Port 5000 is in use by another program. Either identify and stop that program, or start the server with a different port.

However, even after I changed the port number of the remote_rm_url HTTP, I still encounter the same error and the same log. I would like to ask if there is an issue with how I’ve set my parameters. Below is the content of my sh file:

export DATASET="/data/lmm-r1/examples/data/mathlv345_8k_chatml.json"

MODEL_CPK_NAME="qwenvl25_3B_ins_rloo_math"
PRETRAIN_MODEL="/data/Qwen2.5-VL-3B-Instruct"
SAVE_PATH="./ckpts"
mkdir -p "${SAVE_PATH}/${MODEL_CPK_NAME}"
export CUDA_VISIBLE_DEVICES="4,5,6,7"
export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1
python -m openrlhf.models.remote_rm.math_verifier --dataset $DATASET --input_key prompt --prompt-template chatml > "${SAVE_PATH}/${MODEL_CPK_NAME}/remote_rm.log" 2>&1 &
childpid=$!

ray start --head --node-ip-address 0.0.0.0 --num-gpus 4 --temp-dir ~/.cache/ray

ray job submit --address="http://127.0.0.1:8265"
--runtime-env-json='{"working_dir": "/data/lmm-r1/OpenRLHF"}'
-- python3 -m openrlhf.cli.train_ppo_ray
--ref_num_nodes 1
--ref_num_gpus_per_node 4
--remote_rm_url http://127.0.0.1:5233/get_reward
--actor_num_nodes 1
--actor_num_gpus_per_node 4
--vllm_num_engines 4
--vllm_tensor_parallel_size 1
--colocate_all_models
--vllm_enable_sleep
--vllm_gpu_memory_utilization 0.7
--vllm_sync_backend gloo
--enable_prefix_caching
--pretrain $PRETRAIN_MODEL
--save_path $SAVE_PATH/$MODEL_CPK_NAME
--micro_train_batch_size 2
--train_batch_size 128
--micro_rollout_batch_size 4
--rollout_batch_size 256
--temperature 1
--n_samples_per_prompt 16
--max_epochs 1
--num_episodes 30
--prompt_max_len 1024
--max_samples 100000
--generate_max_len 3000
--advantage_estimator rloo
--zero_stage 3
--bf16
--actor_learning_rate 1e-6
--init_kl_coef 0.0
--prompt_data $DATASET
--input_key prompt
--normalize_reward
--flash_attn
--gradient_checkpointing
--save_steps 10
--ckpt_path $SAVE_PATH/$MODEL_CPK_NAME/ckpt
--save_hf_ckpt
--use_tensorboard $SAVE_PATH/$MODEL_CPK_NAME/logs
ray stop

TideDra · 2025-02-15T12:42:13Z

Which line of code triger TypeError: JobSupervisor.init() takes 5 positional arguments but 7 were given.? Do you mean even if you change the port to 5233, the log shows 5000 is in use? It seems to be a problem with flask and not related to the training code. You can check the remote_rm standalone, it's a simple script.

lky-violet · 2025-02-16T03:03:02Z

Thank you for your prompt reply. The error log is as follows. I would like to ask if I need to modify the port number of the Flask app 'math_verifier' in the code. I couldn’t find any related settings in the .sh file.

TideDra · 2025-02-16T10:55:03Z

The port number is hard coded in openrlhf/models/remote_rm/math_verifier.py currently. You can try to change it and see.

lky-violet · 2025-02-17T12:29:39Z

Thank you for your reply, I have changed the port number and solve the port been used problem. But I still can not run the training. The log and the issue are shown below. I have checked [https://github.com/ray-project/ray/issues/24920] to solve problem but it doesn't work. I would like to ask, what might be the issue here?

TideDra · 2025-02-17T13:31:18Z

Do you put some large files under the work directory? Or does the temp dir ~/.cache/ray have no space left?

lky-violet · 2025-02-17T15:00:34Z

I have checked the available space in the current working directory and the cache. There is 162T of space left in the current directory, but the cache has already reached 99% usage, as shown in the figure below. Could you please advise me on what I should do? I would be extremely grateful!

TideDra · 2025-02-17T15:04:36Z

the path ~/.cache/ray is set in the script. Just change it to another available path.

lky-violet · 2025-02-18T02:07:12Z

Thank you for your reply, I have changed the tmp_dir to the local available path. But I still get the error message shown below. I would like to ask if the reason for the error is due to the inability to connect to Redis as mentioned above. I apologize for not being very familiar with distributed systems, and I appreciate your help.

TideDra · 2025-02-18T02:59:10Z

It's weird to see the time session_2025-01-16, maybe your system environment is not clean. I suggest you to run in a
clean nvidia/cuda docker.

TideDra · 2025-02-20T03:21:02Z

I have encountered the same error. My solution is to remove the cache files ~/.cache/ray

lky-violet · 2025-02-20T14:34:24Z

I have encountered the same error. My solution is to remove the cache files ~/.cache/ray

Thank you for your reply. I try to rm -rf the ~/.cache/ray and change the tmp_dir from local dir to ~/.cache/ray. However, I still get the error mentioned above. I search for this error and find https://discuss.ray.io/t/assertionerror-session-name-does-not-match-persisted-value/13067. But sadly, I have no permission to kill other people's process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in Example Run #5

Error in Example Run #5

lky-violet commented Feb 15, 2025

TideDra commented Feb 15, 2025

lky-violet commented Feb 16, 2025

TideDra commented Feb 16, 2025

lky-violet commented Feb 17, 2025

TideDra commented Feb 17, 2025

lky-violet commented Feb 17, 2025

TideDra commented Feb 17, 2025

lky-violet commented Feb 18, 2025

TideDra commented Feb 18, 2025

TideDra commented Feb 20, 2025

lky-violet commented Feb 20, 2025 •

edited

Loading

Error in Example Run #5

Error in Example Run #5

Comments

lky-violet commented Feb 15, 2025

TideDra commented Feb 15, 2025

lky-violet commented Feb 16, 2025

TideDra commented Feb 16, 2025

lky-violet commented Feb 17, 2025

TideDra commented Feb 17, 2025

lky-violet commented Feb 17, 2025

TideDra commented Feb 17, 2025

lky-violet commented Feb 18, 2025

TideDra commented Feb 18, 2025

TideDra commented Feb 20, 2025

lky-violet commented Feb 20, 2025 • edited Loading

lky-violet commented Feb 20, 2025 •

edited

Loading