Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in Example Run #5

Open
lky-violet opened this issue Feb 15, 2025 · 11 comments
Open

Error in Example Run #5

lky-violet opened this issue Feb 15, 2025 · 11 comments

Comments

@lky-violet
Copy link

Thank you very much for your amazing open-source work. When I tried to reproduce the code you provided on a 4-card A100, I encountered the following error:TypeError: JobSupervisor.init() takes 5 positional arguments but 7 were given. Additionally, the log shows:

Address already in use
Port 5000 is in use by another program. Either identify and stop that program, or start the server with a different port.

However, even after I changed the port number of the remote_rm_url HTTP, I still encounter the same error and the same log. I would like to ask if there is an issue with how I’ve set my parameters. Below is the content of my sh file:

export DATASET="/data/lmm-r1/examples/data/mathlv345_8k_chatml.json"

MODEL_CPK_NAME="qwenvl25_3B_ins_rloo_math"
PRETRAIN_MODEL="/data/Qwen2.5-VL-3B-Instruct"
SAVE_PATH="./ckpts"
mkdir -p "${SAVE_PATH}/${MODEL_CPK_NAME}"
export CUDA_VISIBLE_DEVICES="4,5,6,7"
export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1
python -m openrlhf.models.remote_rm.math_verifier --dataset $DATASET --input_key prompt --prompt-template chatml > "${SAVE_PATH}/${MODEL_CPK_NAME}/remote_rm.log" 2>&1 &
childpid=$!

ray start --head --node-ip-address 0.0.0.0 --num-gpus 4 --temp-dir ~/.cache/ray

ray job submit --address="http://127.0.0.1:8265"
   --runtime-env-json='{"working_dir": "/data/lmm-r1/OpenRLHF"}'
   -- python3 -m openrlhf.cli.train_ppo_ray
   --ref_num_nodes 1
   --ref_num_gpus_per_node 4
   --remote_rm_url http://127.0.0.1:5233/get_reward
   --actor_num_nodes 1
   --actor_num_gpus_per_node 4
   --vllm_num_engines 4
   --vllm_tensor_parallel_size 1
   --colocate_all_models
   --vllm_enable_sleep
   --vllm_gpu_memory_utilization 0.7
   --vllm_sync_backend gloo
   --enable_prefix_caching
   --pretrain $PRETRAIN_MODEL
   --save_path $SAVE_PATH/$MODEL_CPK_NAME
   --micro_train_batch_size 2
   --train_batch_size 128
   --micro_rollout_batch_size 4
   --rollout_batch_size 256
   --temperature 1
   --n_samples_per_prompt 16
   --max_epochs 1
   --num_episodes 30
   --prompt_max_len 1024
   --max_samples 100000
   --generate_max_len 3000
   --advantage_estimator rloo
   --zero_stage 3
   --bf16
   --actor_learning_rate 1e-6
   --init_kl_coef 0.0
   --prompt_data $DATASET
   --input_key prompt
   --normalize_reward
   --flash_attn
   --gradient_checkpointing
   --save_steps 10
   --ckpt_path $SAVE_PATH/$MODEL_CPK_NAME/ckpt
   --save_hf_ckpt
   --use_tensorboard $SAVE_PATH/$MODEL_CPK_NAME/logs
ray stop

@TideDra
Copy link
Owner

TideDra commented Feb 15, 2025

Which line of code triger TypeError: JobSupervisor.init() takes 5 positional arguments but 7 were given.? Do you mean even if you change the port to 5233, the log shows 5000 is in use? It seems to be a problem with flask and not related to the training code. You can check the remote_rm standalone, it's a simple script.

@lky-violet
Copy link
Author

Thank you for your prompt reply. The error log is as follows. I would like to ask if I need to modify the port number of the Flask app 'math_verifier' in the code. I couldn’t find any related settings in the .sh file.

Image

@TideDra
Copy link
Owner

TideDra commented Feb 16, 2025

The port number is hard coded in openrlhf/models/remote_rm/math_verifier.py currently. You can try to change it and see.

@lky-violet
Copy link
Author

Thank you for your reply, I have changed the port number and solve the port been used problem. But I still can not run the training. The log and the issue are shown below. I have checked [https://github.com/ray-project/ray/issues/24920] to solve problem but it doesn't work. I would like to ask, what might be the issue here?

Image

Image

@TideDra
Copy link
Owner

TideDra commented Feb 17, 2025

Do you put some large files under the work directory? Or does the temp dir ~/.cache/ray have no space left?

@lky-violet
Copy link
Author

I have checked the available space in the current working directory and the cache. There is 162T of space left in the current directory, but the cache has already reached 99% usage, as shown in the figure below. Could you please advise me on what I should do? I would be extremely grateful!

Image

@TideDra
Copy link
Owner

TideDra commented Feb 17, 2025

the path ~/.cache/ray is set in the script. Just change it to another available path.

@lky-violet
Copy link
Author

Thank you for your reply, I have changed the tmp_dir to the local available path. But I still get the error message shown below. I would like to ask if the reason for the error is due to the inability to connect to Redis as mentioned above. I apologize for not being very familiar with distributed systems, and I appreciate your help.

Image

@TideDra
Copy link
Owner

TideDra commented Feb 18, 2025

It's weird to see the time session_2025-01-16, maybe your system environment is not clean. I suggest you to run in a
clean nvidia/cuda docker.

@TideDra
Copy link
Owner

TideDra commented Feb 20, 2025

I have encountered the same error. My solution is to remove the cache files ~/.cache/ray

@lky-violet
Copy link
Author

lky-violet commented Feb 20, 2025

I have encountered the same error. My solution is to remove the cache files ~/.cache/ray

Thank you for your reply. I try to rm -rf the ~/.cache/ray and change the tmp_dir from local dir to ~/.cache/ray. However, I still get the error mentioned above. I search for this error and find https://discuss.ray.io/t/assertionerror-session-name-does-not-match-persisted-value/13067. But sadly, I have no permission to kill other people's process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants