-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in Example Run #5
Comments
Which line of code triger |
The port number is hard coded in |
Thank you for your reply, I have changed the port number and solve the port been used problem. But I still can not run the training. The log and the issue are shown below. I have checked [https://github.com/ray-project/ray/issues/24920] to solve problem but it doesn't work. I would like to ask, what might be the issue here? |
Do you put some large files under the work directory? Or does the temp dir |
the path |
Thank you for your reply, I have changed the tmp_dir to the local available path. But I still get the error message shown below. I would like to ask if the reason for the error is due to the inability to connect to Redis as mentioned above. I apologize for not being very familiar with distributed systems, and I appreciate your help. |
It's weird to see the time |
I have encountered the same error. My solution is to remove the cache files |
Thank you for your reply. I try to rm -rf the |
Thank you very much for your amazing open-source work. When I tried to reproduce the code you provided on a 4-card A100, I encountered the following error:TypeError: JobSupervisor.init() takes 5 positional arguments but 7 were given. Additionally, the log shows:
Address already in use
Port 5000 is in use by another program. Either identify and stop that program, or start the server with a different port.
However, even after I changed the port number of the remote_rm_url HTTP, I still encounter the same error and the same log. I would like to ask if there is an issue with how I’ve set my parameters. Below is the content of my sh file:
export DATASET="/data/lmm-r1/examples/data/mathlv345_8k_chatml.json"
MODEL_CPK_NAME="qwenvl25_3B_ins_rloo_math"$DATASET --input_key prompt --prompt-template chatml > "$ {SAVE_PATH}/${MODEL_CPK_NAME}/remote_rm.log" 2>&1 &
PRETRAIN_MODEL="/data/Qwen2.5-VL-3B-Instruct"
SAVE_PATH="./ckpts"
mkdir -p "${SAVE_PATH}/${MODEL_CPK_NAME}"
export CUDA_VISIBLE_DEVICES="4,5,6,7"
export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1
python -m openrlhf.models.remote_rm.math_verifier --dataset
childpid=$!
ray start --head --node-ip-address 0.0.0.0 --num-gpus 4 --temp-dir ~/.cache/ray
ray job submit --address="http://127.0.0.1:8265"
--runtime-env-json='{"working_dir": "/data/lmm-r1/OpenRLHF"}'
-- python3 -m openrlhf.cli.train_ppo_ray
--ref_num_nodes 1
--ref_num_gpus_per_node 4
--remote_rm_url http://127.0.0.1:5233/get_reward
--actor_num_nodes 1
--actor_num_gpus_per_node 4
--vllm_num_engines 4
--vllm_tensor_parallel_size 1
--colocate_all_models
--vllm_enable_sleep
--vllm_gpu_memory_utilization 0.7
--vllm_sync_backend gloo
--enable_prefix_caching
--pretrain $PRETRAIN_MODEL
--save_path $SAVE_PATH/$MODEL_CPK_NAME
--micro_train_batch_size 2
--train_batch_size 128
--micro_rollout_batch_size 4
--rollout_batch_size 256
--temperature 1
--n_samples_per_prompt 16
--max_epochs 1
--num_episodes 30
--prompt_max_len 1024
--max_samples 100000
--generate_max_len 3000
--advantage_estimator rloo
--zero_stage 3
--bf16
--actor_learning_rate 1e-6
--init_kl_coef 0.0
--prompt_data $DATASET
--input_key prompt
--normalize_reward
--flash_attn
--gradient_checkpointing
--save_steps 10
--ckpt_path $SAVE_PATH/$MODEL_CPK_NAME/ckpt
--save_hf_ckpt
--use_tensorboard $SAVE_PATH/$MODEL_CPK_NAME/logs
ray stop
The text was updated successfully, but these errors were encountered: