-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix compatibility issues with nyuv2 experiments #47
Conversation
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 13 out of 27 changed files in this pull request and generated 3 suggestions.
Files not reviewed (14)
- config/llama_weighted_average.yaml: Language not supported
- examples/lm_finetune/llama_fullfinetune.sh: Language not supported
- fusion_bench/dataset/llama/collate.py: Evaluated as low risk
- fusion_bench/compat/modelpool/init.py: Evaluated as low risk
- fusion_bench/compat/taskpool/init.py: Evaluated as low risk
- config/method/lm_finetune/peftfinetune_sft.yaml: Evaluated as low risk
- config/modelpool/SeqenceClassificationModelPool/llama_preference700k.yaml: Evaluated as low risk
- fusion_bench/method/lm_finetune/bradley_terry_rm.py: Evaluated as low risk
- config/nyuv2_config.yaml: Evaluated as low risk
- config/modelpool/CausalLMPool/llama_ultrachat.yaml: Evaluated as low risk
- config/modelpool/SeqenceClassificationModelPool/single_reward_model.yaml: Evaluated as low risk
- config/dataset/llm_sft/ultrachat_200k.yaml: Evaluated as low risk
- fusion_bench/dataset/llama/preference_700k.py: Evaluated as low risk
- config/taskpool/reward_model_evaluation.yaml: Evaluated as low risk
Comments skipped due to low confidence (3)
fusion_bench/dataset/llama/ultrachat.py:37
- [nitpick] The comment on line 37 is unnecessary and should be removed for clarity.
# ? is it necessary to `.replace(tokenizer.bos_token, '')`?
fusion_bench/dataset/llama/stanford_shp.py:69
- The assertion error message uses an undefined variable 'positive'. It should use 'chosen' instead.
assert (tokenizer.eos_token_id not in tokenized_pos["input_ids"][:-1]), f"Prompt contains EOS token: {sample['positive']}"
fusion_bench/dataset/llama/stanford_shp.py:79
- The assertion error message uses an undefined variable 'rejected'. It should use 'rejected' instead.
assert (tokenizer.eos_token_id not in tokenized_neg["input_ids"][:-1]), f"Prompt contains EOS token: {sample['rejected']}"
cache_path: Optional[str] = None, | ||
): | ||
R""" | ||
Load and tokenized Ultrachat 200k dataset for Bradley-Terry ranking model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring is misleading as it mentions 'winner' which is not relevant to the actual function. It should be updated to accurately describe the function's behavior.
Load and tokenized Ultrachat 200k dataset for Bradley-Terry ranking model. | |
Load and tokenize the Ultrachat 200k dataset for Bradley-Terry ranking model. |
Copilot is powered by AI, so mistakes are possible. Review output carefully before use.
sample["rejected_input_ids"].append(tokenizer.eos_token_id) | ||
sample["rejected_attention_mask"].append(1) | ||
|
||
dataset = dataset.map(tokenize, num_proc=num_proc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'tokenize' function does not return the modified sample, which is required for 'dataset.map' to work correctly.
dataset = dataset.map(tokenize, num_proc=num_proc) | |
return sample |
Copilot is powered by AI, so mistakes are possible. Review output carefully before use.
@@ -0,0 +1,6 @@ | |||
alpaca-cleaned: | |||
_target_: fusion_bench.dataset.llama.alpaca.load_tokenized_alpaca_dataset | |||
tokenizer: ??? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tokenizer field is set to ???. It should be replaced with the appropriate tokenizer.
tokenizer: ??? | |
tokenizer: 'appropriate_tokenizer' |
Copilot is powered by AI, so mistakes are possible. Review output carefully before use.
No description provided.