Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix compatibility issues with nyuv2 experiments #47

Merged
merged 13 commits into from
Dec 3, 2024
Merged

fix compatibility issues with nyuv2 experiments #47

merged 13 commits into from
Dec 3, 2024

Conversation

tanganke
Copy link
Owner

@tanganke tanganke commented Dec 3, 2024

No description provided.

@tanganke tanganke requested a review from Copilot December 3, 2024 16:17
@tanganke tanganke merged commit 299a481 into main Dec 3, 2024
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 13 out of 27 changed files in this pull request and generated 3 suggestions.

Files not reviewed (14)
  • config/llama_weighted_average.yaml: Language not supported
  • examples/lm_finetune/llama_fullfinetune.sh: Language not supported
  • fusion_bench/dataset/llama/collate.py: Evaluated as low risk
  • fusion_bench/compat/modelpool/init.py: Evaluated as low risk
  • fusion_bench/compat/taskpool/init.py: Evaluated as low risk
  • config/method/lm_finetune/peftfinetune_sft.yaml: Evaluated as low risk
  • config/modelpool/SeqenceClassificationModelPool/llama_preference700k.yaml: Evaluated as low risk
  • fusion_bench/method/lm_finetune/bradley_terry_rm.py: Evaluated as low risk
  • config/nyuv2_config.yaml: Evaluated as low risk
  • config/modelpool/CausalLMPool/llama_ultrachat.yaml: Evaluated as low risk
  • config/modelpool/SeqenceClassificationModelPool/single_reward_model.yaml: Evaluated as low risk
  • config/dataset/llm_sft/ultrachat_200k.yaml: Evaluated as low risk
  • fusion_bench/dataset/llama/preference_700k.py: Evaluated as low risk
  • config/taskpool/reward_model_evaluation.yaml: Evaluated as low risk
Comments skipped due to low confidence (3)

fusion_bench/dataset/llama/ultrachat.py:37

  • [nitpick] The comment on line 37 is unnecessary and should be removed for clarity.
# ? is it necessary to `.replace(tokenizer.bos_token, '')`?

fusion_bench/dataset/llama/stanford_shp.py:69

  • The assertion error message uses an undefined variable 'positive'. It should use 'chosen' instead.
assert (tokenizer.eos_token_id not in tokenized_pos["input_ids"][:-1]), f"Prompt contains EOS token: {sample['positive']}"

fusion_bench/dataset/llama/stanford_shp.py:79

  • The assertion error message uses an undefined variable 'rejected'. It should use 'rejected' instead.
assert (tokenizer.eos_token_id not in tokenized_neg["input_ids"][:-1]), f"Prompt contains EOS token: {sample['rejected']}"

cache_path: Optional[str] = None,
):
R"""
Load and tokenized Ultrachat 200k dataset for Bradley-Terry ranking model.
Copy link
Preview

Copilot AI Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring is misleading as it mentions 'winner' which is not relevant to the actual function. It should be updated to accurately describe the function's behavior.

Suggested change
Load and tokenized Ultrachat 200k dataset for Bradley-Terry ranking model.
Load and tokenize the Ultrachat 200k dataset for Bradley-Terry ranking model.

Copilot is powered by AI, so mistakes are possible. Review output carefully before use.

Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options
sample["rejected_input_ids"].append(tokenizer.eos_token_id)
sample["rejected_attention_mask"].append(1)

dataset = dataset.map(tokenize, num_proc=num_proc)
Copy link
Preview

Copilot AI Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'tokenize' function does not return the modified sample, which is required for 'dataset.map' to work correctly.

Suggested change
dataset = dataset.map(tokenize, num_proc=num_proc)
return sample

Copilot is powered by AI, so mistakes are possible. Review output carefully before use.

Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options
@@ -0,0 +1,6 @@
alpaca-cleaned:
_target_: fusion_bench.dataset.llama.alpaca.load_tokenized_alpaca_dataset
tokenizer: ???
Copy link
Preview

Copilot AI Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tokenizer field is set to ???. It should be replaced with the appropriate tokenizer.

Suggested change
tokenizer: ???
tokenizer: 'appropriate_tokenizer'

Copilot is powered by AI, so mistakes are possible. Review output carefully before use.

Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant