fix compatibility issues with nyuv2 experiments #47

tanganke · 2024-12-03T16:17:19Z

No description provided.

Co-authored-by: Copilot <[email protected]>

Copilot

Copilot reviewed 13 out of 27 changed files in this pull request and generated 3 suggestions.

Files not reviewed (14)

config/llama_weighted_average.yaml: Language not supported
examples/lm_finetune/llama_fullfinetune.sh: Language not supported
fusion_bench/dataset/llama/collate.py: Evaluated as low risk
fusion_bench/compat/modelpool/init.py: Evaluated as low risk
fusion_bench/compat/taskpool/init.py: Evaluated as low risk
config/method/lm_finetune/peftfinetune_sft.yaml: Evaluated as low risk
config/modelpool/SeqenceClassificationModelPool/llama_preference700k.yaml: Evaluated as low risk
fusion_bench/method/lm_finetune/bradley_terry_rm.py: Evaluated as low risk
config/nyuv2_config.yaml: Evaluated as low risk
config/modelpool/CausalLMPool/llama_ultrachat.yaml: Evaluated as low risk
config/modelpool/SeqenceClassificationModelPool/single_reward_model.yaml: Evaluated as low risk
config/dataset/llm_sft/ultrachat_200k.yaml: Evaluated as low risk
fusion_bench/dataset/llama/preference_700k.py: Evaluated as low risk
config/taskpool/reward_model_evaluation.yaml: Evaluated as low risk

Comments skipped due to low confidence (3)

fusion_bench/dataset/llama/ultrachat.py:37

[nitpick] The comment on line 37 is unnecessary and should be removed for clarity.

# ? is it necessary to `.replace(tokenizer.bos_token, '')`?

fusion_bench/dataset/llama/stanford_shp.py:69

The assertion error message uses an undefined variable 'positive'. It should use 'chosen' instead.

assert (tokenizer.eos_token_id not in tokenized_pos["input_ids"][:-1]), f"Prompt contains EOS token: {sample['positive']}"

fusion_bench/dataset/llama/stanford_shp.py:79

The assertion error message uses an undefined variable 'rejected'. It should use 'rejected' instead.

assert (tokenizer.eos_token_id not in tokenized_neg["input_ids"][:-1]), f"Prompt contains EOS token: {sample['rejected']}"

Copilot · 2024-12-03T16:18:27Z

fusion_bench/dataset/llama/ultrachat.py

+    cache_path: Optional[str] = None,
+):
+    R"""
+    Load and tokenized Ultrachat 200k dataset for Bradley-Terry ranking model.


The docstring is misleading as it mentions 'winner' which is not relevant to the actual function. It should be updated to accurately describe the function's behavior.

Suggested change

Load and tokenized Ultrachat 200k dataset for Bradley-Terry ranking model.

Load and tokenize the Ultrachat 200k dataset for Bradley-Terry ranking model.

Copilot · 2024-12-03T16:18:27Z

fusion_bench/dataset/llama/stanford_shp.py

+            sample["rejected_input_ids"].append(tokenizer.eos_token_id)
+            sample["rejected_attention_mask"].append(1)
+
+    dataset = dataset.map(tokenize, num_proc=num_proc)


The 'tokenize' function does not return the modified sample, which is required for 'dataset.map' to work correctly.

Suggested change

dataset = dataset.map(tokenize, num_proc=num_proc)

return sample

Copilot · 2024-12-03T16:18:28Z

config/dataset/llm_sft/alpaca_cleaned.yaml

@@ -0,0 +1,6 @@
+alpaca-cleaned:
+  _target_: fusion_bench.dataset.llama.alpaca.load_tokenized_alpaca_dataset
+  tokenizer: ???


The tokenizer field is set to ???. It should be replaced with the appropriate tokenizer.

Suggested change

tokenizer: ???

tokenizer: 'appropriate_tokenizer'

tanganke and others added 13 commits December 1, 2024 18:51

add docstring

a25dc6b

add ultrachat dataset

f8d11b5

update config

32bd802

rename '*_j' to 'chosen_*' and *_k to 'rejected_*' in reward modeling

342532d

update config, add support for stanford shp dataset

b597cb5

update

1ba2d45

Merge branch 'develop' of github.com:tanganke/fusion_bench into develop

d13ff1d

update

8a3dd5d

fix bugs

f9d5e15

update config

52483fa

update

4f34dfb

Update fusion_bench/dataset/llama/ultrachat.py

6792afe

Co-authored-by: Copilot <[email protected]>

fix compatibility issues with nyuv2 experiments

b26e821

tanganke requested a review from Copilot December 3, 2024 16:17

pull-request-size bot added the size/XL label Dec 3, 2024

tanganke merged commit 299a481 into main Dec 3, 2024

Copilot AI reviewed Dec 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix compatibility issues with nyuv2 experiments #47

fix compatibility issues with nyuv2 experiments #47

tanganke commented Dec 3, 2024

Copilot AI left a comment

Copilot AI Dec 3, 2024

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Copilot AI Dec 3, 2024

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Copilot AI Dec 3, 2024

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

	Load and tokenized Ultrachat 200k dataset for Bradley-Terry ranking model.
	Load and tokenize the Ultrachat 200k dataset for Bradley-Terry ranking model.

	dataset = dataset.map(tokenize, num_proc=num_proc)
	return sample

fix compatibility issues with nyuv2 experiments #47

fix compatibility issues with nyuv2 experiments #47

Conversation

tanganke commented Dec 3, 2024

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot AI Dec 3, 2024

Choose a reason for hiding this comment

Copilot AI Dec 3, 2024

Choose a reason for hiding this comment

Copilot AI Dec 3, 2024

Choose a reason for hiding this comment