Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic eval table logging for WandbCallback #31050

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

andrewtruong
Copy link

@andrewtruong andrewtruong commented May 27, 2024

What does this PR do?

This PR adds basic support for logging raw evals using WandbCallback.

  1. Calling trainer.evaluate() will log an eval table with inputs and outputs; and
  2. Logging is controlled by the WANDB_LOG_EVALS env var.

This also adds some changes to trainer internals to support this, including:

  1. Updates EvalLoopOutputs to include inputs
  2. Automatically adds a ref to the trainer when instantiating callbacks

Here's an example table that gets logged when the user calls trainer.evaluate()
image

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@amyeroberts are you the right person to review?

@andrewtruong
Copy link
Author

I'm not sure how to get the failing tests to pass, even on main. Am I missing something?

Failing tests are:

  • tests/utils/test_offline.py::OfflineTests::test_offline_mode
  • tests/utils/test_offline.py::OfflineTests::test_offline_mode_sharded_checkpoint
  • tests/utils/test_offline.py::OfflineTests::test_offline_model_dynamic_model

@andrewtruong
Copy link
Author

This PR also makes 2 changes to the trainer internals that could be split into their own PRs:

@andrewtruong andrewtruong marked this pull request as ready for review May 27, 2024 17:02
@andrewtruong andrewtruong changed the title [WIP] Add basic eval table logging for WandbCallback Add basic eval table logging for WandbCallback May 27, 2024
@andrewtruong
Copy link
Author

Hi @amyeroberts!

Sorry to bother -- are you the right person to take a look at this?

@amyeroberts
Copy link
Collaborator

Hi @andrewtruong, thanks for opening this PR!

I'm going bring in @muellerzr here, who knows far more about Trainer and its interactions with W&B. One thing to note, is that integrations like W&B aren't actively maintained by the Hugging Face team. Looking at the PR, I'm not sure we want to add attributes to core objects like Trainer to enable feature integrations, unless it would unlock other things (here @muellerzr can definitely advise!)

@andrewtruong
Copy link
Author

Thanks!

I'm hoping these are small and reasonable changes:

  1. It doesn't touch the public Trainer API, only an internal object EvalLoopOutput. In that case, it just surfaces as variable that was already there but not exposed.
  2. It adds a reference back to the trainer from a callback, which can be useful if e.g. you need to get the tokenizer or dataset which are passed into the Trainer but not into the callback.

@andrewtruong
Copy link
Author

Hey @muellerzr, any chance you can take a look this week? I'd love to get this in :)

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On paper this seems okay, let's try our best to come up with a solution that doesn't involve self-referencing the trainer if possible though please

Comment on lines +3915 to +3952
return EvalLoopOutput(
predictions=all_preds, label_ids=all_labels, metrics=metrics, num_samples=num_samples, inputs=all_inputs
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some things to be very careful with, I'd appreciate checking the memory usage before/after this change. To make sure we don't have a memory leak, and we don't increase the VRAM used by the user by utilizing this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any profiling tools you recommend / would want to see?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wandb logs work just fine :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the update here?

@@ -933,6 +984,36 @@ def on_predict(self, args, state, control, metrics, **kwargs):
metrics = rewrite_logs(metrics)
self._wandb.log(metrics)

def on_evaluate(self, args, state, control, **kwargs):
if os.getenv("WANDB_LOG_EVALS"):
eval_loop_output = self.trainer.eval_loop_output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is eval_loop_output coming from exactly?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching, I missed this commit. It's added here:
b8d5c6e

Comment on lines +575 to +579
def init_callback(cb):
cb.trainer = self
return cb

callbacks = [init_callback(cb) for cb in callbacks]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better way we can do this rather than adding the trainer to self here? I'm not a fan of this because otherwise if users are saving states, they have an entire reference to the trainer in there, not good.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't love this either, but it's a tradeoff.

  1. The current Trainer reporting interface looks like report_to="wandb" which is fine if your callback doesn't need any args/kwargs, but in this case we do (the tokenizer, dataset, etc.) and the one object that has all of these is the Trainer.
  2. The alternative is also implemented in this PR, but you can't pass report_to="wandb". I think counter-intuitively you have to NOT report to wandb. Instead, you need to instantiate a callback and manually pass it in -- not the end of the world, but it didn't seem idiomatic.
trainer = Trainer(...)
wandb_callback = WandbCallback(..., tokenizer=..., dataset=...)
trainer.add_callback(wandb_callback)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd much prefer the non-idiomatic way please

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Author

@andrewtruong andrewtruong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, just realized this was stuck in "pending review" instead of actually posting.

Comment on lines +3915 to +3952
return EvalLoopOutput(
predictions=all_preds, label_ids=all_labels, metrics=metrics, num_samples=num_samples, inputs=all_inputs
)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any profiling tools you recommend / would want to see?

Comment on lines +575 to +579
def init_callback(cb):
cb.trainer = self
return cb

callbacks = [init_callback(cb) for cb in callbacks]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't love this either, but it's a tradeoff.

  1. The current Trainer reporting interface looks like report_to="wandb" which is fine if your callback doesn't need any args/kwargs, but in this case we do (the tokenizer, dataset, etc.) and the one object that has all of these is the Trainer.
  2. The alternative is also implemented in this PR, but you can't pass report_to="wandb". I think counter-intuitively you have to NOT report to wandb. Instead, you need to instantiate a callback and manually pass it in -- not the end of the world, but it didn't seem idiomatic.
trainer = Trainer(...)
wandb_callback = WandbCallback(..., tokenizer=..., dataset=...)
trainer.add_callback(wandb_callback)

@andrewtruong
Copy link
Author

Hey @muellerzr any feedback?

@huggingface huggingface deleted a comment from github-actions bot Jul 28, 2024
Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks much better :)

@muellerzr
Copy link
Contributor

@andrewtruong can you fix the conflicts then @amyeroberts can give it a final review!

@muellerzr muellerzr requested a review from amyeroberts July 31, 2024 19:48
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

At the moment, there's too many assumptions hard-coded about the model being a text-based model. I'd suggest maybe a new, text-specific callback class which we can safely make these assumptions

Comment on lines +3915 to +3952
return EvalLoopOutput(
predictions=all_preds, label_ids=all_labels, metrics=metrics, num_samples=num_samples, inputs=all_inputs
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the update here?

self.trainer = trainer

if tokenizer is None:
tokenizer = self.trainer.tokenizer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes trainer is not None


if tokenizer is None:
tokenizer = self.trainer.tokenizer
self.tokenizer = tokenizer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to set even if tokenizer is None

self.tokenizer = tokenizer

if dataset is None:
dataset = self.trainer.eval_dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - assumes self.trainer and self.trainer.dataset is not None

"""

def __init__(self):
def __init__(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a docstring for all these args. In particular num_samples and freq which aren't obvious

try:
sampled_dataset = dataset.select(range(num_samples))
except IndexError as e:
print(f"WARNING: Could not get those indices: {e=}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this a bit clearer - the user never specifies indices. so it's a bit weird to refer to them as "those indices". Maybe something along the lines of "Could not select {num_sample=} rows from the dataset"

dataset = self.trainer.eval_dataset

try:
sampled_dataset = dataset.select(range(num_samples))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes dataset is not None

print(f"WARNING: Could not get those indices: {e=}")
sampled_dataset = dataset

self.sample_dataset = sampled_dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to store both the full and sampled dataset?

if ignore_tokens is None:
ignore_tokens = [-100]

padding_token_id = self.tokenizer.pad_token_id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assumes tokenizer is not None. Confusingly, the tokenizer for Trainer may not be a tokenizer at all c.f. #32385

a[mask] = padding_token_id
return a

self._replace_ignored_tokens_func = replace_ignored_tokens
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this functionality in the callback?

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants