-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add basic eval table logging for WandbCallback #31050
base: main
Are you sure you want to change the base?
Conversation
I'm not sure how to get the failing tests to pass, even on Failing tests are:
|
This PR also makes 2 changes to the trainer internals that could be split into their own PRs: |
Hi @amyeroberts! Sorry to bother -- are you the right person to take a look at this? |
Hi @andrewtruong, thanks for opening this PR! I'm going bring in @muellerzr here, who knows far more about Trainer and its interactions with W&B. One thing to note, is that integrations like W&B aren't actively maintained by the Hugging Face team. Looking at the PR, I'm not sure we want to add attributes to core objects like Trainer to enable feature integrations, unless it would unlock other things (here @muellerzr can definitely advise!) |
Thanks! I'm hoping these are small and reasonable changes:
|
Hey @muellerzr, any chance you can take a look this week? I'd love to get this in :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On paper this seems okay, let's try our best to come up with a solution that doesn't involve self-referencing the trainer if possible though please
return EvalLoopOutput( | ||
predictions=all_preds, label_ids=all_labels, metrics=metrics, num_samples=num_samples, inputs=all_inputs | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some things to be very careful with, I'd appreciate checking the memory usage before/after this change. To make sure we don't have a memory leak, and we don't increase the VRAM used by the user by utilizing this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any profiling tools you recommend / would want to see?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wandb
logs work just fine :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the update here?
@@ -933,6 +984,36 @@ def on_predict(self, args, state, control, metrics, **kwargs): | |||
metrics = rewrite_logs(metrics) | |||
self._wandb.log(metrics) | |||
|
|||
def on_evaluate(self, args, state, control, **kwargs): | |||
if os.getenv("WANDB_LOG_EVALS"): | |||
eval_loop_output = self.trainer.eval_loop_output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is eval_loop_output
coming from exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching, I missed this commit. It's added here:
b8d5c6e
def init_callback(cb): | ||
cb.trainer = self | ||
return cb | ||
|
||
callbacks = [init_callback(cb) for cb in callbacks] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better way we can do this rather than adding the trainer to self
here? I'm not a fan of this because otherwise if users are saving states, they have an entire reference to the trainer in there, not good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't love this either, but it's a tradeoff.
- The current Trainer reporting interface looks like
report_to="wandb"
which is fine if your callback doesn't need any args/kwargs, but in this case we do (the tokenizer, dataset, etc.) and the one object that has all of these is the Trainer. - The alternative is also implemented in this PR, but you can't pass
report_to="wandb"
. I think counter-intuitively you have to NOT report to wandb. Instead, you need to instantiate a callback and manually pass it in -- not the end of the world, but it didn't seem idiomatic.
trainer = Trainer(...)
wandb_callback = WandbCallback(..., tokenizer=..., dataset=...)
trainer.add_callback(wandb_callback)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd much prefer the non-idiomatic way please
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops, just realized this was stuck in "pending review" instead of actually posting.
return EvalLoopOutput( | ||
predictions=all_preds, label_ids=all_labels, metrics=metrics, num_samples=num_samples, inputs=all_inputs | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any profiling tools you recommend / would want to see?
def init_callback(cb): | ||
cb.trainer = self | ||
return cb | ||
|
||
callbacks = [init_callback(cb) for cb in callbacks] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't love this either, but it's a tradeoff.
- The current Trainer reporting interface looks like
report_to="wandb"
which is fine if your callback doesn't need any args/kwargs, but in this case we do (the tokenizer, dataset, etc.) and the one object that has all of these is the Trainer. - The alternative is also implemented in this PR, but you can't pass
report_to="wandb"
. I think counter-intuitively you have to NOT report to wandb. Instead, you need to instantiate a callback and manually pass it in -- not the end of the world, but it didn't seem idiomatic.
trainer = Trainer(...)
wandb_callback = WandbCallback(..., tokenizer=..., dataset=...)
trainer.add_callback(wandb_callback)
4235289
to
a80db7a
Compare
Hey @muellerzr any feedback? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This looks much better :)
@andrewtruong can you fix the conflicts then @amyeroberts can give it a final review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
At the moment, there's too many assumptions hard-coded about the model being a text-based model. I'd suggest maybe a new, text-specific callback class which we can safely make these assumptions
return EvalLoopOutput( | ||
predictions=all_preds, label_ids=all_labels, metrics=metrics, num_samples=num_samples, inputs=all_inputs | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the update here?
self.trainer = trainer | ||
|
||
if tokenizer is None: | ||
tokenizer = self.trainer.tokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes trainer
is not None
|
||
if tokenizer is None: | ||
tokenizer = self.trainer.tokenizer | ||
self.tokenizer = tokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to set even if tokenizer
is None
self.tokenizer = tokenizer | ||
|
||
if dataset is None: | ||
dataset = self.trainer.eval_dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here - assumes self.trainer and self.trainer.dataset is not None
""" | ||
|
||
def __init__(self): | ||
def __init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a docstring for all these args. In particular num_samples
and freq
which aren't obvious
try: | ||
sampled_dataset = dataset.select(range(num_samples)) | ||
except IndexError as e: | ||
print(f"WARNING: Could not get those indices: {e=}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this a bit clearer - the user never specifies indices. so it's a bit weird to refer to them as "those indices". Maybe something along the lines of "Could not select {num_sample=} rows from the dataset"
dataset = self.trainer.eval_dataset | ||
|
||
try: | ||
sampled_dataset = dataset.select(range(num_samples)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes dataset
is not None
print(f"WARNING: Could not get those indices: {e=}") | ||
sampled_dataset = dataset | ||
|
||
self.sample_dataset = sampled_dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to store both the full and sampled dataset?
if ignore_tokens is None: | ||
ignore_tokens = [-100] | ||
|
||
padding_token_id = self.tokenizer.pad_token_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assumes tokenizer is not None. Confusingly, the tokenizer
for Trainer may not be a tokenizer at all c.f. #32385
a[mask] = padding_token_id | ||
return a | ||
|
||
self._replace_ignored_tokens_func = replace_ignored_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this functionality in the callback?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
What does this PR do?
This PR adds basic support for logging raw evals using
WandbCallback
.trainer.evaluate()
will log an eval table with inputs and outputs; andWANDB_LOG_EVALS
env var.This also adds some changes to trainer internals to support this, including:
EvalLoopOutputs
to includeinputs
Here's an example table that gets logged when the user calls

trainer.evaluate()
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@amyeroberts are you the right person to review?