Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Add lm-eval correctness test #210

Merged
merged 32 commits into from
May 10, 2024
Merged

Add lm-eval correctness test #210

merged 32 commits into from
May 10, 2024

Conversation

dbarbuzzi
Copy link

@dbarbuzzi dbarbuzzi commented Apr 25, 2024

Note

The PR description has been rewritten for clarity based on current state.

The main goal of this PR is to update the actions added by @mgoin in #166 and translate one of them into an end-to-end test building off of the server framework from #200.

This also updates some of the actions and workflows to ensure they are running leanly and at appropriate times:

  • an initial draft of a release workflow is added (runs only on-demand/manually)
  • the ‘build-test’ workflow is updated to support a 'WEEKLY' category
    • the lm-eval smoke test is run nightly (Mon-Sat)
    • the lm-eval full test is run weekly (Sun)
  • the lm-eval smoke flow/action is updated to use the prebuilt wheel
  • the lm-eval full flow/action is updated to use the prebuilt wheel and run the new test version

Pending

Per a conversation with @robertgshaw2-neuralmagic, the problematic marlin models have been disabled while work is underway to reduce their non-determinism. They can be re-evaluated/re-enabled at a later point.

There are currently a few models that may occasionally fail the test which may relate to either non-determinism of marlin, implementation issues, or perhaps a combination:

  • neuralmagic/llama-2-7b-chat-marlin has failed with an rtol as high as 0.2
  • neuralmagic/phi-2-super-marlin has failed occasionally with rtol=0.05

Separately, mistralai/Mixtral-8x7B-Instruct-v0.1 is untested due to the significant hardware required.

@dbarbuzzi dbarbuzzi self-assigned this Apr 25, 2024
measured_value,
)

assert numpy.isclose(ground_truth, measured_value, rtol=0.05)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be nice to collect results for all of the metrics, and maybe all of the tasks, that are not close, then assert if there are any in error. that way all of the problems are reported, rather than having a developer fix one issue then getting an error on the next that they didn't fix yet.

dbarbuzzi added a commit that referenced this pull request May 1, 2024
This job will be reinstated shortly after with PR #210
@dbarbuzzi dbarbuzzi changed the base branch from add-test-server-framework to main May 3, 2024 15:01
@dbarbuzzi dbarbuzzi requested a review from andy-neuma May 8, 2024 14:51
Copy link
Member

@andy-neuma andy-neuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good

@dbarbuzzi dbarbuzzi merged commit affd4f4 into main May 10, 2024
12 checks passed
@dbarbuzzi dbarbuzzi deleted the add-lm-eval-correctness-test branch May 10, 2024 17:24
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants