Add lm-eval correctness test #210

dbarbuzzi · 2024-04-25T20:49:42Z

Note

The PR description has been rewritten for clarity based on current state.

The main goal of this PR is to update the actions added by @mgoin in #166 and translate one of them into an end-to-end test building off of the server framework from #200.

This also updates some of the actions and workflows to ensure they are running leanly and at appropriate times:

an initial draft of a release workflow is added (runs only on-demand/manually)
the ‘build-test’ workflow is updated to support a 'WEEKLY' category
- the lm-eval smoke test is run nightly (Mon-Sat)
- the lm-eval full test is run weekly (Sun)
the lm-eval smoke flow/action is updated to use the prebuilt wheel
the lm-eval full flow/action is updated to use the prebuilt wheel and run the new test version

Pending

Per a conversation with @robertgshaw2-neuralmagic, the problematic marlin models have been disabled while work is underway to reduce their non-determinism. They can be re-evaluated/re-enabled at a later point.

There are currently a few models that may occasionally fail the test which may relate to either non-determinism of marlin, implementation issues, or perhaps a combination:

neuralmagic/llama-2-7b-chat-marlin has failed with an rtol as high as 0.2
neuralmagic/phi-2-super-marlin has failed occasionally with rtol=0.05

Separately, mistralai/Mixtral-8x7B-Instruct-v0.1 is untested due to the significant hardware required.

derekk-nm · 2024-04-29T17:30:25Z

tests/accuracy/test_lm_eval_correctness.py

+                measured_value,
+            )
+
+            assert numpy.isclose(ground_truth, measured_value, rtol=0.05)


it might be nice to collect results for all of the metrics, and maybe all of the tasks, that are not close, then assert if there are any in error. that way all of the problems are reported, rather than having a developer fix one issue then getting an error on the next that they didn't fix yet.

This job will be reinstated shortly after with PR #210

This is in lieu of determining a more lenient yet still reasonable relative tolerance value.

andy-neuma

looking good

.github/workflows/build-test.yml

.github/workflows/nightly.yml

dbarbuzzi added 10 commits April 22, 2024 18:46

Add test framework for server

952a0db

Update docstring

6178aea

Add missing '__init__.py'

c13b5c2

In-line updated ServerRunner implementation

df48eef

Restore logging of server command args

09f7161

Add lm-eval correctness test

2b32a92

Add "--max-model-len" arg

74d0293

Adjust relative tolerance value to 0.05

4f6a5cf

Change '--max-model-len' to 2048

7392992

Fix comment length, remove outdated comment

3ebcc81

dbarbuzzi self-assigned this Apr 25, 2024

Update comment

a790a1f

dbarbuzzi requested review from derekk-nm, robertgshaw2-redhat and mgoin April 25, 2024 20:54

Skip if lm_eval is not available

431f051

derekk-nm approved these changes Apr 29, 2024

View reviewed changes

dbarbuzzi added a commit that referenced this pull request May 1, 2024

Drop lm-eval accuracy job

0dd7c07

This job will be reinstated shortly after with PR #210

dbarbuzzi added 4 commits May 3, 2024 14:35

Merge branch 'main' into add-lm-eval-correctness-test

44d781f

Skip test in remote push jobs

6856f24

Fix check in lm-eval smoke test

dc33cee

Update lm-eval smoke job to use prebuilt wheel

9bf3a71

dbarbuzzi changed the base branch from add-test-server-framework to main May 3, 2024 15:01

dbarbuzzi added 7 commits May 3, 2024 15:20

Fix typing in test

c914b36

Add lm-eval-full job on release runs

da1adf2

Skip full test in nightly

473f8ee

Fix style

f316375

Update eval task configs

c61d6b2

Add support for configurable rtol

44df6ad

Mark 'chat-marlin' model as xfail

7a1ecdf

This is in lieu of determining a more lenient yet still reasonable relative tolerance value.

dbarbuzzi added 6 commits May 3, 2024 20:48

Use correct label for TEST-LM-EVAL-FULL

49d115b

Only run full lm-eval on a weekly cadence

d6571d4

Update naming

3b25154

Add manual release workflow

5308642

Remove xfail logic

4471031

Fix release workflow category

e972635

dbarbuzzi requested a review from andy-neuma May 8, 2024 14:51

Disable marlin models

73adc9f

andy-neuma reviewed May 9, 2024

View reviewed changes

.github/workflows/build-test.yml Show resolved Hide resolved

.github/workflows/nightly.yml Outdated Show resolved Hide resolved

dbarbuzzi added 2 commits May 9, 2024 19:36

Separate nightly/weekly workflows

638d924

Additional fix for lm-eval smoke check

9828633

dbarbuzzi merged commit affd4f4 into main May 10, 2024
12 checks passed

dbarbuzzi deleted the add-lm-eval-correctness-test branch May 10, 2024 17:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lm-eval correctness test #210

Add lm-eval correctness test #210

dbarbuzzi commented Apr 25, 2024 •

edited

Loading

derekk-nm Apr 29, 2024

andy-neuma left a comment

Add lm-eval correctness test #210

Add lm-eval correctness test #210

Conversation

dbarbuzzi commented Apr 25, 2024 • edited Loading

derekk-nm Apr 29, 2024

Choose a reason for hiding this comment

andy-neuma left a comment

Choose a reason for hiding this comment

dbarbuzzi commented Apr 25, 2024 •

edited

Loading