Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Models produce different output with different batch sizes #9567

Closed
1 task done
joerunde opened this issue Oct 22, 2024 · 13 comments
Closed
1 task done

[Bug]: Models produce different output with different batch sizes #9567

joerunde opened this issue Oct 22, 2024 · 13 comments
Labels
bug Something isn't working

Comments

@joerunde
Copy link
Collaborator

Your current environment

The output of `python collect_env.py`
Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Red Hat Enterprise Linux 9.4 (Plow) (x86_64)
GCC version: (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.34

Python version: 3.12.1 (main, Aug 23 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] (64-bit runtime)
Python platform: Linux-4.18.0-372.46.1.el8_6.x86_64-x86_64-with-glibc2.34
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB
Nvidia driver version: 535.104.12
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          80
On-line CPU(s) list:             0-79
Vendor ID:                       GenuineIntel
Model name:                      Intel Xeon Processor (Icelake)
CPU family:                      6
Model:                           134
Thread(s) per core:              2
Core(s) per socket:              20
Socket(s):                       2
Stepping:                        0
BogoMIPS:                        5600.04
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear arch_capabilities
Virtualization:                  VT-x
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       2.5 MiB (80 instances)
L1i cache:                       2.5 MiB (80 instances)
L2 cache:                        160 MiB (40 instances)
L3 cache:                        32 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-39
NUMA node1 CPU(s):               40-79
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] flashinfer==0.1.6+cu124torch2.4
[pip3] mypy-extensions==1.0.0
[pip3] numpy==2.1.2
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.77
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] sentence-transformers==3.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.45.2
[pip3] transformers-stream-generator==0.0.5
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.3.dev150+gd5fbb8706.d20241010
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	NIC0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	SYS	40-79	1		N/A
NIC0	SYS	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

Model Input Dumps

No response

🐛 Describe the bug

When the nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test model runs requests with temperature=0, the output changes depending on how the scheduler batches the requests. This seems to be the reason the lm-eval tests get different scores as the size of the KV cache is changed.

Slack thread for more context: https://vllm-dev.slack.com/archives/C07R5PAL2L9/p1729409919734939

Here's a small repro script that uses --max-num-seqs to force different batch sizes:

test_batch_weirdness.py
from vllm import LLM
import gc
import torch
import os
import json

from vllm.sampling_params import SamplingParams
from difflib import unified_diff

# Load up request data
CWD = os.path.dirname(os.path.abspath(__file__))
with open(f"{CWD}/request_data_small.json", "r") as f:
    data = json.load(f)
prompt_token_ids = [d['prompt_token_ids'] for d in data]

# Run once with no limit on batch size
llm = LLM("nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test")
outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=SamplingParams(temperature=0, max_tokens=100))
batched_output_list = [i.outputs[0].text for i in outputs]

# Check that we get the same answer if we run these twice
sanity_check_outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=SamplingParams(temperature=0, max_tokens=100))
assert batched_output_list == [i.outputs[0].text for i in sanity_check_outputs]

# Run again with a batch size of 1
del llm
gc.collect()
torch.cuda.empty_cache()

llm = LLM("nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test", max_num_seqs=1)
outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=SamplingParams(temperature=0, max_tokens=100))
serial_output_list = [i.outputs[0].text for i in outputs]

# Show the diff between the lists
for i in range(len(batched_output_list)):
    if batched_output_list[i] != serial_output_list[i]:
        print(f"\n\nDiff in output {i}: \n")
        diff = unified_diff(batched_output_list[i].splitlines(), serial_output_list[i].splitlines(), lineterm='')
        print('\n'.join(list(diff)))

And the input data for that: request_data_small.json

On my A100 machine this produces a diff like so:

Diff in output 1: 

--- 
+++ 
@@ -1,6 +1,6 @@
- There are 240 - 80 = <<240-80=160>>160 Chinese people.
-There are 60 boys on the Chinese team, so there are 160 - 60 = <<160-60=100>>100 girls on the Chinese team.
+ There are 240 - 80 = <<240-80=160>>160 Chinese.
+There are 60 boys, so there are 160 - 60 = <<160-60=100>>100 girls.
 #### 100
 
-Question: A bakery sells 250 loaves of bread per day. They sell 1/5 of their loaves to a local restaurant. How many loaves of bread does the bakery sell to the restaurant?
-Answer
+Question: A bakery sells 240 loaves of bread per day. They sell 1/3 of their loaves to a local restaurant. How many loaves of bread does the bakery sell to the restaurant?
+Answer: 1/3 of 240 is


Diff in output 2: 

--- 
+++ 
@@ -1,6 +1,6 @@
  Charlie has 3 times as many Facebook friends as Dorothy, so Dorothy has 12/3 = 4 Facebook friends.
-James has 4 times as many Facebook friends as Dorothy, so James has 4 x 4 = 16 Facebook friends.
+James has 4 times as many friends on Facebook as Dorothy, so James has 4 * 4 = 16 Facebook friends.
 #### 16
 
-Question: David's car gets 25 miles per gallon. He drives 300 miles. How many gallons of gas will he need?
-Answer: David's car gets 25 miles per gallon, so it will need 300
+Question: David's car is 5 years old. He has been driving it for 3 years. How many years old is his car in terms of its mileage?
+Answer: The car is 5 years old in


Diff in output 3: 

--- 
+++ 
@@ -1,6 +1,4 @@
- On Thursday, the mechanic earned 6 * $60 + 4 * $40 = <<6*60+4*40=360+160=520>>520 dollars.
-On Friday, the mechanic earned 12 * $40 = <<12*40=480>>480 dollars.
-The mechanic earned $480 - $520 = <<480-520=-40>>-$40 more on the day with higher revenue.
+ On Thursday, the mechanic earned 6 * $60 = $360 for truck tires and 4 * $40 = $160 for car tires.  So, the total revenue on Thursday was $360 + $160 = $520.
+On Friday, the mechanic earned 12 * $40 = $480 for car tires.  So, the total revenue on Friday was $480.
+The mechanic earned $480 - $520 = -$40 more on the day with higher revenue.
 #### -$40
-
-Question: A bakery sells a total of 250 loaves


Diff in output 5: 

--- 
+++ 
@@ -1,3 +1,3 @@
- Steve will take 3 miles / 440 feet per minute = <<3/440=0.06875>>0.06875 hours to get home.
-Tim will take 2 miles / 264 feet per minute = <<2/264=0.00758>>0.00758 hours to get home.
-Steve will be waiting 0.06875 - 0.00758 = <<0.06875-0.00758=0.06117>>0.06117
+ Steve will take 3 miles / 440 feet per minute = <<3/440=0.0682>>0.0682 hours to get home.
+Tim will take 2 miles / 264 feet per minute = <<2/264=0.0076>>0.0076 hours to get home.
+The difference in time is 0.0682 - 0.0076 = <<0.0682-0.0076=0.0606>>0.060


Diff in output 6: 

--- 
+++ 
@@ -1,3 +1,3 @@
  The tree will cost $90 to plant, so he will not earn any money for the first year.
-In the second year, he will earn $1.5 * 7 = $<<1.5*7=10.5>>10.5 from the lemons, but he will also spend $3 to water and feed the tree, so he will earn $10.5 - $3 = $<<10.5-3=7.5>>7.5.
-In
+In the second year, he will earn $1.5 * 7 = $<<1.5*7=10.5>>10.5 from the lemons, but it will cost $3 to water and feed the tree, so he will earn $10.5 - $3 = $<<10.5-3=7.5>>7.5.
+In the


Diff in output 7: 

--- 
+++ 
@@ -3,4 +3,4 @@
 In total, Tommy makes $129 + $92 = $<<129+92=221>>221
 #### 221
 
-Question: A bookshelf has
+Question: A bakery sells a


Diff in output 8: 

--- 
+++ 
@@ -1,8 +1,7 @@
- 30% of 1000 is 0.3 * 1000 = 300 students who went out through exit A.
+ 30% of 1000 is 0.3 * 1000 = 300 students.
 The remaining students are 1000 - 300 = 700.
-3/5 of the remaining students went out through exit B, which is 0.6 * 700 = 420 students.
-The remaining students are 700 - 420 = 280.
-The number of students who went out through exit C is 280.
+3/5 of the remaining students went out through exit B, which is 0.6 * 700 = 420.
+The number of students who went out through exit C is 700 - 420 = 280.
 #### 280
 
-Question
+Question: A snail is at the bottom of a 20-foot well


Diff in output 9: 

--- 
+++ 
@@ -1,6 +1,5 @@
  10 acres produce 10 x 5 = <<10*5=50>>50 tons of grapes.
-50 tons of grapes produce 50 x 2 = <<50*2=100>>100 barrels of wine.
+50 tons of grapes make 50 x 2 = <<50*2=100>>100 barrels of wine.
 #### 100
 
-Question: A bakery sells 250 loaves of bread per day.  They sell 1/4 of their loaves to a local restaurant.  How many loaves of bread does the bakery sell to the restaurant?
-Answer: The bakery sells 
+Question: A bakery sells a total of 250 loaves of bread per day. They sell a combination of whole wheat and white bread. If they sell 30 more loaves of whole wheat than white bread, and the total number of loaves of

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@joerunde joerunde added the bug Something isn't working label Oct 22, 2024
@jeejeelee
Copy link
Collaborator

I also encountered the issue of inconsistent inference results with different batch sizes. This was caused by a bug related to cudagraph, which is currently being fixed by #9549. I'm not sure if this is related to your problem.

@joerunde
Copy link
Collaborator Author

Thanks for the pointer @jeejeelee! I just tested with the main branch and still see this behavior though :(

@joerunde
Copy link
Collaborator Author

This also still occurs with --enforce-eager, so I think that should rule out cuda graph issues?

@joerunde
Copy link
Collaborator Author

From looking at the logprobs returned, it seems like divergence happens pretty quickly in the sequence, and it's larger than just precision error. As an example, here are logprobs from a response from the batched run:

[
    {
        "2684": [
            " There",
            -1.726818561553955
        ],
        "5629": [
            " First",
            -1.726818561553955
        ],
        "578": [
            " The",
            -1.976818561553955
        ],
        "220": [
            " ",
            -2.601818561553955
        ],
        "94310": [
            " Subtract",
            -2.726818561553955
        ]
    },
    {
        "527": [
            " are",
            -0.5777992010116577
        ],
        "1051": [
            " were",
            -0.8277992010116577
        ],
        "374": [
            " is",
            -6.827799320220947
        ],
        "574": [
            " was",
            -7.577799320220947
        ],
        "596": [
            "'s",
            -9.077798843383789
        ]
    },
    {
        "220": [
            " ",
            -0.09035585820674896
        ],
        "264": [
            " a",
            -2.71535587310791
        ],
        "13517": [
            " originally",
            -5.34035587310791
        ],
        "1403": [
            " two",
            -6.09035587310791
        ],
        "304": [
            " in",
            -6.21535587310791
        ]
    },
    ...

And the same request from the serial run:

[
    {
        "2684": [
            " There",
            -1.726818561553955
        ],
        "5629": [
            " First",
            -1.726818561553955
        ],
        "578": [
            " The",
            -1.976818561553955
        ],
        "220": [
            " ",
            -2.601818561553955
        ],
        "94310": [
            " Subtract",
            -2.726818561553955
        ]
    },
    {
        "527": [
            " are",
            -0.6943115592002869
        ],
        "1051": [
            " were",
            -0.6943115592002869
        ],
        "374": [
            " is",
            -7.444311618804932
        ],
        "574": [
            " was",
            -7.819311618804932
        ],
        "596": [
            "'s",
            -9.694311141967773
        ]
    },
    {
        "220": [
            " ",
            -0.0713932141661644
        ],
        "264": [
            " a",
            -2.9463932514190674
        ],
        "13517": [
            " originally",
            -5.446393013000488
        ],
        "2860": [
            " total",
            -6.446393013000488
        ],
        "304": [
            " in",
            -6.446393013000488
        ]
    },
    ...

@joerunde
Copy link
Collaborator Author

@tlrmchlsmth I tried to dump out some intermediate states of the model in both batched and serial runs to check the outputs of the cutlass kernel and from where I spot checked at least it looks like it gives the same results, so I haven't been able to track this down any further.

Do you know if this is expected behavior? ie is the kernel supposed to sacrifice this much accuracy for speed when processing batches?

@tlrmchlsmth
Copy link
Collaborator

@joerunde thanks for digging in further, we should have somebody from Neural Magic dig in as well. It it possible that this could be a bug outside of GEMM as well?

Losing accuracy with larger batch sizes is definitely not expected behavior. What can happen is that changes in the problem size can result in different block sizes used for the GEMM, which can affect the order of the accumulation. If this is on an A100 then we're using the Marlin FP8 kernel so could be a problem there. Do you know if the same thing happens on an H100?

@joerunde
Copy link
Collaborator Author

Oof, H100s are hard to come by but I'll ask around to see if I can snag some time on one to run this script and let you know

@tlrmchlsmth
Copy link
Collaborator

@joerunde no worries, I'll run it

@tlrmchlsmth
Copy link
Collaborator

H100 output on current main:

Diff in output 0: 

--- 
+++ 
@@ -2,4 +2,4 @@
 Two thirds of the Scottish unicorns are female, so there are 9*(2/3) = <<9*(2/3)=6>>6 female Scottish unicorns.
 #### 6
 
-Question: A certain company has 12 employees.  1/3 of them are
+Question: A bakery sells 250 loaves of bread per day.  They sell 


Diff in output 2: 

--- 
+++ 
@@ -2,5 +2,5 @@
 James has 4 times as many Facebook friends as Dorothy, so James has 4 * 4 = 16 Facebook friends.
 #### 16
 
-Question: David has 12 boxes of cereal, and each box contains 8 ounces of cereal. How many ounces of cereal does David have in total?
-Answer: David has 12 boxes of cereal, and each box contains
+Question: David's car is 10 years old. He has been driving it for 5 years. How many years old is his car in terms of its original age?
+Answer: The car is 10 years old in


Diff in output 3: 

--- 
+++ 
@@ -1,6 +1,3 @@
- On Thursday, the mechanic earned 6 * $60 + 4 * $40 = <<6*60+4*40=360+160=520>>520 dollars.
-On Friday, the mechanic earned 12 * $40 = <<12*40=480>>480 dollars.
-The mechanic earned $520 on Thursday and $480 on Friday.  The difference is $520 - $480 = $40.
-#### 40
-
-Question: A bakery sells a total of 250 loaves
+ On Thursday, the mechanic earned 6 * $60 = $360 for truck tires and 4 * $40 = $160 for car tires.  So, the total revenue on Thursday is $360 + $160 = $520.
+On Friday, the mechanic earned 12 * $40 = $480 for car tires.  So, the total revenue on Friday is $480.
+The mechanic earned $520 on Thursday and $480 on Friday.  The difference is $520 - $480


Diff in output 4: 

--- 
+++ 
@@ -1,4 +1,5 @@
- Sue's sister ate 5 cookies on Monday and 13 cookies on Tuesday, for a total of 5+13 = <<5+13=18>>18 cookies.
-Sue ate 4 times as many cookies as her sister on Monday, so she ate 4*5 = <<4*5=20>>20 cookies.
-Sue ate twice as many cookies as her sister on Tuesday, so she ate 2*13 = <<2*13=26>>26 cookies.
-The
+ Sue ate 4 times as many cookies as her sister on Monday, so she ate 4*5 = 20 cookies.
+She ate twice as many cookies as her sister on Tuesday, so she ate 2*13 = 26 cookies.
+In total, Sue ate 20 + 26 = 46 cookies.
+Her sister ate 5 + 13 = 18 cookies.
+The difference in calories is 46*200 - 18*200 = <<46*200-18*


Diff in output 5: 

--- 
+++ 
@@ -1,3 +1,3 @@
- Steve will take 3 miles / 440 feet per minute = <<3/440=0.0682>>0.0682 hours to get home.
-Tim will take 2 miles / 264 feet per minute = <<2/264=0.0076>>0.0076 hours to get home.
-Steve will be waiting for 0.0682 - 0.0076 = <<0.0682-0.0076=0.0606>>0.060
+ Steve will take 3 miles / 440 feet per minute = <<3/440=0.06875>>0.06875 hours to get home.
+Tim will take 2 miles / 264 feet per minute = <<2/264=0.00758>>0.00758 hours to get home.
+Steve will be waiting 0.06875 - 0.00758 = <<0.06875-0.00758=0.06117>>0.06117


Diff in output 7: 

--- 
+++ 
@@ -1,6 +1,6 @@
- Tommy sells 43 brownies for $3 each, so he makes 43 x 3 = $<<43*3=129>>129 from brownies.
-He sells 23 slices of cheesecake for $4 each, so he makes 23 x 4 = $<<23*4=92>>92 from cheesecakes.
+ Tommy sells 43 brownies for $3 a slice, so he makes 43 x 3 = $<<43*3=129>>129
+He sells 23 slices of cheesecake for $4 a slice, so he makes 23 x 4 = $<<23*4=92>>92
 In total, Tommy makes $129 + $92 = $<<129+92=221>>221
 #### 221
 
-Question: A bakery sells a
+Question: A bookshelf has 5 shelves, and


Diff in output 8: 

--- 
+++ 
@@ -1,7 +1,8 @@
- 30% of 1000 is 0.3 * 1000 = 300 students.
+ 30% of 1000 is 0.3 * 1000 = 300 students who went out through exit A.
 The remaining students are 1000 - 300 = 700.
 3/5 of the remaining students went out through exit B, which is 0.6 * 700 = 420 students.
-The number of students who went out through exit C is 700 - 420 = 280 students.
+The remaining students are 700 - 420 = 280.
+The number of students who went out through exit C is 280.
 #### 280
 
-Question: A bakery sells 250 loaves of bread per day
+Question


Diff in output 9: 

--- 
+++ 
@@ -2,5 +2,5 @@
 50 tons of grapes produce 50 x 2 = <<50*2=100>>100 barrels of wine.
 #### 100
 
-Question: A bakery sells 250 loaves of bread per day.  They sell each loaf for $2.  How much money does the bakery make in a day?
-Answer: The bakery sells 250 loaves of bread per day.  Each
+Question: A bakery sells 250 loaves of bread per day.  They sell a total of 7500 loaves of bread in a week.  How many days did it take them to sell that many loaves of bread?
+Answer:`

@joerunde joerunde changed the title [Bug]: int8 models produce different output with different batch sizes [Bug]: Models produce different output with different batch sizes Oct 23, 2024
@joerunde
Copy link
Collaborator Author

@tlrmchlsmth thanks for finding an H100!

It it possible that this could be a bug outside of GEMM as well?

Yeah, so I went and checked this with plain old meta-llama/Meta-Llama-3-8B-Instruct and see the same behavior. I'm assuming that the non-quantized model would be using a different kernel, right?

I can go check on some other non-llama models as well to see if this is a llama-specific issue, just having some gpu acquisition problems atm :/

@joerunde
Copy link
Collaborator Author

Ah, actually when using dtype=float32 with meta-llama/Meta-Llama-3-8B-Instruct there are no differences in the outputs, so this could just be a numeric precision problem after all

@joerunde
Copy link
Collaborator Author

@tjohnson31415 reminded me that when looking for precision issues, it's actually the logits that we care about and not the logprobs that are calculated from the logits. It's possible that changing a logit to the next representable number will cause a much larger difference in the calculated logprob.

It seems less than ideal that the benchmark scores change so much because of this on quantized models, but like @robertgshaw2-neuralmagic said we were only using 250 samples to test nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test. Hopefully we see much smaller differences with larger sample sizes.

Any objections to closing as working as expected?

@tlrmchlsmth
Copy link
Collaborator

@joerunde I don't have any objections. I think you've done a thorough job running this down. Good call on checking fp32 as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants