Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: an illegal memory access was encountered with Mistral FP8 Marlin kernels on NVIDIA driver 535.216.01 (AWS Sagemaker Real-time Inference) #2915

Open
3 of 4 tasks
dwyatte opened this issue Jan 15, 2025 · 2 comments

Comments

@dwyatte
Copy link
Contributor

dwyatte commented Jan 15, 2025

System Info

Tested with text-generation-inference 2.4.0 and 3.0.0 Docker containers running the CLI from within on Sagemaker Real-time Inference (NVIDIA driver 535.216.01)

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Launching a Mistral-based model with FP8 Marlin kernels raises a CUDA illegal memory access error on server startup when using NVIDIA driver 535.216.01. The error happens during model warmup. We have reproduced the error for multiple text-generation-inference versions (3.0.0 and 2.4.0) and multiple GPU models (A10G and L40S)

  • This seems limited to NVIDIA driver 535.216.01 (CUDA 12.2) which is used by AWS Sagemaker Real-time inference. The error is not raised for example on driver 550.127.05 (CUDA 12.4) which is used by other AWS Sagemaker products
  • This seems limited to Mistral-based models. We have successfully run other models such as Qwen 2 with FP8 Marlin kernels with both of the above NVIDIA drivers
  • Mistral-based models run fine without FP8 Marlin quantization regardless of NVIDIA driver

text-generation-inference 3.0.0
text-generation-launcher --model-id=prometheus-eval/prometheus-7b-v2.0 --quantize=fp8

#033[2m2025-01-15T18:32:53.009058Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Method Prefill encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 321, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 728, in main
    return _main(
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 197, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 703, in wrapper
    return callback(**use_params)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 117, in serve
    server.serve(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 315, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.11/site-packages/text_generation_server/interceptor.py", line 24, in intercept
    return await response
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 183, in Prefill
    generations, next_batch, timings = self.model.generate_token(batch)
  File "/opt/conda/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 2145, in generate_token
    adapter_segments, _ = find_segments(batch.adapter_meta.adapter_indices)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/utils/segments.py", line 15, in find_segments
    adapter_indices = adapter_indices.cpu().numpy()
RuntimeError: CUDA error: an illegal memory access was encountered

text-generation-inference 2.4.0

#033[2m2025-01-15T18:44:06.110631Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Method Prefill encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 116, in serve
    server.serve(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 303, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.11/site-packages/text_generation_server/interceptor.py", line 24, in intercept
    return await response
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 171, in Prefill
    generations, next_batch, timings = self.model.generate_token(batch)
  File "/opt/conda/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 2024, in generate_token
    copy_next_input_ids_inplace(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/metadata_kernels.py", line 113, in copy_next_input_ids_inplace
    triton_copy_next_input_ids_inplace[grid](
  File "/opt/conda/lib/python3.11/site-packages/triton/runtime/jit.py", line 345, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/triton/runtime/jit.py", line 691, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/lib/python3.11/site-packages/triton/compiler/compiler.py", line 381, in __getattribute__
    self._init_handles()
  File "/opt/conda/lib/python3.11/site-packages/triton/compiler/compiler.py", line 376, in _init_handles
    self.module, self.function, self.n_regs, self.n_spills = driver.active.utils.load_binary(
RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered

Expected behavior

We would like to leverage FP8 Marlin quantization for Mistral-based models on Sagemaker Real-time Inference which currently is limited to NVIDIA driver 535.216.01

@danieldk
Copy link
Member

Any chance you could run with CUDA_LAUNCH_BLOCKING=1, which may help pinpointing the source of the error? It's also worth testing with USE_CUTLASS_W8A8=1, which will use CUTLASS gemm kernels instead (only works on compute capability 8.9, so not on A10).

@dwyatte
Copy link
Contributor Author

dwyatte commented Jan 17, 2025

Assuming text-generation-inference 3.0.0 from here unless otherwise noted.

With CUDA_LAUNCH_BLOCKING=1, the source of the error looks to be flashinfer/BatchPrefillWithPagedKVCache

#033[2m2025-01-17T15:08:27.088671Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Args {
    model_id: "/tmp/tgi/model",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: Some(
        1,
    ),
    quantize: Some(
        Fp8,
    ),
    speculate: None,
    dtype: None,
    kv_cache_dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: None,
    max_total_tokens: None,
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "container-0.local",
    port: 8081,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: None,
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    api_key: None,
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
    lora_adapters: None,
    usage_stats: On,
    payload_limit: 2000000,
    enable_prefill_logprobs: false,
}
#033[2m2025-01-17T15:08:29.214016Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Using attention flashinfer - Prefix caching true
#033[2m2025-01-17T15:08:29.236348Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Default `max_batch_prefill_tokens` to 24147
#033[2m2025-01-17T15:08:29.236367Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Using default cuda graphs [1, 2, 4, 8, 16, 32]
#033[2m2025-01-17T15:08:29.236472Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting check and download process for /tmp/tgi/model
#033[2m2025-01-17T15:08:31.846913Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Files are already present on the host. Skipping download.
#033[2m2025-01-17T15:08:32.344346Z#033[0m #033[32m INFO#033[0m #033[1mdownload#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Successfully downloaded weights for /tmp/tgi/model
#033[2m2025-01-17T15:08:32.344542Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting shard #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
#033[2m2025-01-17T15:08:34.992143Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Using prefix caching = True
#033[2m2025-01-17T15:08:34.992236Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Using Attention = flashinfer
#033[2m2025-01-17T15:08:37.239049Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m GPU does not support FP8, using Marlin FP8 kernel
#033[2m2025-01-17T15:08:42.358040Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
#033[2m2025-01-17T15:08:52.366719Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for shard to be ready... #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
#033[2m2025-01-17T15:08:56.671801Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Using prefill chunking = True
#033[2m2025-01-17T15:08:56.890531Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Server started at unix:///tmp/text-generation-server-0
#033[2m2025-01-17T15:08:56.970775Z#033[0m #033[32m INFO#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard ready in 24.621139621s #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
#033[2m2025-01-17T15:08:57.063656Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Starting Webserver
#033[2m2025-01-17T15:08:57.108499Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router_v3#033[0m#033[2m:#033[0m #033[2mbackends/v3/src/lib.rs#033[0m#033[2m:#033[0m#033[2m125:#033[0m Warming up model
#033[2m2025-01-17T15:08:57.189235Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Using optimized Triton indexing kernels.
#033[2m2025-01-17T15:09:02.076025Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m KV-cache blocks: 241374, size: 1
#033[2m2025-01-17T15:09:02.240286Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1]
#033[2m2025-01-17T15:09:04.185746Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router_v3#033[0m#033[2m:#033[0m #033[2mbackends/v3/src/lib.rs#033[0m#033[2m:#033[0m#033[2m137:#033[0m Setting max batch total tokens to 241374
#033[2m2025-01-17T15:09:04.185767Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_router_v3::backend#033[0m#033[2m:#033[0m #033[2mbackends/v3/src/backend.rs#033[0m#033[2m:#033[0m#033[2m39:#033[0m Model supports prefill chunking. `waiting_served_ratio` and `max_waiting_tokens` will be ignored.
#033[2m2025-01-17T15:09:04.185803Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router_v3#033[0m#033[2m:#033[0m #033[2mbackends/v3/src/lib.rs#033[0m#033[2m:#033[0m#033[2m166:#033[0m Using backend V3
#033[2m2025-01-17T15:09:04.185810Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mbackends/v3/src/main.rs#033[0m#033[2m:#033[0m#033[2m162:#033[0m Maximum input tokens defaulted to 241373
#033[2m2025-01-17T15:09:04.185814Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mbackends/v3/src/main.rs#033[0m#033[2m:#033[0m#033[2m168:#033[0m Maximum total tokens defaulted to 241374
#033[2m2025-01-17T15:09:06.016609Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router::server#033[0m#033[2m:#033[0m #033[2mrouter/src/server.rs#033[0m#033[2m:#033[0m#033[2m1873:#033[0m Using config Some(Mistral)
#033[2m2025-01-17T15:09:06.084230Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_router::server#033[0m#033[2m:#033[0m #033[2mrouter/src/server.rs#033[0m#033[2m:#033[0m#033[2m1913:#033[0m no pipeline tag found for model /tmp/tgi/model
#033[2m2025-01-17T15:09:06.084248Z#033[0m #033[33m WARN#033[0m #033[2mtext_generation_router::server#033[0m#033[2m:#033[0m #033[2mrouter/src/server.rs#033[0m#033[2m:#033[0m#033[2m2015:#033[0m Invalid hostname, defaulting to 0.0.0.0
#033[2m2025-01-17T15:09:06.098654Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router::server#033[0m#033[2m:#033[0m #033[2mrouter/src/server.rs#033[0m#033[2m:#033[0m#033[2m2402:#033[0m Connected
#033[2m2025-01-17T15:09:08.428099Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Method Prefill encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 321, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 728, in main
    return _main(
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 197, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 703, in wrapper
    return callback(**use_params)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 117, in serve
    server.serve(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 315, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.11/site-packages/text_generation_server/interceptor.py", line 24, in intercept
    return await response
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 183, in Prefill
    generations, next_batch, timings = self.model.generate_token(batch)
  File "/opt/conda/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 1953, in generate_token
    out, speculative_logits = self.forward(batch, adapter_data)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 1848, in forward
    logits, speculative_logits = self.model.forward(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 524, in forward
    hidden_states = self.model(
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 448, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 373, in forward
    attn_output = self.self_attn(
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 223, in forward
    attn_output = attention(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/layers/attention/cuda.py", line 233, in attention
    return prefill_with_paged_kv_state.get().forward(
  File "/opt/conda/lib/python3.11/site-packages/flashinfer/prefill.py", line 879, in forward
    return self.run(q, paged_kv_cache, k_scale=k_scale, v_scale=v_scale)
  File "/opt/conda/lib/python3.11/site-packages/flashinfer/prefill.py", line 939, in run
    out = self._wrapper.run(
RuntimeError: BatchPrefillWithPagedKVCache failed with error code an illegal memory access was encountered
#033[2m2025-01-17T15:09:08.428381Z#033[0m #033[31mERROR#033[0m #033[1mhealth#033[0m#033[2m:#033[0m#033[1mhealth#033[0m#033[2m:#033[0m#033[1mprefill#033[0m#033[1m{#033[0m#033[3mid#033[0m#033[2m=#033[0m18446744073709551615 #033[3msize#033[0m#033[2m=#033[0m1#033[1m}#033[0m#033[2m:#033[0m#033[1mprefill#033[0m#033[1m{#033[0m#033[3mid#033[0m#033[2m=#033[0m18446744073709551615 #033[3msize#033[0m#033[2m=#033[0m1#033[1m}#033[0m#033[2m:#033[0m #033[2mtext_generation_router_v3::client#033[0m#033[2m:#033[0m #033[2mbackends/v3/src/client/mod.rs#033[0m#033[2m:#033[0m#033[2m45:#033[0m Server error: BatchPrefillWithPagedKVCache failed with error code an illegal memory access was encountered
#033[2m2025-01-17T15:09:09.381346Z#033[0m #033[31mERROR#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard complete standard error output:
2025-01-17 15:08:33.704 | INFO     | text_generation_server.utils.import_utils:<module>:80 - Detected system cuda
/opt/conda/lib/python3.11/site-packages/text_generation_server/layers/gptq/triton.py:242: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd(cast_inputs=torch.float16)
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @custom_bwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @custom_bwd
/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py:79: FutureWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.
  return func(*args, **kwargs)
CUDA Error: an illegal memory access was encountered (700) /tmp/build-via-sdist-fmqwe4he/flashinfer-0.1.6+cu124torch2.4/include/flashinfer/attention/prefill.cuh: line 2370 at function cudaLaunchKernel((void*)kernel, nblks, nthrs, args, smem_size, stream)
Traceback (most recent call last):
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 787, in grpc._cython.cygrpc._schedule_rpc_coro
asyncio.exceptions.CancelledError #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
#033[2m2025-01-17T15:09:09.410357Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard 0 crashed
#033[2m2025-01-17T15:09:09.410578Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Terminating webserver
#033[2m2025-01-17T15:09:09.410735Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Waiting for webserver to gracefully shutdown
#033[2m2025-01-17T15:09:09.411160Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router::server#033[0m#033[2m:#033[0m #033[2mrouter/src/server.rs#033[0m#033[2m:#033[0m#033[2m2494:#033[0m signal received, starting graceful shutdown
#033[2m2025-01-17T15:09:09.711162Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m webserver terminated
#033[2m2025-01-17T15:09:09.711187Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shutting down shards
Error: ShardFailed

It looks like USE_CUTLASS_W8A8=1 to force cutlass FP8 kernels on L40S does work (thanks for the suggestion, we had mixed success with this in the past), but as I understand, we may take an additional accuracy hit compared to Marlin W8A16 kernels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants