[Pass] PruneRelaxFunc to remove Relax function based on target #1555

MasterJH5574 · 2024-01-07T18:53:41Z

We recently noticed that when FlashInfer is not built due to unsupported cuda architecture or platform, running single-sequence ChatModule will hit VM function initialization error, where the function is used in create_flashinfer_paged_kv_cache, which won't actually be invoked in single-sequence flow.

This is due to relax VM eagerly initializes all used PackedFunc at initialization stage (instead of lazy load). Therefore, even when the create_flashinfer_paged_kv_cache is not invoked, the PackedFuncs will be looked up. So whenever FlashInfer is not available, the issue will happen.

This PR adds a compiler pass which removes
create_flashinfer_paged_kv_cache (and also other similar functions that may be introduced in the future) based on the target. This pass can effectively address the issue.

junrushao · 2024-01-07T19:08:35Z

python/mlc_chat/compiler_pass/pipeline.py

@@ -75,7 +76,8 @@ def _mlc_llm_pipeline(  # pylint: disable=too-many-arguments
    def _pipeline(mod: tvm.ir.IRModule, _ctx: tvm.transform.PassContext) -> tvm.ir.IRModule:
        seq = tvm.transform.Sequential(
            [
-                # Phase 0. Add additional information for compilation
+                # Phase 0. Add additional information for compilation and remove unused Relax func
+                PruneRelaxFunc(),


how about this: let's add a boolean flag flashinfer_enabled in _mlc_llm_pipeline, which comes from OptimizationFlags, and base on this flag we may choose to prune flashinfer-related functions

Just updated with a new flag here. Actually I think it might be more extensible to pass the entire compilation flag object to the pipeline, given right now we have already passed flashinfer and cublas-gemm. What do you think?

The principle I'm trying to maintain here is that we wanted every nn.Module to be standalone-compilation-and-debuggable, it means:

SomeModule(...).jit(format="torch", pipeline=..., pipeline_args=...)

and it would be ideal if pipeline_args are package-independent, i.e. one doesn't have to import anything extra to create a pipeline. In this sense, while there are indeed abusively huge amount of parameters to tweak, I believe we don't have to aggregate them into one dataclass that requires extra import logics - as long as we document them clearly

python/mlc_chat/compiler_pass/prune_relax_func.py

We recently noticed that when FlashInfer is not built due to unsupported cuda architecture or platform, running single-sequence ChatModule will hit VM function initialization error, where the function is used in `create_flashinfer_paged_kv_cache`, which won't actually be invoked in single-sequence flow. This is due to relax VM eagerly initializes all used PackedFunc at initialization stage (instead of lazy load). Therefore, even when the `create_flashinfer_paged_kv_cache` is not invoked, the PackedFuncs will be looked up. So whenever FlashInfer is not available, the issue will happen. This PR adds a compiler pass which removes `create_flashinfer_paged_kv_cache` (and also other similar functions that may be introduced in the future) based on the target. This pass can effectively address the issue.

…i#1555) We recently noticed that when FlashInfer is not built due to unsupported cuda architecture or platform, running single-sequence ChatModule will hit VM function initialization error, where the function is used in `create_flashinfer_paged_kv_cache`, which won't actually be invoked in single-sequence flow. This is due to relax VM eagerly initializes all used PackedFunc at initialization stage (instead of lazy load). Therefore, even when the `create_flashinfer_paged_kv_cache` is not invoked, the PackedFuncs will be looked up. So whenever FlashInfer is not available, the issue will happen. This PR adds a compiler pass which removes `create_flashinfer_paged_kv_cache` (and also other similar functions that may be introduced in the future) based on the target. This pass can effectively address the issue.

MasterJH5574 force-pushed the 01-07-prune branch from 60998ab to ce7fa2c Compare January 7, 2024 18:56

junrushao mentioned this pull request Jan 7, 2024

[Bug] [llama2-7B] fail to execute Llama-2-7b-chat-hf-q4f16_1-MLC #1551

Closed

junrushao reviewed Jan 7, 2024

View reviewed changes

python/mlc_chat/compiler_pass/prune_relax_func.py Show resolved Hide resolved

MasterJH5574 force-pushed the 01-07-prune branch from ce7fa2c to dbb7672 Compare January 7, 2024 21:06

MasterJH5574 force-pushed the 01-07-prune branch from dbb7672 to 9d35c77 Compare January 8, 2024 17:02

junrushao approved these changes Jan 8, 2024

View reviewed changes

junrushao merged commit eddc5b1 into mlc-ai:main Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pass] PruneRelaxFunc to remove Relax function based on target #1555

[Pass] PruneRelaxFunc to remove Relax function based on target #1555

MasterJH5574 commented Jan 7, 2024

junrushao Jan 7, 2024

MasterJH5574 Jan 7, 2024

junrushao Jan 8, 2024

[Pass] PruneRelaxFunc to remove Relax function based on target #1555

[Pass] PruneRelaxFunc to remove Relax function based on target #1555

Conversation

MasterJH5574 commented Jan 7, 2024

junrushao Jan 7, 2024

Choose a reason for hiding this comment

MasterJH5574 Jan 7, 2024

Choose a reason for hiding this comment

junrushao Jan 8, 2024

Choose a reason for hiding this comment