[Make] Add USE FLASHINFER into doc and gen cmake config #1743

CharlieFRuan · 2024-02-12T17:29:34Z

When using CUDA with compute capability >80, it is required to build runtime with USE_FLASHINFER and specifying FLASHINFER_CUDA_ARCHITECTURES and CMAKE_CUDA_ARCHITECTURES. Otherwise, users may run into issues like Cannot find PackedFunc flashinfer.single_prefill in either Relax VM kernel library, or in TVM runtime PackedFunc registry, or in global Relax functions of the VM executable.

This is related to issues #1728 and #1551

tqchen · 2024-02-12T17:32:43Z

cc @yzh119

CharlieFRuan · 2024-02-12T17:39:45Z

For more context: when compiling a model, the optimization flag flashinfer is true when the target is CUDA and the compute capability is above 80:

mlc-llm/python/mlc_chat/interface/compiler_flags.py

Lines 59 to 73 in 9ddffd6

    
           def _flashinfer(target) -> bool: 
        
               from mlc_chat.support.auto_target import (  # pylint: disable=import-outside-toplevel 
        
                   detect_cuda_arch_list, 
        
               ) 
        
               if not self.flashinfer: 
        
                   return False 
        
               if target.kind.name != "cuda": 
        
                   return False 
        
               arch_list = detect_cuda_arch_list(target) 
        
               for arch in arch_list: 
        
                   if arch < 80: 
        
                       logger.warning("flashinfer is not supported on CUDA arch < 80") 
        
                       return False 
        
               return True

Otherwise, we prune create_flashinfer_paged_kv_cache so that corresponding methods do not get looked up in runtime:

mlc-llm/python/mlc_chat/compiler_pass/prune_relax_func.py

Lines 23 to 28 in 9ddffd6

    
           for g_var, func in mod.functions_items(): 
        
               # Remove "create_flashinfer_paged_kv_cache" for unsupported target 
        
               if g_var.name_hint == "create_flashinfer_paged_kv_cache" and not self.flashinfer: 
        
                   continue 
        
               func_dict[g_var] = func 
        
           ret_mod = IRModule(func_dict)

Therefore, if users use a "qualified" machine to compile the model, but the user does not compile TVM runtime with USE_FLASHINFER ON explicitly, then they would run into Cannot find PackedFunc issues.

Add USE FLASHINFER into doc and gen cmake config

ffe3dfd

CharlieFRuan changed the title ~~Add USE FLASHINFER into doc and gen cmake config~~ [Make] Add USE FLASHINFER into doc and gen cmake config Feb 12, 2024

tqchen merged commit 9f67f37 into mlc-ai:main Feb 13, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Make] Add USE FLASHINFER into doc and gen cmake config #1743

[Make] Add USE FLASHINFER into doc and gen cmake config #1743

CharlieFRuan commented Feb 12, 2024

tqchen commented Feb 12, 2024

CharlieFRuan commented Feb 12, 2024

[Make] Add USE FLASHINFER into doc and gen cmake config #1743

[Make] Add USE FLASHINFER into doc and gen cmake config #1743

Conversation

CharlieFRuan commented Feb 12, 2024

tqchen commented Feb 12, 2024

CharlieFRuan commented Feb 12, 2024