Unable to build fast_layer_norm #1636

ZhiyuanChen · 2023-04-12T15:24:42Z

Describe the Bug

Minimal Steps/Code to Reproduce the Bug

pip install -v --no-cache-dir \ 
  --global-option="--cpp_ext" \
  --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" \
  --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" \
  --global-option="--fast_bottleneck" \
  --global-option="--fused_conv_bias_relu" \
  --global-option="--cudnn_gbn" \
  --global-option="--fmha" \
  --global-option="--focal_loss" \
  --global-option="--fast_layer_norm" \
  --global-option="--bnp" \
  ./

Output

Using pip 23.0.1 from /mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/pip (python 3.10)
WARNING: Implying --no-binary=:all: due to the presence of --build-option / --global-option / --install-option. Consider using --config-settings for more flexibility.
DEPRECATION: --no-binary currently disables reading from the cache of locally built wheels. In the future --no-binary will not influence the wheel cache. pip 23.1 will enforce this behaviour change. A possible replacement is to use the --no-cache-dir option. You can use the flag --use-feature=no-binary-enable-wheel-cache to test the upcoming behaviour. Discussion can be found at https://github.com/pypa/pip/issues/11453
Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
Processing /home/chenzhiyuan/apex
  Running command python setup.py egg_info


  torch.__version__  = 2.0.0


  running egg_info
  creating /tmp/pip-pip-egg-info-rc4kykg9/apex.egg-info
  writing /tmp/pip-pip-egg-info-rc4kykg9/apex.egg-info/PKG-INFO
  writing dependency_links to /tmp/pip-pip-egg-info-rc4kykg9/apex.egg-info/dependency_links.txt
  writing requirements to /tmp/pip-pip-egg-info-rc4kykg9/apex.egg-info/requires.txt
  writing top-level names to /tmp/pip-pip-egg-info-rc4kykg9/apex.egg-info/top_level.txt
  writing manifest file '/tmp/pip-pip-egg-info-rc4kykg9/apex.egg-info/SOURCES.txt'
  reading manifest file '/tmp/pip-pip-egg-info-rc4kykg9/apex.egg-info/SOURCES.txt'
  adding license file 'LICENSE'
  writing manifest file '/tmp/pip-pip-egg-info-rc4kykg9/apex.egg-info/SOURCES.txt'
  Preparing metadata (setup.py) ... done
Requirement already satisfied: packaging>20.6 in /mnt/shared/mamba/envs/dev/lib/python3.10/site-packages (from apex==0.1) (23.0)
Installing collected packages: apex
  DEPRECATION: apex is being installed using the legacy 'setup.py install' method, because the '--no-binary' option was enabled for it and this currently disables local wheel building for projects that don't have a 'pyproject.toml' file. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/11451
  Running command Running setup.py install for apex


  torch.__version__  = 2.0.0



  Compiling cuda extensions with
  nvcc: NVIDIA (R) Cuda compiler driver
  Copyright (c) 2005-2022 NVIDIA Corporation
  Built on Wed_Sep_21_10:33:58_PDT_2022
  Cuda compilation tools, release 11.8, V11.8.89
  Build cuda_11.8.r11.8/compiler.31833905_0
  from /mnt/shared/mamba/envs/dev/bin

  running install
  /mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
    warnings.warn(
  running build
  running build_py
  running build_ext
  /mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/utils/cpp_extension.py:398: UserWarning: There are no g++ version bounds defined for CUDA version 11.8
    warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
  building 'apex_C' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/flatten_unflatten.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-cpython-310/apex_C.cpython-310-x86_64-linux-gnu.so
  building 'amp_C' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/amp_C_frontend.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_adagrad.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_adam.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_axpby_kernel.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_l2norm_kernel.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_l2norm_kernel_mp.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_l2norm_scale_kernel.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_lamb.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_lamb_mp.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_lamb_stage_1.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_lamb_stage_2.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_novograd.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_scale_kernel.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_sgd_kernel.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/amp_C.cpython-310-x86_64-linux-gnu.so
  building 'syncbn' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/syncbn.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/welford.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/syncbn.cpython-310-x86_64-linux-gnu.so
  building 'fused_layer_norm_cuda' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/layer_norm_cuda.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/layer_norm_cuda_kernel.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/fused_layer_norm_cuda.cpython-310-x86_64-linux-gnu.so
  building 'mlp_cuda' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/mlp.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/mlp_cuda.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/mlp_cuda.cpython-310-x86_64-linux-gnu.so
  building 'fused_dense_cuda' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/fused_dense.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/fused_dense_cuda.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/fused_dense_cuda.cpython-310-x86_64-linux-gnu.so
  building 'scaled_upper_triang_masked_softmax_cuda' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/megatron/scaled_upper_triang_masked_softmax.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/megatron/scaled_upper_triang_masked_softmax_cuda.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/scaled_upper_triang_masked_softmax_cuda.cpython-310-x86_64-linux-gnu.so
  building 'generic_scaled_masked_softmax_cuda' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/megatron/generic_scaled_masked_softmax.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/megatron/generic_scaled_masked_softmax_cuda.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/generic_scaled_masked_softmax_cuda.cpython-310-x86_64-linux-gnu.so
  building 'scaled_masked_softmax_cuda' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/megatron/scaled_masked_softmax.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/megatron/scaled_masked_softmax_cuda.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/scaled_masked_softmax_cuda.cpython-310-x86_64-linux-gnu.so
  building 'scaled_softmax_cuda' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/megatron/scaled_softmax.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/megatron/scaled_softmax_cuda.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/scaled_softmax_cuda.cpython-310-x86_64-linux-gnu.so
  building 'fused_weight_gradient_mlp_cuda' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/megatron/fused_weight_gradient_dense.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/megatron/fused_weight_gradient_dense_16bit_prec_cuda.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/megatron/fused_weight_gradient_dense_cuda.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/fused_weight_gradient_mlp_cuda.cpython-310-x86_64-linux-gnu.so
  building 'xentropy_cuda' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/apex/contrib/csrc/xentropy/interface.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/apex/contrib/csrc/xentropy/xentropy_kernel.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/xentropy_cuda.cpython-310-x86_64-linux-gnu.so
  building 'focal_loss_cuda' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/apex/contrib/csrc/focal_loss/focal_loss_cuda.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/apex/contrib/csrc/focal_loss/focal_loss_cuda_kernel.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/focal_loss_cuda.cpython-310-x86_64-linux-gnu.so
  building 'fused_adam_cuda' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/apex/contrib/csrc/optimizers/fused_adam_cuda.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/apex/contrib/csrc/optimizers/fused_adam_cuda_kernel.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/fused_adam_cuda.cpython-310-x86_64-linux-gnu.so
  building 'fast_layer_norm' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  [1/2] /mnt/shared/mamba/envs/dev/bin/nvcc  -I/home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include/TH -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include/THC -I/mnt/shared/mamba/envs/dev/include -I/mnt/shared/mamba/envs/dev/include/python3.10 -c -c /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu -o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ -I./apex/contrib/csrc/layer_norm/ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fast_layer_norm -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  FAILED: /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.o
  /mnt/shared/mamba/envs/dev/bin/nvcc  -I/home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include/TH -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include/THC -I/mnt/shared/mamba/envs/dev/include -I/mnt/shared/mamba/envs/dev/include/python3.10 -c -c /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu -o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ -I./apex/contrib/csrc/layer_norm/ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fast_layer_norm -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(113): error: identifier "uint64_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(133): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(138): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(143): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(150): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(171): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(172): error: identifier "uint64_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(172): error: identifier "uint64_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(180): error: identifier "uint64_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(190): error: identifier "uint64_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(12): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(90): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(95): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(100): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(105): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(154): error: identifier "uint64_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(160): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(166): error: identifier "uint16_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(172): error: identifier "uint8_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(280): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(321): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(325): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(325): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(366): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(366): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(366): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(382): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(382): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(382): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(382): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(382): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(421): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(421): warning #842-D: constant "WARPS_M" is not used in or cannot be deduced from the template argument list of class template "layer_norm::Reducer<T, <error>, <error>, <error>>"

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(422): error: the template argument list of the partial specialization includes a nontype argument whose type depends on a template parameter

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(431): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(431): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(431): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(431): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(431): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(466): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(466): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(466): warning #842-D: constant "WARPS_M" is not used in or cannot be deduced from the template argument list of class template "layer_norm::Reducer<T, <error>, <error>, <error>>"

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(466): warning #842-D: constant "WARPS_N" is not used in or cannot be deduced from the template argument list of class template "layer_norm::Reducer<T, <error>, <error>, <error>>"

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(467): error: the template argument list of the partial specialization includes a nontype argument whose type depends on a template parameter

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(479): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(479): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(479): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(479): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(479): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(562): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(562): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(562): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(573): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(573): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(573): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(573): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(573): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(584): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(632): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(632): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(632): warning #842-D: constant "WARPS_M" is not used in or cannot be deduced from the template argument list of class template "layer_norm::Stats<T, <error>, <error>, <error>>"

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(632): warning #842-D: constant "WARPS_N" is not used in or cannot be deduced from the template argument list of class template "layer_norm::Stats<T, <error>, <error>, <error>>"

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(633): error: the template argument list of the partial specialization includes a nontype argument whose type depends on a template parameter

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(641): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(641): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(641): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(641): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(641): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(649): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(690): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(690): warning #842-D: constant "WARPS_M" is not used in or cannot be deduced from the template argument list of class template "layer_norm::Stats<T, <error>, <error>, <error>>"

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(691): error: the template argument list of the partial specialization includes a nontype argument whose type depends on a template parameter

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(700): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(700): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(700): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(700): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(700): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(705): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(7): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(13): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(32): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(38): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(39): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(87): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(88): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(89): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(90): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(91): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(85): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(86): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(87): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(88): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(89): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(91): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(92): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(93): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(94): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(95): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(97): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(98): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(99): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(100): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(101): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(103): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(104): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(105): error: identifier "uint32_t" is undefined

  Error limit reached.
  100 errors detected in the compilation of "/home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu".
  Compilation terminated.
  [2/2] /mnt/shared/mamba/envs/dev/bin/nvcc  -I/home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include/TH -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include/THC -I/mnt/shared/mamba/envs/dev/include -I/mnt/shared/mamba/envs/dev/include/python3.10 -c -c /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu -o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ -I./apex/contrib/csrc/layer_norm/ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fast_layer_norm -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  FAILED: /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.o
  /mnt/shared/mamba/envs/dev/bin/nvcc  -I/home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include/TH -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include/THC -I/mnt/shared/mamba/envs/dev/include -I/mnt/shared/mamba/envs/dev/include/python3.10 -c -c /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu -o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ -I./apex/contrib/csrc/layer_norm/ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fast_layer_norm -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(113): error: identifier "uint64_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(133): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(138): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(143): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(150): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(171): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(172): error: identifier "uint64_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(172): error: identifier "uint64_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(180): error: identifier "uint64_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(190): error: identifier "uint64_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(12): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(90): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(95): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(100): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(105): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(154): error: identifier "uint64_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(160): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(166): error: identifier "uint16_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(172): error: identifier "uint8_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(280): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(321): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(325): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(325): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(366): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(366): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(366): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(382): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(382): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(382): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(382): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(382): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(421): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(421): warning #842-D: constant "WARPS_M" is not used in or cannot be deduced from the template argument list of class template "layer_norm::Reducer<T, <error>, <error>, <error>>"

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(422): error: the template argument list of the partial specialization includes a nontype argument whose type depends on a template parameter

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(431): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(431): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(431): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(431): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(431): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(466): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(466): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(466): warning #842-D: constant "WARPS_M" is not used in or cannot be deduced from the template argument list of class template "layer_norm::Reducer<T, <error>, <error>, <error>>"

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(466): warning #842-D: constant "WARPS_N" is not used in or cannot be deduced from the template argument list of class template "layer_norm::Reducer<T, <error>, <error>, <error>>"

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(467): error: the template argument list of the partial specialization includes a nontype argument whose type depends on a template parameter

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(479): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(479): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(479): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(479): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(479): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(562): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(562): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(562): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(573): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(573): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(573): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(573): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(573): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(584): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(632): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(632): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(632): warning #842-D: constant "WARPS_M" is not used in or cannot be deduced from the template argument list of class template "layer_norm::Stats<T, <error>, <error>, <error>>"

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(632): warning #842-D: constant "WARPS_N" is not used in or cannot be deduced from the template argument list of class template "layer_norm::Stats<T, <error>, <error>, <error>>"

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(633): error: the template argument list of the partial specialization includes a nontype argument whose type depends on a template parameter

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(641): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(641): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(641): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(641): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(641): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(649): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(690): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(690): warning #842-D: constant "WARPS_M" is not used in or cannot be deduced from the template argument list of class template "layer_norm::Stats<T, <error>, <error>, <error>>"

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(691): error: the template argument list of the partial specialization includes a nontype argument whose type depends on a template parameter

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(700): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(700): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(700): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(700): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(700): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(705): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(7): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(13): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(32): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(38): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(39): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(87): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(88): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(89): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(90): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(91): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(73): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(74): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(75): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(76): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(77): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(79): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(80): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(81): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(82): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(83): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(85): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(86): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(87): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(88): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(89): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(91): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(92): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(93): error: identifier "uint32_t" is undefined

  Error limit reached.
  100 errors detected in the compilation of "/home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu".
  Compilation terminated.
  ninja: build stopped: subcommand failed.
  Traceback (most recent call last):
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
      subprocess.run(
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/subprocess.py", line 526, in run
      raise CalledProcessError(retcode, process.args,
  subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

  The above exception was the direct cause of the following exception:

  Traceback (most recent call last):
    File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "/home/chenzhiyuan/apex/setup.py", line 762, in <module>
      setup(
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/__init__.py", line 108, in setup
      return distutils.core.setup(**attrs)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
      return run_commands(dist)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
      dist.run_commands()
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
      self.run_command(cmd)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/dist.py", line 1221, in run_command
      super().run_command(command)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/command/install.py", line 68, in run
      return orig.install.run(self)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/command/install.py", line 697, in run
      self.run_command('build')
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
      self.distribution.run_command(command)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/dist.py", line 1221, in run_command
      super().run_command(command)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/command/build.py", line 131, in run
      self.run_command(cmd_name)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
      self.distribution.run_command(command)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/dist.py", line 1221, in run_command
      super().run_command(command)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 84, in run
      _build_ext.run(self)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 345, in run
      self.build_extensions()
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 843, in build_extensions
      build_ext.build_extensions(self)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 467, in build_extensions
      self._build_extensions_serial()
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 493, in _build_extensions_serial
      self.build_extension(ext)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 246, in build_extension
      _build_ext.build_extension(self, ext)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 548, in build_extension
      objects = self.compiler.compile(
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 658, in unix_wrap_ninja_compile
      _write_ninja_file_and_compile_objects(
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1574, in _write_ninja_file_and_compile_objects
      _run_ninja_build(
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
      raise RuntimeError(message) from e
  RuntimeError: Error compiling objects for extension
  error: subprocess-exited-with-error

  × Running setup.py install for apex did not run successfully.
  │ exit code: 1
  ╰─> See above for output.

  note: This error originates from a subprocess, and is likely not a problem with pip.
  full command: /mnt/shared/mamba/envs/dev/bin/python3.10 -u -c '
  exec(compile('"'"''"'"''"'"'
  # This is <pip-setuptools-caller> -- a caller that pip uses to run setup.py
  #
  # - It imports setuptools before invoking setup.py, to enable projects that directly
  #   import from `distutils.core` to work with newer packaging standards.
  # - It provides a clear error message when setuptools is not installed.
  # - It sets `sys.argv[0]` to the underlying `setup.py`, when invoking `setup.py` so
  #   setuptools doesn'"'"'t think the script is `-c`. This avoids the following warning:
  #     manifest_maker: standard file '"'"'-c'"'"' not found".
  # - It generates a shim setup.py, for handling setup.cfg-only projects.
  import os, sys, tokenize

  try:
      import setuptools
  except ImportError as error:
      print(
          "ERROR: Can not execute `setup.py` since setuptools is not available in "
          "the build environment.",
          file=sys.stderr,
      )
      sys.exit(1)

  __file__ = %r
  sys.argv[0] = __file__

  if os.path.exists(__file__):
      filename = __file__
      with tokenize.open(__file__) as f:
          setup_py_code = f.read()
  else:
      filename = "<auto-generated setuptools caller>"
      setup_py_code = "from setuptools import setup; setup()"

  exec(compile(setup_py_code, filename, "exec"))
  '"'"''"'"''"'"' % ('"'"'/home/chenzhiyuan/apex/setup.py'"'"',), "<pip-setuptools-caller>", "exec"))' --cpp_ext --cuda_ext --deprecated_fused_adam --xentropy --fast_multihead_attn --fast_bottleneck --fused_conv_bias_relu --cudnn_gbn --fmha --focal_loss --fast_layer_norm install --record /tmp/pip-record-ew5oygz6/install-record.txt --single-version-externally-managed --compile --install-headers /mnt/shared/mamba/envs/dev/include/python3.10/apex
  cwd: /home/chenzhiyuan/apex/
  Running setup.py install for apex ... error
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> apex

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

The text was updated successfully, but these errors were encountered:

ZhiyuanChen · 2023-04-12T15:24:50Z

Environment

Collecting environment information...
PyTorch version: 2.0.0
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: CentOS Linux release 7.8.2003 (Core) (x86_64)
GCC version: (conda-forge gcc 11.3.0-19) 11.3.0
Clang version: Could not collect
CMake version: version 2.8.12.2
Libc version: glibc-2.17

Python version: 3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:08:06) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.83.1.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 530.30.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                112
On-line CPU(s) list:   0-111
Thread(s) per core:    2
Core(s) per socket:    28
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 106
Model name:            Intel(R) Xeon(R) Platinum 8336C CPU @ 2.30GHz
Stepping:              6
CPU MHz:               2300.036
BogoMIPS:              4600.07
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             48K
L1i cache:             32K
L2 cache:              1280K
L3 cache:              55296K
NUMA node0 CPU(s):     0-55
NUMA node1 CPU(s):     56-111
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd rsb_ctxsw ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq md_clear spec_ctrl intel_stibp arch_capabilities

Versions of relevant libraries:
[pip3] mypy==1.2.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.24.2
[pip3] torch==2.0.0
[pip3] torchaudio==2.0.0
[pip3] torcheval==0.0.6
[pip3] torchmetrics==0.11.4
[pip3] torchtnt==0.0.7
[pip3] torchvision==0.15.0
[conda] blas                      1.0                         mkl    conda-forge
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] libblas                   3.9.0            16_linux64_mkl    conda-forge
[conda] libcblas                  3.9.0            16_linux64_mkl    conda-forge
[conda] liblapack                 3.9.0            16_linux64_mkl    conda-forge
[conda] mkl                       2022.2.1         h84fe81f_16997    conda-forge
[conda] numpy                     1.24.2          py310h8deb116_0    conda-forge
[conda] pytorch                   2.0.0           py3.10_cuda11.8_cudnn8.7.0_0    pytorch
[conda] pytorch-cuda              11.8                 h7e8668a_3    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                2.0.0               py310_cu118    pytorch
[conda] torcheval                 0.0.6                    pypi_0    pypi
[conda] torchmetrics              0.11.4                   pypi_0    pypi
[conda] torchtnt                  0.0.7                    pypi_0    pypi
[conda] torchtriton               2.0.0                     py310    pytorch
[conda] torchvision               0.15.0              py310_cu118    pytorch

ZhiyuanChen · 2023-04-12T15:34:26Z

I have tried to add the following lines on the top of apex/contrib/layer_norm/ln.h, and resolved the issue.

include <stdint.h>
include <stdio.h>

fixes NVIDIA#1636

crcrpar · 2023-04-12T15:46:47Z

Would you mind opening a pull request with your diff?

fixes NVIDIA#1636

ZhiyuanChen · 2023-04-12T15:48:59Z

Would you mind opening a pull request with your diff?

Just opened~

fixes #1636

fixes NVIDIA#1636

ZhiyuanChen added the bug Something isn't working label Apr 12, 2023

ZhiyuanChen added a commit to ZhiyuanChen/apex that referenced this issue Apr 12, 2023

include stdint.h & stdio.h in fast_layernorm

4dcf43e

fixes NVIDIA#1636

ZhiyuanChen added a commit to ZhiyuanChen/apex that referenced this issue Apr 12, 2023

include stdint.h & stdio.h in fast_layer_norm/ln.h

b182ec0

fixes NVIDIA#1636

ZhiyuanChen mentioned this issue Apr 12, 2023

include stdint.h & stdio.h in fast_layer_norm/ln.h #1637

Merged

crcrpar closed this as completed in #1637 Apr 18, 2023

crcrpar pushed a commit that referenced this issue Apr 18, 2023

include stdint.h & stdio.h in fast_layer_norm/ln.h (#1637)

4e1ae43

fixes #1636

david-waterworth mentioned this issue Jun 7, 2023

Dockerfile apex build command fails NVIDIA/NeMo#6826

Closed

yuanzhedong pushed a commit to yuanzhedong/apex that referenced this issue Jul 14, 2023

include stdint.h & stdio.h in fast_layer_norm/ln.h (NVIDIA#1637)

9f5db2b

fixes NVIDIA#1636

kevingreenman mentioned this issue Jul 25, 2023

ImportError - fast_transformers/causal_product undefined symbol - unable to train or finetune IBM/molformer#6

Open

shjwudp mentioned this issue Aug 1, 2023

Follow the README guide. The installation of the apex failed NVIDIA/NeMo#7142

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to build fast_layer_norm #1636

Unable to build fast_layer_norm #1636

ZhiyuanChen commented Apr 12, 2023

ZhiyuanChen commented Apr 12, 2023

ZhiyuanChen commented Apr 12, 2023 •

edited

Loading

crcrpar commented Apr 12, 2023

ZhiyuanChen commented Apr 12, 2023

Unable to build fast_layer_norm #1636

Unable to build fast_layer_norm #1636

Comments

ZhiyuanChen commented Apr 12, 2023

ZhiyuanChen commented Apr 12, 2023

ZhiyuanChen commented Apr 12, 2023 • edited Loading

crcrpar commented Apr 12, 2023

ZhiyuanChen commented Apr 12, 2023

ZhiyuanChen commented Apr 12, 2023 •

edited

Loading