Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to build fast_layer_norm #1636

Closed
ZhiyuanChen opened this issue Apr 12, 2023 · 4 comments · Fixed by #1637
Closed

Unable to build fast_layer_norm #1636

ZhiyuanChen opened this issue Apr 12, 2023 · 4 comments · Fixed by #1637
Labels
bug Something isn't working

Comments

@ZhiyuanChen
Copy link
Contributor

Describe the Bug

Minimal Steps/Code to Reproduce the Bug

pip install -v --no-cache-dir \ 
  --global-option="--cpp_ext" \
  --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" \
  --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" \
  --global-option="--fast_bottleneck" \
  --global-option="--fused_conv_bias_relu" \
  --global-option="--cudnn_gbn" \
  --global-option="--fmha" \
  --global-option="--focal_loss" \
  --global-option="--fast_layer_norm" \
  --global-option="--bnp" \
  ./

Output

Using pip 23.0.1 from /mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/pip (python 3.10)
WARNING: Implying --no-binary=:all: due to the presence of --build-option / --global-option / --install-option. Consider using --config-settings for more flexibility.
DEPRECATION: --no-binary currently disables reading from the cache of locally built wheels. In the future --no-binary will not influence the wheel cache. pip 23.1 will enforce this behaviour change. A possible replacement is to use the --no-cache-dir option. You can use the flag --use-feature=no-binary-enable-wheel-cache to test the upcoming behaviour. Discussion can be found at https://github.com/pypa/pip/issues/11453
Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
Processing /home/chenzhiyuan/apex
  Running command python setup.py egg_info


  torch.__version__  = 2.0.0


  running egg_info
  creating /tmp/pip-pip-egg-info-rc4kykg9/apex.egg-info
  writing /tmp/pip-pip-egg-info-rc4kykg9/apex.egg-info/PKG-INFO
  writing dependency_links to /tmp/pip-pip-egg-info-rc4kykg9/apex.egg-info/dependency_links.txt
  writing requirements to /tmp/pip-pip-egg-info-rc4kykg9/apex.egg-info/requires.txt
  writing top-level names to /tmp/pip-pip-egg-info-rc4kykg9/apex.egg-info/top_level.txt
  writing manifest file '/tmp/pip-pip-egg-info-rc4kykg9/apex.egg-info/SOURCES.txt'
  reading manifest file '/tmp/pip-pip-egg-info-rc4kykg9/apex.egg-info/SOURCES.txt'
  adding license file 'LICENSE'
  writing manifest file '/tmp/pip-pip-egg-info-rc4kykg9/apex.egg-info/SOURCES.txt'
  Preparing metadata (setup.py) ... done
Requirement already satisfied: packaging>20.6 in /mnt/shared/mamba/envs/dev/lib/python3.10/site-packages (from apex==0.1) (23.0)
Installing collected packages: apex
  DEPRECATION: apex is being installed using the legacy 'setup.py install' method, because the '--no-binary' option was enabled for it and this currently disables local wheel building for projects that don't have a 'pyproject.toml' file. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/11451
  Running command Running setup.py install for apex


  torch.__version__  = 2.0.0



  Compiling cuda extensions with
  nvcc: NVIDIA (R) Cuda compiler driver
  Copyright (c) 2005-2022 NVIDIA Corporation
  Built on Wed_Sep_21_10:33:58_PDT_2022
  Cuda compilation tools, release 11.8, V11.8.89
  Build cuda_11.8.r11.8/compiler.31833905_0
  from /mnt/shared/mamba/envs/dev/bin

  running install
  /mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
    warnings.warn(
  running build
  running build_py
  running build_ext
  /mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/utils/cpp_extension.py:398: UserWarning: There are no g++ version bounds defined for CUDA version 11.8
    warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
  building 'apex_C' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/flatten_unflatten.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-cpython-310/apex_C.cpython-310-x86_64-linux-gnu.so
  building 'amp_C' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/amp_C_frontend.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_adagrad.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_adam.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_axpby_kernel.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_l2norm_kernel.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_l2norm_kernel_mp.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_l2norm_scale_kernel.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_lamb.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_lamb_mp.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_lamb_stage_1.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_lamb_stage_2.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_novograd.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_scale_kernel.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/multi_tensor_sgd_kernel.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/amp_C.cpython-310-x86_64-linux-gnu.so
  building 'syncbn' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/syncbn.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/welford.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/syncbn.cpython-310-x86_64-linux-gnu.so
  building 'fused_layer_norm_cuda' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/layer_norm_cuda.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/layer_norm_cuda_kernel.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/fused_layer_norm_cuda.cpython-310-x86_64-linux-gnu.so
  building 'mlp_cuda' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/mlp.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/mlp_cuda.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/mlp_cuda.cpython-310-x86_64-linux-gnu.so
  building 'fused_dense_cuda' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/fused_dense.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/fused_dense_cuda.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/fused_dense_cuda.cpython-310-x86_64-linux-gnu.so
  building 'scaled_upper_triang_masked_softmax_cuda' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/megatron/scaled_upper_triang_masked_softmax.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/megatron/scaled_upper_triang_masked_softmax_cuda.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/scaled_upper_triang_masked_softmax_cuda.cpython-310-x86_64-linux-gnu.so
  building 'generic_scaled_masked_softmax_cuda' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/megatron/generic_scaled_masked_softmax.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/megatron/generic_scaled_masked_softmax_cuda.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/generic_scaled_masked_softmax_cuda.cpython-310-x86_64-linux-gnu.so
  building 'scaled_masked_softmax_cuda' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/megatron/scaled_masked_softmax.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/megatron/scaled_masked_softmax_cuda.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/scaled_masked_softmax_cuda.cpython-310-x86_64-linux-gnu.so
  building 'scaled_softmax_cuda' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/megatron/scaled_softmax.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/megatron/scaled_softmax_cuda.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/scaled_softmax_cuda.cpython-310-x86_64-linux-gnu.so
  building 'fused_weight_gradient_mlp_cuda' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/megatron/fused_weight_gradient_dense.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/megatron/fused_weight_gradient_dense_16bit_prec_cuda.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/csrc/megatron/fused_weight_gradient_dense_cuda.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/fused_weight_gradient_mlp_cuda.cpython-310-x86_64-linux-gnu.so
  building 'xentropy_cuda' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/apex/contrib/csrc/xentropy/interface.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/apex/contrib/csrc/xentropy/xentropy_kernel.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/xentropy_cuda.cpython-310-x86_64-linux-gnu.so
  building 'focal_loss_cuda' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/apex/contrib/csrc/focal_loss/focal_loss_cuda.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/apex/contrib/csrc/focal_loss/focal_loss_cuda_kernel.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/focal_loss_cuda.cpython-310-x86_64-linux-gnu.so
  building 'fused_adam_cuda' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  ninja: no work to do.
  g++ -pthread -B /mnt/shared/mamba/envs/dev/compiler_compat -shared -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib -Wl,--allow-shlib-undefined -Wl,-rpath,/mnt/shared/mamba/envs/dev/lib -Wl,-rpath-link,/mnt/shared/mamba/envs/dev/lib -L/mnt/shared/mamba/envs/dev/lib /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/apex/contrib/csrc/optimizers/fused_adam_cuda.o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/apex/contrib/csrc/optimizers/fused_adam_cuda_kernel.o -L/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/lib -L/mnt/shared/mamba/envs/dev/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/fused_adam_cuda.cpython-310-x86_64-linux-gnu.so
  building 'fast_layer_norm' extension
  Emitting ninja build file /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  [1/2] /mnt/shared/mamba/envs/dev/bin/nvcc  -I/home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include/TH -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include/THC -I/mnt/shared/mamba/envs/dev/include -I/mnt/shared/mamba/envs/dev/include/python3.10 -c -c /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu -o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ -I./apex/contrib/csrc/layer_norm/ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fast_layer_norm -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  FAILED: /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.o
  /mnt/shared/mamba/envs/dev/bin/nvcc  -I/home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include/TH -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include/THC -I/mnt/shared/mamba/envs/dev/include -I/mnt/shared/mamba/envs/dev/include/python3.10 -c -c /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu -o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ -I./apex/contrib/csrc/layer_norm/ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fast_layer_norm -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(113): error: identifier "uint64_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(133): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(138): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(143): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(150): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(171): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(172): error: identifier "uint64_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(172): error: identifier "uint64_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(180): error: identifier "uint64_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(190): error: identifier "uint64_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(12): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(90): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(95): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(100): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(105): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(154): error: identifier "uint64_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(160): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(166): error: identifier "uint16_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(172): error: identifier "uint8_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(280): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(321): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(325): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(325): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(366): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(366): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(366): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(382): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(382): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(382): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(382): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(382): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(421): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(421): warning #842-D: constant "WARPS_M" is not used in or cannot be deduced from the template argument list of class template "layer_norm::Reducer<T, <error>, <error>, <error>>"

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(422): error: the template argument list of the partial specialization includes a nontype argument whose type depends on a template parameter

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(431): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(431): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(431): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(431): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(431): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(466): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(466): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(466): warning #842-D: constant "WARPS_M" is not used in or cannot be deduced from the template argument list of class template "layer_norm::Reducer<T, <error>, <error>, <error>>"

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(466): warning #842-D: constant "WARPS_N" is not used in or cannot be deduced from the template argument list of class template "layer_norm::Reducer<T, <error>, <error>, <error>>"

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(467): error: the template argument list of the partial specialization includes a nontype argument whose type depends on a template parameter

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(479): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(479): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(479): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(479): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(479): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(562): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(562): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(562): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(573): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(573): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(573): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(573): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(573): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(584): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(632): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(632): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(632): warning #842-D: constant "WARPS_M" is not used in or cannot be deduced from the template argument list of class template "layer_norm::Stats<T, <error>, <error>, <error>>"

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(632): warning #842-D: constant "WARPS_N" is not used in or cannot be deduced from the template argument list of class template "layer_norm::Stats<T, <error>, <error>, <error>>"

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(633): error: the template argument list of the partial specialization includes a nontype argument whose type depends on a template parameter

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(641): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(641): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(641): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(641): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(641): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(649): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(690): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(690): warning #842-D: constant "WARPS_M" is not used in or cannot be deduced from the template argument list of class template "layer_norm::Stats<T, <error>, <error>, <error>>"

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(691): error: the template argument list of the partial specialization includes a nontype argument whose type depends on a template parameter

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(700): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(700): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(700): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(700): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(700): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(705): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(7): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(13): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(32): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(38): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(39): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(87): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(88): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(89): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(90): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(91): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(85): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(86): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(87): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(88): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(89): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(91): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(92): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(93): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(94): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(95): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(97): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(98): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(99): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(100): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(101): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(103): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(104): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu(105): error: identifier "uint32_t" is undefined

  Error limit reached.
  100 errors detected in the compilation of "/home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_bwd_semi_cuda_kernel.cu".
  Compilation terminated.
  [2/2] /mnt/shared/mamba/envs/dev/bin/nvcc  -I/home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include/TH -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include/THC -I/mnt/shared/mamba/envs/dev/include -I/mnt/shared/mamba/envs/dev/include/python3.10 -c -c /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu -o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ -I./apex/contrib/csrc/layer_norm/ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fast_layer_norm -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  FAILED: /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.o
  /mnt/shared/mamba/envs/dev/bin/nvcc  -I/home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include/TH -I/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/include/THC -I/mnt/shared/mamba/envs/dev/include -I/mnt/shared/mamba/envs/dev/include/python3.10 -c -c /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu -o /home/chenzhiyuan/apex/build/temp.linux-x86_64-cpython-310/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ -I./apex/contrib/csrc/layer_norm/ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fast_layer_norm -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(113): error: identifier "uint64_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(133): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(138): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(143): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(150): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(171): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(172): error: identifier "uint64_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(172): error: identifier "uint64_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(180): error: identifier "uint64_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln.h(190): error: identifier "uint64_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(12): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(90): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(95): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(100): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(105): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(154): error: identifier "uint64_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(160): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(166): error: identifier "uint16_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(172): error: identifier "uint8_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(280): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(321): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(325): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(325): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(366): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(366): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(366): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(382): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(382): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(382): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(382): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(382): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(421): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(421): warning #842-D: constant "WARPS_M" is not used in or cannot be deduced from the template argument list of class template "layer_norm::Reducer<T, <error>, <error>, <error>>"

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(422): error: the template argument list of the partial specialization includes a nontype argument whose type depends on a template parameter

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(431): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(431): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(431): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(431): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(431): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(466): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(466): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(466): warning #842-D: constant "WARPS_M" is not used in or cannot be deduced from the template argument list of class template "layer_norm::Reducer<T, <error>, <error>, <error>>"

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(466): warning #842-D: constant "WARPS_N" is not used in or cannot be deduced from the template argument list of class template "layer_norm::Reducer<T, <error>, <error>, <error>>"

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(467): error: the template argument list of the partial specialization includes a nontype argument whose type depends on a template parameter

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(479): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(479): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(479): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(479): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(479): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(562): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(562): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(562): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(573): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(573): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(573): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(573): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(573): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(584): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(632): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(632): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(632): warning #842-D: constant "WARPS_M" is not used in or cannot be deduced from the template argument list of class template "layer_norm::Stats<T, <error>, <error>, <error>>"

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(632): warning #842-D: constant "WARPS_N" is not used in or cannot be deduced from the template argument list of class template "layer_norm::Stats<T, <error>, <error>, <error>>"

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(633): error: the template argument list of the partial specialization includes a nontype argument whose type depends on a template parameter

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(641): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(641): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(641): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(641): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(641): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(649): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(690): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(690): warning #842-D: constant "WARPS_M" is not used in or cannot be deduced from the template argument list of class template "layer_norm::Stats<T, <error>, <error>, <error>>"

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(691): error: the template argument list of the partial specialization includes a nontype argument whose type depends on a template parameter

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(700): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(700): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(700): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(700): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(700): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_utils.cuh(705): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(7): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(13): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(32): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(38): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(39): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(87): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(88): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(89): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(90): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_kernel_traits.h(91): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(73): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(74): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(75): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(76): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(77): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(79): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(80): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(81): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(82): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(83): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(85): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(86): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(87): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(88): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(89): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(91): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(92): error: identifier "uint32_t" is undefined

  /home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu(93): error: identifier "uint32_t" is undefined

  Error limit reached.
  100 errors detected in the compilation of "/home/chenzhiyuan/apex/apex/contrib/csrc/layer_norm/ln_fwd_cuda_kernel.cu".
  Compilation terminated.
  ninja: build stopped: subcommand failed.
  Traceback (most recent call last):
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
      subprocess.run(
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/subprocess.py", line 526, in run
      raise CalledProcessError(retcode, process.args,
  subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

  The above exception was the direct cause of the following exception:

  Traceback (most recent call last):
    File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "/home/chenzhiyuan/apex/setup.py", line 762, in <module>
      setup(
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/__init__.py", line 108, in setup
      return distutils.core.setup(**attrs)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
      return run_commands(dist)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
      dist.run_commands()
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
      self.run_command(cmd)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/dist.py", line 1221, in run_command
      super().run_command(command)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/command/install.py", line 68, in run
      return orig.install.run(self)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/command/install.py", line 697, in run
      self.run_command('build')
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
      self.distribution.run_command(command)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/dist.py", line 1221, in run_command
      super().run_command(command)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/command/build.py", line 131, in run
      self.run_command(cmd_name)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
      self.distribution.run_command(command)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/dist.py", line 1221, in run_command
      super().run_command(command)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 84, in run
      _build_ext.run(self)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 345, in run
      self.build_extensions()
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 843, in build_extensions
      build_ext.build_extensions(self)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 467, in build_extensions
      self._build_extensions_serial()
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 493, in _build_extensions_serial
      self.build_extension(ext)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 246, in build_extension
      _build_ext.build_extension(self, ext)
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 548, in build_extension
      objects = self.compiler.compile(
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 658, in unix_wrap_ninja_compile
      _write_ninja_file_and_compile_objects(
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1574, in _write_ninja_file_and_compile_objects
      _run_ninja_build(
    File "/mnt/shared/mamba/envs/dev/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
      raise RuntimeError(message) from e
  RuntimeError: Error compiling objects for extension
  error: subprocess-exited-with-error

  × Running setup.py install for apex did not run successfully.
  │ exit code: 1
  ╰─> See above for output.

  note: This error originates from a subprocess, and is likely not a problem with pip.
  full command: /mnt/shared/mamba/envs/dev/bin/python3.10 -u -c '
  exec(compile('"'"''"'"''"'"'
  # This is <pip-setuptools-caller> -- a caller that pip uses to run setup.py
  #
  # - It imports setuptools before invoking setup.py, to enable projects that directly
  #   import from `distutils.core` to work with newer packaging standards.
  # - It provides a clear error message when setuptools is not installed.
  # - It sets `sys.argv[0]` to the underlying `setup.py`, when invoking `setup.py` so
  #   setuptools doesn'"'"'t think the script is `-c`. This avoids the following warning:
  #     manifest_maker: standard file '"'"'-c'"'"' not found".
  # - It generates a shim setup.py, for handling setup.cfg-only projects.
  import os, sys, tokenize

  try:
      import setuptools
  except ImportError as error:
      print(
          "ERROR: Can not execute `setup.py` since setuptools is not available in "
          "the build environment.",
          file=sys.stderr,
      )
      sys.exit(1)

  __file__ = %r
  sys.argv[0] = __file__

  if os.path.exists(__file__):
      filename = __file__
      with tokenize.open(__file__) as f:
          setup_py_code = f.read()
  else:
      filename = "<auto-generated setuptools caller>"
      setup_py_code = "from setuptools import setup; setup()"

  exec(compile(setup_py_code, filename, "exec"))
  '"'"''"'"''"'"' % ('"'"'/home/chenzhiyuan/apex/setup.py'"'"',), "<pip-setuptools-caller>", "exec"))' --cpp_ext --cuda_ext --deprecated_fused_adam --xentropy --fast_multihead_attn --fast_bottleneck --fused_conv_bias_relu --cudnn_gbn --fmha --focal_loss --fast_layer_norm install --record /tmp/pip-record-ew5oygz6/install-record.txt --single-version-externally-managed --compile --install-headers /mnt/shared/mamba/envs/dev/include/python3.10/apex
  cwd: /home/chenzhiyuan/apex/
  Running setup.py install for apex ... error
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> apex

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.
@ZhiyuanChen ZhiyuanChen added the bug Something isn't working label Apr 12, 2023
@ZhiyuanChen
Copy link
Contributor Author

Environment

Collecting environment information...
PyTorch version: 2.0.0
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: CentOS Linux release 7.8.2003 (Core) (x86_64)
GCC version: (conda-forge gcc 11.3.0-19) 11.3.0
Clang version: Could not collect
CMake version: version 2.8.12.2
Libc version: glibc-2.17

Python version: 3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:08:06) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.83.1.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 530.30.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                112
On-line CPU(s) list:   0-111
Thread(s) per core:    2
Core(s) per socket:    28
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 106
Model name:            Intel(R) Xeon(R) Platinum 8336C CPU @ 2.30GHz
Stepping:              6
CPU MHz:               2300.036
BogoMIPS:              4600.07
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             48K
L1i cache:             32K
L2 cache:              1280K
L3 cache:              55296K
NUMA node0 CPU(s):     0-55
NUMA node1 CPU(s):     56-111
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd rsb_ctxsw ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq md_clear spec_ctrl intel_stibp arch_capabilities

Versions of relevant libraries:
[pip3] mypy==1.2.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.24.2
[pip3] torch==2.0.0
[pip3] torchaudio==2.0.0
[pip3] torcheval==0.0.6
[pip3] torchmetrics==0.11.4
[pip3] torchtnt==0.0.7
[pip3] torchvision==0.15.0
[conda] blas                      1.0                         mkl    conda-forge
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] libblas                   3.9.0            16_linux64_mkl    conda-forge
[conda] libcblas                  3.9.0            16_linux64_mkl    conda-forge
[conda] liblapack                 3.9.0            16_linux64_mkl    conda-forge
[conda] mkl                       2022.2.1         h84fe81f_16997    conda-forge
[conda] numpy                     1.24.2          py310h8deb116_0    conda-forge
[conda] pytorch                   2.0.0           py3.10_cuda11.8_cudnn8.7.0_0    pytorch
[conda] pytorch-cuda              11.8                 h7e8668a_3    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                2.0.0               py310_cu118    pytorch
[conda] torcheval                 0.0.6                    pypi_0    pypi
[conda] torchmetrics              0.11.4                   pypi_0    pypi
[conda] torchtnt                  0.0.7                    pypi_0    pypi
[conda] torchtriton               2.0.0                     py310    pytorch
[conda] torchvision               0.15.0              py310_cu118    pytorch

@ZhiyuanChen
Copy link
Contributor Author

ZhiyuanChen commented Apr 12, 2023

I have tried to add the following lines on the top of apex/contrib/layer_norm/ln.h, and resolved the issue.

include <stdint.h>
include <stdio.h>

ZhiyuanChen added a commit to ZhiyuanChen/apex that referenced this issue Apr 12, 2023
@crcrpar
Copy link
Collaborator

crcrpar commented Apr 12, 2023

Would you mind opening a pull request with your diff?

@ZhiyuanChen
Copy link
Contributor Author

Would you mind opening a pull request with your diff?

Just opened~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants