Problem

CUDA Minor Version Mismatch

Latest PyTorch (v2.3.1) only supports CUDA 12.1. When installing PyTorch with embedded CUDA toolkit, if system CUDA version is not also 12.1, Apex compilation will throw a version mismatch error.

GitHub Issue

Random Compilation Errors

When compilation hits a seemingly random error, such as template error or undefined reference/identifier, it is likely an error caused by incompatible gcc version. Note that even if the gcc version complies with the maximum supported version corresponding to the CUDA Version (for CUDA 12.1, max supported GCC version is 12.2 ref).

Also see this post.

Example:

/home/snowsr/.local/lib/python3.9/site-packages/torch/include/pybind11/detail/../cast.h: In function ‘typename pybind11::detail::type_caster<typename pybind11::detail::intrinsic_type<T>::type>::cast_op_type<T> pybind11::detail::cast_op(make_caster<T>&)’:
  /home/snowsr/.local/lib/python3.9/site-packages/torch/include/pybind11/detail/../cast.h:45:120: error: expected template-name before< token
     45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
        |                                                                                                                        ^
  /home/snowsr/.local/lib/python3.9/site-packages/torch/include/pybind11/detail/../cast.h:45:120: error: expected identifier before< token
  /home/snowsr/.local/lib/python3.9/site-packages/torch/include/pybind11/detail/../cast.h:45:123: error: expected primary-expression before> token
     45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
        |                                                                                                                           ^
  /home/snowsr/.local/lib/python3.9/site-packages/torch/include/pybind11/detail/../cast.h:45:126: error: expected primary-expression before)’ token
     45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
        |                                                                                                                              ^
  ninja: build stopped: subcommand failed.
 
/usr/include/stdlib.h(141): error: identifier "_Float32" is undefined

Solution

Method

Code

# build apex from source
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

Additional Notes


References

GitHub post on fixing Apex compilation error