#Wmma 3 patch serial
#Wmma 3 patch Patch
I don't think this patch prevents optimizations like these. Take the address space optimization for an example, when we translate a generic load to specific load, we can just change the pointer type. If NVidia would send a patch with the implementation of NVVM-IR style intrinsics, I would be glad to help reviewing and getting it into LLVM. For all practical purposes they should not conflict in any way with your downstream implementation. Intrinsics in the patch are llvm.nvvm.*W*mma, while the intrinsics in NVVM-IR-spec use the llvm.nvvm.*H*mma. Just in case - the naming of intrinsics is also different. with a bit of tablegen magic it should be possible to pattern-match ld_a_f16(addrpacecast(SHARED)) and replace it with ld_a_f16_shared. The patch does not block further optimizations. 1:1 mapping is relatively simple to implement with tablegen and is sufficient for its intended use of generating specific instruction variant. Even with reduced number of intrinsics that map to these instructions, someone/somewhere will have to match them to appropriate instruction variant. Reducing the number of intrinsics does not change the fact that the root cause of complexity here is the fact that PTX encodes instruction parameters in instruction *names*. We took this approach to reduce the number of intrinsic functions that opt and code-gen has to deal with, for example to have one ld_a_f16 instead of 12.