This talk presents Google’s effort of optimizing LLVM for CUDA. When we started this effort, LLVM was well-tuned for CPUs but there had been little public work on improving its GPU performance. We developed, tuned, and augmented several general and CUDA-specific optimization passes. As a result, our LLVM-based compiler generates better code than nvcc on key end-to-end internal benchmarks and is on par with nvcc on a variety of open-source benchmarks.