Authors
- Philippe Tillet, who’s leading the Triton team at OpenAI. Previously, he was at pretty much all major chip makers including NVIDIA, AMD, Intel, and Nervana.
- Sharan Chetlur, Principal engineer working on TensorRT-LLM at NVIDIA. He’s been working on CUDA since 2012, having optimized the performance of deep learning models from single GPU to full data center scale. Previously, he was Director of Engineer on Kernels team at Cerebras.
- William Malpica, co-founder of Voltron Data and creator of BlazingSQL. He helped scale our GPU-native query engine to handle 100TB queries!
- Mark Saroufim, PyTorch core developer and cofounder of CUDA MODE. He also ran the really fun NeurIPS LLM Efficiency challenge last year. Previously, he was at Graphcore and Microsoft.
A Crash Course on GPU Optimization
- How to make PyTorch models faster?
- Fuse more
- Use tensor cores
- Reduce overhead
- Quantize
- Use a custom kernel
- Arithmetic Intensity = Total number of operations / Data I/O movement
- If it is less than 1 then our operation is memroy bound i.e. the GPUs utilization can be optimized further.
- A GPU operation is considered memory bound when the overall execution time is dominated by accessing data from memory rather than the actual computation itself.
- Conversely, a GPU operation is considered computation bound when the execution time is dominated by the actual calculations rather than memory access.
Abandoned...