CUDA Engineering¶
The path to mastering GPU programming, High-Performance Computing (HPC), and securing a role at NVIDIA/Core AI Labs.
Zero to Hero Roadmap¶
0. The Foundation (Prerequisites)¶
- Modern C++: Pointers, Memory Layout, RAII, and
std::vectorinternals. NVIDIA requires strong C++ skills, not just Python. - Computer Architecture: Cache hierarchies (L1/L2), SIMD concepts, and Latency vs Throughput.
1. Kernel Basics¶
- GPU Architecture: SMs, Warps, Scheduling, and the execution hierarchy.
- Structuring Kernels: Grids, Blocks, Threads mapping.
- Hello World: Vector Addition, Matrix Multiplication (Naive).
2. Tools of the Trade (Crucial)¶
- Profiling: Using Nsight Compute (NCU) and Nsight Systems (NSYS) to identify bottlenecks.
- Debugging: Compute Sanitizer and
printfdebugging strategies.
3. Memory Mastery¶
- Global Memory: Coalescing patterns to maximize bandwidth.
- Shared Memory: Using the L1/Shared cache for tiling (Tiled MatMul).
- Register File: Optimization and preventing spills.
4. Compute Optimization¶
- Control Flow: Branch divergence minimization.
- Warp Primitives: Shuffle instructions for fast reductions.
- Occupancy: Calculating optimal thread/block usage.
5. Advanced & Modern CUDA¶
- Tensor Cores: Using
wmmaintrinsics (Volta/Ampere/Hopper). - CUTLASS: Understanding the template library for high-performance GEMMs.
- Triton: The Pythonic way to write CUDA (OpenAI).
6. Capstone Projects (Resume Builders)¶
- Optimized MatMul: Implementing tiling and analyzing performance vs cuBLAS.
- Custom Attention: Writing FlashAttention from scratch.
- Requirement: Every project must include Benchmarks (Op/s, Bandwidth) and Nsight timeline analysis.
📖 Deep Dive: PMPP Study Notes¶
I am currently working through the 4th Edition of "Programming Massively Parallel Processors".
- PMPP Book Overview: Syllabus and core concepts.
- Chapter 1: Introduction: The shift to throughput-oriented computing.
- Chapter 2: Data Parallel Computing: Kernels, Grids, Blocks, and Threads.
- Chapter 3: Multidimensional Grids: 2D/3D mapping and matrix operations.
- Chapter 4: Compute Architecture: SMs, warps, and occupancy.
- Chapter 5: Memory Architecture: Tiling and shared memory optimization.
- Chapter 6: Performance Considerations: Coalescing and thread coarsening.
- Chapter 7: Convolution: Constant memory and halo cells.
- Chapter 8: Stencil: 3D patterns and register tiling.
- Chapter 9: Parallel Histogram: Atomic operations and privatization.
- Chapter 10: Reduction: Minimizing divergence and hierarchical reduction.
- Chapter 11: Prefix Sum (Scan): Parallelizing sequential recursions.
- Chapter 12: Parallel Merge: Data-dependent workload boundaries.
- Chapter 13: Parallel Sorting: Efficient data movement for sorting.
- Chapter 14: Sparse Matrix Computation: Handling irregular data structures.
- Chapter 15: Parallel Graph Algorithms: Vertex and edge centric processing.
- Chapter 16: Deep Learning: Modern AI as matrix multiplications.
- Chapter 17: MRI Reconstruction: Hardware trigonometry and constant cache.
- Chapter 18: Molecular Dynamics: Spatial binning and cutoff summation.
- Chapter 19: Programming Strategy: Computational thinking and algorithm selection.
- Chapter 20: Heterogeneous Clusters: Scaling with MPI and CUDA Streams.
- Chapter 21: Dynamic Parallelism: Kernels launching other kernels.
- Chapter 22: Evolution and Trends: Unified Memory and task-level concurrency.
- Chapter 23: Conclusion: The future of throughput-oriented computing.
- Appendix A: Numerical Issues: Floating-point precision and stability.
- GPU Engineer Roadmap: Career advice and landing roles at top tech companies.
Key Resources¶
- Book: Programming Massively Parallel Processors (The "Bible" of GPU programming).
- Official Docs: CUDA C++ Programming Guide & Best Practices Guide.
- Community: CUDA Mode (Excellent lectures & repo).
- Interactive: GPU Mode Lectures (YouTube).