CUDA Engineering¶

The path to mastering GPU programming, High-Performance Computing (HPC), and securing a role at NVIDIA/Core AI Labs.

Zero to Hero Roadmap¶

0. The Foundation (Prerequisites)¶

Modern C++: Pointers, Memory Layout, RAII, and std::vector internals. NVIDIA requires strong C++ skills, not just Python.
Computer Architecture: Cache hierarchies (L1/L2), SIMD concepts, and Latency vs Throughput.

1. Kernel Basics¶

GPU Architecture: SMs, Warps, Scheduling, and the execution hierarchy.
Structuring Kernels: Grids, Blocks, Threads mapping.
Hello World: Vector Addition, Matrix Multiplication (Naive).

2. Tools of the Trade (Crucial)¶

Profiling: Using Nsight Compute (NCU) and Nsight Systems (NSYS) to identify bottlenecks.
Debugging: Compute Sanitizer and printf debugging strategies.

3. Memory Mastery¶

Global Memory: Coalescing patterns to maximize bandwidth.
Shared Memory: Using the L1/Shared cache for tiling (Tiled MatMul).
Register File: Optimization and preventing spills.

4. Compute Optimization¶

Control Flow: Branch divergence minimization.
Warp Primitives: Shuffle instructions for fast reductions.
Occupancy: Calculating optimal thread/block usage.

5. Advanced & Modern CUDA¶

Tensor Cores: Using wmma intrinsics (Volta/Ampere/Hopper).
CUTLASS: Understanding the template library for high-performance GEMMs.
Triton: The Pythonic way to write CUDA (OpenAI).

6. Capstone Projects (Resume Builders)¶

Optimized MatMul: Implementing tiling and analyzing performance vs cuBLAS.
Custom Attention: Writing FlashAttention from scratch.
Requirement: Every project must include Benchmarks (Op/s, Bandwidth) and Nsight timeline analysis.

📖 Deep Dive: PMPP Study Notes¶

I am currently working through the 4^th Edition of "Programming Massively Parallel Processors".

PMPP Book Overview: Syllabus and core concepts.
Chapter 1: Introduction: The shift to throughput-oriented computing.
Chapter 2: Data Parallel Computing: Kernels, Grids, Blocks, and Threads.
Chapter 3: Multidimensional Grids: 2D/3D mapping and matrix operations.
Chapter 4: Compute Architecture: SMs, warps, and occupancy.
Chapter 5: Memory Architecture: Tiling and shared memory optimization.
Chapter 6: Performance Considerations: Coalescing and thread coarsening.
Chapter 7: Convolution: Constant memory and halo cells.
Chapter 8: Stencil: 3D patterns and register tiling.
Chapter 9: Parallel Histogram: Atomic operations and privatization.
Chapter 10: Reduction: Minimizing divergence and hierarchical reduction.
Chapter 11: Prefix Sum (Scan): Parallelizing sequential recursions.
Chapter 12: Parallel Merge: Data-dependent workload boundaries.
Chapter 13: Parallel Sorting: Efficient data movement for sorting.
Chapter 14: Sparse Matrix Computation: Handling irregular data structures.
Chapter 15: Parallel Graph Algorithms: Vertex and edge centric processing.
Chapter 16: Deep Learning: Modern AI as matrix multiplications.
Chapter 17: MRI Reconstruction: Hardware trigonometry and constant cache.
Chapter 18: Molecular Dynamics: Spatial binning and cutoff summation.
Chapter 19: Programming Strategy: Computational thinking and algorithm selection.
Chapter 20: Heterogeneous Clusters: Scaling with MPI and CUDA Streams.
Chapter 21: Dynamic Parallelism: Kernels launching other kernels.
Chapter 22: Evolution and Trends: Unified Memory and task-level concurrency.
Chapter 23: Conclusion: The future of throughput-oriented computing.
Appendix A: Numerical Issues: Floating-point precision and stability.
GPU Engineer Roadmap: Career advice and landing roles at top tech companies.

Key Resources¶

Book: Programming Massively Parallel Processors (The "Bible" of GPU programming).
Official Docs: CUDA C++ Programming Guide & Best Practices Guide.
Community: CUDA Mode (Excellent lectures & repo).
Interactive: GPU Mode Lectures (YouTube).