Programming Massively Parallel Processors (PMPP) - Overview¶

EnglishTiếng Việt

Syllabus: PMPP (4^th Edition)¶

This syllabus is organized into four major phases: Foundations, Performance Optimization, Parallel Patterns, and Advanced Applications (AI/ML).

Phase 1: Foundations of GPU Computing¶

Introduction to Heterogeneous Computing: Why parallel computing? The shift from "faster clocks" to "more cores."
The CUDA Programming Model: Kernels, grids, blocks, and threads. Your first "Vector Addition" program.
Data-Parallel Execution Model: Understanding the SIMT (Single Instruction, Multiple Threads) architecture and hardware multithreading.
GPU Memory Hierarchy: Global, constant, and shared memory; registers and caches.

Phase 2: Performance and Hardware Architecture¶

Performance Considerations: Thread divergence, memory coalescing, and latency hiding.
Compute Capability and Occupancy: How to balance resources (registers/shared memory) to keep the GPU fully utilized.
Floating Point Excellence: Understanding precision (FP32, FP16, BF16) and numerical stability.

Phase 3: Fundamental Parallel Patterns¶

Convolution: Implementation of 1D and 2D filters.
Prefix Sum (Scan): Solving problems that seem inherently sequential.
Reduction: Efficiently summing or finding the max/min of billions of elements.
Stencil Computation: Grid-based updates (used in weather simulations).
Histogramming: Dealing with "atomic" operations and memory contention.
Sparse Matrix-Vector Multiplication (SpMV): Handling data where most values are zero.
Merge Sort: Implementing high-performance sorting on the GPU.

Phase 4: Advanced Topics and Modern AI¶

Deep Learning and Tensor Cores: (New in 4^th Ed) How GPUs accelerate Transformers and CNNs.
Graph Processing: Navigating irregular data structures.
Dynamic Parallelism: Kernels launching other kernels.
Multi-GPU Programming: Using NVLink and MPI.

What makes this book unique?¶

The 4^th Edition is a masterclass in "Thinking in Parallel."

Pattern-Based Teaching: Focuses on "Algorithmic Patterns" like Reduction, Scan, and Convolution rather than just syntax.
Bridging Software and Hardware: Explains why code is slow by looking at hardware limitations (bandwidth, warp scheduling).
Focus on Modern AI: Significant new content on Tensor Cores and Mixed Precision (FP16/INT8).

The Path Beyond¶

Ready to turn this knowledge into a career? Check out the GPU Engineer Roadmap for advice on projects, portfolios, and interviewing.

Lộ trình học: PMPP (Ấn bản thứ 4)¶

Lộ trình này được chia thành bốn giai đoạn chính: Nền tảng, Tối ưu hóa hiệu năng, Các mẫu song song (Parallel Patterns), và Ứng dụng nâng cao (AI/ML).

Giai đoạn 1: Nền tảng của Tính toán GPU¶

Giới thiệu về Tính toán không đồng nhất: Tại sao cần tính toán song song? Sự chuyển dịch từ "tăng xung nhịp" sang "tăng số lõi".
Mô hình lập trình CUDA: Kernel, grid, block, và thread. Chương trình "Cộng vector" đầu tiên.
Mô hình thực thi song song dữ liệu: Hiểu về kiến trúc SIMT (Single Instruction, Multiple Threads) và đa luồng phần cứng.
Hệ thống phân cấp bộ nhớ GPU: Bộ nhớ global, constant, và shared; register và cache.

Giai đoạn 2: Hiệu năng và Kiến trúc phần cứng¶

Các cân nhắc về hiệu năng: Phân kỳ luồng (thread divergence), gộp bộ nhớ (memory coalescing), và che giấu độ trễ.
Khả năng tính toán và Độ lấp đầy (Occupancy): Cách cân bằng tài nguyên để GPU luôn hoạt động hết công suất.
Độ chính xác số thực: Hiểu về FP32, FP16, BF16 và tính ổn định số học.

Giai đoạn 3: Các mẫu song song cơ bản¶

Tích chập (Convolution): Triển khai các bộ lọc 1D và 2D.
Prefix Sum (Scan): Giải quyết các bài toán có vẻ tuần tự.
Reduction (Gom nhóm): Tính tổng hoặc tìm max/min một cách hiệu quả.
Stencil Computation: Cập nhật dựa trên lưới (dùng trong mô phỏng thời tiết).
Histogramming (Biểu đồ tần suất): Xử lý các thao tác "atomic" và tranh chấp bộ nhớ.
Nhân ma trận thưa (SpMV): Xử lý dữ liệu có nhiều giá trị bằng không.
Merge Sort: Triển khai sắp xếp hiệu năng cao trên GPU.

Giai đoạn 4: Chủ đề nâng cao và AI hiện đại¶

Deep Learning và Tensor Cores: (Mới ở bản 4) Cách GPU tăng tốc Transformer và CNN.
Xử lý đồ thị: Điều hướng các cấu trúc dữ liệu không đều.
Dynamic Parallelism: Kernel khởi tạo kernel khác.
Lập trình Multi-GPU: Sử dụng NVLink và MPI.

Tại sao cuốn sách này đặc biệt?¶

Ấn bản thứ 4 là một khóa học chuyên sâu về "Tư duy song song".

Phương pháp dạy dựa trên mẫu (Pattern-Based): Tập trung vào "Các mẫu thuật toán" như Reduction, Scan, Tích chập thay vì chỉ dạy cú pháp.
Kết nối Phần mềm và Phần cứng: Giải thích tại sao code chạy chậm thông qua các giới hạn phần cứng (băng thông, warp scheduling).
Tập trung vào AI hiện đại: Nội dung mới về Tensor Cores và Độ chính xác hỗn hợp (FP16/INT8).

Con đường phía trước¶

Bạn đã sẵn sàng biến kiến thức này thành sự nghiệp? Hãy xem Lộ trình Kỹ sư GPU để biết các lời khuyên về dự án, danh mục hồ sơ và phỏng vấn.