Skip to content

Tuan's Blog

AI Systems

AI Systems (SysML)¶

Bridging the gap between cutting-edge algorithms and massive-scale hardware.

SOTA Roadmap¶

1. Scaling Infrastructure¶

Cluster Orchestration: Kubernetes for ML (KubeFlow, Ray on K8s), Slurm.
Interconnects: InfiniBand vs Ethernet (RoCEv2), NVLink/NVSwitch topology.
Storage: High-performance implementations (Lustre, GPUDirect Storage).

2. Distributed Training Frameworks¶

3D Parallelism: Creating the optimal recipe of Data, Tensor, and Pipeline parallelism (Megatron-LM).
Optimization: ZeRO Stages (DeepSpeed), FSDP (Fully Sharded Data Parallel).
Fault Tolerance: Checkpointing strategies, auto-recovery design.

3. Inference at Scale¶

Serving Engines: Deeper dive into TGI vs vLLM vs TRT-LLM architectures.
Continuous Batching: Orca scheduling.
Prefill vs/and Decode separation: Distaggregating prefill and decode compute (Splitwise).

4. Data Engineering for AI¶

Dataloaders: Ray Data, MosaicML Streaming.
Formats: Parquet, Arrow, LanceDB.

Key Resources¶

Course: CS329S: Machine Learning Systems Design (Stanford/Chip Huyen).
Blog: Chip Huyen's Blog (Real-world MLSys).
Paper: Efficient Large-Scale Language Model Training on GPU Clusters (Megatron-LM).