Skip to content

AI Systems (SysML)

Bridging the gap between cutting-edge algorithms and massive-scale hardware.

SOTA Roadmap

1. Scaling Infrastructure

  • Cluster Orchestration: Kubernetes for ML (KubeFlow, Ray on K8s), Slurm.
  • Interconnects: InfiniBand vs Ethernet (RoCEv2), NVLink/NVSwitch topology.
  • Storage: High-performance implementations (Lustre, GPUDirect Storage).

2. Distributed Training Frameworks

  • 3D Parallelism: Creating the optimal recipe of Data, Tensor, and Pipeline parallelism (Megatron-LM).
  • Optimization: ZeRO Stages (DeepSpeed), FSDP (Fully Sharded Data Parallel).
  • Fault Tolerance: Checkpointing strategies, auto-recovery design.

3. Inference at Scale

  • Serving Engines: Deeper dive into TGI vs vLLM vs TRT-LLM architectures.
  • Continuous Batching: Orca scheduling.
  • Prefill vs/and Decode separation: Distaggregating prefill and decode compute (Splitwise).

4. Data Engineering for AI

  • Dataloaders: Ray Data, MosaicML Streaming.
  • Formats: Parquet, Arrow, LanceDB.

Key Resources