Machine Learning Practice¶
This section focuses on implementation details, best practices, and code snippets.
SOTA Roadmap¶
1. Distributed Training¶
- Parallelism: Data Parallel (DDP/FSDP), Tensor Parallel (TP), Pipeline Parallel (PP).
- Infrastructure: DeepSpeed, Megatron-LM, DTensor (PyTorch).
2. High-Performance Kernels¶
- Triton: Writing custom CUDA kernels in Python.
- FlashAttention: IO-Aware exact attention.
- Kernel Fusion: torch.compile (Inductor).
3. MLOps for LLMs (LLMOps)¶
- Evaluation: Ragas, TruLens.
- serving: vLLM, TGI, SGLang.
Key Resources¶
- Guide: Effective PyTorch.
- Book: Machine Learning Engineering (Andriy Burkov).
- Repo: Hugging Face Transformers.
Efficient Data Loading¶
When training large models, data loading can become a bottleneck. Here is a comparison of standard vs optimized loading patterns.
import torch
from torch.utils.data import DataLoader, Dataset
class SimpleDataset(Dataset):
def __init__(self, data):
self.data = data
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return len(self.data)
# Standard loader
loader = DataLoader(
SimpleDataset(range(1000)),
batch_size=32,
shuffle=True
)
Fun with Python¶
Sometimes we need to remember the roots of our tools.
"""
The Antigravity Module.
A classic Python Easter egg.
"""
import antigravity
def fly():
print("Flying with Python!")
# This module opens a web browser to the XKCD comic about Python.
# https://xkcd.com/353/
if __name__ == "__main__":
fly()
This snippet is loaded dynamically from docs/snippets/antigravity.py!