Skip to content

Tuan's Blog

Practice

Machine Learning Practice¶

This section focuses on implementation details, best practices, and code snippets.

SOTA Roadmap¶

1. Distributed Training¶

Parallelism: Data Parallel (DDP/FSDP), Tensor Parallel (TP), Pipeline Parallel (PP).
Infrastructure: DeepSpeed, Megatron-LM, DTensor (PyTorch).

2. High-Performance Kernels¶

Triton: Writing custom CUDA kernels in Python.
FlashAttention: IO-Aware exact attention.
Kernel Fusion: torch.compile (Inductor).

3. MLOps for LLMs (LLMOps)¶

Evaluation: Ragas, TruLens.
serving: vLLM, TGI, SGLang.

Key Resources¶

Guide: Effective PyTorch.
Book: Machine Learning Engineering (Andriy Burkov).
Repo: Hugging Face Transformers.

Efficient Data Loading¶

When training large models, data loading can become a bottleneck. Here is a comparison of standard vs optimized loading patterns.

Standard PyTorchOptimized with Prefetch

import torch
from torch.utils.data import DataLoader, Dataset

class SimpleDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return len(self.data)

# Standard loader
loader = DataLoader(
    SimpleDataset(range(1000)), 
    batch_size=32, 
    shuffle=True
)

import torch
from torch.utils.data import DataLoader

# Optimized loader config
loader = DataLoader(
    dataset, 
    batch_size=32, 
    shuffle=True,
    num_workers=4,           # Parallelize reading
    pin_memory=True,         # Fast transfer to CUDA
    prefetch_factor=2        # Pre-load batches
)

Fun with Python¶

Sometimes we need to remember the roots of our tools.

"""
The Antigravity Module.
A classic Python Easter egg.
"""

import antigravity

def fly():
    print("Flying with Python!")
    # This module opens a web browser to the XKCD comic about Python.
    # https://xkcd.com/353/

if __name__ == "__main__":
    fly()

This snippet is loaded dynamically from docs/snippets/antigravity.py!