Skip to content

Tuan's Blog

Trustworthy AI

Trustworthy AI¶

Research on bias, fairness, robustness, and interpretability in AI systems.

SOTA Roadmap¶

1. Mechanistic Interpretability¶

Reverse Engineering LLMs: Induction Heads, Superposition, Monosemanticity.
Probing: Linear Probing, Activation Steering (Representation Engineering).
Dictionary Learning: Sparse Autoencoders (SAE) for feature extraction (Anthropic research).

2. Adversarial Robustness & Safety¶

Jailbreaking: GCG (Greedy Coordinate Gradient), PAIR, TAP (Tree of Attacks).
Defenses: LlamaGuard, NeMo Guardrails, Circuit Breaking.
Red Teaming: Automated Red Teaming via LLMs.

3. Evaluation & Benchmarks¶

Benchmarks: MMLU-Pro, GPQA (Graduate-Level Reasoning), MATH.
Safety Benchmarks: TruthfulQA, RealToxicityPrompts, Do-Not-Answer.
LLM-as-a-Judge: MT-Bench, AlpacaEval 2.0.

4. Detection & Watermarking¶

Watermarking: Tree-Ring Watermarking, Distillation-Resistant Watermarking.
Hallucination: Detection via Uncertainty (SelfCheckGPT), RAG-based Fact Checking.

Key Resources¶

Journal: Transformer Circuits Thread (Anthropic's interpretability research).
Course: AI Safety Fundamentals (BlueDot).
Benchmark: Chatbot Arena Leaderboard.