Trustworthy AI¶
Research on bias, fairness, robustness, and interpretability in AI systems.
SOTA Roadmap¶
1. Mechanistic Interpretability¶
- Reverse Engineering LLMs: Induction Heads, Superposition, Monosemanticity.
- Probing: Linear Probing, Activation Steering (Representation Engineering).
- Dictionary Learning: Sparse Autoencoders (SAE) for feature extraction (Anthropic research).
2. Adversarial Robustness & Safety¶
- Jailbreaking: GCG (Greedy Coordinate Gradient), PAIR, TAP (Tree of Attacks).
- Defenses: LlamaGuard, NeMo Guardrails, Circuit Breaking.
- Red Teaming: Automated Red Teaming via LLMs.
3. Evaluation & Benchmarks¶
- Benchmarks: MMLU-Pro, GPQA (Graduate-Level Reasoning), MATH.
- Safety Benchmarks: TruthfulQA, RealToxicityPrompts, Do-Not-Answer.
- LLM-as-a-Judge: MT-Bench, AlpacaEval 2.0.
4. Detection & Watermarking¶
- Watermarking: Tree-Ring Watermarking, Distillation-Resistant Watermarking.
- Hallucination: Detection via Uncertainty (SelfCheckGPT), RAG-based Fact Checking.
Key Resources¶
- Journal: Transformer Circuits Thread (Anthropic's interpretability research).
- Course: AI Safety Fundamentals (BlueDot).
- Benchmark: Chatbot Arena Leaderboard.