Skip to content

Trustworthy AI

Research on bias, fairness, robustness, and interpretability in AI systems.

SOTA Roadmap

1. Mechanistic Interpretability

  • Reverse Engineering LLMs: Induction Heads, Superposition, Monosemanticity.
  • Probing: Linear Probing, Activation Steering (Representation Engineering).
  • Dictionary Learning: Sparse Autoencoders (SAE) for feature extraction (Anthropic research).

2. Adversarial Robustness & Safety

  • Jailbreaking: GCG (Greedy Coordinate Gradient), PAIR, TAP (Tree of Attacks).
  • Defenses: LlamaGuard, NeMo Guardrails, Circuit Breaking.
  • Red Teaming: Automated Red Teaming via LLMs.

3. Evaluation & Benchmarks

  • Benchmarks: MMLU-Pro, GPQA (Graduate-Level Reasoning), MATH.
  • Safety Benchmarks: TruthfulQA, RealToxicityPrompts, Do-Not-Answer.
  • LLM-as-a-Judge: MT-Bench, AlpacaEval 2.0.

4. Detection & Watermarking

  • Watermarking: Tree-Ring Watermarking, Distillation-Resistant Watermarking.
  • Hallucination: Detection via Uncertainty (SelfCheckGPT), RAG-based Fact Checking.

Key Resources