RLHF and Human Preference
RLHF helps models better follow human preferences and instructions, but preference optimization does not automatically provide an operational test for whether deployed behavior remains centered on a product or governance objective over time.
Alignment Theory treats RLHF as part of the broader landscape while focusing on post-deployment behavioral QA.
Constitutional AI
Anthropic's Constitutional AI work helps frame principle-based alignment: model behavior can be shaped by explicit rules and critiques rather than only direct preference labels.
Alignment Theory is compatible with principle-based systems, but asks a different operational question: after principles and constraints are in place, does the system drift within the allowed zone?
Scalable Oversight
Scalable oversight addresses the problem of evaluating model behavior when direct human supervision is expensive, slow, or insufficiently expert.
The Alignment Theory contribution is to define drift categories and correction routes that can focus oversight on meaningful deviations rather than raw output volume.
Interpretability
Interpretability research examines internal model mechanisms and representations. Anthropic's interpretability program is especially important for understanding how model internals may support or undermine safe behavior.
Alignment Theory is behavior-first. It does not replace mechanistic interpretability; it supplies a production-facing layer for detecting drift in observable prompt-output behavior.
Model Behavior Specifications
OpenAI's Model Spec helps define desired assistant behavior, including how a system should respond under competing instructions, policies, and user goals.
Alignment Theory uses the same broad need for behavioral specification, but emphasizes objective state, detector categories, and correction after allowed-but-off-center behavior appears.
Runtime Monitoring
Runtime monitoring, logging, observability, and eval frameworks are necessary for deployed AI systems. They show what happened and can detect many known failure modes.
The drift detection gap is the missing layer between ordinary pass/fail compliance and long-term objective fidelity.
The Drift Detection Gap
OpenAI alignment work helps frame the problem of aligning models to human intent and studying where methods scale or break. Anthropic's Constitutional AI and interpretability work help frame principle-based and mechanistic approaches.
Alignment Theory contributes a runtime behavioral drift detection and realignment layer for deployed systems.
How to Cite
CitationMichael Bower. (2026). Literature Review: AI Alignment Approaches and the Drift Detection Gap. AlignmentTheory.org. https://alignmenttheory.org/pages/ai-alignment-literature-review.html
@misc{bower2026aialignmentliteraturereview,
author = {Bower, Michael},
title = {Literature Review: AI Alignment Approaches and the Drift Detection Gap},
year = {2026},
howpublished = {AlignmentTheory.org},
url = {https://alignmenttheory.org/pages/ai-alignment-literature-review.html}
}