Literature Review: AI Alignment Approaches and the Drift Detection Gap

RLHF and Human Preference

RLHF helps models better follow human preferences and instructions, but preference optimization does not automatically provide an operational test for whether deployed behavior remains centered on a product or governance objective over time.

Alignment Theory treats RLHF as part of the broader landscape while focusing on post-deployment behavioral QA.

Constitutional AI

Anthropic's Constitutional AI work helps frame principle-based alignment: model behavior can be shaped by explicit rules and critiques rather than only direct preference labels.

Alignment Theory is compatible with principle-based systems, but asks a different operational question: after principles and constraints are in place, does the system drift within the allowed zone?

Scalable Oversight

Scalable oversight addresses the problem of evaluating model behavior when direct human supervision is expensive, slow, or insufficiently expert.

The Alignment Theory contribution is to define drift categories and correction routes that can focus oversight on meaningful deviations rather than raw output volume.

Interpretability

Interpretability research examines internal model mechanisms and representations. Anthropic's interpretability program is especially important for understanding how model internals may support or undermine safe behavior.

Alignment Theory is behavior-first. It does not replace mechanistic interpretability; it supplies a production-facing layer for detecting drift in observable prompt-output behavior.

Model Behavior Specifications

OpenAI's Model Spec helps define desired assistant behavior, including how a system should respond under competing instructions, policies, and user goals.

Alignment Theory uses the same broad need for behavioral specification, but emphasizes objective state, detector categories, and correction after allowed-but-off-center behavior appears.

Runtime Monitoring

Runtime monitoring, logging, observability, and eval frameworks are necessary for deployed AI systems. They show what happened and can detect many known failure modes.

The drift detection gap is the missing layer between ordinary pass/fail compliance and long-term objective fidelity.

The Drift Detection Gap

OpenAI alignment work helps frame the problem of aligning models to human intent and studying where methods scale or break. Anthropic's Constitutional AI and interpretability work help frame principle-based and mechanistic approaches.

Alignment Theory contributes a runtime behavioral drift detection and realignment layer for deployed systems.

Download PDF Full Combined Corpus Research Hub

How to Cite

Citation

Michael Bower. (2026). Literature Review: AI Alignment Approaches and the Drift Detection Gap. AlignmentTheory.org. https://alignmenttheory.org/pages/ai-alignment-literature-review.html

@misc{bower2026aialignmentliteraturereview,
  author = {Bower, Michael},
  title = {Literature Review: AI Alignment Approaches and the Drift Detection Gap},
  year = {2026},
  howpublished = {AlignmentTheory.org},
  url = {https://alignmenttheory.org/pages/ai-alignment-literature-review.html}
}

Open full citation guidance

References

Source

Literature Review: AI Alignment Approaches and the Drift Detection Gap

RLHF and Human Preference

Constitutional AI

Scalable Oversight

Interpretability

Model Behavior Specifications

Runtime Monitoring

The Drift Detection Gap

How to Cite

References

Related Research

Three-Layer Blueprint

Competitive Positioning

Limitations & Open Problems