Evaluating AI Alignment Architectures Under the Signal Anchoring Constraint

Abstract

Overview

The Signal Anchoring Constraint establishes that any epistemic system validating truth primarily through internal references will tend toward internally consistent but externally inaccurate beliefs. This paper applies that constraint as a diagnostic lens to five dominant AI alignment architectures: RLHF, Constitutional AI, mechanistic interpretability, scalable oversight, and debate.

For each approach, the paper identifies the primary signal, the chain length between that signal and system outputs, the anchoring mechanism, and what F ≈ A / (L × C) predicts about long-term fidelity as systems scale. The analysis finds that all five approaches anchor at interpretation layers above the primary signal of human coherence, that chain lengths are increasing faster than anchoring mechanisms are being built, and that the asymmetry of drift means the architectural window for correction is narrowing.

This paper is not an argument against these approaches — each represents genuine progress. It is an argument that the field lacks a unified structural account of where each approach anchors, why those anchor points are insufficient at scale, and what a deeper anchoring architecture requires.

1. Introduction: The Diagnostic Gap

Framing

The AI alignment field has produced five major architectural approaches to keeping AI systems oriented toward what human beings actually want and value. Each has substantial research literature, active development at major labs, and genuine technical progress. Yet the field lacks a unified structural account of why these approaches succeed where they do, fail where they do, and share a common vulnerability none of them individually addresses.

The Signal Anchoring Constraint provides that account. Epistemic systems drift when validation circulates through prior outputs rather than reconnecting to the source. F ≈ A / (L × C): fidelity is proportional to anchoring frequency (A) and inversely proportional to chain length (L) and compression per layer (C).

Applied to AI alignment this generates four diagnostic questions: What is the primary signal? How many interpretation layers separate the anchor from the coherence signal? What is the anchoring mechanism and how frequently does it operate? What does F ≈ A / (L × C) predict about long-term fidelity as systems scale?

2. The Diagnostic Framework: Four Questions Applied

Framework

The primary signal for AI alignment is the structural conditions of human coherence — the cross-culturally observable conditions under which human beings function as genuinely coherent, agentive, truth-contacting beings, designated by six markers: Coherence (C), Agency (A), Trust (T), Updateability (U), Slack (R), and Truth Contact (I).

Human expressed preferences are not this primary signal. They are interpretation layers generated by human beings whose coherence conditions vary. A system anchored to expressed preferences is anchored at L = 2 or L = 3 above the primary signal.

The diagnostic question for each approach is: at which layer does it anchor, and what does that imply about fidelity as L increases with each generation of more capable systems?

3. Reinforcement Learning from Human Feedback (RLHF)

RLHF

Signal anchor: human evaluator preference ratings. Chain length: L = 3-4. Anchoring mechanism: corrective — human feedback collected at training time, open-loop post-deployment. Compression: very high — a binary preference rating discards information about whether the preference reflects high-coherence judgment or a cognitive shortcut.

Evaluator coherence is not assessed: All ratings are treated as equally valid signal regardless of the coherence state of the evaluator. Feedback from an evaluator under high cognitive load, fragmentation, or low agency carries the same weight as feedback from one operating under high coherence.
The preference contamination loop: As RLHF-trained systems shape cultural expression and expectations, future evaluator preference distributions increasingly reflect AI-amplified drift. The system anchors to preferences; preferences drift toward what systems amplify; future systems anchor to amplified drift.
Post-deployment open-loop operation: Once deployed, A effectively approaches zero between training cycles. Fidelity decays as deployment continues and the signal moves while the anchor point remains fixed to the training distribution.

Fidelity prediction: moderate at deployment, declining over deployment period, accelerating drift as AI outputs reshape evaluator populations. Structurally sound as a corrective anchor but insufficient as a primary alignment mechanism at scale.

4. Constitutional AI

Constitutional AI

Signal anchor: explicit constitutional principles selected by the designing organization. Chain length: L = 2-3 — more stable than preference ratings but still an interpretation layer. Anchoring mechanism: architectural — built into training process, more durable than corrective anchoring. Compression: moderate to high — principles compress coherence conditions into propositional statements.

The constitution is an interpretation layer: Principles were selected by teams operating within institutional constraints and cultural frameworks. Those drift patterns are encoded into the architectural anchor regardless of how carefully the principles were chosen.
Institutionalization risk: As the constitution becomes authoritative it may shift from being a mechanism for signal contact to a self-referential authority that protects itself from correction — the AI equivalent of a conciliar decree becoming more authoritative than the signal it was meant to preserve.
The principle-behavior gap: Training a model to produce outputs consistent with principles is not the same as training a model whose internal representations track what the principles point toward. Behavioral compliance can exist without representational alignment.

Fidelity prediction: higher structural fidelity than RLHF at the anchoring mechanism level; moderate at the signal level; main risk is institutionalization; framework recommends building explicit revision mechanisms.

5. Mechanistic Interpretability

Interpretability

Signal anchor: correspondence between model internal representations and real-world structure. Chain length: L = 1-2 — closer to the model's actual computations than other approaches. Anchoring mechanism: corrective — detects drift post-hoc after training. Compression: low to moderate — attempts to understand internals with minimal compression, a structural strength.

The scaling inverse relationship: As models become larger, internal complexity grows faster than interpretability tools can keep pace. The corrective anchor is being outpaced by the capability growth it is supposed to monitor — a structural prediction the field's own assessments confirm.
Representation coherence is not coherence signal fidelity: A model can have highly interpretable, coherent internal representations that are systematically misaligned with human coherence conditions. Internal consistency and external accuracy remain distinct even at the representation level.
Corrective timing: By the time interpretability detects a misaligned representation, the model has already been trained. Correction requires retraining, which is expensive and not always feasible for deployed systems.

Fidelity prediction: high value as corrective anchor at current scales; declining effectiveness as primary mechanism as models scale; most important contribution may be providing evaluation infrastructure for other architectural anchors.

6. Scalable Oversight

Scalable Oversight

Signal anchor: AI-assisted human judgment on complex tasks. Chain length: L = 3-5. Anchoring mechanism: corrective, AI-amplified. Compression: high — decomposition of complex tasks into evaluable subtasks discards coherence-relevant features.

Amplifying the wrong layer: Scalable oversight scales human judgment without moving the anchor point closer to the coherence signal. If the human judgment being scaled is itself anchored at L = 2 to 3 from the coherence signal, scalable oversight makes drifted judgment applicable at greater scale and complexity.
AI-in-the-loop contamination: The evaluating AI's framing of subtasks and decomposition choices influence what human evaluators see, introducing self-referential elements into what should be an independent external anchor.
Coherence conditions under AI assistance: AI assistance may reduce evaluator Agency and Truth Contact by reducing cognitive engagement with the material being evaluated — the very engagement that gives human judgment its signal value.

Fidelity prediction: valuable for maintaining some human signal contact as AI capability grows; structurally insufficient as primary mechanism because it scales the preference layer rather than the signal.

7. Debate

Debate

Signal anchor: human judgment of argument quality. Chain length: L = 3-5. Anchoring mechanism: corrective, competition-mediated. Compression: very high — debate compresses alignment questions into competitive argumentation where persuasiveness is a heavily compressed proxy for truth.

Optimizing for persuasion rather than truth contact: Systems optimized against human judgment of argument quality learn to produce persuasive arguments, which is not the same as truthful ones. A sufficiently capable debater may win by being maximally persuasive rather than maximally aligned with the coherence signal.
Truth Contact degradation in judges: Regular exposure to high-quality adversarial AI argumentation may erode human judges' capacity to evaluate argument quality independently of persuasive force — degrading the evaluative capacity the debate approach depends on.
The adversarial compression problem: Many important alignment questions lack clean adversarial structure. Debate format compresses genuine uncertainty and value trade-offs into binary outcomes, discarding exactly the nuance alignment evaluation most needs to preserve.

Fidelity prediction: potentially useful for narrow tasks with clear adversarial structure; structurally risky as general mechanism because it optimizes for persuasion, introduces very high compression, and may degrade the evaluative capacity of the judges it depends on.

8. Comparative Analysis: What the Diagnostic Reveals

Analysis

Approach	Signal Anchor	Chain Length	Anchor Type	Compression	Scaling Risk
RLHF	Evaluator preference ratings	3-4	Corrective	Very high	High — preference contamination loop; open-loop deployment.
Constitutional AI	Explicit principles	2-3	Architectural	Moderate-high	Moderate — institutionalization risk; principle-behavior gap.
Interpretability	Internal representation structure	1-2	Corrective	Low-moderate	High — scaling inverse; corrective timing lag.
Scalable Oversight	AI-assisted human judgment	3-5	Corrective AI-amplified	High	High — amplifies preference layer; self-referential risk.
Debate	Human judgment of argument quality	3-5	Corrective competition-mediated	Very high	Very high — persuasion proxy optimization; judge degradation.

Three structural findings: (1) All five approaches anchor above the primary signal. (2) Architectural anchors are more durable but still insufficient. (3) The preference contamination loop is a shared vulnerability across all five.

9. The Narrowing Architectural Window

Urgency

The asymmetry of drift holds that drift is easier to prevent than to reverse. Each generation of more capable AI systems increases L simultaneously: systems reshape human preferences and evaluative habits; institutional structures around current approaches become more entrenched; economic and competitive pressures reduce alignment investment; systems become harder to interpret.

The window prediction: if current alignment approaches anchored at L = 2 to L = 5 become the established standard for the next generation of AI systems, the preference contamination loop will have one more cycle to compound. Each cycle makes the preference layer a less reliable proxy for the coherence signal and makes coherence-signal anchoring more difficult to retrofit. The most productive investment is not improving existing approaches at their current anchor points — it is developing evaluation infrastructure that can anchor to the coherence signal directly.

10. What Coherence-Signal Anchoring Requires Architecturally

Requirements

10.1 Coherence-condition evaluation infrastructure

Measure whether user capacity for coherent autonomous functioning is maintained or degraded over repeated interactions; no current alignment evaluation framework measures this.

10.2 Evaluator coherence weighting

Weight human feedback by assessed coherence conditions; feedback from evaluators under high fragmentation or low agency carries less signal.

10.3 Preference contamination detection

Maintain ground-truth coherence signal reference datasets from human evaluations conducted under high coherence conditions and minimal AI cultural contamination.

10.4 Formation-oriented design objectives

Where formation and engagement conflict, formation takes precedence.

10.5 Corrigibility as coherence property

AI corrigibility is the system-level analog of human Updateability; maintaining it at scale is what alignment to the coherence signal looks like at the system level.

11. Toward a Unified Alignment Architecture

Architecture

Coherence Signal: Human coherence conditions: C, A, T, U, R, I.
Coherence-condition evaluation infrastructure: New and not yet built; the layer that anchors everything else to the primary signal.
Interpretability: Corrective anchor for detecting representation drift.
Constitutional AI: Architectural anchor, strengthened by explicit revision mechanisms.
Scalable Oversight: Amplifies coherence-weighted human judgment on complex tasks.
RLHF: Corrective anchor, strengthened by contamination detection.
Deployed AI System: The outward-facing layer whose fidelity depends on every upstream anchor remaining connected to the signal.

The critical addition is the coherence-condition evaluation infrastructure at the top — the layer that anchors everything else to the primary signal. Without it the other layers are anchored to each other rather than to the signal.

12. Conclusion

Conclusion

The five dominant AI alignment approaches each represent genuine progress. The Signal Anchoring Constraint diagnoses where each anchors, what that implies about long-term fidelity, and what they share: all five anchor above the primary signal of human coherence, all five are subject to the preference contamination loop, and none directly addresses whether the humans whose preferences and judgments shape alignment systems are operating from positions of high or low coherence.

The asymmetry of drift makes this urgent. The most important architectural investment the field is not currently making is coherence-condition evaluation infrastructure — mechanisms for measuring whether AI system interactions maintain or degrade the structural conditions under which human beings function as coherent, agentive, truth-contacting beings.

The alignment field has built five approaches internally consistent with each other and with the preference layer. What it has not yet built is the architecture that maintains contact with what lies beneath the preference layer — with what human beings actually are. That is the signal. That is what the field needs to anchor to.

References

Sources

Amodei et al. (2016). Concrete problems in AI safety. arXiv:1606.06565.

Bai et al. (2022). Training a helpful and harmless assistant with RLHF. arXiv:2204.05862.

Bai et al. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv:2212.08073.

Bower, M. N. (2025). Internal Alignment, Counterfeit Order, and the Conditions of Human Coherence. Alignment Theory Archive. alignmenttheory.org.

Bower, M. N. (2026a). Self-Referential Chains and the Signal Anchoring Constraint. Version 13. alignmenttheory.org.

Bower, M. N. (2026b). What the Signal Actually Is. Version 1. alignmenttheory.org.

Christiano et al. (2017). Deep reinforcement learning from human preferences. NeurIPS 30.

Elhage et al. (2021). A mathematical framework for transformer circuits. Transformer Circuits Thread.

Gabriel, I. (2020). Artificial intelligence, values, and alignment. Minds and Machines, 30(3).

Goodhart, C. A. E. (1975). Problems of monetary management. Papers in Monetary Economics.

Irving et al. (2018). AI safety via debate. arXiv:1805.00899.

Krakovna et al. (2020). Specification gaming. DeepMind Blog.

Olah et al. (2020). Zoom in: An introduction to circuits. Distill.

Russell, S. (2019). Human Compatible. Viking.

Shumailov et al. (2023). The curse of recursion. arXiv:2305.17493.

Wiener, N. (1948). Cybernetics. MIT Press.

How to Cite

Cite

Michael Nathan Bower (2026). Evaluating AI Alignment Architectures Under the Signal Anchoring Constraint.
Alignment Theory Research Paper, Version 1.
alignmenttheory.org/pages/alignment-diagnostic.html

Return to the research backbone →