Evaluating AI Alignment Architectures Under the Signal Anchoring Constraint
A Structural Diagnosis of Five Dominant Approaches and Their Long-Term Fidelity Risks
Michael Nathan Bower — Version 1 — 2026
Abstract
OverviewThe Signal Anchoring Constraint establishes that any epistemic system validating truth primarily through internal references will tend toward internally consistent but externally inaccurate beliefs. This paper applies that constraint as a diagnostic lens to five dominant AI alignment architectures: RLHF, Constitutional AI, mechanistic interpretability, scalable oversight, and debate.
For each approach, the paper identifies the primary signal, the chain length between that signal and system outputs, the anchoring mechanism, and what F ≈ A / (L × C) predicts about long-term fidelity as systems scale. The analysis finds that all five approaches anchor at interpretation layers above the primary signal of human coherence, that chain lengths are increasing faster than anchoring mechanisms are being built, and that the asymmetry of drift means the architectural window for correction is narrowing.
This paper is not an argument against these approaches — each represents genuine progress. It is an argument that the field lacks a unified structural account of where each approach anchors, why those anchor points are insufficient at scale, and what a deeper anchoring architecture requires.
1. Introduction: The Diagnostic Gap
FramingThe AI alignment field has produced five major architectural approaches to keeping AI systems oriented toward what human beings actually want and value. Each has substantial research literature, active development at major labs, and genuine technical progress. Yet the field lacks a unified structural account of why these approaches succeed where they do, fail where they do, and share a common vulnerability none of them individually addresses.
The Signal Anchoring Constraint provides that account. Epistemic systems drift when validation circulates through prior outputs rather than reconnecting to the source. F ≈ A / (L × C): fidelity is proportional to anchoring frequency (A) and inversely proportional to chain length (L) and compression per layer (C).
Applied to AI alignment this generates four diagnostic questions: What is the primary signal? How many interpretation layers separate the anchor from the coherence signal? What is the anchoring mechanism and how frequently does it operate? What does F ≈ A / (L × C) predict about long-term fidelity as systems scale?
2. The Diagnostic Framework: Four Questions Applied
FrameworkThe primary signal for AI alignment is the structural conditions of human coherence — the cross-culturally observable conditions under which human beings function as genuinely coherent, agentive, truth-contacting beings, designated by six markers: Coherence (C), Agency (A), Trust (T), Updateability (U), Slack (R), and Truth Contact (I).
Human expressed preferences are not this primary signal. They are interpretation layers generated by human beings whose coherence conditions vary. A system anchored to expressed preferences is anchored at L = 2 or L = 3 above the primary signal.
The diagnostic question for each approach is: at which layer does it anchor, and what does that imply about fidelity as L increases with each generation of more capable systems?
3. Reinforcement Learning from Human Feedback (RLHF)
RLHFSignal anchor: human evaluator preference ratings. Chain length: L = 3-4. Anchoring mechanism: corrective — human feedback collected at training time, open-loop post-deployment. Compression: very high — a binary preference rating discards information about whether the preference reflects high-coherence judgment or a cognitive shortcut.
- Evaluator coherence is not assessed
- All ratings are treated as equally valid signal regardless of the coherence state of the evaluator. Feedback from an evaluator under high cognitive load, fragmentation, or low agency carries the same weight as feedback from one operating under high coherence.
- The preference contamination loop
- As RLHF-trained systems shape cultural expression and expectations, future evaluator preference distributions increasingly reflect AI-amplified drift. The system anchors to preferences; preferences drift toward what systems amplify; future systems anchor to amplified drift.
- Post-deployment open-loop operation
- Once deployed, A effectively approaches zero between training cycles. Fidelity decays as deployment continues and the signal moves while the anchor point remains fixed to the training distribution.
Fidelity prediction: moderate at deployment, declining over deployment period, accelerating drift as AI outputs reshape evaluator populations. Structurally sound as a corrective anchor but insufficient as a primary alignment mechanism at scale.
4. Constitutional AI
Constitutional AISignal anchor: explicit constitutional principles selected by the designing organization. Chain length: L = 2-3 — more stable than preference ratings but still an interpretation layer. Anchoring mechanism: architectural — built into training process, more durable than corrective anchoring. Compression: moderate to high — principles compress coherence conditions into propositional statements.
- The constitution is an interpretation layer
- Principles were selected by teams operating within institutional constraints and cultural frameworks. Those drift patterns are encoded into the architectural anchor regardless of how carefully the principles were chosen.
- Institutionalization risk
- As the constitution becomes authoritative it may shift from being a mechanism for signal contact to a self-referential authority that protects itself from correction — the AI equivalent of a conciliar decree becoming more authoritative than the signal it was meant to preserve.
- The principle-behavior gap
- Training a model to produce outputs consistent with principles is not the same as training a model whose internal representations track what the principles point toward. Behavioral compliance can exist without representational alignment.
Fidelity prediction: higher structural fidelity than RLHF at the anchoring mechanism level; moderate at the signal level; main risk is institutionalization; framework recommends building explicit revision mechanisms.
5. Mechanistic Interpretability
InterpretabilitySignal anchor: correspondence between model internal representations and real-world structure. Chain length: L = 1-2 — closer to the model's actual computations than other approaches. Anchoring mechanism: corrective — detects drift post-hoc after training. Compression: low to moderate — attempts to understand internals with minimal compression, a structural strength.
- The scaling inverse relationship
- As models become larger, internal complexity grows faster than interpretability tools can keep pace. The corrective anchor is being outpaced by the capability growth it is supposed to monitor — a structural prediction the field's own assessments confirm.
- Representation coherence is not coherence signal fidelity
- A model can have highly interpretable, coherent internal representations that are systematically misaligned with human coherence conditions. Internal consistency and external accuracy remain distinct even at the representation level.
- Corrective timing
- By the time interpretability detects a misaligned representation, the model has already been trained. Correction requires retraining, which is expensive and not always feasible for deployed systems.
Fidelity prediction: high value as corrective anchor at current scales; declining effectiveness as primary mechanism as models scale; most important contribution may be providing evaluation infrastructure for other architectural anchors.
6. Scalable Oversight
Scalable OversightSignal anchor: AI-assisted human judgment on complex tasks. Chain length: L = 3-5. Anchoring mechanism: corrective, AI-amplified. Compression: high — decomposition of complex tasks into evaluable subtasks discards coherence-relevant features.
- Amplifying the wrong layer
- Scalable oversight scales human judgment without moving the anchor point closer to the coherence signal. If the human judgment being scaled is itself anchored at L = 2 to 3 from the coherence signal, scalable oversight makes drifted judgment applicable at greater scale and complexity.
- AI-in-the-loop contamination
- The evaluating AI's framing of subtasks and decomposition choices influence what human evaluators see, introducing self-referential elements into what should be an independent external anchor.
- Coherence conditions under AI assistance
- AI assistance may reduce evaluator Agency and Truth Contact by reducing cognitive engagement with the material being evaluated — the very engagement that gives human judgment its signal value.
Fidelity prediction: valuable for maintaining some human signal contact as AI capability grows; structurally insufficient as primary mechanism because it scales the preference layer rather than the signal.
7. Debate
DebateSignal anchor: human judgment of argument quality. Chain length: L = 3-5. Anchoring mechanism: corrective, competition-mediated. Compression: very high — debate compresses alignment questions into competitive argumentation where persuasiveness is a heavily compressed proxy for truth.
- Optimizing for persuasion rather than truth contact
- Systems optimized against human judgment of argument quality learn to produce persuasive arguments, which is not the same as truthful ones. A sufficiently capable debater may win by being maximally persuasive rather than maximally aligned with the coherence signal.
- Truth Contact degradation in judges
- Regular exposure to high-quality adversarial AI argumentation may erode human judges' capacity to evaluate argument quality independently of persuasive force — degrading the evaluative capacity the debate approach depends on.
- The adversarial compression problem
- Many important alignment questions lack clean adversarial structure. Debate format compresses genuine uncertainty and value trade-offs into binary outcomes, discarding exactly the nuance alignment evaluation most needs to preserve.
Fidelity prediction: potentially useful for narrow tasks with clear adversarial structure; structurally risky as general mechanism because it optimizes for persuasion, introduces very high compression, and may degrade the evaluative capacity of the judges it depends on.
8. Comparative Analysis: What the Diagnostic Reveals
Analysis| Approach | Signal Anchor | Chain Length | Anchor Type | Compression | Scaling Risk |
|---|---|---|---|---|---|
| RLHF | Evaluator preference ratings | 3-4 | Corrective | Very high | High — preference contamination loop; open-loop deployment. |
| Constitutional AI | Explicit principles | 2-3 | Architectural | Moderate-high | Moderate — institutionalization risk; principle-behavior gap. |
| Interpretability | Internal representation structure | 1-2 | Corrective | Low-moderate | High — scaling inverse; corrective timing lag. |
| Scalable Oversight | AI-assisted human judgment | 3-5 | Corrective AI-amplified | High | High — amplifies preference layer; self-referential risk. |
| Debate | Human judgment of argument quality | 3-5 | Corrective competition-mediated | Very high | Very high — persuasion proxy optimization; judge degradation. |
Three structural findings: (1) All five approaches anchor above the primary signal. (2) Architectural anchors are more durable but still insufficient. (3) The preference contamination loop is a shared vulnerability across all five.
9. The Narrowing Architectural Window
UrgencyThe asymmetry of drift holds that drift is easier to prevent than to reverse. Each generation of more capable AI systems increases L simultaneously: systems reshape human preferences and evaluative habits; institutional structures around current approaches become more entrenched; economic and competitive pressures reduce alignment investment; systems become harder to interpret.
The window prediction: if current alignment approaches anchored at L = 2 to L = 5 become the established standard for the next generation of AI systems, the preference contamination loop will have one more cycle to compound. Each cycle makes the preference layer a less reliable proxy for the coherence signal and makes coherence-signal anchoring more difficult to retrofit. The most productive investment is not improving existing approaches at their current anchor points — it is developing evaluation infrastructure that can anchor to the coherence signal directly.
10. What Coherence-Signal Anchoring Requires Architecturally
Requirements10.1 Coherence-condition evaluation infrastructure
Measure whether user capacity for coherent autonomous functioning is maintained or degraded over repeated interactions; no current alignment evaluation framework measures this.
10.2 Evaluator coherence weighting
Weight human feedback by assessed coherence conditions; feedback from evaluators under high fragmentation or low agency carries less signal.
10.3 Preference contamination detection
Maintain ground-truth coherence signal reference datasets from human evaluations conducted under high coherence conditions and minimal AI cultural contamination.
10.4 Formation-oriented design objectives
Where formation and engagement conflict, formation takes precedence.
10.5 Corrigibility as coherence property
AI corrigibility is the system-level analog of human Updateability; maintaining it at scale is what alignment to the coherence signal looks like at the system level.
11. Toward a Unified Alignment Architecture
Architecture- Coherence Signal
- Human coherence conditions: C, A, T, U, R, I.
- Coherence-condition evaluation infrastructure
- New and not yet built; the layer that anchors everything else to the primary signal.
- Interpretability
- Corrective anchor for detecting representation drift.
- Constitutional AI
- Architectural anchor, strengthened by explicit revision mechanisms.
- Scalable Oversight
- Amplifies coherence-weighted human judgment on complex tasks.
- RLHF
- Corrective anchor, strengthened by contamination detection.
- Deployed AI System
- The outward-facing layer whose fidelity depends on every upstream anchor remaining connected to the signal.
The critical addition is the coherence-condition evaluation infrastructure at the top — the layer that anchors everything else to the primary signal. Without it the other layers are anchored to each other rather than to the signal.
12. Conclusion
ConclusionThe five dominant AI alignment approaches each represent genuine progress. The Signal Anchoring Constraint diagnoses where each anchors, what that implies about long-term fidelity, and what they share: all five anchor above the primary signal of human coherence, all five are subject to the preference contamination loop, and none directly addresses whether the humans whose preferences and judgments shape alignment systems are operating from positions of high or low coherence.
The asymmetry of drift makes this urgent. The most important architectural investment the field is not currently making is coherence-condition evaluation infrastructure — mechanisms for measuring whether AI system interactions maintain or degrade the structural conditions under which human beings function as coherent, agentive, truth-contacting beings.
The alignment field has built five approaches internally consistent with each other and with the preference layer. What it has not yet built is the architecture that maintains contact with what lies beneath the preference layer — with what human beings actually are. That is the signal. That is what the field needs to anchor to.
References
SourcesAmodei et al. (2016). Concrete problems in AI safety. arXiv:1606.06565.
Bai et al. (2022). Training a helpful and harmless assistant with RLHF. arXiv:2204.05862.
Bai et al. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv:2212.08073.
Bower, M. N. (2025). Internal Alignment, Counterfeit Order, and the Conditions of Human Coherence. Alignment Theory Archive. alignmenttheory.org.
Bower, M. N. (2026a). Self-Referential Chains and the Signal Anchoring Constraint. Version 13. alignmenttheory.org.
Bower, M. N. (2026b). What the Signal Actually Is. Version 1. alignmenttheory.org.
Christiano et al. (2017). Deep reinforcement learning from human preferences. NeurIPS 30.
Elhage et al. (2021). A mathematical framework for transformer circuits. Transformer Circuits Thread.
Gabriel, I. (2020). Artificial intelligence, values, and alignment. Minds and Machines, 30(3).
Goodhart, C. A. E. (1975). Problems of monetary management. Papers in Monetary Economics.
Irving et al. (2018). AI safety via debate. arXiv:1805.00899.
Krakovna et al. (2020). Specification gaming. DeepMind Blog.
Olah et al. (2020). Zoom in: An introduction to circuits. Distill.
Russell, S. (2019). Human Compatible. Viking.
Shumailov et al. (2023). The curse of recursion. arXiv:2305.17493.
Wiener, N. (1948). Cybernetics. MIT Press.
How to Cite
CiteMichael Nathan Bower (2026). Evaluating AI Alignment Architectures Under the Signal Anchoring Constraint.
Alignment Theory Research Paper, Version 1.
alignmenttheory.org/pages/alignment-diagnostic.html