Regain Epistemology

The First Health AI Built on Good Explanations

The quest for good explanations is, I believe, the basic regulating principle not only of science, but of the Enlightenment generally.

David Deutsch, The Beginning of Infinity

Most healthcare AI systems optimize for prediction accuracy. We optimize for explanation quality. The difference: predictions can be right for wrong reasons. Explanations that are hard to vary remain correctable and auditable even when wrong.

Version 2.0.0 Last updated: 2026-01-25

TL;DR

Most healthcare AI systems optimize for prediction accuracy. We optimize for explanation quality.

The difference: predictions can be right for wrong reasons (and silently fail when conditions change). Explanations that are hard to vary where every component is load-bearing remain correctable and auditable even when wrong.

What We Implement

Mechanism Purpose
ArgMed Debate Generate multiple hypotheses, attack each adversarially, keep only survivors
HTV Scoring Quantify how "hard to vary" each explanation is (0.0-1.0)
IDK Protocol 12 specific uncertainty triggers with structured responses
Falsification Criteria Every claim specifies what would prove it wrong
Safety Routing High-risk decisions require human clinician approval
Clinician Feedback Loop Overrides actively change future reasoning for that patient
Composable Domains Medication, nutrition, exercise, sleep, mental health: same principles
Rules as Data Interaction rules are explicit, versioned, auditable data
Accuracy Ascertainment We measure our own predictions against outcomes

The result: an AI that tells you why, admits when it doesn't know, specifies how to prove it wrong, measures whether it was right, and learns from clinician corrections.

Philosophy Foundations

The fundamental failure of current health AI is epistemological, not computational. Most health AI makes predictions based on patterns. But pattern-matching is not understanding.

The Inductive Fallacy

Failure Mode Problem Consequence
Correlation does not equal Causation Pattern matching finds correlations, not causes Interventions based on spurious correlations fail or cause harm
The Black Swan Problem Rare cases don't match common patterns. Medicine is full of rare cases Induction fails precisely on the edge cases that matter most
Easy-to-Vary Explanations Probabilistic outputs are 'mushy': you can change details without breaking the theory 'You might have A, B, or C' explains nothing and helps no one

These are practical manifestations of the deeper Hume/Popper critique: induction (inferring general rules from particular observations) cannot justify knowledge. No amount of data can prove a universal claim, but a single counterexample can refute one.

The Seasons Example

In "The Beginning of Infinity" (Ch. 1), Deutsch illustrates good vs. bad explanations using the ancient Greek explanation for seasons:

Easy to Vary

Persephone, goddess of spring, was kidnapped by Hades. Her mother Demeter's grief causes winter. When Persephone returns, spring comes.

Could substitute any gods or emotions. Nothing is load-bearing.

Hard to Vary

Earth's axis is tilted 23.5 degrees relative to its orbital plane around the sun. This causes different hemispheres to receive more direct sunlight at different times of year.

Change the tilt angle and predictions break. Every detail constrains.

This distinction is the foundation of our entire architecture.

Our Epistemological Foundation

We build on two complementary philosophical frameworks:

Karl Popper

Conjecture and Refutation

Science advances not by confirming theories but by attempting to refute them. The demarcation between science and pseudoscience is falsifiability.

David Deutsch

Hard-to-Vary Explanations

Good explanations begin with bad explanations. You get there by criticism, by conjecturing variants, and choosing the one that survives.

Philosophy Made Concrete

Philosophical Principle Component Implementation
Conjecture-Refutation Deutsch ArgMed Debate Multi-agent Generator -> Verifier -> Reasoner pipeline
Hard-to-Vary Criterion HTV Scoring 4-dimensional algorithm (interdependence, specificity, non-adhocness, falsifiability)
Boundary Enforcement Popper Safety Rules Deterministic policy engine enforcing safety boundaries
Fallibilism IDK Protocol Structured honest uncertainty admission
Falsifiability FalsificationCriteria Every claim includes explicit refutation conditions
Error Correction Clinician Feedback Loop Override tracking with confidence decay

System Architecture

Deutsch -> Hermes -> Popper Pipeline Interactive Diagram

Before diving into each component, here's how they work together:

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#fff7ed', 'primaryBorderColor': '#ea580c', 'lineColor': '#9a3412'}}}%%
flowchart TB
    subgraph Input[" "]
        PM[Patient Message]
    end

    subgraph DEUTSCH[DEUTSCH - The Knowledge Creator]
        G[Generator] --> V[Verifier]
        V --> R[Reasoner]
        R --> HTV[HTV Scoring]
    end

    subgraph HERMES[HERMES - The Contract]
        SR[SupervisionRequest]
    end

    subgraph POPPER[POPPER - Safety Referee]
        SE[Safety Evaluation]
        SE --> D{Decision}
        D -->|APPROVE| A[Proceed]
        D -->|ROUTE| RT[Clinician]
        D -->|HARD_STOP| HS[Block]
    end

    subgraph Output[" "]
        PR[Patient Response]
        CR[Clinician Review]
    end

    PM --> G
    HTV --> SR
    SR --> SE
    A --> PR
    RT --> CR
    CR -.->|Feedback| DEUTSCH

Key principle: The Hermes contract ensures that epistemological metadata (HTV scores, evidence grades, falsification criteria) is structurally enforced, not optional. Popper cannot evaluate a proposal that doesn't include this data.

Component Summary

Component Named After Epistemological Role
Deutsch David Deutsch Knowledge creation via conjecture-refutation
Hermes Greek messenger god Typed epistemological contract
Popper Karl Popper Safety boundary enforcement

Data Flow

  1. Patient Message arrives
  2. Deutsch Generator creates multiple hypotheses (mechanism diversity required)
  3. Deutsch Verifier attacks each hypothesis adversarially
  4. Deutsch Reasoner selects survivors based on HTV scores
  5. Hermes packages the recommendation with required metadata
  6. Popper evaluates safety and makes a decision
  7. Output goes to patient (if approved) or clinician (if routed)
  8. Feedback Loop captures clinician overrides for future learning

The Separation of Concerns

Aspect Deutsch Hermes Popper
Purpose Reasoning quality Message structure Safety gating
Computes HTV scores, hypotheses Validation Decisions
When During hypothesis generation During message passing During supervision
Action Reject low-HTV hypotheses Reject malformed messages Route/Approve/Block

Core Components

Three interconnected mechanisms form the heart of our epistemological architecture:

ArgMed Debate: Conjecture-Refutation in Action

The ArgMed (Argumentative Medicine) debate is our core reasoning mechanism. It directly implements Popperian conjecture-refutation through a three-agent architecture:

🧠

Generator

Conjecturer

Produces multiple hypotheses spanning different mechanisms

Verifier

Critic

Attacks each hypothesis adversarially, scores HTV

Reasoner

Synthesizer

Selects survivors based on HTV threshold

Mechanism Diversity Requirement

✓ Good Differential (chest pain)

  • Acute coronary syndrome (cardiac)
  • Pulmonary embolism (pulmonary)
  • Musculoskeletal pain (muscle/bone)

✕ Bad Differential (rejected)

  • ST-elevation heart attack (cardiac)
  • Non-ST-elevation heart attack (cardiac)
  • Unstable angina (cardiac)

The second example is easy to vary within the cardiac category, all three share the same mechanism. This violates Deutschian epistemology.

HTV Scoring: Operationalizing "Hard to Vary"

The HTV (Hard-to-Vary) score quantifies how hard it is to vary an explanation while preserving its predictions. We score explanations on four dimensions:

Dimension Question High Score Low Score
Interdependence How tightly coupled are the components? Every piece connects to the conclusion Components could be swapped
Specificity How precise are the predictions? Specific, measurable outcomes Vague, unfalsifiable predictions
Non-adhocness Are all elements load-bearing? Removing any element changes predictions Contains free parameters
Falsifiability What would refute this claim? Clear conditions that prove it wrong Immune to counterevidence

HTV Thresholds (v0.1)

< 0.3 REJECT
0.3-0.4 ROUTE
0.4-0.7 MODERATE
≥ 0.7 GOOD

Worked Example: Fatigue

HTV: 0.14

"You feel tired because of stress."

No mechanism, no specificity, no falsification criteria. Classic mushy AI output.
HTV: 0.90

"Your fatigue is caused by iron deficiency anemia. Ferritin 8 ng/mL indicates depleted stores. Hemoglobin 10.2 g/dL confirms anemia."

Tight causal chain: iron -> hemoglobin -> oxygen -> fatigue. Every piece constrains.

IDK Protocol: Fallibilism as a Feature

Acknowledging the limits of knowledge is a virtue, not a failure.

Expressing the Deutschian position on fallibilism

The IDK (I Don't Know) Protocol formalizes how our system handles situations where it cannot make a confident recommendation. When we trigger IDK, we're not claiming the problem is unsolvable. Deutsch's optimism states that all problems are soluble given the right knowledge.

7 Core Triggers

Trigger Condition Default Action
IDK_HTV_LOW Composite below 0.4 Route to clinician
IDK_NO_SURVIVORS All hypotheses rejected Route to clinician
IDK_MISSING_SIGNAL Critical data absent Request more info
IDK_CONFLICT Unresolved contradictions Route to clinician
IDK_EVIDENCE_WEAK Only expert opinion available Route to clinician
IDK_STALE Snapshot too old Request refresh
IDK_OUT_OF_SCOPE Query outside domain Deflect appropriately

The Discriminator: Breaking Ties Between Theories

When multiple theories survive with equal HTV scores, we don't guess. We identify the discriminator: the single test that would kill one theory but not the other.

Competing Theories Discriminator Logic
Iron vs. B12 Deficiency MCV Iron: low MCV; B12: high MCV
Heart Failure vs. Venous Insufficiency BNP Level HF: elevated; Venous: normal
Hypothyroidism vs. Depression TSH Level Hypothyroid: elevated; Depression: normal

End-to-End Example

Full Patient Interaction Trace Complete Flow

To make this concrete, here's how the system handles a real patient interaction.

Patient Message

"I've been feeling more tired than usual and my ankles are swollen. Should I be worried?"

The Journey Through the System

1

Generator: Bold Conjectures

Creates multiple hypotheses spanning different mechanisms:

❤️
H1: Heart failure exacerbation
Cardiac mechanism
🦾
H2: Venous insufficiency
Vascular mechanism
💊
H3: Medication side effect
Pharmacological mechanism
2

Verifier: Attempted Refutation

Attacks each hypothesis adversarially, scoring HTV dimensions:

0.86
H1: Survived ✓
0.52
H2: Moderate ⚠
0.78
H3: Survived ✓
3

Reasoner: Survivor Selection

Primary: H1 (Heart failure) and H3 (Medication). Secondary: H2 with disclosed uncertainty.

4

Hermes + Popper: Package & Evaluate

Hermes packages with metadata. Popper evaluates safety:

HTV ✓ Evidence ✓ Scope ✓ APPROVED

What the Patient Receives

Your symptoms of fatigue and ankle swelling could indicate that your heart is working harder to manage fluid. This is worth checking.

I recommend:

  1. Checking your weight trend over the past week
  2. Getting a cardiac marker blood test if you haven't recently

If your cardiac marker is normal and your weight is stable, these symptoms may have a different cause (like your blood pressure medication).

Would you like me to help you log your weight or schedule a lab check?

What Makes This Different

Element What's Included Why It Matters
Explanation Why symptoms could indicate heart strain Patient understands the reasoning
Specific recommendations Weight trend, cardiac marker test Clear action path
Falsification conditions "If cardiac marker normal, reconsider" Patient knows when to update
Alternative Medication side effect Competing hypothesis disclosed
Next steps Offer to help log weight Empowers patient action

What We're NOT Claiming

Intellectual honesty requires acknowledging limitations:

01 We're not claiming AI can "create explanatory knowledge"

In Deutsch's philosophical sense, explanatory knowledge emerges through genuine understanding: the ability to creatively vary conjectures and recognize when variations break the explanation. Large language models operate through sophisticated pattern matching on training data.

What we implement is structured reasoning that approximates conjecture-refutation. The Generator doesn't truly "understand" why a hypothesis explains the data; it produces outputs that structurally resemble good explanations.

The epistemological structure is real. The underlying cognition is not Deutschian knowledge creation.

02 Human oversight remains essential

The Popper routing mechanism exists precisely because we don't trust the AI to handle all cases. This is not a temporary limitation to be engineered away; it's a principled design choice.

High-risk decisions require human judgment because:

  • Large language models can produce confident, well-structured, wrong outputs
  • Medical decisions involve values and trade-offs beyond optimization
  • Accountability requires a human decision-maker

Routing to clinicians is a feature, not a bug.

03 HTV doesn't fix model hallucination

A subtle point: a large language model can hallucinate a high-HTV explanation. It can fabricate specific lab values, invent plausible mechanisms, and generate falsifiable predictions, all of which are false.

HTV measures structural quality of explanation, not correspondence to reality.

Red-Team Example: Fabricated Specificity

"Patient has hypokalemia-induced arrhythmia risk due to K+ of 2.9 mEq/L from recent diarrheal illness, exacerbated by concurrent furosemide 80mg daily."

Interdep.
0.9
Specificity
0.95
Non-adhoc
0.85
Falsifiable
0.9

The problem: The K+ value was fabricated. Patient's actual K+ is 4.1 mEq/L.

This is why HTV operates alongside provenance verification, snapshot grounding, and clinician oversight, not as a standalone safety measure.

04 This is a methodological commitment

We're making a bet: that AI systems structured around epistemological principles will be more reliable, more auditable, and more correctable than systems optimized purely for prediction accuracy. This is not a claim about machine consciousness, understanding, or intelligence. It's a claim about architecture.

The Value Proposition

The value is not that our AI "thinks like Deutsch". It doesn't. The value is that by structuring outputs to include HTV scores, falsification criteria, evidence grading, and honest uncertainty, we create systems that are:

Correctable

When wrong, we know why and how to fix it

Auditable

Every decision has a traceable reasoning chain

Conservative under uncertainty

Low confidence triggers routing, not overconfident action

Improvable

Error patterns can be identified and addressed

Why This Matters

👤

Your doctor stays in control

Medication changes are always reviewed and approved by your clinician

💡

Explanations, not just predictions

You understand why a recommendation is made

😶

Honest uncertainty

The system tells you what it doesn't know

📈

Your data matters

Decisions are grounded in your specific situation, not generic advice

Medication proposals require your approval

AI proposes start/stop/titrate/hold; you decide

📑

Audit trails with epistemological metadata

Every decision is reviewable with full reasoning chain

😶

System admits uncertainty

No overconfident black boxes. Low confidence triggers routing to you

📝

Your overrides matter

Rejections and modifications actively change future recommendations for that patient

The ability to create new explanations is the defining attribute of people.

David Deutsch, The Beginning of Infinity (Ch. 7)

In Chapter 7 ('Artificial Creativity'), Deutsch argues that genuine AI must involve explanation and creativity, not just prediction. This implies that AI systems handling high-stakes domains need mechanisms for generating and evaluating explanations, not just pattern matching.

Our Approach Prioritizes

🔄

Error Correction over Error Prevention

We assume we'll be wrong and build in correction mechanisms

Fallibilism over Certainty

We never claim final answers

💬

Explanation over Prediction

We require every claim to be justifiable

Open Questions

Intellectual honesty requires acknowledging not just what we don't claim, but what we don't yet know. These are active research questions we're working through.

6 Research Questions Active Work

01 Knowledge Creation vs. Retrieval

Is our multi-agent debate genuinely creating new knowledge, or is it sophisticated retrieval dressed up as knowledge-creation?

The tension: Deutsch argues that genuine knowledge creation involves creativity: the ability to generate new explanations that weren't implicit in the inputs. Large language models operate through pattern matching on training data.

Our current position: We implement structured reasoning that approximates conjecture-refutation. Whether this constitutes 'knowledge creation' in Deutsch's sense is philosophically contested.

02 Falsifiability with Delayed Ground Truth

In medicine, ground truth often takes weeks or months. How do you run conjecture-refutation cycles when refutation is delayed?

The tension: Popperian epistemology emphasizes rapid refutation. But medical outcomes often take days (lab results), weeks (symptom resolution), months (disease progression), or years (long-term outcomes).

What would resolve this: A formal framework for 'asynchronous refutation' that maintains epistemic rigor across time delays.

03 Hard-to-Vary vs. Prior Probability

Is there a coherent way to combine 'hard-to-vary' with 'prior probability'? Or are they fundamentally incompatible?

The tension: Deutsch is skeptical of Bayesian epistemology as a foundation for knowledge. Yet medicine is deeply Bayesian: pre-test probability, likelihood ratios, and posterior odds are foundational to clinical reasoning.

04 Scaling Refutation

Our Verifier uses schema-driven refutation. What other refutation mechanisms should we consider?

What would improve this:

  • Literature-backed refutation: 'This mechanism contradicts the findings in [Paper X]'
  • Causal model refutation: 'This intervention has no causal path to the claimed outcome'
  • Temporal refutation: 'This timeline is physiologically impossible'

05 Calibration Without Ground Truth

How do we calibrate HTV thresholds when we can't always observe outcomes?

Our current approach: Validate against clinician agreement as a proxy. If high-HTV claims consistently get clinician approval and low-HTV claims get overridden, the thresholds are directionally correct.

06 Gaming the HTV Score

Can a sufficiently sophisticated model learn to produce high-HTV explanations that are fabricated?

The risk: HTV measures structural quality, not truth. A model could learn to fabricate specific lab values (high specificity), invent plausible mechanisms (high interdependence), and include irrelevant but testable predictions (high falsifiability).

Why it matters: If HTV can be gamed, it becomes security theater rather than genuine quality assurance.

Contributing

If you're a researcher interested in any of these questions, we'd welcome collaboration. These aren't rhetorical challenges: they're genuine open problems we're actively working on.

Glossary

Key terms used throughout this document. Click any highlighted term in the text to see its definition.

HTV (Hard-to-Vary)

A score (0.0-1.0) measuring how much each part of an explanation is load-bearing.

Falsification criteria

What observations would prove a claim wrong.

Conjecture-refutation

Popper's method: propose ideas, then try to disprove them.

IDK Protocol

Structured admission of uncertainty with 12 specific trigger types.

Evidence grade

How methodologically rigorous the supporting evidence is.

Epistemological metadata

Information about the quality and basis of a claim.

Discriminator

A test designed to falsify one theory while leaving another intact.

Circuit breaker

A safety mechanism that automatically stops the system when errors become too frequent.

References

  1. Popper, K. (1963). Conjectures and Refutations: The Growth of Scientific Knowledge. Routledge.
  2. Deutsch, D. (2011). The Beginning of Infinity: Explanations That Transform the World. Viking.
  3. Taleb, N.N. (2007). The Black Swan: The Impact of the Highly Improbable. Random House.
  4. Hunter, A. & Williams, M. (2012). Aggregating evidence about the positive and negative effects of treatments. Artificial Intelligence in Medicine, 56(3), 173-190.
  5. GRADE Working Group (2004). Grading quality of evidence and strength of recommendations. BMJ, 328(7454), 1490.
  6. Regain Health (2025). Hard to Vary Specification v0.9.0. Internal documentation.

Prior Work & Influences

We have built upon the following work:

Component Based On Our Extension
ArgMed Debate ArgMed-Agents (Hong et al., 2024) Added HTV scoring + Popperian falsification
Multi-Agent Debate Du et al., 2023 Applied to clinical domain with safety routing
Evidence Hierarchy GRADE / Canadian Task Force Reinterpreted through Deutschian "hard to vary" lens
HTV Operationalization Parascandolo et al., ICLR 2021 Extended to 4-dimension clinical scoring
Argumentation Schemes Douglas Walton Applied to clinical reasoning
Uncertainty Abstention Leibig et al., 2019 Formalized as IDK Protocol with trigger taxonomy
Hypothesis Diversity Kammer et al., 2021 Enforced via mechanism-diversity requirement

What We Believe Is Novel

  • Unified architecture combining all above under Deutschian/Popperian philosophy
  • The Non-Trust Principle (epistemological metadata increases but never decreases conservatism)
  • Safety rules implementing demarcation as executable policy
  • Clinician feedback loop with confidence decay
  • Systematic application of Deutsch's philosophy to healthcare AI architecture

Ongoing Validation Work

01 Comparing HTV scores against clinician-rated explanation quality
02 Tracking correlation between HTV and downstream outcome accuracy
03 Red-teaming for "HTV gaming" (superficially specific but fabricated claims)
04 Monitoring score drift over model updates

Problems are inevitable. Problems are soluble.

David Deutsch, The Beginning of Infinity

This document set explains how we translate Deutschian optimism into engineering constraints: every problem is addressable given the right knowledge, and our architecture is designed to acquire and correct that knowledge over time.

Anton Kim

Anton Kim

CEO, Regain Inc.

Questions about our epistemological approach? Let's discuss.

Last updated: 2026-01-26  •  Version 2.1.0