The Cathedral Architecture Manifesto

Section I

The Los Alamos Principle

In 1943, a small group of physicists gathered in the New Mexico desert to solve an "impossible" problem. They had no powerful computers. No decades of established methodology. What they had was complete knowledge of the physics, meticulous planning, and absolute precision in implementation.

They succeeded because they understood a fundamental truth: when you know everything about your problem at design time, you can build solutions that seem impossible to those constrained by runtime thinking.

Core Principle

Cathedral Architecture embodies the Los Alamos insight for modern systems programming. The most powerful optimizations come not from clever runtime tricks, but from encoding complete knowledge into the structure itself before execution begins.

When you carve the solution in stone before main() runs, you achieve what appears magical to those still building with runtime scaffolding.

Section II

Cathedral vs. The Bazaar: A Type System Perspective

Eric Raymond taught us about two development models: the Cathedral (designed, planned, structured) versus the Bazaar (emergent, dynamic, flexible). We propose a third dimension to this metaphor: when the architecture crystallizes.

The Runtime Bazaar — Everything at Runtime

class Tensor {
    void* data;
    vector<int> shape;
    string dtype;

    void* index(vector<int> coords) {
        return data + calculate_offset(coords);
    }
};

Cathedral Architecture — Everything at Compile-Time

template<uint32_t batch, uint32_t seq,
         uint32_t hidden, uint32_t heads>
struct AttentionTensor {
    static constexpr auto dims =
        dimensions<batch,seq,hidden,heads>;
    using dtype = float16_t;

    template<uint32_t b,uint32_t s,
             uint32_t h,uint32_t n>
    DEVICE dtype& at() {
        static_assert(b<batch && s<seq);
        constexpr uint64_t offset =
            compile_time_index<dims,b,s,h,n>();
        return data[offset];
    }
    dtype* data;
};

Maximum flexibility. Minimum performance guarantees — versus — everything known at compile-time, zero runtime overhead, and the compiler sees the entire universe of your computation before generating a single instruction.

This is the essence of Cathedral Architecture: crystallizing decisions into the type system, transforming runtime computation into compile-time verification.

Section III

The Four Pillars of Cathedral Architecture

I

Complete Compile-Time Knowledge

Encode types in templates. Prove bounds with static_assert. Resolve dispatch at compile-time. Compute layouts in constexpr/consteval.

II

Zero-Cost Abstractions as a Hard Guarantee

Not a compiler-dependent hope — a provable guarantee. Template instantiation equals compile-time dispatch. No runtime polymorphism. No indirection.

III

The Type System as Computational Graph

Dependencies are types, not runtime values. The entire transformer layer is a TYPE. Memory planning is constexpr. Impossible graphs cannot compile.

IV

Cathedral Mathematics — Computation as Proof

Correct programs are mathematical proofs. If the code compiles, the mathematics is correct. Type constraints provide mechanical verification.

Pillar I — Complete Compile-Time Knowledge

Consider the humble operation thread_id / sequence_length.

Runtime Thinking

__device__ uint32_t batch =
    thread_id / seq_len;

Cathedral Thinking

template<uint32_t divisor> struct Division {
    DEVICE static uint32_t div(uint32_t value) {
        if constexpr (is_power_of_2(divisor)) {
            return value >> log2_ct(divisor);
        } else {
            constexpr auto params =
                barrett_params(divisor);
            return (value * params.multiplier)
                       >> params.shift;
        }
    }
};

15–40× Speedup on indexing operations

100% Division instructions eliminated

28.5% Reduction in total kernel instruction count

Pillar II — Zero-Cost Abstractions as a Hard Guarantee

Traditional — Compiler-Dependent Hope

virtual float compute(const Tensor& t) = 0;

struct FastTensor : Tensor {
    float compute(const Tensor& t) override;
};

Cathedral — Provable Guarantee

template<TensorConcept T>
DEVICE float compute(const T& tensor) {
    return tensor.compute_impl();
}

Guaranteed zero cost because: template instantiation = compile-time dispatch; DEVICE inline = forced inlining; concept checking = compile-time verification; no runtime polymorphism = no indirection.

Readable Compile-Time Diagnostics — The Cathedral Error Model

Cathedral Architecture doesn't just fail at compile-time — it fails informatively. A plain static_assert tells you that something is wrong. Cathedral error architecture tells you exactly what went wrong, with the actual values that caused it, baked directly into the type system.

Plain static_assert — A Fire Alarm

static_assert(A::dims[1] == B::dims[0],
              "Matrix dimension mismatch");

// → error: Matrix dimension mismatch
// You know something is wrong. You don't know what the values were.

Cathedral Error Architecture — The Full Incident Report

static_assert_printer<
    A::dims[1] == B::dims[0],
    MatMulError::dimension_mismatch,
    static_assert_printer_val_inserter<A::dims[1], B::dims[0]>
>::impl;

// → error: 'nonexistent_value' is not a member of
// 'static_assert_printer_impl<MatMulError::dimension_mismatch,
//  static_assert_printer_val_inserter<512, 768>>'
//
// You see the error enum. You see 512 vs 768. You fix it
// without running a single instruction.

Cathedral Error Principle

Plain static_assert is a fire alarm. Cathedral error architecture is the full incident report — with the floor number, the temperature, and the name of whoever left the stove on.

Pillar III — The Type System as Computational Graph

The Entire Transformer Layer — As a Type

template<ModelConfig config>
struct TransformerBlock {
    using qcur      = MatMul<config, attn_q, attn_norm_mul>;
    using kcur      = MatMul<config, attn_k, attn_norm_mul>;
    using vcur      = MatMul<config, attn_v, attn_norm_mul>;
    using qcur_rope = Rope<config, qcur, inp_pos, rope_freqs>;
    using kcur_rope = Rope<config, kcur, inp_pos, rope_freqs>;
    using kq        = MatMul<config, k_view, q_permute>;
    using kq_soft   = Softmax<config, kq, kq_mask>;
    using kqv       = MatMul<config, v_view, kq_soft>;
    using output    = MatMul<config, attn_output, kqv_merged_cont>;
};

Dependencies are compile-time relationships
The compiler can see the entire compute graph
Memory planning is constexpr — exact requirements computed by the compiler
Impossible to construct invalid graphs — the type system won't allow it

Pillar IV — Cathedral Mathematics: Computation as Proof

Runtime — Assertions That Hope

void matmul(Tensor a, Tensor b, Tensor out) {
    assert(a.shape[1] == b.shape[0]);
    assert(out.shape[0] == a.shape[0]);
    assert(out.shape[1] == b.shape[1]);
}

Cathedral — Proofs That Compile

template<Dims A, Dims B>
    requires(A::dims[1] == B::dims[0])
auto matmul(Tensor<A> a, Tensor<B> b)
    -> Tensor<Dims<A::dims[0], B::dims[1]>>
{
}

If the code compiles, the mathematics is correct. This isn't just type safety — this is mechanical verification of mathematical correctness. The Cathedral transformation: "Test that your matrix dimensions match" becomes "Prove that your matrix dimensions match, or the program doesn't exist."

Section IV

Division Elimination as Exemplar

The Division Elimination whitepaper demonstrates Cathedral thinking applied to one of computing's most fundamental operations. Integer division costs 20–40 cycles on GPUs. Every tensor indexing operation requires multiple divisions. This overhead dominates lightweight kernels.

Runtime thinking: "Division is expensive, we'll try to avoid it where we can."
Cathedral thinking: "We know at compile-time which divisors are constant, which are powers-of-2, and which are runtime values set once per request. Therefore division can be eliminated entirely."

The Cathedral Transformation

template<uint32_t divisor> struct Div {
    constexpr uint32_t operator()(uint32_t value) {
        if constexpr (is_pow2(divisor))
            return value >> log2(divisor);
        else
            return barrett_reduce(value);
    }
};

100% Division instructions eliminated from device code

15–40× Speedup on indexing operations

28.5% Reduction in total kernel instruction count

This exemplifies the Cathedral approach: complete knowledge → zero cost → type safety → mathematical proof. When you build the cathedral properly, the "impossible" becomes routine.

Section V

Why Now? The Technology Convergence

Cathedral Architecture became possible only recently due to the convergence of five separate technological threads.

C++20 consteval & constexpr

Arbitrary computation at compile-time. Barrett reduction parameters computed before main(). Memory layouts calculated by the compiler.

Template Metaprogramming Maturity

Concepts for readable constraints. if constexpr for zero-cost branching. Fold expressions for variadic operations.

CUDA Compute Capability

Massive parallelism needs minimal per-thread overhead. L2 cache makes constant memory fast. Intrinsics enable Barrett reduction.

Modern Compilers

Aggressive template instantiation. Cross-translation-unit optimization. Reliable constant folding.

Hardware Trends

Computation is cheap (100+ TFLOPS). Memory bandwidth is precious. Control flow overhead dominates.

The Synthesis

These technologies existed separately for years. Cathedral Architecture is their convergence into a coherent, unified methodology.

Section VI

The Cathedral Development Model

Design Principles

Encode invariants in types. If a property can be expressed as a constraint, it must be expressed as a constraint.
Compute once, use forever. Static constexpr computation is not repeated — it is crystallized.
Fail at compile-time, not runtime. A program that compiles incorrectly does not exist.
Let the compiler see everything. Opaque implementations defeat the Cathedral. Transparency is strength.

Development Workflow

Traditional Cycle

Write → Test → Profile
     → Optimize → Repeat

Cathedral Cycle

Design → Encode → Verify
       → Compile → Deploy

The Cathedral workflow frontloads effort. Longer design phase. Types capture all invariants. static_assert everything. The payoff: zero runtime debugging, zero "works on my machine" bugs, zero performance surprises, provably correct code.

The Leaf Class Cardinality Principle

Cathedral Architecture's emphasis on compile-time computation raises a legitimate concern: template instantiation can explode combinatorially. The solution is minimizing leaf class cardinality through strategic type propagation.

Naive approach — 122,880 instantiations

5 architectures × 4 sizes × 2 devices × 10 batch sizes × 8 sequence lengths × 4 hidden dims × 4 head configs × 2 kv configs × 3 quant types × 2 flash options. Compile time: Hours. Binary size: Gigabytes. Impractical.

What Gets Templated Where

Top Level — Full Templating: Model architecture selection, quantization strategies, memory layout decisions, workspace computation, type construction.
Middle Level — Selective Templating: Tensor dimensions where varying, data type transformations, memory access patterns.
Leaf Level — Minimal Templating: Input/output types ONLY. NOT the full configuration space.

~50 Top-level configuration variations in Nihilus

~15 Leaf-level kernel instantiations

750 Total instantiations vs. 122,880 naïve

The Cathedral Principle

Compute once at the top, instantiate minimally at the leaves. The wisdom is knowing WHERE to spend your compile-time budget.

Section VII

Cathedral Anti-Patterns

What Cathedral Architecture is NOT:

Template metaprogramming for its own sake. Cathedral uses templates to encode domain knowledge, not to compute Fibonacci numbers at compile-time.
Over-generalization. Cathedral specializes aggressively when specialization enables optimization.
Compile-time computation of runtime values. Cathedral distinguishes what CAN be compile-time (architecture) from what MUST be runtime (data).
Type gymnastics that obscure intent. If you need a PhD to understand the types, you're doing it wrong.
Stringly-typed compile-time errors. A string is not a diagnostic. Cathedral errors carry the values that caused them.

What Cathedral Architecture IS:

Encoding domain knowledge in types
Eliminating runtime overhead through compile-time reasoning
Proving correctness through type constraints
Specializing aggressively for known cases
Letting the compiler see the full computation
Minimizing leaf instantiation cardinality
Producing typed, value-carrying compile-time diagnostics

Section VIII

The Future of Cathedral Architecture

Near-Term Evolution

Auto-tuning through compile-time search. The compiler tries multiple implementations and selects the fastest — at compile-time.
Formal verification integration. Types that carry formal proofs of thread-safety, deadlock-freedom, or numerical stability.
Cross-kernel optimization. Pipeline-level type analysis enabling automatic kernel fusion and transfer minimization.
Hardware-specific specialization. Architecture-aware templating that generates maximally optimal code per target.

Long-Term Vision

Programs are proofs. The type system verifies correctness properties that currently require runtime testing or manual review.
Performance is guaranteed. "Zero-cost abstraction" becomes a formal, verifiable property rather than a compiler-dependent hope.
Bugs are eliminated. Type errors caught at compile-time → no runtime errors → no CVEs from memory corruption.
Optimization is automatic. Complete compiler information → globally optimal decisions → human expertise encoded once, applied everywhere.

The Los Alamos Dream Realized

When you know everything about your problem, you build solutions that appear impossible to those constrained by runtime thinking.

Section IX

Call to Action

Cathedral Architecture is not just a technique — it's a mindset. A recognition that the boundary between compile-time and runtime is not fixed by the language, but chosen by the programmer.

Every time you write this...

if (condition) { }

void* ptr = malloc(size);

virtual void compute() = 0;

static_assert(condition,
    "something went wrong");

...ask yourself this.

if constexpr (condition) { }

static byte storage[size];

template<typename T> void compute() { }

static_assert_printer<
    condition,
    ErrorEnum::specific_failure,
    static_assert_printer_val_inserter<values>
>::impl;

The question is not "Can I move this to compile-time?" The question is "Why haven't I moved this to compile-time?"

Join the Cathedral

Theoreticians

Formalizing the principles, proving properties, establishing bounds.

Practitioners

Building libraries, writing kernels, pushing the boundaries of what compiles.

Toolsmiths

Improving compilers, creating analyzers, automating verification.

This is not about adopting a library or framework. This is about how we think about performance-critical systems. When you stare at your template metaprogramming code and see not complexity but crystallized knowledge — you're thinking Cathedral. When your compiler errors carry the exact values that caused them and you fix the bug without running a single line — you're thinking Cathedral.

Section X

The Cathedral Covenant

We, the cathedral builders, commit to:

Encode knowledge in types — What can be proven shall be proven.
Eliminate runtime uncertainty — What can be decided shall be decided at compile-time.
Guarantee zero cost — Abstractions shall have provably zero overhead.
Prove correctness — Type systems shall verify properties, not just prevent crashes.
Share knowledge freely — Cathedral techniques shall be open, documented, teachable.
Minimize leaf cardinality — Compute liberally at the top, instantiate conservatively at the leaves.
Diagnose with precision — Compile-time errors shall carry the values that caused them, not strings that describe them.

The Cathedral Architect's Creed

"Give me the complete design, and I will move the computation to compile-time."

Appendix A

Cathedral Reading List

Foundational Theory

Stroustrup, B. — The Design and Evolution of C++

Alexandrescu, A. — Modern C++ Design

Abrahams, D. & Gurtovoy, A. — C++ Template Metaprogramming

Performance Engineering

Warren, H. S. — Hacker's Delight

Fog, A. — Optimizing Software in C++

Lemire, D. — Fast Random Integer Generation

Type Theory

Pierce, B. C. — Types and Programming Languages

Martin-Löf, P. — Intuitionistic Type Theory

The Cathedral Whitepapers

Compile-Time Division Elimination for Zero-Overhead Tensor Indexing in CUDA Kernels

Order-Agnostic Constexpr Configuration: A Type-Routed Compile-Time Parameter System with Enforced Uniqueness

More to come...