Cathedral Architecture

The Manifesto

VERSION 1.2  ·  TORONTO, MARCH 2026

Section I

The Los Alamos Principle

In 1943, a small group of physicists gathered in the New Mexico desert to solve an "impossible" problem. They had no powerful computers. No decades of established methodology. What they had was complete knowledge of the physics, meticulous planning, and absolute precision in implementation.

They succeeded because they understood a fundamental truth: when you know everything about your problem at design time, you can build solutions that seem impossible to those constrained by runtime thinking.

Core Principle

Cathedral Architecture embodies the Los Alamos insight for modern systems programming. The most powerful optimizations come not from clever runtime tricks, but from encoding complete knowledge into the structure itself before execution begins.

When you carve the solution in stone before main() runs, you achieve what appears magical to those still building with runtime scaffolding.

Section II

Cathedral vs. The Bazaar: A Type System Perspective

Eric Raymond taught us about two development models: the Cathedral (designed, planned, structured) versus the Bazaar (emergent, dynamic, flexible). We propose a third dimension to this metaphor: when the architecture crystallizes.

The Runtime Bazaar — Everything at Runtime
class Tensor {
    void* data;
    vector<int> shape;
    string dtype;

    void* index(vector<int> coords) {
        return data + calculate_offset(coords);
    }
};
Cathedral Architecture — Everything at Compile-Time
template<uint32_t batch, uint32_t seq,
         uint32_t hidden, uint32_t heads>
struct AttentionTensor {
    static constexpr auto dims =
        dimensions<batch,seq,hidden,heads>;
    using dtype = float16_t;

    template<uint32_t b,uint32_t s,
             uint32_t h,uint32_t n>
    DEVICE dtype& at() {
        static_assert(b<batch && s<seq);
        constexpr uint64_t offset =
            compile_time_index<dims,b,s,h,n>();
        return data[offset];
    }
    dtype* data;
};

Maximum flexibility. Minimum performance guarantees — versus — everything known at compile-time, zero runtime overhead, and the compiler sees the entire universe of your computation before generating a single instruction.

This is the essence of Cathedral Architecture: crystallizing decisions into the type system, transforming runtime computation into compile-time verification.

Section III

The Four Pillars of Cathedral Architecture

I

Complete Compile-Time Knowledge

Encode types in templates. Prove bounds with static_assert. Resolve dispatch at compile-time. Compute layouts in constexpr/consteval.

II

Zero-Cost Abstractions as a Hard Guarantee

Not a compiler-dependent hope — a provable guarantee. Template instantiation equals compile-time dispatch. No runtime polymorphism. No indirection.

III

The Type System as Computational Graph

Dependencies are types, not runtime values. The entire transformer layer is a TYPE. Memory planning is constexpr. Impossible graphs cannot compile.

IV

Cathedral Mathematics — Computation as Proof

Correct programs are mathematical proofs. If the code compiles, the mathematics is correct. Type constraints provide mechanical verification.


Pillar I — Complete Compile-Time Knowledge

Consider the humble operation thread_id / sequence_length.

Runtime Thinking
__device__ uint32_t batch =
    thread_id / seq_len;
Cathedral Thinking
template<uint32_t divisor> struct Division {
    DEVICE static uint32_t div(uint32_t value) {
        if constexpr (is_power_of_2(divisor)) {
            return value >> log2_ct(divisor);
        } else {
            constexpr auto params =
                barrett_params(divisor);
            return (value * params.multiplier)
                       >> params.shift;
        }
    }
};
15–40× Speedup on indexing operations
100% Division instructions eliminated
28.5% Reduction in total kernel instruction count

Pillar II — Zero-Cost Abstractions as a Hard Guarantee

Traditional — Compiler-Dependent Hope
virtual float compute(const Tensor& t) = 0;

struct FastTensor : Tensor {
    float compute(const Tensor& t) override;
};
Cathedral — Provable Guarantee
template<TensorConcept T>
DEVICE float compute(const T& tensor) {
    return tensor.compute_impl();
}

Guaranteed zero cost because: template instantiation = compile-time dispatch; DEVICE inline = forced inlining; concept checking = compile-time verification; no runtime polymorphism = no indirection.


Readable Compile-Time Diagnostics — The Cathedral Error Model

Cathedral Architecture doesn't just fail at compile-time — it fails informatively. A plain static_assert tells you that something is wrong. Cathedral error architecture tells you exactly what went wrong, with the actual values that caused it, baked directly into the type system.

Plain static_assert — A Fire Alarm
static_assert(A::dims[1] == B::dims[0],
              "Matrix dimension mismatch");

// → error: Matrix dimension mismatch
// You know something is wrong. You don't know what the values were.
Cathedral Error Architecture — The Full Incident Report
static_assert_printer<
    A::dims[1] == B::dims[0],
    MatMulError::dimension_mismatch,
    static_assert_printer_val_inserter<A::dims[1], B::dims[0]>
>::impl;

// → error: 'nonexistent_value' is not a member of
// 'static_assert_printer_impl<MatMulError::dimension_mismatch,
//  static_assert_printer_val_inserter<512, 768>>'
//
// You see the error enum. You see 512 vs 768. You fix it
// without running a single instruction.
Cathedral Error Principle

Plain static_assert is a fire alarm. Cathedral error architecture is the full incident report — with the floor number, the temperature, and the name of whoever left the stove on.


Pillar III — The Type System as Computational Graph

The Entire Transformer Layer — As a Type
template<ModelConfig config>
struct TransformerBlock {
    using qcur      = MatMul<config, attn_q, attn_norm_mul>;
    using kcur      = MatMul<config, attn_k, attn_norm_mul>;
    using vcur      = MatMul<config, attn_v, attn_norm_mul>;
    using qcur_rope = Rope<config, qcur, inp_pos, rope_freqs>;
    using kcur_rope = Rope<config, kcur, inp_pos, rope_freqs>;
    using kq        = MatMul<config, k_view, q_permute>;
    using kq_soft   = Softmax<config, kq, kq_mask>;
    using kqv       = MatMul<config, v_view, kq_soft>;
    using output    = MatMul<config, attn_output, kqv_merged_cont>;
};

Pillar IV — Cathedral Mathematics: Computation as Proof

Runtime — Assertions That Hope
void matmul(Tensor a, Tensor b, Tensor out) {
    assert(a.shape[1] == b.shape[0]);
    assert(out.shape[0] == a.shape[0]);
    assert(out.shape[1] == b.shape[1]);
}
Cathedral — Proofs That Compile
template<Dims A, Dims B>
    requires(A::dims[1] == B::dims[0])
auto matmul(Tensor<A> a, Tensor<B> b)
    -> Tensor<Dims<A::dims[0], B::dims[1]>>
{
}

If the code compiles, the mathematics is correct. This isn't just type safety — this is mechanical verification of mathematical correctness. The Cathedral transformation: "Test that your matrix dimensions match" becomes "Prove that your matrix dimensions match, or the program doesn't exist."

Section IV

Division Elimination as Exemplar

The Division Elimination whitepaper demonstrates Cathedral thinking applied to one of computing's most fundamental operations. Integer division costs 20–40 cycles on GPUs. Every tensor indexing operation requires multiple divisions. This overhead dominates lightweight kernels.

The Cathedral Transformation
template<uint32_t divisor> struct Div {
    constexpr uint32_t operator()(uint32_t value) {
        if constexpr (is_pow2(divisor))
            return value >> log2(divisor);
        else
            return barrett_reduce(value);
    }
};
100% Division instructions eliminated from device code
15–40× Speedup on indexing operations
28.5% Reduction in total kernel instruction count

This exemplifies the Cathedral approach: complete knowledgezero costtype safetymathematical proof. When you build the cathedral properly, the "impossible" becomes routine.

Section V

Why Now? The Technology Convergence

Cathedral Architecture became possible only recently due to the convergence of five separate technological threads.

C++20 consteval & constexpr

Arbitrary computation at compile-time. Barrett reduction parameters computed before main(). Memory layouts calculated by the compiler.

Template Metaprogramming Maturity

Concepts for readable constraints. if constexpr for zero-cost branching. Fold expressions for variadic operations.

CUDA Compute Capability

Massive parallelism needs minimal per-thread overhead. L2 cache makes constant memory fast. Intrinsics enable Barrett reduction.

Modern Compilers

Aggressive template instantiation. Cross-translation-unit optimization. Reliable constant folding.

Hardware Trends

Computation is cheap (100+ TFLOPS). Memory bandwidth is precious. Control flow overhead dominates.

The Synthesis

These technologies existed separately for years. Cathedral Architecture is their convergence into a coherent, unified methodology.

Section VI

The Cathedral Development Model

Design Principles

Development Workflow

Traditional Cycle
Write → Test → Profile
     → Optimize → Repeat
Cathedral Cycle
Design → Encode → Verify
       → Compile → Deploy

The Cathedral workflow frontloads effort. Longer design phase. Types capture all invariants. static_assert everything. The payoff: zero runtime debugging, zero "works on my machine" bugs, zero performance surprises, provably correct code.


The Leaf Class Cardinality Principle

Cathedral Architecture's emphasis on compile-time computation raises a legitimate concern: template instantiation can explode combinatorially. The solution is minimizing leaf class cardinality through strategic type propagation.

Naive approach — 122,880 instantiations

5 architectures × 4 sizes × 2 devices × 10 batch sizes × 8 sequence lengths × 4 hidden dims × 4 head configs × 2 kv configs × 3 quant types × 2 flash options. Compile time: Hours. Binary size: Gigabytes. Impractical.

What Gets Templated Where

~50 Top-level configuration variations in Nihilus
~15 Leaf-level kernel instantiations
750 Total instantiations vs. 122,880 naïve
The Cathedral Principle

Compute once at the top, instantiate minimally at the leaves. The wisdom is knowing WHERE to spend your compile-time budget.

Section VII

Cathedral Anti-Patterns

What Cathedral Architecture is NOT:


What Cathedral Architecture IS:

Section VIII

The Future of Cathedral Architecture

Near-Term Evolution

Long-Term Vision

The Los Alamos Dream Realized

When you know everything about your problem, you build solutions that appear impossible to those constrained by runtime thinking.

Section IX

Call to Action

Cathedral Architecture is not just a technique — it's a mindset. A recognition that the boundary between compile-time and runtime is not fixed by the language, but chosen by the programmer.

Every time you write this...
if (condition) { }

void* ptr = malloc(size);

virtual void compute() = 0;

static_assert(condition,
    "something went wrong");
...ask yourself this.
if constexpr (condition) { }

static byte storage[size];

template<typename T> void compute() { }

static_assert_printer<
    condition,
    ErrorEnum::specific_failure,
    static_assert_printer_val_inserter<values>
>::impl;

The question is not "Can I move this to compile-time?" The question is "Why haven't I moved this to compile-time?"


Join the Cathedral

Theoreticians

Formalizing the principles, proving properties, establishing bounds.

Practitioners

Building libraries, writing kernels, pushing the boundaries of what compiles.

Toolsmiths

Improving compilers, creating analyzers, automating verification.

This is not about adopting a library or framework. This is about how we think about performance-critical systems. When you stare at your template metaprogramming code and see not complexity but crystallized knowledge — you're thinking Cathedral. When your compiler errors carry the exact values that caused them and you fix the bug without running a single line — you're thinking Cathedral.

Section X

The Cathedral Covenant

We, the cathedral builders, commit to:

  1. Encode knowledge in types — What can be proven shall be proven.
  2. Eliminate runtime uncertainty — What can be decided shall be decided at compile-time.
  3. Guarantee zero cost — Abstractions shall have provably zero overhead.
  4. Prove correctness — Type systems shall verify properties, not just prevent crashes.
  5. Share knowledge freely — Cathedral techniques shall be open, documented, teachable.
  6. Minimize leaf cardinality — Compute liberally at the top, instantiate conservatively at the leaves.
  7. Diagnose with precision — Compile-time errors shall carry the values that caused them, not strings that describe them.
The Cathedral Architect's Creed

"Give me the complete design, and I will move the computation to compile-time."

Appendix A

Cathedral Reading List

Foundational Theory

Stroustrup, B. — The Design and Evolution of C++
Alexandrescu, A. — Modern C++ Design
Abrahams, D. & Gurtovoy, A. — C++ Template Metaprogramming

Performance Engineering

Warren, H. S. — Hacker's Delight
Fog, A. — Optimizing Software in C++
Lemire, D. — Fast Random Integer Generation

Type Theory

Pierce, B. C. — Types and Programming Languages
Martin-Löf, P. — Intuitionistic Type Theory

The Cathedral Whitepapers

Compile-Time Division Elimination for Zero-Overhead Tensor Indexing in CUDA Kernels
Order-Agnostic Constexpr Configuration: A Type-Routed Compile-Time Parameter System with Enforced Uniqueness
More to come...

Like the physicists at Los Alamos who changed the world through complete understanding crystallized into precise execution, we build systems where every optimization is intentional, every guarantee is proven, every abstraction is free, and every error is a typed, value-carrying proof of exactly what went wrong.

Additional Credit: To Claude (Sonnet), for the conversation him and I had on Jun 7 2025, where we realized that there were only a maximum of 2 runtime-mutable dimensions in a transformer, hence, the birth of Cathedral Architecture.