NIHILUS

THE CATHEDRAL OF INFERENCE

The Era of the Bazaar is Over.
Current LLM inference frameworks are built on a "Bazaar" philosophy: layers of abstraction, fragmented kernel launches, and a 10x "Latency Tax" paid to the Python runtime. They treat the GPU as a collection of disjointed math functions. We treat it as a Single, Synchronous Bit-Stream.

The Architecture of Necessity

Nihilus is a bare-metal C++20 inference engine synthesized from first principles. It does not "wrap" CUDA; it occupies it. By collapsing the entire Transformer forward pass—from embedding egress to logit sampling—into a Single, Monolithic Mega-Kernel, we have eliminated the physical boundaries that slow down the state-of-the-art.

Technical Invariants

  • Device-Side Model Instantiation

    We resolve the entirety of the model-execution's metadata into a trivially copyable struct and blast it into __constant__ memory once before every forward pass.
  • Static Hardware Synthesis

    Using the CAFBERIHT pattern, the model's topology is resolved at compile-time.
  • Coordinate Projection

    Every tensor index is a Direct Projection onto the memory arena via pre-calculated magic multipliers, eliminating integer division across the entire grid.

Performance as a Moral Imperative

13ms Metadata Scan: Map the "Whole Fucker" (405B-FP16) and validate the HBM arena in less time than a single frame of a 60FPS game.

Line-Rate Throughput: If the H100 bus can move the data, Nihilus is already there waiting to compute it.

Zero Warm-up: No JIT, no "tracing," no "tuning." The binary is the optimized execution plan.

API ACCESS

Sub-millisecond overhead by-the-token inference.

EXECUTIVE DEMO

For Distinguished Engineers at Hyperscale organizations.

LICENSING

Enterprise-grade HBM-saturation for private clusters.