The Era of the Bazaar is Over.
Current LLM inference frameworks are built on a "Bazaar" philosophy: layers of abstraction, fragmented kernel launches, and a 10x "Latency Tax" paid to the Python runtime. They treat the GPU as a collection of disjointed math functions. We treat it as a Single, Synchronous Bit-Stream.
The Architecture of Necessity
Nihilus is a bare-metal C++20 inference engine synthesized from first principles. It does not "wrap" CUDA; it occupies it. By collapsing the entire Transformer forward pass—from embedding egress to logit sampling—into a Single, Monolithic Mega-Kernel, we have eliminated the physical boundaries that slow down the state-of-the-art.
Technical Invariants
-
Device-Side Model Instantiation
We resolve the entirety of the model-execution's metadata into a trivially copyable struct and blast it into __constant__ memory once before every forward pass. -
Static Hardware Synthesis
Using the CAFBERIHT pattern, the model's topology is resolved at compile-time. -
Coordinate Projection
Every tensor index is a Direct Projection onto the memory arena via pre-calculated magic multipliers, eliminating integer division across the entire grid.
Performance as a Moral Imperative
13ms Metadata Scan: Map the "Whole Fucker" (405B-FP16) and validate the HBM arena in less time than a single frame of a 60FPS game.
Line-Rate Throughput: If the H100 bus can move the data, Nihilus is already there waiting to compute it.
Zero Warm-up: No JIT, no "tracing," no "tuning." The binary is the optimized execution plan.