Why from-scratch
Most LLM training stacks are vendored — PyTorch + Megatron + DeepSpeed + a cloud abstraction layer. Each link is a black box from the math perspective and the deployment story is "rent a hyperscaler".
The KeiLab stack is written bottom-up: six gradient kernels, hand-rolled, finite-difference verified, parity-checked against a CPU reference layer-backward. The same code runs across the entire fleet — Blackwell, Hopper, ARM-DGX — no vendor lock, no hidden randomness in the toolchain.
Hardware actually exercised
- NVIDIA B300 (Blackwell, SXM6) — heaviest tier, largest per-device memory budget. Used for full-depth runs where activations and optimizer state would squeeze a smaller card.
- NVIDIA H200 (Hopper) — primary single-GPU tier for domain-injection fine-tunes; comfortably holds the model with QLoRA and a real batch.
- NVIDIA H100 (Hopper)— multi-GPU DDP tier when wall-clock pressure justifies the cost multiplier (per the lab's hardware-right-sizing rule).
- 2× NVIDIA GB10 (ARM-DGX workstations) — local fleet. Unified-memory ARM silicon, used for inference, KV-streaming, GPU kernel development, and zero-cost experimentation outside the cloud cost budget.
What is built
- Six gradient kernels on GPU — matrix-multiply, RMSNorm, SiLU / SwiGLU, RoPE, softmax cross-entropy, scaled-dot-product attention.
- GPU layer-backward orchestrating all six — parity-validated against the CPU reference.
- GPU forward path with paged key-value cache, saved-activations for backward, and FlashAttention-2 wired for long context.
- GPU lm-head — softmax cross-entropy + argmax fused in a single launch.
- Layer-streaming forward — full 64-layer Qwen3-32B runs by holding one layer at a time. Depth-independent peak RAM, lossless against bulk forward.
- End-to-end smoke on real Qwen3-32B weights with a real next-token CE objective — loss decreasing monotonically under gradient clipping across the fleet.
Operating discipline
Every cloud launch goes through a pre-flight memory and cost gate before the meter starts — measured per-GPU budget, dry-run loss, single-GPU-first sizing — under the lab's standing rules. Launches are orchestrated end-to-end (spawn → upload → train → poll → download → terminate) so there are no orphan billed pods.
Adapter transfer for runs that touch internal-IP corpora is rsync-direct, never through a public model hub. The audit trail is preserved per run.
Honesty
Combat-scale training at full depth still requires streaming offload and grad-clip discipline. The kernels are proven; integration into a continuous training loop is bounded engineering — not a missing capability. Partners can run the kernels in their own environment to verify before committing.