KeiLab — deep-tech research lab

Why from-scratch

Most LLM training stacks are vendored — PyTorch + Megatron + DeepSpeed + a cloud abstraction layer. Each link is a black box from the math perspective and the deployment story is "rent a hyperscaler".

The KeiLab stack is written bottom-up: six gradient kernels, hand-rolled, finite-difference verified, parity-checked against a CPU reference layer-backward. The same code runs across the entire fleet — Blackwell, Hopper, ARM-DGX — no vendor lock, no hidden randomness in the toolchain.

Hardware actually exercised

NVIDIA B300 (Blackwell, SXM6) — heaviest tier, largest per-device memory budget. Used for full-depth runs where activations and optimizer state would squeeze a smaller card.
NVIDIA H200 (Hopper) — primary single-GPU tier for domain-injection fine-tunes; comfortably holds the model with QLoRA and a real batch.
NVIDIA H100 (Hopper)— multi-GPU DDP tier when wall-clock pressure justifies the cost multiplier (per the lab's hardware-right-sizing rule).
2× NVIDIA GB10 (ARM-DGX workstations) — local fleet. Unified-memory ARM silicon, used for inference, KV-streaming, GPU kernel development, and zero-cost experimentation outside the cloud cost budget.

What is built

Six gradient kernels on GPU — matrix-multiply, RMSNorm, SiLU / SwiGLU, RoPE, softmax cross-entropy, scaled-dot-product attention.
GPU layer-backward orchestrating all six — parity-validated against the CPU reference.
GPU forward path with paged key-value cache, saved-activations for backward, and FlashAttention-2 wired for long context.
GPU lm-head — softmax cross-entropy + argmax fused in a single launch.
Layer-streaming forward — full 64-layer Qwen3-32B runs by holding one layer at a time. Depth-independent peak RAM, lossless against bulk forward.
End-to-end smoke on real Qwen3-32B weights with a real next-token CE objective — loss decreasing monotonically under gradient clipping across the fleet.

Operating discipline

Every cloud launch goes through a pre-flight memory and cost gate before the meter starts — measured per-GPU budget, dry-run loss, single-GPU-first sizing — under the lab's standing rules. Launches are orchestrated end-to-end (spawn → upload → train → poll → download → terminate) so there are no orphan billed pods.

Adapter transfer for runs that touch internal-IP corpora is rsync-direct, never through a public model hub. The audit trail is preserved per run.

Honesty

Combat-scale training at full depth still requires streaming offload and grad-clip discipline. The kernels are proven; integration into a continuous training loop is bounded engineering — not a missing capability. Partners can run the kernels in their own environment to verify before committing.

cross-references

→ MAX (language used to write the kernels)→ KeiSeiKit runtime → KeiSei OS @ About the lab