Rust + CUDA, nothing else
The entire stack — weights loading, paged KV cache, schedulers, kernels — is built from the ground up in Rust and CUDA, with Triton AOT and FlashInfer kernels compiled at build time. Python is build-time only.
Rust + CUDA, nothing else
The entire stack — weights loading, paged KV cache, schedulers, kernels — is built from the ground up in Rust and CUDA, with Triton AOT and FlashInfer kernels compiled at build time. Python is build-time only.
OpenAI-compatible API
Serves a /v1/completions endpoint with streaming. Point any OpenAI SDK
or curl at it and start generating.
One engine per model
No universal model abstraction. Each model line owns its scheduler, kernel plan, and execution path — full attention, hybrid linear attention, MLA, and MoE with expert parallelism.
CUDA Graph decode
The decode path is captured as a CUDA graph with pre-allocated buffers, keeping per-token overhead low and decode latency flat across context lengths.
| Model | Architecture |
|---|---|
| Qwen3-4B / 8B | Full attention, tensor parallel |
| Qwen3.5-4B | Hybrid: 24 linear + 8 full attention layers |
| DeepSeek-V4 | MoE + compressor + indexer, 8-GPU |
| DeepSeek-V2-Lite | MoE + expert parallelism, 2-GPU |
| Kimi-K2 | MLA + MoE + Marlin INT4, 8-GPU expert parallelism |