Skip to content

Getting Started

openinfer builds from source with Cargo. Python is needed once at build time for Triton AOT kernel compilation — the running server has no Python dependency.

  • Rust (2024 edition)
  • CUDA Toolkit (nvcc, cuBLAS) and a CUDA-capable GPU
  • NVIDIA driver R535 (CUDA 12.2) or newer
  • Python 3 + Triton (build-time only)
Terminal window
git clone https://github.com/openinfer-project/openinfer
cd openinfer
# One-time Python setup for Triton AOT kernel compilation
uv venv && source .venv/bin/activate
uv pip install torch --index-url https://download.pytorch.org/whl/cu128
# Download a model
huggingface-cli download Qwen/Qwen3-4B --local-dir models/Qwen3-4B
# Build & start the server on port 8000
export CUDA_HOME=/usr/local/cuda
export OPENINFER_TRITON_PYTHON=.venv/bin/python
cargo run --release -- --model-path models/Qwen3-4B

Always build with --release — debug builds of the CUDA paths are far too slow to be usable.

The server exposes an OpenAI-compatible /v1/completions endpoint:

Terminal window
curl -s http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"prompt": "The capital of France is", "max_tokens": 32}'

Streaming:

Terminal window
curl -N http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"prompt": "The capital of France is", "max_tokens": 32, "stream": true}'

Any OpenAI SDK works the same way — set the base URL to http://localhost:8000/v1.

Pick a model from the sidebar for model-specific launch flags, performance numbers, and architecture notes. Qwen3-4B is the most mature line and the best place to start.