A runtime-first Rust workspace. infer serves OpenAI-compatible traffic on CUDA, Metal, and CPU; arle is the unified front door for run, serve, train, and data flows.
$ arle --doctor cuda ok # nvidia-smi · cuda 12.x · ampere+ metal beta # apple m-series detected cpu ok # dev-only smoke path model ok # Qwen3-4B reachable api ok # /v1/chat/completions · streaming $ arle serve --backend cuda --model Qwen3-4B listening on http://0.0.0.0:8000 · ready in 1.4s
One runnable line per platform. Pre-built tarballs and SHAs on each GitHub Release; the curl installer verifies SHA256 before extracting.
$ brew install cklxx/tap/arle $ arle --doctor
$ curl -fsSL https://github.com/cklxx/arle/releases/latest/download/install.sh \ | sh $ arle --doctor
$ docker run --rm --gpus all -p 8000:8000 \
-v /path/to/Qwen3-4B:/model:ro \
ghcr.io/cklxx/arle:latest \
serve --backend cuda --model-path /model $ git clone https://github.com/cklxx/arle && cd arle $ cargo install --path crates/cli --features cuda # --features cuda is opt-in; cpu builds out of the box
Dated, reproducible snapshots straight from docs/experience/wins/. Numbers come out of scripts/bench_guidellm.sh and the canonical step-driver smokes — nothing is curated.
cuda · NVIDIA L4 · Qwen3-4B · BF16 + FP8 paged KV (auto) · c=16
scripts/bench_guidellm.sh cuda-l4-hbm-tier-fp8-auto snapshot ↗ metal · Apple M4 Pro · Qwen3.5-0.8B Q4_K_M · GGUF decode
metal_bench --model Qwen3.5-0.8B-Q4_K_M.gguf snapshot ↗ Three backends, one runtime contract. Authoritative truth lives in docs/support-matrix.md.
| backend | stability | os / hardware | models | quants | api |
|---|---|---|---|---|---|
cuda | stable | Linux + NVIDIA Ampere+ | Qwen3 / Qwen3.5 | FP16 / BF16, GGUF Q4_K | OpenAI v1 |
metal | beta | Apple Silicon (M1+) | Qwen3 / Qwen3.5 | FP16 / BF16, dense GGUF | OpenAI v1 |
cpu | dev only | portable smoke | Qwen3 / Qwen3.5 (small) | FP16 / BF16 | OpenAI v1 |
The repo at a glance. Everything links back to canonical paths in cklxx/arle.