conviction · 01From scratch is the point.
From-scratch autograd, scheduler, radix prefix cache, paged KV, CUDA graphs, MLX bridge. The concepts you’ve only met in papers are here as small Rust files — read one crate, own one concept.
The whole modern inference stack — continuous batching, radix prefix cache, paged KV, CUDA graphs, speculative decode — hand-built in Rust, small enough to read in a weekend.
$ arle --doctor cuda ok # nvidia-smi · cuda 12.x · ampere+ metal beta # apple m-series detected cpu ok # dev-only smoke path model ok # Qwen3-4B reachable api ok # /v1/chat/completions · streaming $ arle serve --backend cuda --model Qwen3-4B listening on http://0.0.0.0:8000 · ready in 1.4s
Not a wrapper, not a binding, not a fork. Every layer the big engines hide behind a pip install exists here as a few hundred lines you can step through — and a readable trail of why each one looks the way it does.
From-scratch autograd, scheduler, radix prefix cache, paged KV, CUDA graphs, MLX bridge. The concepts you’ve only met in papers are here as small Rust files — read one crate, own one concept.
Dead ends stay in the repo, marked KILL with the measurement that killed them. You inherit the measurements, not just the conclusions — months of A/B work, free to read.
Every benchmark is a dated snapshot in docs/experience/wins/ with env, params, and regressions. Nothing is curated. “Fast” is a number with a date, or it’s nothing.
No Python at serving time, no sidecar processes, no config sprawl. arle is the only binary the workspace builds — clone to first token in minutes, on a GPU box or a MacBook.
Post-cutover (2026-06-04) the monolithic infer crate is gone. The runtime is a device-neutral crate graph — dependencies flow strictly downward, infer-core carries no backend dependency, and backends plug in at the front door. Canonical topology lives in docs/codebase-map.md.
src/main.rs InferenceEngine · LoadedInferenceEngine · backends plug in here Engine<E,K> · scheduler · radix prefix BackendExecutor + KvPool seam · ForwardPlan IR — host-only One runnable line per platform. Pre-built tarballs and SHAs on each GitHub Release; the curl installer verifies SHA256 before extracting.
$ brew install cklxx/tap/arle $ arle --doctor
$ curl -fsSL https://github.com/cklxx/arle/releases/latest/download/install.sh \ | sh $ arle --doctor
$ docker run --rm --gpus all -p 8000:8000 \
-v /path/to/Qwen3-4B:/model:ro \
ghcr.io/cklxx/arle:latest \
serve --backend cuda --model-path /model $ git clone https://github.com/cklxx/arle && cd arle $ cargo build --release --features cuda --bin arle # cli is default-on; cpu: --no-default-features --features cpu,no-cuda
Dated, reproducible snapshots straight from docs/experience/wins/. Numbers come out of scripts/bench_guidellm.sh and the canonical step-driver smokes — nothing is curated.
cuda · 8×H20 TP=8 / EP=8 · DeepSeek-V4-Flash FP8 · B=1, official FlashMLA / DSA / DeepGEMM + MTP · 256K boots, needle-exact @230K
scripts/dsv4_lever_gate.sh snapshot ↗ cuda · RTX 4070 Ti SUPER 16GB · Qwen3-0.6B real-checkpoint OPD step · 32-commit session, kill-or-license-gated
cargo run -p train --example opd_step_cuda_realckpt_train --release --features cuda -- --lr 1e-7 --steps 5000 snapshot ↗ metal · Apple M4 Pro 48GB · Qwen3.6-35B-A3B 4-bit MLX · HTTP serve, streaming /v1/completions
arle serve --backend metal --model-path mlx-community/Qwen3.6-35B-A3B-4bit --port 8010 snapshot ↗ cuda · NVIDIA L4 · Qwen3-4B · BF16 + FP8 paged KV (auto) · c=16
scripts/bench_guidellm.sh cuda-l4-hbm-tier-fp8-auto snapshot ↗ metal · Apple M4 Pro · Qwen3.5-0.8B Q4_K_M · GGUF decode
arle serve --backend metal --model-path Qwen3.5-0.8B-Q4_K_M.gguf snapshot ↗ Three backends, one runtime contract. Authoritative truth lives in docs/support-matrix.md.
| backend | stability | os / hardware | models | quants | api |
|---|---|---|---|---|---|
cuda | stable | Linux + NVIDIA Ampere+ | Qwen3 / Qwen3.5 · DeepSeek-V4-Flash (8×H20) | FP16 / BF16 · FP8 KV (auto) · GGUF Q4_K | OpenAI v1 |
metal | beta | Apple Silicon (M1+) | Qwen3.5 · Qwen3.6 MoE (canonical) | FP16 / BF16 · MLX 4-bit · GGUF Q4_K | OpenAI v1 |
cpu | dev only | portable smoke | Qwen3 / Qwen3.5 (small) | FP16 / BF16 | OpenAI v1 |
No queue, no committee — a weekend PR here can move a headline number, and the battlefields are public: the serial phase plan lives in ROADMAP.md with one tracked issue per front. Start with CONTRIBUTING.md, not the maintainer plans tree.
Stars are the only metric a solo project has. If this repo saved you a read of someone else’s CUDA — or just proved it can be done in Rust — leave one. It decides how much time this gets.
★ Star cklxx/arleThe repo at a glance. Everything links back to canonical paths in cklxx/arle.