arle(1)
one person · pure rust · from scratch · in public
rle

Inference stops
being magic.

The whole modern inference stack — continuous batching, radix prefix cache, paged KV, CUDA graphs, speculative decode — hand-built in Rust, small enough to read in a weekend.

cuda stable · ampere+ metal beta · apple silicon cpu dev only api openai · v1 release v0.1.5 · 2026-04-28
arle — bash ~/projects/arle
$ arle --doctor
cuda    ok    # nvidia-smi · cuda 12.x · ampere+
metal   beta  # apple m-series detected
cpu     ok    # dev-only smoke path
model   ok    # Qwen3-4B reachable
api     ok    # /v1/chat/completions · streaming

$ arle serve --backend cuda --model Qwen3-4B
listening on http://0.0.0.0:8000  · ready in 1.4s

Why this exists

Not a wrapper, not a binding, not a fork. Every layer the big engines hide behind a pip install exists here as a few hundred lines you can step through — and a readable trail of why each one looks the way it does.

conviction · 01From scratch is the point.

From-scratch autograd, scheduler, radix prefix cache, paged KV, CUDA graphs, MLX bridge. The concepts you’ve only met in papers are here as small Rust files — read one crate, own one concept.

conviction · 02Kills are documented.

Dead ends stay in the repo, marked KILL with the measurement that killed them. You inherit the measurements, not just the conclusions — months of A/B work, free to read.

conviction · 03Numbers are dated.

Every benchmark is a dated snapshot in docs/experience/wins/ with env, params, and regressions. Nothing is curated. “Fast” is a number with a date, or it’s nothing.

conviction · 04One binary, no glue.

No Python at serving time, no sidecar processes, no config sprawl. arle is the only binary the workspace builds — clone to first token in minutes, on a GPU box or a MacBook.

Architecture

Post-cutover (2026-06-04) the monolithic infer crate is gone. The runtime is a device-neutral crate graph — dependencies flow strictly downward, infer-core carries no backend dependency, and backends plug in at the front door. Canonical topology lives in docs/codebase-map.md.

bin
arle the only binary the workspace builds · src/main.rs
control plane
cliagentchattools REPL · session loop · protocol · sandboxed tools
front door
infer-api InferenceEngine · LoadedInferenceEngine · backends plug in here
server · core
infer-serverinfer-core OpenAI v1 facade (axum) · Engine<E,K> · scheduler · radix prefix
seam · ir
infer-seaminfer-plan BackendExecutor + KvPool seam · ForwardPlan IR — host-only
backends
infer-cudainfer-metal feature-gated · metal’s host KV pool doubles as the cpu smoke path
kernels
cuda-kernelsmlx-syskv-native-sys CUDA C / TileLang · MLX C++ bridge · KV persistence
pure leaves · infer-topo · infer-moe · infer-utilspecs · qwen3 · qwen35 · deepseekffi · deepep-sys · xgrammar-systrain · autograd + train — OPD-only since 2026-05-18

Install

One runnable line per platform. Pre-built tarballs and SHAs on each GitHub Release; the curl installer verifies SHA256 before extracting.

Apple Silicon · Homebrew zsh / bash
$ brew install cklxx/tap/arle
$ arle --doctor
Linux x86_64 / macOS · curl sh-compatible
$ curl -fsSL https://github.com/cklxx/arle/releases/latest/download/install.sh \
    | sh
$ arle --doctor
CUDA · GPU container docker / nvidia
$ docker run --rm --gpus all -p 8000:8000 \
    -v /path/to/Qwen3-4B:/model:ro \
    ghcr.io/cklxx/arle:latest \
    serve --backend cuda --model-path /model
Source · Cargo workspace
$ git clone https://github.com/cklxx/arle && cd arle
$ cargo build --release --features cuda --bin arle
# cli is default-on; cpu: --no-default-features --features cpu,no-cuda

Bench

Dated, reproducible snapshots straight from docs/experience/wins/. Numbers come out of scripts/bench_guidellm.sh and the canonical step-driver smokes — nothing is curated.

2026-06-10 beta · gate-licensed

cuda · 8×H20 TP=8 / EP=8 · DeepSeek-V4-Flash FP8 · B=1, official FlashMLA / DSA / DeepGEMM + MTP · 256K boots, needle-exact @230K

prefill
23ms
decode
15ms/tok
decode +MTP
64.2tok/s
needle-exact
230Kctx
scripts/dsv4_lever_gate.sh snapshot ↗
2026-05-21 beta · cycle-wrap

cuda · RTX 4070 Ti SUPER 16GB · Qwen3-0.6B real-checkpoint OPD step · 32-commit session, kill-or-license-gated

step
0.164s
vs naive CPU
~170×
moderate vs PyTorch CUDA
1.71× ARLE faster
held-out overlap 5k
82.8% (from 50)
cargo run -p train --example opd_step_cuda_realckpt_train --release --features cuda -- --lr 1e-7 --steps 5000 snapshot ↗
2026-05-18 beta · ad-hoc

metal · Apple M4 Pro 48GB · Qwen3.6-35B-A3B 4-bit MLX · HTTP serve, streaming /v1/completions

decode
85.6tok/s
e2e
76.1tok/s
ttft
385ms
vs mlx-lm
≈100%
arle serve --backend metal --model-path mlx-community/Qwen3.6-35B-A3B-4bit --port 8010 snapshot ↗
2026-04-28 stable · ci-gated

cuda · NVIDIA L4 · Qwen3-4B · BF16 + FP8 paged KV (auto) · c=16

output
197tok/s
itl p50
77.9ms
vs legacy
+64%
kv util
69%
scripts/bench_guidellm.sh cuda-l4-hbm-tier-fp8-auto snapshot ↗
2026-04-27 beta · validated

metal · Apple M4 Pro · Qwen3.5-0.8B Q4_K_M · GGUF decode

gen
211tok/s
e2e
202tok/s
decode
4.7ms/tok
ttft
223ms
arle serve --backend metal --model-path Qwen3.5-0.8B-Q4_K_M.gguf snapshot ↗

Support matrix

Three backends, one runtime contract. Authoritative truth lives in docs/support-matrix.md.

backendstabilityos / hardwaremodelsquantsapi
cudastableLinux + NVIDIA Ampere+Qwen3 / Qwen3.5 · DeepSeek-V4-Flash (8×H20)FP16 / BF16 · FP8 KV (auto) · GGUF Q4_KOpenAI v1
metalbetaApple Silicon (M1+)Qwen3.5 · Qwen3.6 MoE (canonical)FP16 / BF16 · MLX 4-bit · GGUF Q4_KOpenAI v1
cpudev onlyportable smokeQwen3 / Qwen3.5 (small)FP16 / BF16OpenAI v1

Where a contribution lands

No queue, no committee — a weekend PR here can move a headline number, and the battlefields are public: the serial phase plan lives in ROADMAP.md with one tracked issue per front. Start with CONTRIBUTING.md, not the maintainer plans tree.

Stars are the only metric a solo project has. If this repo saved you a read of someone else’s CUDA — or just proved it can be done in Rust — leave one. It decides how much time this gets.

★ Star cklxx/arle

Files

The repo at a glance. Everything links back to canonical paths in cklxx/arle.