arle

A runtime-first Rust workspace. infer serves OpenAI-compatible traffic on CUDA, Metal, and CPU; arle is the unified front door for run, serve, train, and data flows.

cuda stable · ampere+ metal beta · apple silicon cpu dev only api openai · v1 release v0.1.4 · 2026-04-28

$ Quickstart cklxx/arle ↗

arle — bash ~/projects/arle

$ arle --doctor
cuda    ok    # nvidia-smi · cuda 12.x · ampere+
metal   beta  # apple m-series detected
cpu     ok    # dev-only smoke path
model   ok    # Qwen3-4B reachable
api     ok    # /v1/chat/completions · streaming

$ arle serve --backend cuda --model Qwen3-4B
listening on http://0.0.0.0:8000  · ready in 1.4s

Install

One runnable line per platform. Pre-built tarballs and SHAs on each GitHub Release; the curl installer verifies SHA256 before extracting.

Apple Silicon · Homebrew zsh / bash

$ brew install cklxx/tap/arle
$ arle --doctor

Linux x86_64 / macOS · curl sh-compatible

$ curl -fsSL https://github.com/cklxx/arle/releases/latest/download/install.sh \
    | sh
$ arle --doctor

CUDA · GPU container docker / nvidia

$ docker run --rm --gpus all -p 8000:8000 \
    -v /path/to/Qwen3-4B:/model:ro \
    ghcr.io/cklxx/arle:latest \
    serve --backend cuda --model-path /model

Source · Cargo workspace

$ git clone https://github.com/cklxx/arle && cd arle
$ cargo install --path crates/cli --features cuda
# --features cuda is opt-in; cpu builds out of the box

Bench

Dated, reproducible snapshots straight from docs/experience/wins/. Numbers come out of scripts/bench_guidellm.sh and the canonical step-driver smokes — nothing is curated.

2026-04-28 stable · ci-gated

cuda · NVIDIA L4 · Qwen3-4B · BF16 + FP8 paged KV (auto) · c=16

output

197tok/s

itl p50

77.9ms

vs legacy

+64%

kv util

69%

scripts/bench_guidellm.sh cuda-l4-hbm-tier-fp8-auto snapshot ↗

2026-04-27 beta · validated

metal · Apple M4 Pro · Qwen3.5-0.8B Q4_K_M · GGUF decode

gen

211tok/s

e2e

202tok/s

decode

4.7ms/tok

ttft

223ms

metal_bench --model Qwen3.5-0.8B-Q4_K_M.gguf snapshot ↗

Support matrix

Three backends, one runtime contract. Authoritative truth lives in docs/support-matrix.md.

backend	stability	os / hardware	models	quants	api
`cuda`	stable	Linux + NVIDIA Ampere+	Qwen3 / Qwen3.5	FP16 / BF16, GGUF Q4_K	OpenAI v1
`metal`	beta	Apple Silicon (M1+)	Qwen3 / Qwen3.5	FP16 / BF16, dense GGUF	OpenAI v1
`cpu`	dev only	portable smoke	Qwen3 / Qwen3.5 (small)	FP16 / BF16	OpenAI v1

Files

The repo at a glance. Everything links back to canonical paths in cklxx/arle.

/README.md public overview · install · CLI · architecture
/docs/http-api.md HTTP contract · streaming behavior
/docs/support-matrix.md backend / model / quant support
/docs/stability-policy.md stability levels · compatibility posture
/docs/experience/wins/ dated benchmark snapshots
/crates/cli/ arle binary · verbs · doctor
/infer/ runtime spine · scheduler · loader · http
/crates/cuda-kernels/ cuda kernel crate · csrc · prelude
/crates/mlx-sys/ metal bridge · cmake + cc
/examples/ copyable curl · Docker · Metal · tiny train smokes
/releases tagged binaries · checksums

.arle

Install

Bench

Support matrix

Files

arle