Why is my model slow?

Default models run on shared CPU infrastructure (5–60s typical). Switch to a premium model for 1–5s response, or check /status/ for current load.

Last updated: 2026-05-04

Expected response times

Model	Typical	Long output (2000 tokens)
qwen2.5-3b (Free fast)	3–8s	15–30s
qwen2.5-32b (Free quality)	10–30s	30–90s
GPT-4o-mini (Pro fast)	1–4s	5–15s
Claude Sonnet (Pro quality)	2–8s	10–30s
GPT-4o (Pro premium)	3–10s	15–45s
Claude Opus (Pro premium)	5–15s	30–90s
Cerebras Llama 70b (Pro fast-quality)	0.5–2s	1–5s ⚡

We self-host qwen2.5 on CPU-only Hetzner nodes (no GPU, ~30 tok/s). It's free for us, free for you. Trade-off: latency.

Switch to Cerebras (Pro) — fastest by 5–20× for text generation. Tools with the ⚡ icon support it.
Reduce target output length — half the words = half the time.
Stream output — UI shows tokens as they arrive (so perceived speed is faster). Enabled by default.
Off-peak hours — UTC 02:00–08:00 has 30% less load.

If your normal model suddenly runs 3× slower:

Check /status/ for incidents.
Try a different model — if the alternative is fast, the original provider is degraded.
Check your network — curl -I https://aicentraltools.com should be <500ms.

Some tools chain multiple model calls internally:

The shown latency is the total. We surface step-by-step timing if you click "Show timing breakdown" on the result.

Response time SLA at 95th percentile:

We don't refund for >SLA cases (model latency isn't a service guarantee), but the /status/ page shows historical p50/p95.

0 / 0 people found this helpful