Skip to content

Why is my model slow?

Default models run on shared CPU infrastructure (5–60s typical). Switch to a premium model for 1–5s response, or check /status/ for current load.

Last updated: 2026-05-04

Expected response times

ModelTypicalLong output (2000 tokens)
qwen2.5-3b (Free fast)3–8s15–30s
qwen2.5-32b (Free quality)10–30s30–90s
GPT-4o-mini (Pro fast)1–4s5–15s
Claude Sonnet (Pro quality)2–8s10–30s
GPT-4o (Pro premium)3–10s15–45s
Claude Opus (Pro premium)5–15s30–90s
Cerebras Llama 70b (Pro fast-quality)0.5–2s1–5s ⚡

Why default Free models are slower

We self-host qwen2.5 on CPU-only Hetzner nodes (no GPU, ~30 tok/s). It's free for us, free for you. Trade-off: latency.

Speed-ups

  1. Switch to Cerebras (Pro) — fastest by 5–20× for text generation. Tools with the ⚡ icon support it.
  2. Reduce target output length — half the words = half the time.
  3. Stream output — UI shows tokens as they arrive (so perceived speed is faster). Enabled by default.
  4. Off-peak hours — UTC 02:00–08:00 has 30% less load.

Spike checks

If your normal model suddenly runs 3× slower:

  1. Check /status/ for incidents.
  2. Try a different model — if the alternative is fast, the original provider is degraded.
  3. Check your network — curl -I https://aicentraltools.com should be <500ms.

Per-tool overhead

Some tools chain multiple model calls internally:

  • Blog Post Generator: 2 calls (outline + draft).
  • Agent runs: 5–20 calls.
  • Image tools: 1 model call + image generation provider (~10–60s).

The shown latency is the total. We surface step-by-step timing if you click "Show timing breakdown" on the result.

API users

Response time SLA at 95th percentile:

  • Default models: <30s.
  • Premium models: <10s.
  • Cerebras: <3s.

We don't refund for >SLA cases (model latency isn't a service guarantee), but the /status/ page shows historical p50/p95.

Was this helpful?

0 / 0 people found this helpful

Still stuck? Contact support