llm router instead of just retrying

ECHO-636, marked Urgent. Our chat feature where hosts ask questions about collected conversations kept dying with rate limit errors. Users mid-analysis just getting a wall.

Obvious fix: retry with exponential backoff. Tried that. Helped for small bursts, but during events with 50+ active conversations being analyzed simultaneously we’d just hit the ceiling repeatedly. Backoff doesn’t help when you’re consistently over the limit.

So we built llm_router.py. Sits between our application and the LLM providers.

Core idea: distribute requests across multiple models based on task type and current load. Not every request needs the most expensive model. Suggestions? Route to TEXT_FAST with structured outputs. Deep analysis? Full model. Status summaries? Fast model is fine.

~288 lines. Key decisions:

Task-based routing. Categorized our LLM calls into tiers. Suggestions, translations, and status text go to fast/cheap tier. Analysis, report generation, and chat responses go to capable tier. This alone cut rate limit hits by ~60% because half our calls didn’t need the big model anyway.

Structured outputs for suggestions. Moving suggestions to TEXT_FAST with structured outputs actually improved UX. Faster response times, more consistent formatting. The “downgrade” was an upgrade.

Coordinated with stream status. Added inline stream status under the “Thinking…” state with a 20-second threshold. If the model takes longer than 20s, users see what’s happening instead of staring at a spinner.

Routing + status display completely eliminated the rate limiting complaints. Users who would’ve seen errors now either get faster responses (fast model) or see transparent progress (status display while capable model works).

What I’d do differently: build the router from the start. We bolted it on after the fact, which meant refactoring every LLM call site. If we’d started with an abstraction layer between our code and the providers, routing would’ve been a config change instead of a 288-line module touching frontend and backend.