auditing our event-driven architecture

ECHO processes audio through a pipeline: upload, chunk, transcribe, analyze, report. Each step is a separate service. In theory, clean event-driven architecture. In practice, tightly coupled handlers with no clear error boundaries.

So I audited every business event in the system.

what we found

The resource management flow was the worst offender. Upload a file, it gets chunked, chunks get processed, audio gets transcribed. Sounds simple. But the event chain was linear and fragile. If any step failed, everything downstream just stopped. No retry logic. No compensation events. No dead letter queue.

We also had “god events”, massive payloads carrying 50 fields when a consumer only needed 3. And our event naming was inconsistent: some events described things that happened (upload.completed), others described things to do (process_chunk). Mixing commands and events makes reasoning about flow much harder.

naming fix

We standardized on past tense for events (things that happened), imperative for commands (things to do):

"chunk.process_request" -> "process 'chunk'"
"chunk.processing.requested" -> "chunk.processing.started" -> "chunk.processing.completed"

Sounds pedantic until you’re debugging a production issue at midnight trying to figure out whether chunk.process means “a chunk was processed” or “please process this chunk.”

domain boundaries

Split events into four bounded contexts:

├── Upload Context (pre-signed URL, upload events)
├── Processing Context (chunk operations)
├── Transcription Context (audio processing)
└── Resource Context (lifecycle management)

Before this, authentication events were directly triggering project management actions. Resource processing was tightly coupled to specific worker types. Cross-cutting concerns everywhere.

anti-patterns we fixed

Synchronous event chains: event A waited for B which waited for C. Latency was additive, any failure blocked the entire chain. Moved to parallel processing where possible. A triggers B and C independently.

Missing idempotency: processing the same event twice created duplicate data. Added correlation IDs and deduplication at the consumer level.

No timeout handling: if a chunk processing job hung, nothing noticed. Added temporal events (processing.timeout, upload.expired, resource.ttl.reached) that fire when deadlines are exceeded.

target flow

After the audit:

upload.requested
├── upload.url.generated
├── upload.completed
│   ├── validation.started
│   ├── validation.passed
│   │   ├── processing.job.created
│   │   ├── chunk.processing.started (x N, parallel)
│   │   ├── chunk.processing.completed (x N)
│   │   ├── transcription.started
│   │   ├── transcription.completed
│   │   └── resource.ready
│   └── validation.failed
│       └── resource.rejected
└── upload.failed
    └── cleanup.initiated

Every event has an explicit failure path. Every long-running operation has a timeout. Each bounded context handles its own errors and emits compensation events when things go wrong.

The audit took a day. Refactoring has been ongoing. But the immediate win was clarity: when something fails in production, we can trace the event chain and see exactly where it broke instead of digging through logs.