real-time transcription pipeline for in-person conversations

3 min read Updated June 19, 2025

ECHO records in-person conversations at events, breakout rooms, consultation sessions. Participants open a link on their phone, the browser records audio, 30-second chunks get uploaded continuously. Here’s how that audio becomes searchable text.

Frontend records in 30-second chunks and uploads each one immediately. Not a design choice for elegance. It’s for reliability. If someone’s phone loses connection mid-conversation, you’ve lost at most 30 seconds, not the entire recording. Each chunk hits our FastAPI backend, gets validated, goes to S3.

Processing pipeline kicks off per-chunk. Each audio chunk gets split if needed (some recordings come in as larger files), then dispatched for transcription. We run faster-whisper-large-v3 on RunPod GPU instances. Handles Dutch, German, French, English, Croatian, whatever language the participants are speaking.

# Dispatch transcription events for each sub-chunk
for index, inner_chunk_id in enumerate(split_chunk_ids):
if inner_chunk_id is not None:
event_bus.publish("TranscriptionRequested", {
"chunk_id": inner_chunk_id,
"conversation_id": conversation_id,
"chunk_index": index,
"total_chunks": total_chunks
})

Completion detection is the interesting part. When you split audio and transcribe chunks in parallel, how do you know when the entire conversation is done? Atomic Redis counters. Each completed transcription increments a counter. When the counter matches the total chunk count, the AllChunksTranscribed event fires and triggers downstream processing: summarization, analysis, report generation.

Diarization runs in parallel on a separate RunPod service. We monitor audio quality metrics in real-time: silence ratio, noise ratio, crosstalk instances. If quality drops, the participant portal shows contextual tips like “move closer to the microphone.” This health monitoring was born out of necessity. When transcription quality tanked, we needed to know if the problem was our model or the input audio.

Multilingual adds a wrinkle. Participants at European government events frequently code-switch. Dutch peppered with English phrases, French meetings where someone quotes an English document. Most transcription models handle monolingual audio fine but struggle with intra-utterance language switching. We run a small LLM post-processing step for proper noun alignment and mixed-language cleanup:

TRANSLATION_PROMPT = """You are an expert multilingual translation assistant.
1. Identify phrases NOT in the target language.
2. Translate ONLY those identified parts.
3. Preserve text ALREADY in the target language.
4. Integrate into a single, seamless output.
**Hotwords:** The user has provided a list of "hotwords"..."""

Hotwords list is project-configurable. Organization names, technical terms, participant names. These get biased during transcription to improve accuracy on domain-specific vocabulary.

Btw, start with the async model from day one if you’re building something like this. We originally built synchronous transcription (send audio, wait for text) which was simpler but didn’t scale. The migration to async processing (submit job, poll for status, handle timeouts) broke downstream assumptions and caused months of stability issues.