designing a transcription format for LLMs

ECHO records conversations and runs them through a transcription pipeline: audio comes in, segments come out, segments feed into LLM analysis. The transcription format is the contract between our audio processing layer and our AI analysis layer.

Most transcription services give you bloated JSON with timestamps, confidence scores, word-level alignments, speaker embeddings, and a dozen metadata fields. Fine for analytics. But when your downstream consumer is an LLM with a context window limit, every unnecessary field is wasted tokens.

the format

Started with the kitchen-sink approach and stripped it down. What survived:

interface TranscriptionOutput {
  transcription: {
    language: string;
    duration_ms: number;
    speakers_detected: number;
    segments: TranscriptionSegment[];
    metadata: TranscriptionMetadata;
  };
}

interface TranscriptionSegment {
  speaker: string;
  text: string;
  meta: SegmentFlag[];
}

type SegmentFlag =
  | 'crosstalk'
  | 'unclear_audio'
  | 'technical_jargon'
  | 'laughter'
  | 'pause'
  | 'interruption';

No timestamps per segment. No confidence scores. No word-level alignments.

what we cut and why

Timestamps were the hardest cut. They feel essential. But for LLM consumption, they’re noise. The model doesn’t need to know that speaker A started at 3,200ms. It needs to know what speaker A said and whether the audio quality was reliable.

We kept timestamps at the global level (duration_ms) because the LLM benefits from knowing overall conversation length. Per-segment timing was removed.

Confidence scores seemed useful in theory: “This segment has 0.71 confidence, take it with a grain of salt.” In practice, we replaced per-segment confidence with inline uncertainty markers directly in the text:

{
  "speaker": "A",
  "text": "the [unsure: latency|legacy] system is causing issues",
  "meta": ["unclear_audio", "technical_jargon"]
}

The [unsure: option1|option2] inline syntax gives the LLM more actionable information than a numeric score. The model can reason about “it’s either latency or legacy” much better than “confidence: 0.71.”

segment flags

Each segment gets a flat array of flags: crosstalk, unclear_audio, technical_jargon, laughter, pause, interruption. These affect how the LLM should interpret the text. Two people talking over each other means garbled text. Laughter means the preceding statement might be sarcastic. Long pause, something significant might have happened.

Global metadata captures audio quality assessment, detected jargon domains, and background noise. Helps the LLM calibrate its overall confidence before processing individual segments.

streamability

The format is appendable. As chunks are processed in real-time, we can push new segments into the array without restructuring anything. Each segment is self-contained with its own speaker label, text, and quality flags. No references to other segments or external state.

This matters because ECHO processes audio in chunks during live events. We don’t wait for the full recording. Segments stream in as they’re transcribed, and the analysis pipeline can start working on early segments while later ones are still being processed.

the trade-off

We lost precise audio navigation from the transcript format alone. If someone wants to jump to “that part where speaker B mentioned the budget,” they need to use the chunk-level audio player, not segment timestamps. Approximate timestamps (within a 30-second chunk) are good enough. Users just want to jump to “about that part,” not the exact word.

For everything feeding into the LLM pipeline, the stripped-down format is better. Fewer tokens, more semantic signal per token, inline uncertainty the model can actually use. On a 60-minute conversation with hundreds of segments, the overhead of timestamps and confidence scores adds up.