Speech Transcription: File, Streaming & Realtime

One setup—three paths to speech-to-text. Learn when to use file uploads, chunked streaming, or realtime captions, and copy-paste patterns that ship.

Speech Transcription Realtime OpenAI
Speech transcription methods flow: file, streaming, realtime

Introduction

Speech-to-text has three practical shapes: upload a file, stream audio in chunks, or run realtime captions. This guide explains when each shines, then gives minimal scripts you can adapt for your product or internal tools.

Methods at a glance

  1. File transcription — simplest: send a .wav/.mp3 and get text back.
  2. Streaming transcription — send long audio in parts; see partial text as you go.
  3. Realtime captions — low-latency text for meetings, calls, or assistants.

How to choose

  • Have a finished recording? Pick file transcription.
  • Processing hours of audio? Use streaming to avoid huge uploads and to view partials.
  • Need live captions? Use the Realtime API for sub-second latency.

Setup once

Python
# 1) Install the OpenAI SDK
pip install --upgrade openai

# 2) Export your key (or use .env)
export OPENAI_API_KEY="sk-...redacted..."

Tip: keep keys out of source control; use env vars or a secrets manager.

File transcription (recordings → text)

Great for podcasts, interviews, voicemail, and call recordings.

transcribe_file.py
from openai import OpenAI
client = OpenAI()

with open("meeting.wav","rb") as f:
    result = client.audio.transcriptions.create(
        model="whisper-1",
        file=f,
        response_format="text"  # or "json" / "srt" / "verbose_json"
    )

print(result)

Streaming transcription (long audio, partials)

Send chunks as you read from disk or a recorder. Useful for long sessions.

stream_transcribe.py (sketch)
from openai import OpenAI
client = OpenAI()

stream = client.audio.transcriptions.create_stream(
    model="whisper-1",
    # add language/prompt if helpful for accuracy
)

def read_chunks(path, size=4096):
    with open(path, "rb") as f:
        while True:
            b = f.read(size)
            if not b: break
            yield b

for chunk in read_chunks("lecture.wav"):
    stream.send(chunk)
    if stream.has_partial:
        print("partial:", stream.partial_text)

final_text = stream.close()
print("final:", final_text)

Realtime captions (low latency)

Use WebSockets to push PCM/Opus frames and receive interim/final text.

realtime_ws.py (outline)
import asyncio, websockets, json, sounddevice as sd

API_KEY = "sk-...redacted..."
URI = "wss://api.openai.com/v1/realtime?model=realtime-stt"

async def mic_stream():
    # yield raw audio frames from the microphone
    samplerate = 16000
    q = asyncio.Queue()

    def cb(indata, frames, time, status):
        q.put_nowait(bytes(indata))

    with sd.RawInputStream(callback=cb, samplerate=samplerate, channels=1, dtype="int16"):
        while True:
            yield await q.get()

async def run():
    async with websockets.connect(
        URI,
        extra_headers={"Authorization": f"Bearer {API_KEY}"}
    ) as ws:
        await ws.send(json.dumps({"type":"start"}))
        async for frame in mic_stream():
            await ws.send(frame)
            msg = await ws.recv()
            data = json.loads(msg)
            if data.get("text"):
                print(data["text"])

asyncio.run(run())
Example transcription code screenshot
Minimal realtime outline (WebSocket). Adapt buffering and error handling for production.

Best practices

  • Pick the right sample rate: 16 kHz mono PCM works well for speech.
  • Give context: A short domain prompt (product names, people) boosts accuracy.
  • Post-process: run a second pass for punctuation, casing, or diarization if you need it.
  • Privacy: scrub PII in call recordings when required; rotate keys; use HTTPS/WSS only.

FAQ

Which method is cheapest? File transcription often is; realtime adds persistent connection overhead.

How fast is realtime? With small frames and good uplink, latency can be well under a second.

Can I get timestamps? Use JSON/verbose formats, then render SRT/VTT for players.

Conclusion & next steps

Start with file transcription to validate quality, move to streaming for long content, then add realtime when you need live captions. Keep prompts and codecs consistent so results are comparable.

Related: GPT-5 Prompt GuideAI Voice Generator

Get in Touch

Building a voice feature? We can help with design, evals, and deployment.

Contact Us

We’ll respond promptly. Email: info@ondevtra.com

Explore More from Ondevtra AI

Jump into our other AI guides, tools, and resources to keep learning.

Scan & Download Our AI Apps

Use these QR codes or tap the badges to jump straight to the App Store.