Speech Transcription with OpenAI — File, Streaming & Realtime

Introduction

Speech-to-text has three practical shapes: upload a file, stream audio in chunks, or run realtime captions. This guide explains when each shines, then gives minimal scripts you can adapt for your product or internal tools.

Methods at a glance

File transcription — simplest: send a .wav/.mp3 and get text back.
Streaming transcription — send long audio in parts; see partial text as you go.
Realtime captions — low-latency text for meetings, calls, or assistants.

How to choose

Have a finished recording? Pick file transcription.
Processing hours of audio? Use streaming to avoid huge uploads and to view partials.
Need live captions? Use the Realtime API for sub-second latency.

Setup once

Python

# 1) Install the OpenAI SDK
pip install --upgrade openai

# 2) Export your key (or use .env)
export OPENAI_API_KEY="sk-...redacted..."

Tip: keep keys out of source control; use env vars or a secrets manager.

File transcription (recordings → text)

Great for podcasts, interviews, voicemail, and call recordings.

transcribe_file.py

from openai import OpenAI
client = OpenAI()

with open("meeting.wav","rb") as f:
    result = client.audio.transcriptions.create(
        model="whisper-1",
        file=f,
        response_format="text"  # or "json" / "srt" / "verbose_json"
    )

print(result)

Streaming transcription (long audio, partials)

Send chunks as you read from disk or a recorder. Useful for long sessions.

stream_transcribe.py (sketch)

from openai import OpenAI
client = OpenAI()

stream = client.audio.transcriptions.create_stream(
    model="whisper-1",
    # add language/prompt if helpful for accuracy
)

def read_chunks(path, size=4096):
    with open(path, "rb") as f:
        while True:
            b = f.read(size)
            if not b: break
            yield b

for chunk in read_chunks("lecture.wav"):
    stream.send(chunk)
    if stream.has_partial:
        print("partial:", stream.partial_text)

final_text = stream.close()
print("final:", final_text)

Realtime captions (low latency)

Use WebSockets to push PCM/Opus frames and receive interim/final text.

realtime_ws.py (outline)

import asyncio, websockets, json, sounddevice as sd

API_KEY = "sk-...redacted..."
URI = "wss://api.openai.com/v1/realtime?model=realtime-stt"

async def mic_stream():
    # yield raw audio frames from the microphone
    samplerate = 16000
    q = asyncio.Queue()

    def cb(indata, frames, time, status):
        q.put_nowait(bytes(indata))

    with sd.RawInputStream(callback=cb, samplerate=samplerate, channels=1, dtype="int16"):
        while True:
            yield await q.get()

async def run():
    async with websockets.connect(
        URI,
        extra_headers={"Authorization": f"Bearer {API_KEY}"}
    ) as ws:
        await ws.send(json.dumps({"type":"start"}))
        async for frame in mic_stream():
            await ws.send(frame)
            msg = await ws.recv()
            data = json.loads(msg)
            if data.get("text"):
                print(data["text"])

asyncio.run(run())

Example transcription code screenshot — Minimal realtime outline (WebSocket). Adapt buffering and error handling for production.

Best practices

Pick the right sample rate: 16 kHz mono PCM works well for speech.
Give context: A short domain prompt (product names, people) boosts accuracy.
Post-process: run a second pass for punctuation, casing, or diarization if you need it.
Privacy: scrub PII in call recordings when required; rotate keys; use HTTPS/WSS only.

FAQ

Which method is cheapest? File transcription often is; realtime adds persistent connection overhead.

How fast is realtime? With small frames and good uplink, latency can be well under a second.

Can I get timestamps? Use JSON/verbose formats, then render SRT/VTT for players.

Conclusion & next steps

Start with file transcription to validate quality, move to streaming for long content, then add realtime when you need live captions. Keep prompts and codecs consistent so results are comparable.

Speech Transcription: File, Streaming & Realtime

Introduction

Methods at a glance

How to choose

Setup once

File transcription (recordings → text)

Streaming transcription (long audio, partials)

Realtime captions (low latency)

Best practices

FAQ

Conclusion & next steps

Get in Touch

Explore More from Ondevtra AI

Clock Tree Synthesis (CTS) — Open-Source Flow, Concepts & Commands | Ondevtra

Build a Customer Service Agent — OpenAI Agents SDK (Full Guide) | Ondevtra

Floorplanning with Open-Source Tools — Die Size, Utilization, IO, Power Grid & Macros | Ondevtra

Frontend Coding Trends 2025 — React, Next.js, Astro, Svelte, AI & CSS | Ondevtra

Future Jobs Report 2025

GPT-5 Prompting Guide — Patterns, Reasoning, Tools & Best Practices | Ondevtra

Scan & Download Our AI Apps