University Transcription Service

A self-hosted, GPU-accelerated transcription service built on Whisper Large-v3, running fully air-gapped so that potentially privileged legal audio never leaves the institution's hardware.

A self-hosted transcription tool for legal academics, built on OpenAI Whisper Large-v3 and packaged so that everything — model weights, inference, audio storage, the web UI — runs inside a single Docker stack on local hardware. No audio is sent to a cloud service. Nothing leaves the institution’s network. That constraint shapes the rest of the design.

Why air-gapped

The driving requirement is the kind of audio legal academics actually work with: client interviews, oral history with vulnerable participants, materials covered by research-ethics protocols that rule out third-party processing. Sending that audio to a transcription SaaS is the wrong answer to the wrong problem. The right answer is a local pipeline that performs as well as the SaaS but never asks the user to make a trust decision about an external service.

Whisper Large-v3 makes this newly viable. It runs on a single GPU, transcribes most languages well, and includes a native English-translation task for 21 source languages. The remaining work is operational: packaging it so a non-technical user can run it, and an institution can deploy it once and forget about it.

Capabilities

Transcription of audio in MP3, WAV, FLAC, M4A, and other common formats. Container-based formats (M4A, MP4, MOV) are pre-converted to WAV via FFmpeg before reaching the model, so the pipeline behaves consistently regardless of input wrapper.
Translation to English for any of the 21 source languages Whisper supports natively. A simple radio toggle on the UI switches between Transcribe and Translate to English; the source language can be auto-detected or set explicitly.
Three output formats per job: a Word document (.docx), WebVTT subtitles (.vtt), and a JSON file with raw segment data.
Two DOCX views — a timecoded version with timestamps per paragraph for citation work, and a reader view that strips the timestamps for prose review. Paragraph chunking prefers sentence boundaries (and always breaks on a question mark) so long monologues don’t render as one wall of text.
Readable VTT — captions conform to readability standards: max 37 characters per line, 2 lines per cue, sentence-boundary breaks, and question marks always end a cue.
Hallucination filter — Whisper occasionally produces end-of-audio artefacts (“thank you for watching”, subtitle credits, bare dashes, music symbols). These are filtered before any output is written.
Multi-user with strict per-user isolation — each authenticated user gets their own directory; one user’s audio and transcripts are never visible to another’s session.
Optional audio retention — uploaded files are deleted once transcription completes; a checkbox preserves the original alongside the transcript when the workflow needs both.
Processing speed indicator — shows audio duration, elapsed time, seconds-per-minute-of-audio, and the ×real-time factor at the end of each job, so users learn what to expect from their hardware.
Advanced panel — a collapsed accordion exposes beam size (1–5) and a condition_on_prev_tokens toggle for users who want to trade speed for accuracy or recover from cascading errors on long recordings.

Stack

Component	Role
Whisper Large-v3	Speech recognition + translation
Hugging Face Transformers	Model loading and the inference pipeline
Gradio	Web UI and authentication
PyTorch (bfloat16)	GPU inference
python-docx	Word document generation
FFmpeg	Audio probing and container-format pre-conversion

The whole stack runs in a single Docker Compose project. The Whisper Large-v3 weights are downloaded once at image build time and baked into the image — no runtime model downloads, ever. Inference runs in bfloat16 rather than float16, which is both faster on modern hardware and more numerically stable on noisy audio.

Security model

Local inference only. No network egress is required at runtime; all model weights are present in the image at build time.
Per-user file isolation. Each authenticated user gets their own subdirectory with restricted permissions; audio and transcripts cannot be enumerated across users.
No hardcoded credentials. Users are configured via an environment variable rotated by editing the deployment’s .env file and restarting the container — no rebuild required.
Non-root container. The transcription process runs as a non-privileged user inside the container.
Logs record events, not content. The container’s standard log stream contains startup banners and per-job event lines (username, filename, timestamp, audio duration); transcript text never appears in container logs.
.gitignore discipline. Per-user data and .env are excluded from version control. The repo can be cloned to a fresh machine without leaking anyone’s audio or credentials.

Operational footprint

The service binds to a configurable IP — 127.0.0.1 for single-machine use, 0.0.0.0 for shared LAN access — and authenticates through Gradio’s built-in login. Once deployed, the operational surface is small: docker compose up -d --build after a code change, docker compose up -d after editing .env, docker compose logs -f to tail. There’s no external database, no separate auth service, no Redis or queue. The system is one container, one model file, and a directory of per-user transcripts on disk.

Status

In active use. Built for and deployed within an academic law school context where the air-gap requirement is the whole point — running it as a public hosted service would defeat the design. Future work includes a configurable legal-lexicon prompt to bias recognition toward standing legal terminology, and per-job custom-vocabulary entry for proper nouns specific to a recording.