A self-hosted transcription tool for legal academics, built on OpenAI Whisper Large-v3 and packaged so that everything — model weights, inference, audio storage, the web UI — runs inside a single Docker stack on local hardware. No audio is sent to a cloud service. Nothing leaves the institution’s network. That constraint shapes the rest of the design.
Why air-gapped
The driving requirement is the kind of audio legal academics actually work with: client interviews, oral history with vulnerable participants, materials covered by research-ethics protocols that rule out third-party processing. Sending that audio to a transcription SaaS is the wrong answer to the wrong problem. The right answer is a local pipeline that performs as well as the SaaS but never asks the user to make a trust decision about an external service.
Whisper Large-v3 makes this newly viable. It runs on a single GPU, transcribes most languages well, and includes a native English-translation task for 21 source languages. The remaining work is operational: packaging it so a non-technical user can run it, and an institution can deploy it once and forget about it.
Capabilities
- Transcription of audio in MP3, WAV, FLAC, M4A, and other common formats. Container-based formats (M4A, MP4, MOV) are pre-converted to WAV via FFmpeg before reaching the model, so the pipeline behaves consistently regardless of input wrapper.
- Translation to English for any of the 21 source languages Whisper supports natively. A simple radio toggle on the UI switches between Transcribe and Translate to English; the source language can be auto-detected or set explicitly.
- Three output formats per job: a Word document (
.docx), WebVTT subtitles (.vtt), and a JSON file with raw segment data. - Two DOCX views — a timecoded version with timestamps per paragraph for citation work, and a reader view that strips the timestamps for prose review. Paragraph chunking prefers sentence boundaries (and always breaks on a question mark) so long monologues don’t render as one wall of text.
- Readable VTT — captions conform to readability standards: max 37 characters per line, 2 lines per cue, sentence-boundary breaks, and question marks always end a cue.
- Hallucination filter — Whisper occasionally produces end-of-audio artefacts (“thank you for watching”, subtitle credits, bare dashes, music symbols). These are filtered before any output is written.
- Multi-user with strict per-user isolation — each authenticated user gets their own directory; one user’s audio and transcripts are never visible to another’s session.
- Optional audio retention — uploaded files are deleted once transcription completes; a checkbox preserves the original alongside the transcript when the workflow needs both.
- Processing speed indicator — shows audio duration, elapsed time, seconds-per-minute-of-audio, and the ×real-time factor at the end of each job, so users learn what to expect from their hardware.
- Advanced panel — a collapsed accordion exposes beam size (1–5) and a
condition_on_prev_tokenstoggle for users who want to trade speed for accuracy or recover from cascading errors on long recordings.
Stack
| Component | Role |
|---|---|
| Whisper Large-v3 | Speech recognition + translation |
| Hugging Face Transformers | Model loading and the inference pipeline |
| Gradio | Web UI and authentication |
| PyTorch (bfloat16) | GPU inference |
| python-docx | Word document generation |
| FFmpeg | Audio probing and container-format pre-conversion |
The whole stack runs in a single Docker Compose project. The Whisper Large-v3 weights are downloaded once at image build time and baked into the image — no runtime model downloads, ever. Inference runs in bfloat16 rather than float16, which is both faster on modern hardware and more numerically stable on noisy audio.
Security model
- Local inference only. No network egress is required at runtime; all model weights are present in the image at build time.
- Per-user file isolation. Each authenticated user gets their own subdirectory with restricted permissions; audio and transcripts cannot be enumerated across users.
- No hardcoded credentials. Users are configured via an environment variable rotated by editing the deployment’s
.envfile and restarting the container — no rebuild required. - Non-root container. The transcription process runs as a non-privileged user inside the container.
- Logs record events, not content. The container’s standard log stream contains startup banners and per-job event lines (username, filename, timestamp, audio duration); transcript text never appears in container logs.
.gitignorediscipline. Per-user data and.envare excluded from version control. The repo can be cloned to a fresh machine without leaking anyone’s audio or credentials.
Operational footprint
The service binds to a configurable IP — 127.0.0.1 for single-machine use, 0.0.0.0 for shared LAN access — and authenticates through Gradio’s built-in login. Once deployed, the operational surface is small: docker compose up -d --build after a code change, docker compose up -d after editing .env, docker compose logs -f to tail. There’s no external database, no separate auth service, no Redis or queue. The system is one container, one model file, and a directory of per-user transcripts on disk.
Status
In active use. Built for and deployed within an academic law school context where the air-gap requirement is the whole point — running it as a public hosted service would defeat the design. Future work includes a configurable legal-lexicon prompt to bias recognition toward standing legal terminology, and per-job custom-vocabulary entry for proper nouns specific to a recording.