Deploy faster-whisper on Ubuntu with Docker for offline, private speech-to-text transcription — no cloud API needed.
Grab the automated bash script from GitHub to follow along with the video.
wget https://raw.githubusercontent.com/mhmdali94/Docker/main/ai/whisper/whisper-ubuntu.sh
chmod +x whisper-ubuntu.sh
sudo bash whisper-ubuntu.sh
Use wget to download the automated installer script from the GitHub repository to your Ubuntu server.
wget https://raw.githubusercontent.com/mhmdali94/Docker/main/ai/whisper/whisper-ubuntu.sh
Grant execute permission to the downloaded script so the shell can run it directly.
chmod +x whisper-ubuntu.sh
The script installs Docker if needed, pulls the faster-whisper image, downloads the selected model, and starts the container with an OpenAI-compatible REST API on port 9000.
sudo bash whisper-ubuntu.sh
Send audio files to the transcription endpoint using curl or any HTTP client. The API is fully compatible with the OpenAI speech-to-text API format.
curl http://<your-server-ip>:9000/v1/audio/transcriptions \
-F file=@audio.mp3 \
-F model=Systran/faster-whisper-small \
-F response_format=json
| Port | Purpose |
|---|---|
| 9000 | Whisper REST API / OpenAI-compatible endpoint |
OpenAI Whisper is a general-purpose speech recognition model trained on 680,000 hours of multilingual audio. When self-hosted via a REST API wrapper like Whisper.cpp or faster-whisper, it transcribes audio files and real-time streams in 99+ languages with high accuracy — all privately on your own hardware.
Self-hosting Whisper eliminates per-minute transcription fees from cloud services like AWS Transcribe or Google Speech-to-Text. Audio data stays on your server, making it ideal for sensitive conversations, medical transcription, and meeting recordings that must not be uploaded to third-party services.
The Whisper API server runs on port 9000. Do not expose this port publicly without authentication. Access it from trusted internal networks or through a reverse proxy with authentication. For local use only, bind to 127.0.0.1.
Alternatives include AWS Transcribe (cloud, per-minute billing), Google Speech-to-Text (cloud), Vosk (lightweight open source, lower accuracy), and Deepgram (cloud, high accuracy). Choose self-hosted Whisper for top-tier accuracy with no data leaving your server.
Avoid self-hosted Whisper if you need real-time streaming transcription with very low latency — cloud services are optimized for that use case. Also avoid on CPU-only servers for large audio files; the large model on CPU is too slow for practical use.
PrismaTechWork provides end-to-end infrastructure services — from initial deployment and security hardening to ongoing monitoring, automated backups, and dedicated support. Whether you need a single-server setup or a multi-site network, our team ensures your infrastructure is built right, secured properly, and maintained reliably.
Whisper offers model sizes from tiny (39M parameters, fast but less accurate) to large-v3 (1.5B parameters, most accurate but slow). For most uses the medium model offers the best balance. Use tiny or base for speed-critical applications, and large for high-accuracy requirements with GPU available.
No, but a GPU makes it dramatically faster. On CPU, the large model takes many minutes per audio minute. On an NVIDIA GPU with CUDA, the same transcription completes in roughly real-time or faster. Use faster-whisper with CTranslate2 for best CPU performance.
Whisper supports MP3, MP4, WAV, FLAC, OGG, WEBM, and most common audio and video formats via FFmpeg preprocessing. The API server accepts files via multipart form upload. Very large files should be split into chunks before submission.
Yes. If you do not specify a language, Whisper analyzes the first 30 seconds of audio and automatically detects the spoken language. You can also specify the language explicitly for faster transcription and better accuracy on short clips.
Yes. Whisper generates timestamps at the segment level by default and at the word level when word-level timestamps are enabled. Timestamps are available in the API response in JSON format and can be used to generate SRT or VTT subtitle files.
Yes. Transcribe the audio track of a video using the Whisper API with timestamp output. Convert the JSON response to SRT or VTT format using a script or tool. Upload the subtitle file to PeerTube, Jellyfin, or any video player that supports external subtitles.
Popular Whisper server implementations like faster-whisper-server and whisper.cpp server offer an OpenAI-compatible API endpoint at /v1/audio/transcriptions. This means any tool or library built for the OpenAI Whisper API can switch to your self-hosted instance by changing only the base URL.
Send a multipart POST request to http://YOUR_SERVER:9000/v1/audio/transcriptions with the audio file as the file field and model name as the model field. The response contains the transcribed text in JSON. You can use curl, Python requests, or any HTTP client library.