Video coming soon…

🎙️ Setup Whisper — Self-Hosted Speech-to-Text

Deploy faster-whisper on Ubuntu with Docker for offline, private speech-to-text transcription — no cloud API needed.

⚠️ This script is provided for demo and testing purposes only. Not intended for production use.

📦 Resources & Setup Scripts

Grab the automated bash script from GitHub to follow along with the video.

Automated install — private speech-to-text running in one command.
View on GitHub

Quick Install:

wget https://raw.githubusercontent.com/mhmdali94/Docker/main/ai/whisper/whisper-ubuntu.sh
chmod +x whisper-ubuntu.sh
sudo bash whisper-ubuntu.sh

Tutorial Steps

1 Download the Script

Use wget to download the automated installer script from the GitHub repository to your Ubuntu server.

wget https://raw.githubusercontent.com/mhmdali94/Docker/main/ai/whisper/whisper-ubuntu.sh

2 Make it Executable

Grant execute permission to the downloaded script so the shell can run it directly.

chmod +x whisper-ubuntu.sh

3 Run the Installer

The script installs Docker if needed, pulls the faster-whisper image, downloads the selected model, and starts the container with an OpenAI-compatible REST API on port 9000.

sudo bash whisper-ubuntu.sh

4 Use the API

Send audio files to the transcription endpoint using curl or any HTTP client. The API is fully compatible with the OpenAI speech-to-text API format.

curl http://<your-server-ip>:9000/v1/audio/transcriptions \
  -F file=@audio.mp3 \
  -F model=Systran/faster-whisper-small \
  -F response_format=json

Ports Used

PortPurpose
9000Whisper REST API / OpenAI-compatible endpoint

Overview

OpenAI Whisper is a general-purpose speech recognition model trained on 680,000 hours of multilingual audio. When self-hosted via a REST API wrapper like Whisper.cpp or faster-whisper, it transcribes audio files and real-time streams in 99+ languages with high accuracy — all privately on your own hardware.

Why Use It

Self-hosting Whisper eliminates per-minute transcription fees from cloud services like AWS Transcribe or Google Speech-to-Text. Audio data stays on your server, making it ideal for sensitive conversations, medical transcription, and meeting recordings that must not be uploaded to third-party services.

When You Need It

    Who Should Use It

      Real Use Cases

        Main Features

          How to Use After Installation

            Security Best Practices

              Ports and Firewall Notes

              The Whisper API server runs on port 9000. Do not expose this port publicly without authentication. Access it from trusted internal networks or through a reverse proxy with authentication. For local use only, bind to 127.0.0.1.

              Backup and Maintenance

                Common Mistakes

                  Troubleshooting

                    Alternatives

                    Alternatives include AWS Transcribe (cloud, per-minute billing), Google Speech-to-Text (cloud), Vosk (lightweight open source, lower accuracy), and Deepgram (cloud, high accuracy). Choose self-hosted Whisper for top-tier accuracy with no data leaving your server.

                    When Not to Use It

                    Avoid self-hosted Whisper if you need real-time streaming transcription with very low latency — cloud services are optimized for that use case. Also avoid on CPU-only servers for large audio files; the large model on CPU is too slow for practical use.

                    PrismaTechWork Professional Help

                    PrismaTechWork provides end-to-end infrastructure services — from initial deployment and security hardening to ongoing monitoring, automated backups, and dedicated support. Whether you need a single-server setup or a multi-site network, our team ensures your infrastructure is built right, secured properly, and maintained reliably.

                      Contact Us

                      Frequently Asked Questions

                      Which Whisper model size should I use?

                      Whisper offers model sizes from tiny (39M parameters, fast but less accurate) to large-v3 (1.5B parameters, most accurate but slow). For most uses the medium model offers the best balance. Use tiny or base for speed-critical applications, and large for high-accuracy requirements with GPU available.

                      Does Whisper require a GPU?

                      No, but a GPU makes it dramatically faster. On CPU, the large model takes many minutes per audio minute. On an NVIDIA GPU with CUDA, the same transcription completes in roughly real-time or faster. Use faster-whisper with CTranslate2 for best CPU performance.

                      What audio formats does Whisper support?

                      Whisper supports MP3, MP4, WAV, FLAC, OGG, WEBM, and most common audio and video formats via FFmpeg preprocessing. The API server accepts files via multipart form upload. Very large files should be split into chunks before submission.

                      Can Whisper detect the language automatically?

                      Yes. If you do not specify a language, Whisper analyzes the first 30 seconds of audio and automatically detects the spoken language. You can also specify the language explicitly for faster transcription and better accuracy on short clips.

                      Does Whisper produce timestamps?

                      Yes. Whisper generates timestamps at the segment level by default and at the word level when word-level timestamps are enabled. Timestamps are available in the API response in JSON format and can be used to generate SRT or VTT subtitle files.

                      Can I use Whisper to generate subtitles for videos?

                      Yes. Transcribe the audio track of a video using the Whisper API with timestamp output. Convert the JSON response to SRT or VTT format using a script or tool. Upload the subtitle file to PeerTube, Jellyfin, or any video player that supports external subtitles.

                      Is the self-hosted Whisper API compatible with OpenAI's API?

                      Popular Whisper server implementations like faster-whisper-server and whisper.cpp server offer an OpenAI-compatible API endpoint at /v1/audio/transcriptions. This means any tool or library built for the OpenAI Whisper API can switch to your self-hosted instance by changing only the base URL.

                      How do I transcribe a file using the API?

                      Send a multipart POST request to http://YOUR_SERVER:9000/v1/audio/transcriptions with the audio file as the file field and model name as the model field. The response contains the transcribed text in JSON. You can use curl, Python requests, or any HTTP client library.