Video tutorial coming soon.
🧠 Setup LocalAI — Self-Hosted OpenAI API
Deploy LocalAI on Ubuntu with Docker — a drop-in OpenAI API replacement that runs entirely on your hardware. Supports LLMs, Whisper speech-to-text, Stable Diffusion image generation, and text-to-speech. One API, every modality, zero cloud costs.
📦 Resources & Setup Scripts
Grab the automated bash script from GitHub to follow along with the video.
Quick Install:
wget https://raw.githubusercontent.com/mhmdali94/Docker/main/ai/localai/localai-ubuntu.sh
chmod +x localai-ubuntu.sh
sudo bash localai-ubuntu.sh
Tutorial Steps
1 Download & Run the Script
The script installs Docker and deploys LocalAI. GPU support is auto-detected — if an NVIDIA GPU is present, the CUDA image is used automatically.
wget https://raw.githubusercontent.com/mhmdali94/Docker/main/ai/localai/localai-ubuntu.sh
chmod +x localai-ubuntu.sh
sudo bash localai-ubuntu.sh
2 Download a Model
Place a GGUF model file into the LocalAI models directory, or use the built-in model gallery to download via the API:
curl http://localhost:8080/models/apply -H "Content-Type: application/json" \
-d '{"id": "huggingface@thebloke__mistral-7b-instruct-v0.2-gguf__mistral-7b-instruct-v0.2.Q4_K_M.gguf"}'
3 Test the API
LocalAI exposes an OpenAI-compatible API on port 8080. Test it with a chat completion request — any OpenAI SDK works by changing the base URL:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"mistral","messages":[{"role":"user","content":"Hello!"}]}'
4 Connect Your Applications
Point any OpenAI-compatible application at your LocalAI endpoint. Use the same base URL pattern for chat, image generation, transcription, and TTS — all through one unified API.
Ports Used
| Port | Purpose |
|---|---|
| 8080 | LocalAI API (OpenAI-compatible) |
Overview
LocalAI is a free, open-source, self-hosted alternative to the OpenAI API. It exposes the same REST endpoints as OpenAI — /v1/chat/completions, /v1/images/generations, /v1/audio/transcriptions, /v1/audio/speech — so any application built for the OpenAI API works with LocalAI by simply changing the base URL and removing the API key requirement. It supports GGUF language models, Whisper for transcription, Stable Diffusion for image generation, and several TTS engines, making it the most versatile self-hosted AI inference server available.
Why Use It
LocalAI's defining advantage over Ollama is breadth: it is not just an LLM runner but a complete multi-modal AI inference server covering text, speech, and images through a single OpenAI-compatible API surface. This means you can migrate an entire application that uses OpenAI's chat, transcription, and image APIs to run fully locally with a single base URL change — no SDK changes, no prompt changes, no application code changes. For organizations with strict data residency requirements, this drop-in compatibility is the fastest path to full AI privatization.
When You Need It
Who Should Use It
Real Use Cases
Main Features
How to Use After Installation
Security Best Practices
Ports and Firewall Notes
LocalAI exposes a single port 8080 for its full OpenAI-compatible API — all modalities (chat, images, audio, embeddings) share this one port. Block it from public access entirely and route through a reverse proxy with TLS. If LocalAI serves internal application servers only, bind it to the Docker internal network and do not publish the port to the host at all. Only publish to localhost (127.0.0.1:8080) if applications run on the same host.
Backup and Maintenance
Common Mistakes
Troubleshooting
Alternatives
Ollama is simpler to set up for LLM-only use cases — one command to pull and run a model, with a cleaner model management CLI. It does not support Whisper, Stable Diffusion, or TTS, making LocalAI the better choice for multi-modal applications. LM Studio provides a desktop GUI for Windows and Mac users who prefer visual model management without Docker. vLLM is optimized for high-throughput GPU LLM inference in production environments and supports tensor parallelism across multiple GPUs — better for large-scale serving. Triton Inference Server is the enterprise standard for multi-model GPU serving but requires significant DevOps expertise. LocalAI's unique position is the widest modality coverage in a single self-hosted Docker container.
When Not to Use It
Avoid LocalAI if you only need LLM text generation — Ollama is simpler, faster to set up, and has better model management. If you need maximum LLM throughput for many concurrent users, vLLM's optimized batching and tensor parallelism outperform LocalAI significantly. LocalAI's configuration via YAML files becomes complex for large model libraries — Ollama's model registry and pull system is easier to maintain at scale. If your team is non-technical and needs a managed AI service, cloud APIs remain simpler. LocalAI is also not a good fit if you need real-time streaming with very low latency — its GGUF inference backend is not as latency-optimized as dedicated inference servers.
Need Help Setting Up LocalAI?
PrismaTechWork provides end-to-end infrastructure services — from initial deployment and security hardening to ongoing monitoring, automated backups, and dedicated support. Whether you need a single-server setup or a multi-site network, our team ensures your infrastructure is built right, secured properly, and maintained reliably.
Frequently Asked Questions
What is the difference between LocalAI and Ollama?
Ollama is a focused LLM runner with a clean CLI, automatic model management, and an OpenAI-compatible API — optimized purely for text generation. LocalAI is a broader multi-modal inference server covering LLMs, Whisper transcription, Stable Diffusion image generation, and TTS through one API. Ollama is easier to use for LLM-only tasks. LocalAI is the right choice when you need to replace multiple OpenAI API capabilities (not just chat) with a single self-hosted endpoint.
Can I use LocalAI as a drop-in replacement for the OpenAI Python SDK?
Yes. Set `base_url='http://your-server:8080/v1'` and `api_key='any-string'` in the OpenAI Python client. All API calls that LocalAI supports — chat completions, embeddings, image generation, and transcription — will route to your local server. Your application code, prompts, and parameters stay unchanged. The only limitation is that LocalAI does not support every OpenAI feature (e.g., Assistants API, file uploads, fine-tuning) — check the LocalAI documentation for the current compatibility list.
How do I add a new model to LocalAI?
The easiest way is using the model gallery API: send a POST to /models/apply with the gallery model ID. LocalAI downloads the GGUF file and creates the YAML configuration automatically. Alternatively, place a GGUF file in the models directory and create a matching YAML file defining the model name, backend (llama-cpp), and parameters. Restart LocalAI after adding models manually. Use `curl http://localhost:8080/v1/models` to verify the model appears in the list.
Does LocalAI support Whisper for speech transcription?
Yes. Download a Whisper model via the gallery API (e.g., whisper-1 maps to the medium model). Then send audio files to /v1/audio/transcriptions using the same format as the OpenAI Whisper API. LocalAI accepts WAV, MP3, and OGG formats. For best accuracy, use 16kHz mono WAV. The medium Whisper model requires about 1.5 GB of RAM/VRAM. The large-v3 model gives near-human accuracy but needs 5+ GB VRAM.
Can LocalAI generate images like DALL-E?
Yes. LocalAI supports Stable Diffusion through its /v1/images/generations endpoint, which mirrors the OpenAI DALL-E API. Add a Stable Diffusion model from the gallery and send POST requests with a text prompt, size, and number of images. Existing code that calls `openai.images.generate()` works without changes. Image generation on CPU is extremely slow (5–30 min per image); a GPU with 6+ GB VRAM produces images in 10–60 seconds depending on steps and resolution.
What model format does LocalAI use?
LocalAI primarily uses GGUF format models for LLMs (processed by the llama.cpp backend). GGUF supports quantization levels from Q2 to Q8 — Q4_K_M is the recommended balance of quality and speed. For Whisper, it uses Whisper GGML format models. For Stable Diffusion, it uses safetensors or ckpt model files. All major open-source LLMs available on Hugging Face have GGUF quantized versions provided by community contributors like TheBloke.
How does LocalAI handle multiple concurrent requests?
LocalAI queues requests when a model is busy — concurrent requests are processed sequentially by default for each model. You can configure parallel slots per model in the YAML config to allow limited concurrency at the cost of higher memory usage. For high-concurrency production scenarios, running multiple LocalAI instances behind a load balancer, or switching to vLLM for the LLM workload, is more appropriate. LocalAI is best suited for low-to-medium concurrency use cases.
How do I update LocalAI without losing my models?
Run `docker compose pull && docker compose up -d` in your LocalAI directory. Model files in the models directory volume are not affected by image updates — only the LocalAI binary and dependencies change. After updating, verify that all models still appear in `GET /v1/models` and test a chat completion. Check the LocalAI GitHub releases for breaking changes in model YAML configuration format before major version upgrades, as the config schema occasionally changes.
