🦙 Setup Ollama — Run LLMs Locally
Deploy Ollama on Ubuntu with Docker — run Llama 3, Mistral, Gemma, Phi, and hundreds of open-source language models locally. GPU and CPU supported. OpenAI-compatible API with zero cloud dependency after the initial model download.
📦 Resources & Setup Scripts
Grab the automated bash script from GitHub to follow along with the video.
Quick Install:
wget https://raw.githubusercontent.com/mhmdali94/Docker/main/ai/ollama/ollama-ubuntu.sh
chmod +x ollama-ubuntu.sh
sudo bash ollama-ubuntu.sh
Tutorial Steps
1 Download & Run the Script
The interactive script asks for your GPU type (CPU / NVIDIA / AMD), available RAM or VRAM, install mode (Ollama + Open WebUI or API only), and picks a model for you from a curated list of 20 models.
wget https://raw.githubusercontent.com/mhmdali94/Docker/main/ai/ollama/ollama-ubuntu.sh
chmod +x ollama-ubuntu.sh
sudo bash ollama-ubuntu.sh
2 Manage Models
The script pulls your selected model automatically. To add more models or manage existing ones:
# Pull an additional model
docker exec -it ollama ollama pull qwen2.5-coder:7b
# List downloaded models
docker exec -it ollama ollama list
# Remove a model
docker exec -it ollama ollama rm llama3.2
3 Access Open WebUI
Open your browser and navigate to the Open WebUI interface to chat with your local models:
http://<your-server-ip>:3210
4 Use the API
Ollama exposes an OpenAI-compatible REST API on port 11434. Connect any compatible app — AnythingLLM, Dify, VS Code extensions (Cline, Continue, KiloCode), or your own scripts:
curl http://<your-server-ip>:11434/api/generate \
-d '{"model":"llama3.2","prompt":"Hello!"}'
# VS Code extensions — use OpenAI Compatible provider:
# Base URL: http://<server-ip>:11434/v1
# API Key: any-string
Ports Used
| Port | Purpose |
|---|---|
| 11434 | Ollama REST API |
| 3210 | Open WebUI (chat interface) |
Overview
Ollama is an open-source runtime for running large language models locally on your own hardware. It supports a wide library of models — including Meta's Llama 3, Mistral, Google's Gemma, Microsoft's Phi, Qwen, and Code Llama — packaged as simple pull commands. Ollama exposes an OpenAI-compatible REST API, which means any tool built for ChatGPT can point at your local server instead. With Open WebUI bundled alongside, you get a polished chat interface without needing any cloud subscription.
Why Use It
Ollama solves the biggest friction in self-hosted AI: getting a model running. What previously required Python environments, CUDA configuration, and manual model file management is now a single `ollama pull` command. It handles model quantization, GPU layer offloading, and context management automatically. For teams that need private AI inference — whether for data compliance, cost control, or offline operation — Ollama is the fastest path from zero to a working LLM endpoint.
When You Need It
Who Should Use It
Real Use Cases
Main Features
How to Use After Installation
Security Best Practices
Ports and Firewall Notes
Ollama runs on port 11434 (REST API) and Open WebUI on port 3210. At a minimum, block port 11434 from public access — it has no built-in authentication. Expose only port 3210 through a reverse proxy with HTTPS. If using Ollama as a backend for Dify or AnythingLLM on the same server, you can keep 11434 bound to localhost only (127.0.0.1:11434) so it is never reachable from outside.
Backup and Maintenance
Common Mistakes
Troubleshooting
Alternatives
LocalAI offers a broader API surface (Whisper, Stable Diffusion, TTS) alongside LLMs, making it better for multi-modal inference servers. LM Studio provides a desktop GUI for Windows and Mac users who prefer a graphical model manager without Docker. vLLM is optimized for high-throughput GPU inference in production environments with many concurrent users. Hugging Face Transformers gives you maximum flexibility through Python code but requires significantly more setup. For most self-hosters who want a simple, reliable LLM runtime with a web UI, Ollama remains the easiest starting point.
When Not to Use It
Avoid Ollama if you need high-throughput production inference serving hundreds of simultaneous users — vLLM or Triton Inference Server are better fits at that scale. If you need image generation, text-to-speech, or speech-to-text alongside LLMs, LocalAI provides a unified API for all of these. If your team is non-technical and needs a managed AI service without infrastructure work, commercial APIs like OpenAI remain simpler. Ollama is also not ideal if your server has less than 8 GB of RAM — even small models require memory headroom beyond the model file size.
Need Help Setting Up Ollama?
PrismaTechWork provides end-to-end infrastructure services — from initial deployment and security hardening to ongoing monitoring, automated backups, and dedicated support. Whether you need a single-server setup or a multi-site network, our team ensures your infrastructure is built right, secured properly, and maintained reliably.
Frequently Asked Questions
Do I need a GPU to run Ollama?
No, but GPU is strongly recommended for usable performance. On CPU-only servers, a 7B model generates about 3–10 tokens per second — a short response takes 30–120 seconds. On an NVIDIA GPU with sufficient VRAM, the same model generates 30–80 tokens per second. For development and testing, CPU is fine. For production or interactive chat, plan for a GPU with at least 8 GB VRAM for 7B models.
Which model should I start with?
For CPU-only servers, start with `llama3.2:3b` — it is fast even without GPU and capable enough for most Q&A tasks. For GPU servers with 8 GB VRAM, try `llama3.1:8b` or `mistral:7b`. For coding assistance, `codellama:7b` or `qwen2.5-coder:7b` are strong choices. For Arabic language tasks, `aya:8b` or `qwen2.5:7b` have solid multilingual support.
Can I run multiple models at the same time?
Yes. Ollama can load multiple models simultaneously, subject to available VRAM and RAM. Each model is loaded into memory when first called and stays loaded for a configurable idle timeout (default 5 minutes). You can configure `OLLAMA_MAX_LOADED_MODELS` to control how many stay resident. On memory-constrained servers, Ollama automatically unloads models to make room for new ones.
How do I connect Ollama to AnythingLLM or Dify?
In AnythingLLM, go to Settings → LLM Preference, select Ollama, and set the base URL to http://your-server-ip:11434. In Dify, go to Settings → Model Provider, add Ollama, and enter the same URL. Both tools will then list all models you have pulled in Ollama. If Ollama and AnythingLLM/Dify are on the same server, use http://host.docker.internal:11434 inside the container.
Is the Ollama API compatible with OpenAI SDKs?
Yes, Ollama exposes an OpenAI-compatible API at /v1/chat/completions. You can use the official OpenAI Python or Node.js SDK by changing the base_url to http://your-server:11434/v1 and passing any string as the API key. Most LangChain integrations, Open WebUI, and many third-party tools support Ollama this way without any code changes.
How much disk space do models use?
Model sizes depend on parameter count and quantization. A 7B model at Q4 quantization uses about 4–5 GB. A 7B model at Q8 uses about 8 GB. A 70B model at Q4 uses about 40 GB. Models are stored in the Ollama Docker volume (usually at /root/.ollama). Run `docker exec -it ollama ollama list` to see all downloaded models and their sizes. Delete unused models with `ollama rm model-name`.
Can I use Ollama with a custom system prompt?
Yes. You can set a system prompt in Open WebUI per-model or per-conversation. You can also create a custom Modelfile that bakes a system prompt and parameters into a named model variant. For API use, pass a system message in the messages array of your chat completion request. This lets you create specialized assistants (customer support, coding helper, etc.) without fine-tuning.
How do I update Ollama and Open WebUI?
Run `docker compose pull && docker compose up -d` in your Ollama directory to pull the latest images and recreate the containers. Your model files and Open WebUI data (conversations, users, settings) are stored in named Docker volumes and are not affected by image updates. Check the Ollama and Open WebUI GitHub releases before updating to catch any breaking changes.
