Video tutorial coming soon.
📊 Setup Langfuse — LLM Observability Platform
Deploy Langfuse on Ubuntu with Docker — an open-source LLM observability and evaluation platform. Trace every AI call, measure output quality, detect prompt regressions, and track costs across all your LLM applications from one self-hosted dashboard.
📦 Resources & Setup Scripts
Grab the automated bash script from GitHub to follow along with the video.
Quick Install:
wget https://raw.githubusercontent.com/mhmdali94/Docker/main/ai/langfuse/langfuse-ubuntu.sh
chmod +x langfuse-ubuntu.sh
sudo bash langfuse-ubuntu.sh
Tutorial Steps
1 Download & Run the Script
The script installs Docker and deploys Langfuse with a PostgreSQL database for storing traces, evaluations, and prompt versions.
wget https://raw.githubusercontent.com/mhmdali94/Docker/main/ai/langfuse/langfuse-ubuntu.sh
chmod +x langfuse-ubuntu.sh
sudo bash langfuse-ubuntu.sh
2 Create Your Account & Project
Open your browser and navigate to Langfuse. Register an account, create a project, and generate your API keys:
http://<your-server-ip>:3000
3 Instrument Your Application
Install the Langfuse SDK in your application and wrap your LLM calls. For Python with OpenAI, it takes two lines:
pip install langfuse
# In your code:
from langfuse.openai import openai # drop-in replacement
# All openai.chat.completions.create() calls are now traced automatically
4 Explore Traces & Evaluate
Open the Langfuse dashboard to see your traces, review individual LLM calls, add human or LLM-based evaluation scores, and set up automated evaluators for continuous quality monitoring.
Ports Used
| Port | Purpose |
|---|---|
| 3000 | Langfuse Web UI & API |
Overview
Langfuse is an open-source LLM engineering platform focused on observability, evaluation, and prompt management. It captures detailed traces of every LLM call in your application — inputs, outputs, token counts, latency, and cost — and organizes them into a searchable, filterable trace explorer. Teams use it to debug unexpected model outputs, run evaluation pipelines to score response quality, manage and version prompt templates, and track spending across multiple LLM providers. Its SDKs integrate with LangChain, LlamaIndex, OpenAI, and any custom LLM call.
Why Use It
LLM applications fail in ways that traditional monitoring cannot catch — a model gives a factually wrong answer, hallucinates a product feature, or becomes less helpful after a prompt change. Langfuse makes these invisible failures visible. By capturing the full context of every LLM interaction — the exact prompt, the model response, the retrieval context, and evaluation scores — it gives engineering teams the data they need to understand, improve, and maintain AI quality systematically. For organizations that cannot send conversation data to cloud observability tools due to privacy requirements, self-hosting Langfuse is the right approach.
When You Need It
Who Should Use It
Real Use Cases
Main Features
How to Use After Installation
Security Best Practices
Ports and Firewall Notes
Langfuse runs on port 3000 for both the web UI and its ingest API. Your instrumented applications send traces to this port, so it must be reachable from your application servers. Block it from public internet access and expose it through Nginx Proxy Manager with HTTPS. If applications and Langfuse are on the same server or Docker network, traces can be sent to the internal Docker network address without exposing port 3000 publicly.
Backup and Maintenance
Common Mistakes
Troubleshooting
Alternatives
Helicone is a cloud-based LLM proxy that adds observability as a transparent middleware layer — easier to set up but stores data externally and charges per request. Braintrust is a cloud-native evaluation platform with a stronger focus on dataset management and fine-tuning workflows — not self-hostable. Phoenix by Arize is an open-source alternative with a strong emphasis on embedding visualization and retrieval quality for RAG pipelines. LangSmith (LangChain's observability tool) is cloud-first but has a self-hosted option — tightly integrated with LangChain but more opinionated. Langfuse's advantage is the broadest SDK support, active open-source development, and clean self-hosted deployment that works with any LLM stack.
When Not to Use It
Langfuse is not a general application performance monitoring tool — for server metrics, error rates, and latency histograms, use Grafana with Prometheus instead. If you are still in early prototyping with no production traffic, the overhead of setting up tracing is not yet worth it — wait until you have real user conversations to analyze. Langfuse is also not a real-time alerting system; it does not send alerts when your error rate spikes. For real-time alerting on LLM failures, combine Langfuse with Uptime Kuma or a webhook-based notification system.
Need Help Setting Up Langfuse?
PrismaTechWork provides end-to-end infrastructure services — from initial deployment and security hardening to ongoing monitoring, automated backups, and dedicated support. Whether you need a single-server setup or a multi-site network, our team ensures your infrastructure is built right, secured properly, and maintained reliably.
Frequently Asked Questions
What is LLM observability and why do I need it?
LLM observability is the practice of capturing, storing, and analyzing the inputs and outputs of every LLM call in your application. Without it, when your chatbot gives a wrong answer, you have no record of what prompt was sent, what context was retrieved, or what the model actually said. With Langfuse, every interaction is stored as a trace you can replay, search, and score. This makes debugging, quality improvement, and cost control possible instead of guesswork.
Does Langfuse work with Ollama and local LLMs?
Yes. Langfuse works with any LLM — cloud or local. For Ollama, use the standard Langfuse SDK's manual tracing: wrap your Ollama API calls with `langfuse.generation()` context managers. Alternatively, if you use Dify or Flowise as your orchestration layer, both have native Langfuse integration that traces automatically without code changes in your application. Cost tracking for local models will show zero token cost since there is no billing, but latency and token count metrics still work.
How do I add Langfuse tracing to an existing OpenAI application?
A trace represents one complete user interaction — for example, a user asking a question and getting an answer. A span is a timed unit of work within a trace — for example, a retrieval step, a tool call, or a post-processing step. A generation is a specific span representing an LLM call with input prompt and output completion. Generations have special fields for model name, token counts, and cost. This hierarchy lets you see the full chain of steps in complex agents and pinpoint exactly which step caused a poor response.
What is the difference between a trace, a span, and a generation in Langfuse?
A trace represents one complete user interaction — for example, a user asking a question and getting an answer. A span is a timed unit of work within a trace — for example, a retrieval step, a tool call, or a post-processing step. A generation is a specific span representing an LLM call with input prompt and output completion. Generations have special fields for model name, token counts, and cost. This hierarchy lets you see the full chain of steps in complex agents and pinpoint exactly which step caused a poor response.
Can Langfuse automatically evaluate the quality of LLM responses?
Yes. Langfuse has an automated evaluator system. You define an evaluator with a criteria (e.g., 'Is the answer factually correct and helpful?') and an LLM model to run the evaluation (e.g., GPT-4o or a local Ollama model). Langfuse then runs this evaluator on every new trace and stores a numeric score. You can set up multiple evaluators for different quality dimensions — correctness, tone, safety, relevance — and track score trends over time to detect regressions after prompt or model changes.
How does Langfuse track LLM costs?
Langfuse captures token counts from every LLM call response and multiplies them by the configured price per token for the model. It has built-in pricing for OpenAI, Anthropic, and other major providers. For local models like Ollama, cost is tracked as zero. The cost dashboard shows spending broken down by model, project, user, and time period. You can also set custom pricing for any model. This gives you an accurate picture of your AI infrastructure cost without relying on provider billing dashboards.
Does Langfuse integrate with Dify and Flowise?
Yes. Dify has native Langfuse integration — go to Settings → Integrations → Langfuse in Dify and enter your Langfuse host URL and API keys. All Dify app calls, RAG retrievals, and workflow steps are automatically traced and visible in Langfuse. Flowise also supports Langfuse through its analytics integration settings. This means you can get full observability on your Dify or Flowise applications without modifying any application code.
How do I manage and version prompts in Langfuse?
Langfuse has a prompt management feature under the Prompts section of the dashboard. Create a named prompt, write your template with variables (e.g., {{user_name}}), and save it as version 1. In your application, fetch the prompt by name using the SDK — it always pulls the latest production version. When you update the prompt, the new version is saved separately and can be compared against the previous version in traces. You can roll back to any previous version instantly by changing the production pointer without redeploying your application.
