How to Build LLM Observability with Langfuse

TL;DR

The Goal: Build a robust LLM observability pipeline.
The Stack: Langfuse, OpenAI, Python.
The Outcome: Full visibility into token usage, latency, and prompt performance across your AI applications.

Why Build This?

When building AI applications, logging simple inputs and outputs is no longer sufficient. You need deep visibility into complex agentic chains, token costs, and user feedback. Without a dedicated observability platform, you are flying blind when your AI misbehaves or when bills spike unexpectedly. Building a custom tracing solution from scratch is time consuming and prone to error.

Langfuse provides an open source alternative to Datadog for LLMs. By integrating it into your stack, you gain immediate access to structured tracing, evaluation metrics, and a centralized prompt management system. This transforms a black box AI system into a predictable, debuggable engineering asset, allowing you to iterate on prompts with confidence and monitor production performance in real time.

The Architecture

The architecture centers around the Langfuse SDK intercepting calls between your application logic and the LLM provider. When a user triggers an action, your backend creates a top level trace in Langfuse. As the request flows through various processing steps, retriever nodes, and LLM inferences, child spans and generations are attached to this root trace. This data is asynchronously batched and sent to the Langfuse backend, ensuring your application latency remains unaffected. The Langfuse dashboard then aggregates this telemetry for visualization and analysis.

Step-by-Step Implementation

Setting up the Project

First, install the Langfuse Python SDK and the OpenAI client. You need to configure your environment variables with both your OpenAI API key and your Langfuse project credentials (public key, secret key, and host URL).

import os
from langfuse import Langfuse
from langfuse.openai import openai

# Initialize Langfuse client
langfuse = Langfuse(
    public_key=os.environ.get("LANGFUSE_PUBLIC_KEY"),
    secret_key=os.environ.get("LANGFUSE_SECRET_KEY"),
    host="https://cloud.langfuse.com" # or your self-hosted URL
)

We initialize the SDK globally to allow connection pooling and background batching of events.

Tracing LLM Calls

The simplest way to get started is using the Langfuse OpenAI drop in replacement. It automatically captures token usage, latency, and model parameters without requiring manual span creation for every call.

def generate_summary(text: str) -> str:
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a helpful summarization assistant."},
            {"role": "user", "content": text}
        ],
        name="summary_generation",
        metadata={"domain": "financial_reports"}
    )
    return response.choices[0].message.content

print(generate_summary("Quarterly earnings exceeded expectations..."))

By passing the name and metadata kwargs, we tag the generation in Langfuse, making it easy to filter and aggregate costs for specific features in the dashboard.

Managing Prompts

Hardcoding prompts in your codebase makes iteration slow. Langfuse offers a prompt management system where prompts are versioned in the cloud and fetched at runtime.

# Fetch the production version of the summarizer prompt
prompt = langfuse.get_prompt("summarizer_prompt")

# Compile it with runtime variables
compiled_messages = prompt.compile(input_text="Quarterly earnings exceeded expectations...")

# Execute with the parameters defined in the Langfuse UI
response = openai.chat.completions.create(
    model=prompt.config.get("model", "gpt-4"),
    temperature=prompt.config.get("temperature", 0.7),
    messages=compiled_messages,
    name="managed_summary_generation"
)

This approach separates prompt engineering from application logic, allowing non technical team members to tweak prompts and evaluate their performance without triggering a redeploy.

The Results & Takeaways

Visibility: Achieved granular tracking of token costs per user and per feature.
Speed: Reduced debugging time by visualizing exactly which step in a complex chain failed.
Agility: Enabled instant prompt updates via cloud management, bypassing deployment cycles.

What you can build next: You can extend this architecture by integrating user feedback loops. Pass the Langfuse trace ID to your frontend, and allow users to upvote or downvote the AI response. Send this feedback back to Langfuse to automatically build an evaluation dataset for future model fine tuning.

Frequently Asked Questions

Logging directly to a database requires you to build custom schemas for complex nested traces, token counting logic, and visualization dashboards. Langfuse provides all of this out of the box, specifically tailored for the non deterministic nature of LLMs, saving weeks of engineering effort.