← Back to blog

Langfuse vs. ClawMetry: an honest comparison for AI agent observability

Langfuse vs. ClawMetry: an honest comparison for AI agent observability

· 8 min read · By Vivek Chand

I get asked this comparison at least a few times a week. “We’re already using Langfuse — why would we switch?” or “Langfuse vs. ClawMetry, which one?” So here is the honest answer, including the part where I tell you to pick Langfuse.

I’m the founder of ClawMetry, so take my framing with that in mind. But I’ve deliberately tried to be accurate here rather than promotional. If you catch anything wrong, open an issue.

What each tool actually is

Langfuse is an open-source LLM engineering platform. Its core abstraction is the trace: every LLM call you make gets wrapped in a span, and Langfuse records latency, cost, input/output tokens, and the prompt. On top of that you get prompt management, datasets for evaluation, and scoring pipelines for measuring output quality. You instrument your code with their SDK (Python, JS/TS) or use one of the many integrations (LangChain, OpenAI, LlamaIndex, etc.).

ClawMetry is an open-source observability dashboard for AI agents across 12 runtimes: OpenClaw, NVIDIA NeMoClaw, Claude Code, Codex, and 8 more. Its core abstraction is the agent session: what tasks the agent worked on, which tools it called, what sub-agents it spawned, what cron jobs ran, and how memory files changed. Zero instrumentation required for supported runtimes — ClawMetry auto-detects by reading each runtime’s session files and (for OpenClaw) connecting to the gateway WebSocket.

That difference — SDK instrumentation vs. zero-config auto-detection — cascades into almost every other dimension of this comparison.

Quick comparison

Dimension Langfuse ClawMetry
Instrumentation SDK wraps (Python, JS/TS) or framework integrations Zero-config — auto-detects 12 agent runtimes (OpenClaw, Claude Code, Codex, NVIDIA NeMoClaw…)
Data model Trace → span → observation → score Session → tool_call → sub-agent → cron_job → memory_diff
Data residency Cloud: data sent to Langfuse servers. Self-host: local but needs Docker Local-first by default. E2E encrypted cloud sync optional
OSS license MIT MIT
Cloud pricing Free up to 50K observations/month, then per-event OSS free forever. Cloud-Pro (alerts, fleet, Slack) on subscription
LLM provider support OpenAI, Anthropic, Google, Mistral, Ollama, and more OpenClaw gateway (which routes to any model)
Agent-native concepts Traces and spans cover LLM calls. No cron jobs, memory diffs, or sub-agent trees Cron jobs, memory file diffs, sub-agent cost attribution, fleet view
Prompt management Yes — versioned prompts, A/B testing, datasets No
Time-to-first-insight Minutes to hours (SDK integration, trace config) pip install clawmetry — literally one command
Storage backend Postgres (self-host) or Langfuse Cloud DuckDB (local, columnar, zero config)

Pricing model

Langfuse Cloud starts free at 50,000 observations per month. An “observation” is roughly one LLM call or one span in a trace. If you run a moderately busy agent that makes 20 LLM calls per session and does 100 sessions a day, you hit 60,000 observations in three days. Then you’re paying per event.

That per-event model is completely reasonable for typical LLM apps — it scales with usage and the unit cost is low. But for teams running high-frequency agent loops (dozens of tool calls per task, hundreds of tasks per day), the meter runs fast.

ClawMetry OSS is free forever. The sync daemon, the dashboard, the DuckDB store — all free. Cloud-Pro adds the features that genuinely need a server: multi-node fleet view, alert rules, Slack/PagerDuty webhooks, and retention beyond 24 hours.

The right framing: if you’re measuring Langfuse costs in “observations per month,” you’re already thinking about LLM tracing. If you’re measuring ClawMetry value in “agent sessions observed without leaking data to a third party,” those are different buying decisions.

Data residency

This is the clearest difference and the one that matters most to the teams I talk to in regulated industries.

Langfuse Cloud sends your traces to Langfuse’s servers. That includes your prompt content, model outputs, and token counts. Langfuse self-host keeps data on your own Postgres instance, but you need Docker, a Postgres setup, and ongoing infrastructure maintenance.

ClawMetry is local-first. The sync daemon reads OpenClaw’s filesystem and stores everything in a local DuckDB file on the agent’s machine. Nothing leaves your environment by default. If you opt into Cloud, the snapshot is AES-256-GCM encrypted before it leaves — we can’t read it, only your browser can decrypt it.

For a company where “agent prompt content touches internal documents” is a data sovereignty issue, that distinction is load-bearing.

AI agent specificity

Langfuse was designed for LLM tracing — instrumenting individual API calls to measure latency, cost, and output quality. That’s genuinely valuable, and Langfuse does it well.

But agent observability is a different problem. When an agent spawns three sub-agents to parallelize a task, you don’t just want to know that nine LLM calls happened. You want to know: which sub-agent made which calls, which one failed silently, and what the total cost across the tree was. When a cron job runs at 3 AM and silently completes zero tasks, you want that surfaced — not buried in a trace you have to know to go look for.

ClawMetry’s data model is built around these concepts from the start. A session is the top-level unit. Inside it you see tool calls (with latency and outcome), sub-agents (with their own sessions nested inside), cron jobs (with their schedule adherence), and memory diffs (which SOUL.md or MEMORY.md lines changed). These don’t map cleanly onto Langfuse’s trace/span model because they were never designed to.

ClawMetry sub-agent tree view showing session hierarchy
ClawMetry sub-agent tree view — each node is a session with its own cost, duration, and outcome. Langfuse traces individual API calls inside those nodes; ClawMetry connects them.

OSS posture

Both tools are genuinely open source. Langfuse is MIT licensed and has a healthy contributor base; you can self-host the full product including the frontend. ClawMetry is MIT licensed and the entire observability stack (sync daemon, DuckDB store, dashboard) is open source on GitHub.

One meaningful difference: Langfuse self-host requires running Postgres, a background worker, and a Next.js frontend. ClawMetry OSS is pip install clawmetry && clawmetry. No Docker, no database setup, no frontend build. The single-command install is a design constraint we’ve held deliberately — DuckDB is the reason we can do it (columnar, embedded, zero config).

Time-to-first-insight

Langfuse requires SDK instrumentation. You install the package, wrap your LLM calls, configure the tracing client with your project key, and deploy. For teams already using LangChain or the OpenAI SDK, the integration is 5–10 lines and takes under an hour. For teams with custom LLM pipelines or proprietary agent frameworks, the integration is non-trivial.

ClawMetry on any of its 12 supported runtimes is one command:

pip install clawmetry && clawmetry

That’s it. No config files, no API keys, no code changes to your agent. ClawMetry reads session transcripts from ~/.openclaw/agents/main/sessions/*.jsonl, connects to the gateway WebSocket, and starts surfacing data. The dashboard is live in under 60 seconds from install.

I’m not saying that because we’re better engineers. I’m saying it because we made a different scope decision: ClawMetry targets specific agent runtimes (12 of them today — OpenClaw, Claude Code, Codex, NVIDIA NeMoClaw, and 8 more) and invests in deep auto-detection for each one. Narrow scope is what makes zero-config possible.

The scope trade-off: Langfuse instruments any LLM call in any framework. ClawMetry auto-detects 12 specific agent runtimes with no instrumentation. If you’re on one of those 12, ClawMetry gives you instant visibility with no code changes. If your stack isn’t on the supported list, Langfuse’s SDK approach is the right answer.

Storage backend

Langfuse uses Postgres. That gives you familiar SQL query semantics, good write throughput for tracing workloads, and ecosystem tooling. Self-host means managing a Postgres instance.

ClawMetry uses DuckDB. The choice wasn’t obvious — we started with SQLite and migrated. DuckDB gives us columnar storage, which matters for the queries we run (aggregate token costs by session, top tool calls by latency, model cost attribution over a time window). These are analytical queries over append-only data, and DuckDB handles them significantly faster than row-store SQLite at the volume agent users generate. The embedded nature means no separate database process to manage.

If you want to run ad-hoc SQL against your ClawMetry data locally, you can: the DuckDB file is at ~/.openclaw/.clawmetry/store.duckdb and any DuckDB client can open it read-only while the daemon holds the write lock.

When to pick Langfuse (not us)

I want to be direct about this because “pick the competitor” sections in most comparison posts are mealy-mouthed. Here are the genuine cases where I’d tell you to use Langfuse:

  • Your runtime isn’t one of the 12 ClawMetry supports. ClawMetry auto-detects OpenClaw, Claude Code, Codex, NVIDIA NeMoClaw, and 8 other runtimes. If you’re building on LangChain, OpenAI Assistants, CrewAI, AutoGen, or a fully custom stack that isn’t on that list, ClawMetry won’t auto-detect anything. Langfuse will instrument you in minutes.
  • You need multi-provider LLM tracing. You run experiments across GPT-4o, Claude, Gemini, and local Ollama models and want unified latency + cost comparisons. Langfuse has integrations for all of them. ClawMetry traces whatever model the OpenClaw gateway routes to, but doesn’t give you cross-provider benchmarks.
  • You need prompt management and A/B testing. Langfuse lets you version prompts, run experiments, and score outputs. That’s a real product capability that ClawMetry doesn’t have and isn’t trying to build. If prompt iteration speed is the bottleneck in your team’s workflow, Langfuse is the right tool.
  • You need output evaluation pipelines. LLM-as-a-judge scoring, human annotation queues, dataset management for regression testing — all Langfuse. If you’re doing serious eval work, ClawMetry has nothing here.
  • Your team is already on it and happy. Switching observability tools has real cost. If Langfuse is working for your team, the grass isn’t meaningfully greener on our side unless you’re hitting one of the specific gaps above.

When to pick ClawMetry

  • You run any of the 12 supported runtimes. ClawMetry is the only tool with zero-config auto-detection for OpenClaw, Claude Code, Codex, NVIDIA NeMoClaw, and 8 others. No SDK, no code changes, no deployment ceremony.
  • Data residency is non-negotiable. Regulated industry, on-prem requirement, or you simply don’t want agent prompt content on a third-party server. Local-first means local-first: data stays on your machine unless you opt in to E2E-encrypted cloud sync.
  • You need agent-native visibility. Sub-agent cost attribution, memory file diffs, cron job health, tool call failure rates by tool type — these are ClawMetry-native. You can reconstruct some of this from Langfuse traces with custom spans, but you’d be building ClawMetry yourself on top of Langfuse.
  • You want a fleet view. Running agents on multiple machines or cloud nodes? ClawMetry’s multi-node fleet view aggregates all of them into a single dashboard.

The simple version: Langfuse is the best tool for instrumenting any LLM application, especially for evaluation and prompt experimentation. ClawMetry is the best tool for observing AI agents across the 12 runtimes it supports, without instrumentation, especially when data residency matters.

Using both at once

Some teams do this and it makes sense. Langfuse handles LLM call tracing and eval for the model layer. ClawMetry handles agent session observability for the orchestration layer. They operate at different levels of abstraction and don’t conflict.

If you go this route, ClawMetry + Langfuse gives you the full picture: ClawMetry tells you which sub-agent ran at 2 AM and what it cost at the session level; Langfuse tells you which of the twenty LLM calls inside that session had a cache miss and a 4-second TTFT.

Bottom line

I built ClawMetry because I was running OpenClaw agents and had no visibility into what they were actually doing. Token costs were a black box, cron job failures were silent, and sub-agent status required me to tail three separate log files. pip install clawmetry was the tool I wished existed.

Langfuse solves a different problem: how do you instrument and evaluate any LLM application across providers and frameworks? It’s a genuinely good tool and the team has built something with real depth, especially on the eval side.

Pick the tool that fits the problem you have. If you’re on one of the 12 runtimes ClawMetry supports and want to see what your agents are doing without writing any instrumentation code — and without your session data leaving your machine — that’s us. If you’re building multi-provider LLM applications and need prompt management and eval pipelines, that’s Langfuse.

See what your AI agents are doing

Zero instrumentation. Local-first. Open source. 120K+ installs.

Get ClawMetry free