← Back to blog

Making the other 90% observable

A joint post by OpenInfer and ClawMetry on what changes when the inference substrate becomes an operating system.

Making the other 90% observable

· vivekchand with Behnam Bastani (OpenInfer) · 6 min read

TL;DR

The OpenInfer OS layer treats every agent request as a scheduling decision across heterogeneous silicon. Latency-critical turns go to premium GPU compute. The other 90% of agentic workloads (long-running, latency-tolerant, always-on) go to cheap CPU and Graviton silicon. This is multi-SLA inference, and it changes the operator's relationship with their dashboard. Token-by-model breakdowns from the LLM provider stop matching the bill. ClawMetry, the open-source observability layer for OpenClaw with 210k+ installs, closes the gap by attributing every scheduling decision back to the session, sub-agent, and tool call that caused it. Together you get the cost curve of heterogeneous inference and the explainability of a single-vendor stack.

GPU GPU GPU CPU CPU CPU OpenInfer Distributed Inference ClawMetry Observability layer · tokens · cost · sub-agents · spans
OpenInfer schedules each turn to the cheapest silicon that meets its SLA. ClawMetry observes every scheduling decision and attributes it back to the session, sub-agent, and tool call. Learn more about the OpenInfer × ClawMetry partnership →

Why this matters now

For a decade, inference has been treated as a fixed cost. Pick a model, pick a provider, pay per token. That made sense when every turn looked the same. It does not make sense for agentic workloads, where most turns are background work that no human is waiting on. A coding agent that polls for build status, a research agent that drafts intermediate notes, a customer-support agent that summarizes a closed ticket: none of those need GPU latency, and most of them do not need GPU economics either.

This is the heterogeneous compute era. The interesting question stopped being which GPU you run on. It became which silicon, at which SLA, for which turn. Answering that question is what an operating system does. The OpenInfer OS layer schedules inference across GPUs, CPUs, NPUs, and accelerators the way an OS schedules processes across cores, memory tiers, and devices. Multi-SLA awareness is the kernel primitive that makes the scheduling decision possible: every request carries an SLA budget, and the OS routes it to the cheapest silicon that meets the budget.

That reframe is why this post exists. When the substrate gets smart, the operator's existing tooling goes blind. ClawMetry is what gives the operator their eyes back.

When inference becomes a scheduling problem

Take the most common dashboard question: why did yesterday's spend spike?

In a single-vendor world the answer is one row deep. N calls times M tokens times $X per token.

In an OpenInfer-scheduled world the right answer is two layers deep:

"42% of yesterday's sessions ran on the L40S queue at ~40 tok/s. The other 58% ran on EPYC CPUs at ~20 tok/s aggregate. The spike came from one runaway agent that re-spawned 17 sub-agents inside a single Telegram thread, and the OS correctly scheduled all of those long-tail turns onto the cheap pool, which is why the spike was 4x smaller than it would have been on single-vendor infra."

You cannot get that answer from inference-layer metrics alone. The scheduler does not know which sub-agent is talking to which user. The agent runtime does not know which silicon ran each turn. The bill does not break down by route. This is the observability gap that comes with multi-SLA scheduling, and it is the price of admission for the cost savings the scheduling makes possible.

The combined product closes it.

What the OpenInfer OS layer gives you

For OpenClaw operators new to OpenInfer, the OS layer is a drop-in replacement for the inference layer. Concretely:

  • Single-file integration. Drop a config.json into your OpenClaw workspace. No changes to agent code, tools, or skills. Your agent calls its model the way it always did, and the OS layer routes the request behind an OpenAI-compatible endpoint to whichever processor in its mesh fits the SLA at the lowest cost.
  • The other 90% in the cheap pool. Latency-critical turns go to GPU. Long-running, routine, always-on turns (the dominant cost driver in a working agentic system) go to CPU and Graviton. You do not pick. The scheduler does, with multi-SLA awareness.
  • Cross-processor session migration. A file-backed three-tier KV cache (VRAM, RAM, NVMe) lets a session migrate between processors at prefill and decode boundaries without re-paying the prefill cost. The agent never knows it moved.
  • Capacity headroom on hardware you already pay for. OpenInfer's reference benchmark shows ~+50% capacity on a single AWS g6e.16xlarge (L40S + EPYC 7R13) by recruiting otherwise-idle CPUs into the inference fabric. Same dollar, more sessions.
  • Available now in beta at openinfer.io/beta at no cost during the trial.

For the deeper architecture (vertical disaggregation, the batchEngine, custom Q4_0 kernels, ISA-tuned dot products) see OpenInfer's vertical disaggregation post.

What ClawMetry shows you about the scheduling

ClawMetry is the open-source observability layer for OpenClaw agents. pip install clawmetry, 210k+ installs across 123+ countries. Because it observes at the agent layer, not the inference layer, it correctly attributes every token, dollar, and second back to the session, sub-agent, and tool call that caused it, regardless of which processor the OS layer scheduled it onto. No new instrumentation is required. ClawMetry's HTTP interceptor is a small monkey-patch on the OpenClaw process's HTTPX/Requests stack, and it sees every inference call as it goes out.

What that looks like in the dashboard:

  • Brain tab. The unified real-time stream annotates each LLM turn with the route taken (CPU pool vs GPU pool), so you can scroll a Telegram chat replay and see "this turn cost $0.0008, ran on EPYC, 1.4s end-to-end" alongside the user's message and the agent's reply. The substrate stops being a black box.
  • Tokens tab. Token and cost charts split by scheduling decision. The same model served on two pools shows up as two cost lines, so you can see exactly what fraction of the spend is the cheap path. "86% of our token spend went through the other-90% pool this week" becomes a number an operator can show their finance team.
  • Sessions tab. Per-session cost attribution with model and route mix. Useful for billing, quota, and the "which user is expensive?" question.
  • Sub-agent tracker. A runaway agent that fanned out 17 sub-agents shows up as a tree, each leaf with its own cost. You can see exactly where the spend went and which leaves the OS correctly scheduled away from the premium compute.
  • Alerts. Set a rule once: "page me if any single session exceeds $5, or any agent spawns more than 10 sub-agents in 60 seconds." Works across the OS-scheduled fleet without per-route configuration.

Two pieces are worth calling out specifically, because they exist precisely because the substrate now makes scheduling decisions:

  1. The cost-by-route view. Two new lines in the Tokens tab: GPU pool, CPU pool. Computed locally on the operator's machine (no inference internals leak) and they make the "other 90%" cost story OpenInfer already tells provable to the operator's own finance team, on the operator's own data.
  2. The kill switch. Most agentic systems today have no graceful stop. When a sub-agent recursion goes wrong you watch the token meter run until the API key hits a quota wall. ClawMetry's per-session kill switch talks back to the OpenClaw gateway and ends the runaway in-place. Combined with the OS layer's per-tier SLA budgets, operators get automatic cost control at the substrate and explicit user control at the agent in the same dashboard.

Try it together

If you're already on the OpenInfer beta, ClawMetry plugs in without configuration:

pip install clawmetry && clawmetry

If you haven't tried OpenInfer yet:

  • Start the beta at openinfer.io/beta (free during the trial)
  • Install ClawMetry: pip install clawmetry && clawmetry

If you operate OpenClaw in production (multi-tenant workloads, sensitive data, or anything where cost and explainability both matter) this combination is the closest thing to a complete agent operations stack today. OpenInfer makes the substrate schedule smartly. ClawMetry makes everything that happens on it observable, attributable, and controllable. Neither requires changes to your agent code.

What's next

Two pieces of feedback from beta users in recent weeks are already on the ClawMetry roadmap and ship in the next release:

  • A raw payload toggle in the Brain tab so users studying OpenClaw's behaviour can flip between the structured view and the exact bytes sent upstream.
  • Memory-access history. Clicking on a memory file shows the sessions and turns that read or wrote it, so you can ask "why does the agent think this?" and trace the answer.

The next generation of AI infrastructure will not be defined by a single model or a single accelerator. It will be defined by intelligent scheduling across heterogeneous compute and by the operational visibility to trust those decisions in production. OpenInfer and ClawMetry are building that stack together.

Quotes

"OpenInfer turned the inference substrate from a fixed cost into a scheduling problem. ClawMetry's job is to make sure that, when the substrate gets smart, the operator does not get blind. Together you get the cost curve of heterogeneous compute and the explainability of a single-vendor stack, without the trade-off either has alone."
Vivek Chand, founder, ClawMetry
"For a decade we treated inference as a fixed cost. Pick a model, pick a provider, pay per token. That assumption breaks the moment you have multiple workloads, multiple models, and multiple SLOs operating on a constrained cluster, often a heterogeneous one. Inference stops being a fixed cost and becomes a scheduling problem. The OpenInfer OS layer makes this possible. In real time, it assigns each hardware resource to the workloads it handles best, within the operator's defined budget and SLO constraints. Multi-SLO scheduling across heterogeneous compute is one piece of the category. Dynamic placement as demand and capacity shift in real time, and the day-to-day operations of running that cluster end-to-end, are the rest. The OpenInfer OS is the system we built to make all of that work together."
Behnam Bastani, CEO, OpenInfer

The operator should never have to choose between cost and visibility

Open source. Free forever for one node. Works on OpenClaw, Claude Code, Codex, Cursor, Hermes, and every channel your agents touch.

Try ClawMetry free Start the OpenInfer beta

Contact: hello@clawmetry.com · contact@openinfer.io