AI Agent Reliability: Achieve 99% Accuracy on Multi-Step Workflows Without Changing the Model
May 19, 2026: Antoine Zambelli, AI Director at Texas Instruments, released Forge — an open-source reliability layer for self-hosted LLM tool-calling. The key finding: an 8B local model with Forge guardrails hits 99.3% accuracy on multi-step agentic workflows, while Claude Sonnet without guardrails hits only 87.2%. The gap between a free local model on a $600 GPU and a paid frontier API is less than 1 point. Here's how to apply the same reliability engineering to your Telegram AI agents.
Here's the math problem that every AI agent builder eventually hits. If each individual step in your workflow succeeds 90% of the time, a 5-step task succeeds only 59% of the time. A 10-step task? 35%. This is the compounding failure problem — and it's the single biggest reason AI agents fail in production.
The Forge paper (accepted to ACM CAIS '26) tested 97 model/backend configurations across 18 scenarios, 50 runs each. The headline numbers:
| Configuration | Accuracy | Cost |
|---|---|---|
| Ministral 8B + Forge guardrails | 99.3% | Free (local, $600 GPU) |
| Claude Sonnet + Forge guardrails | 100% | API cost |
| Ministral 8B (no guardrails) | ~53% | Free (local) |
| Claude Sonnet (no guardrails) | 87.2% | API cost |
Forge is a Cargo Rust framework with a Python SDK. It adds domain-and-tool-agnostic guardrails: retry nudges, step enforcement, error recovery, and VRAM-aware context management. It doesn't change the model — it changes the system around it.
You don't need to install Forge to benefit from its findings. The same reliability principles — guardrails, structured steps, retry logic, self-verification — can be encoded directly into your agent prompt on OpenClaw.
Why AI Agents Fail Without Reliability Engineering
The Forge paper identified six failure modes in agentic workflows. Every single one applies to your Telegram agents:
🔴 Failure Mode 1: Premature Execution
The agent acts before it has enough information. Jumping to conclusions on incomplete data is the #1 cause of bad outputs.
🟡 Failure Mode 2: Step Skipping
Multi-step workflows have natural chokepoints. Agents often skip analysis steps and guess the output, especially when the task feels "familiar."
🟠 Failure Mode 3: Error Cascading
One bad step corrupts every subsequent output. Without checkpoints, a small hallucination in step 2 becomes a large hallucination in step 10.
🔵 Failure Mode 4: Context Collapse
As the agent works through a task, it loses track of earlier context. Instructions, constraints, and intermediate findings get dropped.
⚪ Failure Mode 5: False Confidence
When the agent can't find an answer, it fabricates one instead of asking for clarification. This is most common in web research agents that can't access paywalled content.
⚫ Failure Mode 6: Tool Misuse
The agent calls the right tool with the wrong parameters, or calls the wrong tool entirely. Function-calling agents suffer most from this.
The Forge paper solved these with structural guardrails. The prompt below does the same for your OpenClaw Telegram agent — encoding retry nudges, step enforcement, error recovery, and self-verification into a single copy-paste prompt.
The Prompt: Your Reliable Multi-Step AI Agent
This prompt builds a research-and-analysis agent that achieves near-100% accuracy on multi-step tasks by implementing the same reliability principles Forge uses — but as pure prompt engineering, no framework needed.
How to use:
- Deploy OpenClaw on GetClawCloud (one click, zero server setup)
- Paste this prompt as your first message
- Send any complex research or analysis request — the agent handles the rest
💡 Works with any OpenClaw agent that has web search access. The retry nudge and self-verification logic works regardless of the underlying model — apply it to replace "prompt harder" with "prompt smarter."
Why This Works: Reliability Is a System Property, Not a Model Property
The Forge paper's most important finding isn't a specific number — it's the proof of concept that reliability engineering trumps model quality. The same 8B model jumps from ~53% to 99.3% just by adding structural guardrails around it.
Here's the direct mapping between Forge's Cargo Rust guardrails and what the prompt above achieves as prompt engineering:
| Forge Component | What It Does | Prompt Equivalent |
|---|---|---|
| Retry nudges | Rephrase and retry failed tool calls | Retry Nudges section: try 3 approaches before escalating |
| Step enforcement | Force sequential execution through phases | Phase 1→2→3→4 with mandatory validation gates |
| Error recovery | Detect failure and pivot, not repeat | Error Recovery: if wrong → diagnose, correct, retry (not same query) |
| VRAM-aware context | Manage memory window to prevent collapse | Context Integrity: restate task before each new phase |
| Self-verification | Check output before delivering | Phase 4 self-verification with 5-point checklist |
| Eval harness | Score output for accuracy | Verification summary at end of every deliverable |
The gap between a $600 GPU running an 8B model and a paid API is less than 1% when both have proper guardrails. That's not a theory — that's the Forge paper's peer-reviewed result across 97 configurations.
Real Example: A 10-Step Research Task With vs. Without Reliability
Let's say you ask an agent: "Research the top 5 threats to Kubernetes deployments in 2026, for each threat find a real-world incident, and rank them by severity with citations."
Without reliability (typical agent prompt):
❌ Step 1: Searches "Kubernetes threats 2026" → First result is a vendor blog from 2024
❌ Step 2: Pulls 3 more results from same SERP (no diversity)
❌ Step 3: Lists "API server vulnerabilities" as #1 threat — cites a blog, not an incident
❌ Step 4: Lists "container escape" as #2 — cites a 2021 paper (not 2026)
❌ Step 5: Fabricates a "major US bank Kubernetes breach" for #3 — no source exists
❌ Self-check: None — output delivered with false confidence
Result: 40% accurate. 1 hallucinated incident. 0 correct citations.
With reliability (the prompt above):
✅ Phase 1: Scope validated — "You want 5 threats, each with a real incident, ranked, with citations. I will search for [threat types]."
✅ Phase 2: 5 independent searches — finds a real S3 bucket misconfig incident (2026), a real RBAC exploit (2025), a real etcd compromise (2026)
✅ Phase 2 Gate: searches for threat #4 return shallow results → flags: "Limited sources for 'sidecar proxy vulnerabilities'. Continue with qualified finding?"
✅ Phase 3: Synthesis — identifies that 3 of 5 threats have real 2025-26 incidents, 2 depend on older data → flags appropriately
✅ Phase 4: Self-verification passes 5/5 — every claim cited, no uncertain assertions, format matches
Result: 100% accurate. 3 real incidents cited. 2 qualified findings. Full audit trail.
The Forge paper found that without guardrails, even frontier models compound errors at roughly the same rate. With guardrails, the error rate drops to near zero — regardless of model size.
Beyond the Prompt: Making Reliability Automatic
A single prompt is a great start, but reliability engineering in production means scheduled checks and automated workflows. The same guardrails can be applied to recurring tasks:
Schedule reliable multi-step workflows:
# Daily competitor monitoring with structured reliability gates
openclaw cron add --every 24h --text "Using the reliable multi-step agent workflow: research 3 competitors for [industry]. For each: product changes, hiring moves, funding news. Verify every claim with 3 sources. Rank by significance."
# Weekly threat landscape scan
openclaw cron add --every 7d --text "Using the reliable multi-step agent workflow: scan for new security threats in [tech stack]. Phase 1: scope. Phase 2: search advisories + news + GitHub. Phase 3: synthesize severity. Phase 4: deliver ranked list with verified citations."
Combined with the reliability prompt, these cron jobs deliver verified, multi-source research — not glorified search summaries. Every claim has a source, every gap is flagged, and every output passes a self-verification gate.
Getting Started in 2 Minutes
- Deploy OpenClaw on GetClawCloud — one click, no server setup
- Paste the prompt above into your Telegram bot — the reliability framework is ready
- Send any multi-step task — the agent scopes, researches, analyzes, and verifies before delivering
The Forge paper proves that reliability is a system property, not a model property. The same 8B local model jumps from 53% to 99.3% with structural guardrails. Your Telegram agents can do the same thing — with one prompt and the right workflow.
Build Your Reliable AI Agent
Deploy OpenClaw in one click, paste the reliability prompt, and get verified, multi-step research delivered to Telegram. No model swapping required — just a better system around it.
Start on GetClawCloud →