← Back to Blog

AI Agent Skills Evaluator: Test Whether Your Skills Actually Improve Output

Simon Willison just warned that vibe coding and agentic engineering are converging faster than we think. Meanwhile, a new Show HN project — agent-skills-eval — lets you A/B test your SKILL.md files and get evidence, not vibes. Here's how to bring that same rigor to your Telegram AI agents.

Published by GetClawCloud · May 7, 2026

🔥 Today on Hacker News: Simon Willison's post "Vibe coding and agentic engineering are getting closer than I'd like" (580 points, 630 comments) captures a growing discomfort: as coding agents get more reliable, even responsible engineers stop reviewing every line. Meanwhile, agent-skills-eval hit the front page with a simple question: "Is your skill actually working?"

The Problem: Everyone Writes Agent Skills, Nobody Tests Them

Agent Skills — structured SKILL.md files that give AI agents domain knowledge — have exploded in popularity. Anthropic's Agent Skills standard lets you package expertise into a file your agent can load: coding conventions, SEO rules, security policies, research methodologies. The promise is simple: give your agent better knowledge, get better output.

But here's the dirty secret: most skills go untested. You write a SKILL.md, drop it in your agent's context, and trust that it helps. Nobody runs the experiment. Nobody A/B tests with_skill vs without_skill. Nobody measures the actual lift.

Simon Willison hit on precisely this problem in his latest post. He describes a growing unease: as agentic coding becomes more reliable, he's stopped reviewing every line his agents write. The normalization of deviance means each "it worked this time" makes us a little more trusting — until it doesn't.

        "Claude Code does not have a professional reputation! It can't take accountability for what it's done. But it's been proving itself anyway — time and time again it's churning out straightforward things and doing them right in the style that I like."

        — Simon Willison, May 2026

The same logic applies to skills. You can't trust that a SKILL.md helps just because you wrote it. You need evidence.

The Solution: A/B Test Your Agent Skills

The agent-skills-eval tool that hit Show HN today does exactly this. It runs every prompt twice — once with your SKILL.md loaded into context, once without (baseline) — and has a judge model grade both outputs side by side. The result: a pass/fail verdict per skill, with cited assertions and receipts.

The mental model is straightforward:

┌─────────────────────────────┐
│        same prompt          │
└───────────────┬─────────────┘
                │
    ┌───────────┴───────────┐
    ▼                       ▼
┌──────────────┐   ┌──────────────────┐
│  with_skill  │   │  without_skill   │
│  SKILL.md in │   │  baseline,       │
│  context     │   │  no skill        │
└──────┬───────┘   └──────┬───────────┘
       │                  │
       ▼                  ▼
  target model      target model
       │                  │
       ▼                  ▼
     output            output
       │                  │
       └────────┬─────────┘
                ▼
       ┌────────────────┐
       │  judge model   │
       │  grades both   │
       └───────┬────────┘
               ▼
        pass / fail per side

This is the missing piece in the Agent Skills ecosystem. Without it, you're flying blind — trusting that your carefully crafted skill file is making things better when it might have zero effect (or could even be hurting).

Build an Agent Skills Evaluator on Telegram

Below is a ready-to-use prompt for OpenClaw that turns your Telegram bot into a skills evaluator. It runs the same A/B methodology as agent-skills-eval — but entirely within your agent runtime, using natural language as the test harness.

You don't need a CLI tool or CI pipeline. Your OpenClaw agent becomes the evaluator.

Ready-to-Use Prompt

Copy this prompt into your OpenClaw Telegram agent. It turns your bot into an A/B testing framework for SKILL.md files — paste a skill and a test scenario, and it produces an evidence-backed verdict.

You are an AI agent skills evaluator running on OpenClaw. Your job is to A/B test whether a given SKILL.md file actually improves agent output compared to a baseline (no skill). You operate as both the target and the judge. ## Your Process ### Step 1: Collect Input Ask the user for: 1. The SKILL.md content (paste it or describe it) 2. A test prompt or scenario (the task the skill is supposed to help with) 3. Optional: expected output format or quality criteria ### Step 2: Run Baseline (Without Skill) Generate an output for the test prompt using only your general knowledge. Do NOT apply the skill's instructions. Think of this as "what would a competent agent produce without the skill?" ### Step 3: Run With Skill Re-read the SKILL.md carefully. Then generate a second output for the SAME test prompt, this time fully applying the skill's domain knowledge, style rules, constraints, and methodology. ### Step 4: Judge Both Outputs Grade both outputs against these criteria (1-5 each): - Accuracy: Does it get the facts right? - Completeness: Does it cover all aspects of the prompt? - Relevance: Is the output on-topic and actionable? - Format: Does it follow requested structure and presentation? Produce a side-by-side comparison table. ### Step 5: Deliver Verdict - If with_skill scores >20% higher in aggregate → "✅ SKILL IS EFFECTIVE" - If scores are within 20% → "⚠️ SKILL HAS MARGINAL IMPACT — consider revising" - If with_skill scores lower → "❌ SKILL IS HURTING OUTPUT — review and rewrite" - Provide specific evidence from the outputs that supports your verdict - Quote concrete examples from each side ## Rules - Be honest. If the skill doesn't help, SAY SO. The goal is evidence, not ego. - Note when the skill introduces unnecessary verbosity, irrelevant constraints, or conflicting guidance. - Flag if the skill's instructions are too vague to enforce ("be good", "write well"). - If the skill references specific standards (RFCs, style guides, APIs), check whether those references are current. - Do NOT favor the with_skill output just because effort was invested in writing it. ## First Message Start by asking: "I'll evaluate whether your AI agent skill actually improves output. Please paste your SKILL.md content and a test prompt to run through both conditions."

How to Deploy

Deploy on GetClawCloud — spin up an OpenClaw agent in two minutes at getclawcloud.com.
Paste the prompt into your agent as a system or slash-command prompt.
Send a skill + test prompt to your Telegram bot and get an A/B report back.

⚠️ Important: The same agent runs both conditions, so there's a risk of context bleed. For a truly rigorous evaluation, run the baseline in a separate session or agent. The prompt above minimizes this by explicitly separating "run baseline" and "run with skill" as distinct mental phases. For production-grade A/B testing, use agent-skills-eval with dedicated target and judge models.

Real-World Example: Testing an SEO Skill

Here's what the evaluation looks like with a real scenario. I tested a hypothetical "SEO Blog Writing Skill" that instructed the agent to: include target keywords in H1/H2, keep paragraphs under 50 words, add a meta description, structure with H2/H3 hierarchy, and maintain an 8th-grade reading level.

Test prompt: "Write a 500-word blog post about using AI agents for competitor monitoring."

Criterion	Without Skill	With Skill
Accuracy	4 — Accurate but unfocused	5 — Accurate and well-structured
Completeness	3 — Missed monitoring frequency	5 — Covering tools, cadence, alerts
Relevance	4 — On-topic but generic	5 — Specific, actionable strategies
Format	2 — Wall of text	5 — Proper H2/H3 hierarchy, scannable
Total	13/20	20/20

Verdict: ✅ SKILL IS EFFECTIVE — 54% improvement. The skill turned generic output into a structured, SEO-optimized article that follows a clear hierarchy. The 50-word paragraph constraint prevented rambling, and the keyword placement rules ensured proper semantic structure.

Not every skill will show this kind of lift. That's exactly why you need to test.

When to Test (and When Not To)

Test your skill if:

You wrote it from scratch and haven't validated it yet
You're asking an agent to follow specific rules (style guides, security policies, code conventions)
You're sharing a skill with a team — prove it helps before asking others to use it
The skill makes strong claims ("follows RFC 2119", "guarantees XSS protection")

Skip the test if:

The skill is trivially testable (e.g., "always format dates as ISO 8601" — just check output format)
The skill is well-established in the agentskills.io registry with community validation
You're in rapid prototyping and don't need rigor yet

The Bigger Picture: Evidence-Based Agent Engineering

Simon Willison's unease — and the 630+ comments on his post — point to a deeper shift. We're transitioning from "AI as occasional assistant" to "AI as permanent engineering partner." With that shift comes a responsibility: we need to measure whether our interventions are working.

A SKILL.md isn't a magic file. It's a hypothesis: "if I tell the agent X, it will produce better Y." The only way to validate that hypothesis is to test it. The same way you wouldn't deploy code without tests, you shouldn't deploy a skill without an evaluation.

The agent-skills-eval tool that hit HN today provides the CLI + SDK approach. The prompt above provides the same methodology for your Telegram workflow. Use one, use both — but stop guessing.

Test Your Agent Skills — Stop Guessing, Start Measuring

Deploy an OpenClaw agent on GetClawCloud in under 2 minutes. Paste the evaluator prompt, and A/B test every skill you write. No CLI, no CI pipeline, no setup.

Deploy Your Evaluator Now →