AI Agent Skills Evaluator: Test Whether Your Skills Actually Improve Output
Simon Willison just warned that vibe coding and agentic engineering are converging faster than we think. Meanwhile, a new Show HN project — agent-skills-eval — lets you A/B test your SKILL.md files and get evidence, not vibes. Here's how to bring that same rigor to your Telegram AI agents.
🔥 Today on Hacker News: Simon Willison's post "Vibe coding and agentic engineering are getting closer than I'd like" (580 points, 630 comments) captures a growing discomfort: as coding agents get more reliable, even responsible engineers stop reviewing every line. Meanwhile, agent-skills-eval hit the front page with a simple question: "Is your skill actually working?"
The Problem: Everyone Writes Agent Skills, Nobody Tests Them
Agent Skills — structured SKILL.md files that give AI agents domain knowledge — have exploded in popularity. Anthropic's Agent Skills standard lets you package expertise into a file your agent can load: coding conventions, SEO rules, security policies, research methodologies. The promise is simple: give your agent better knowledge, get better output.
But here's the dirty secret: most skills go untested. You write a SKILL.md, drop it in your agent's context, and trust that it helps. Nobody runs the experiment. Nobody A/B tests with_skill vs without_skill. Nobody measures the actual lift.
Simon Willison hit on precisely this problem in his latest post. He describes a growing unease: as agentic coding becomes more reliable, he's stopped reviewing every line his agents write. The normalization of deviance means each "it worked this time" makes us a little more trusting — until it doesn't.
— Simon Willison, May 2026
The same logic applies to skills. You can't trust that a SKILL.md helps just because you wrote it. You need evidence.
The Solution: A/B Test Your Agent Skills
The agent-skills-eval tool that hit Show HN today does exactly this. It runs every prompt twice — once with your SKILL.md loaded into context, once without (baseline) — and has a judge model grade both outputs side by side. The result: a pass/fail verdict per skill, with cited assertions and receipts.
The mental model is straightforward:
┌─────────────────────────────┐
│ same prompt │
└───────────────┬─────────────┘
│
┌───────────┴───────────┐
▼ ▼
┌──────────────┐ ┌──────────────────┐
│ with_skill │ │ without_skill │
│ SKILL.md in │ │ baseline, │
│ context │ │ no skill │
└──────┬───────┘ └──────┬───────────┘
│ │
▼ ▼
target model target model
│ │
▼ ▼
output output
│ │
└────────┬─────────┘
▼
┌────────────────┐
│ judge model │
│ grades both │
└───────┬────────┘
▼
pass / fail per side
This is the missing piece in the Agent Skills ecosystem. Without it, you're flying blind — trusting that your carefully crafted skill file is making things better when it might have zero effect (or could even be hurting).
Build an Agent Skills Evaluator on Telegram
Below is a ready-to-use prompt for OpenClaw that turns your Telegram bot into a skills evaluator. It runs the same A/B methodology as agent-skills-eval — but entirely within your agent runtime, using natural language as the test harness.
You don't need a CLI tool or CI pipeline. Your OpenClaw agent becomes the evaluator.
Ready-to-Use Prompt
Copy this prompt into your OpenClaw Telegram agent. It turns your bot into an A/B testing framework for SKILL.md files — paste a skill and a test scenario, and it produces an evidence-backed verdict.
How to Deploy
- Deploy on GetClawCloud — spin up an OpenClaw agent in two minutes at getclawcloud.com.
- Paste the prompt into your agent as a system or slash-command prompt.
- Send a skill + test prompt to your Telegram bot and get an A/B report back.
⚠️ Important: The same agent runs both conditions, so there's a risk of context bleed. For a truly rigorous evaluation, run the baseline in a separate session or agent. The prompt above minimizes this by explicitly separating "run baseline" and "run with skill" as distinct mental phases. For production-grade A/B testing, use agent-skills-eval with dedicated target and judge models.
Real-World Example: Testing an SEO Skill
Here's what the evaluation looks like with a real scenario. I tested a hypothetical "SEO Blog Writing Skill" that instructed the agent to: include target keywords in H1/H2, keep paragraphs under 50 words, add a meta description, structure with H2/H3 hierarchy, and maintain an 8th-grade reading level.
Test prompt: "Write a 500-word blog post about using AI agents for competitor monitoring."
| Criterion | Without Skill | With Skill |
|---|---|---|
| Accuracy | 4 — Accurate but unfocused | 5 — Accurate and well-structured |
| Completeness | 3 — Missed monitoring frequency | 5 — Covering tools, cadence, alerts |
| Relevance | 4 — On-topic but generic | 5 — Specific, actionable strategies |
| Format | 2 — Wall of text | 5 — Proper H2/H3 hierarchy, scannable |
| Total | 13/20 | 20/20 |
Verdict: ✅ SKILL IS EFFECTIVE — 54% improvement. The skill turned generic output into a structured, SEO-optimized article that follows a clear hierarchy. The 50-word paragraph constraint prevented rambling, and the keyword placement rules ensured proper semantic structure.
Not every skill will show this kind of lift. That's exactly why you need to test.
When to Test (and When Not To)
Test your skill if:
- You wrote it from scratch and haven't validated it yet
- You're asking an agent to follow specific rules (style guides, security policies, code conventions)
- You're sharing a skill with a team — prove it helps before asking others to use it
- The skill makes strong claims ("follows RFC 2119", "guarantees XSS protection")
Skip the test if:
- The skill is trivially testable (e.g., "always format dates as ISO 8601" — just check output format)
- The skill is well-established in the agentskills.io registry with community validation
- You're in rapid prototyping and don't need rigor yet
The Bigger Picture: Evidence-Based Agent Engineering
Simon Willison's unease — and the 630+ comments on his post — point to a deeper shift. We're transitioning from "AI as occasional assistant" to "AI as permanent engineering partner." With that shift comes a responsibility: we need to measure whether our interventions are working.
A SKILL.md isn't a magic file. It's a hypothesis: "if I tell the agent X, it will produce better Y." The only way to validate that hypothesis is to test it. The same way you wouldn't deploy code without tests, you shouldn't deploy a skill without an evaluation.
The agent-skills-eval tool that hit HN today provides the CLI + SDK approach. The prompt above provides the same methodology for your Telegram workflow. Use one, use both — but stop guessing.
Test Your Agent Skills — Stop Guessing, Start Measuring
Deploy an OpenClaw agent on GetClawCloud in under 2 minutes. Paste the evaluator prompt, and A/B test every skill you write. No CLI, no CI pipeline, no setup.
Deploy Your Evaluator Now →