Introduction

Evalify.sh is a community-driven registry of expert-authored evaluation criteria for AI agent skills.

What are evals?

Evaluation criteria — evals — are structured tests that define what a skill should do and how well it should do it. They're used by tools like skill-creator to generate accuracy reports and improve skill design.

A single eval looks like this:

{
  "prompt": "Summarize this pull request diff in under 100 words",
  "expectations": [
    "Response is under 100 words",
    "Response mentions the core change",
    "Response does not include file paths or line numbers"
  ]
}

Better evals → better-designed skills → more reliable AI agents.

Who is this for?

Skill consumers — you're building or using an AI agent and want battle-tested evals to validate it behaves correctly. Browse the registry, pull a pack, and point skill-creator at it.

Skill authors — you've designed a skill and want to share your evaluation criteria with the community. Write a pack, publish it, and let others improve on your work.

Supported formats

Evalify supports two eval formats:

  • Anthropic skill-creator v2 — the format used by skills built with skill-creator. Includes skill_name, id, expected_output, files, and expectations.
  • Evalify (native) — a minimal format: just prompt and expectations.

All formats normalize to the same internal model, so packs are usable regardless of which format they were authored in. See Import & Export for the full format reference.

Guides