Testing Framework
Traditional software testing verifies deterministic behavior. AI plugins are different: skills are probabilistic, routing is inferred, and quality degrades silently when one skill steals triggers from another.
The plugin uses a four-layer testing model designed for these challenges. Each layer tests a different aspect of the system, and each is faster than the next.

The Four Layers
| Layer | Name | Speed | Purpose |
|---|---|---|---|
| L1 | Structure | 0.1s | Linting, spec validation, naming errors |
| L2 | Triggers | ~30s | Trigger evaluation sets |
| L3 | Sessions | 2–3 min | Multi-turn integration tests via tmux |
| L4 | Benchmarking | 5+ min | Skill value comparison (does it help?) |
The goal is L1+L2 coverage for every skill, L3 for critical workflows, and L4 for skills whose value is in question.
L1: Structure Validation
Static checks that run instantly. Validates file structure, naming conventions, cross-references, and spec compliance.
What it checks:
- Skill directories match
SKILL.mdname field - Agent files follow
{name}.agent.mdnaming - Cross-references between agents and skills are valid
- No orphan skills (referenced but missing) or phantom skills (present but unreferenced)
- Frontmatter contains only allowed fields (
name,description)
L2: Trigger Evaluation
Tests that a skill fires for the right prompts and stays silent for the wrong ones. Each skill has an eval set with positive and negative test cases.
Each eval set is a JSON file at tests/evals/triggers/{name}.json:
{
"skill_name": "brain",
"evals": [
{
"query": "what did we decide about using CNPG vs Azure Postgres?",
"should_trigger": true
},
{
"query": "run the smoke tests on azure/ship",
"should_trigger": false
}
]
}
Requirements:
- Minimum 8 positive entries (
should_trigger: true) - Minimum 8 negative entries (
should_trigger: false) - Include near misses: prompts that are topically close but belong to a different skill
L3: Session Integration
Multi-turn integration tests that validate real workflows via tmux. These test that skills work end-to-end in an actual Copilot CLI session.
Session scenarios are JSON files at tests/evals/scenarios/{name}-workflow.json:
{
"name": "glab-workflow",
"steps": [
{
"name": "list-open-mrs",
"prompt": "what merge requests are currently open for me?",
"timeout": 120,
"assertions": [
{
"pattern": "glab mr list|MR|PR|open",
"type": "regex",
"description": "Discusses merge requests"
},
{
"pattern": "--state=opened",
"type": "not_contains",
"description": "Does not use hallucinated flags"
}
]
}
]
}
How it works: The test runner spawns a tmux session, launches the CLI process inside it, sends each prompt, waits for a response, then runs assertions against the captured output. This tests the full stack: skill loading, routing, tool execution, and response quality.
Design rules:
- All prompts must be read-only (no mutations)
- Last step should test a boundary (trigger a different skill)
- Both
copilotandclaudeCLI targets are supported - Timeouts are per-step
L4: Benchmarking
Measures whether a skill provides measurable value. Runs the same scenario twice (with the skill loaded and without) and compares assertion pass rates.
| Delta | Verdict |
|---|---|
| > +10% | Valuable — skill clearly helps |
| +1% to +10% | Marginal — review context cost |
| 0% to +1% | Redundant — skill adds no value |
| Negative | Harmful — skill makes things worse |
Run this quarterly, after model upgrades, or when questioning whether a skill is still needed.
Running Tests
Quick Validation (After Every Change)
Skill-Specific Testing
Full Suite
Writing Good Tests
Trigger Evals (L2)
- Positive entries should be natural language a real user would type
- Negative entries should include near misses that test discrimination
- Cover edge cases: abbreviations, ambiguous phrasing, partial matches
- Include both short ("scan deps") and long ("run a full dependency analysis on the partition service") prompts
Session Scenarios (L3)
- Keep scenarios focused: 2–4 steps per workflow
- First step should clearly activate the target skill
- Last step should test a boundary (verifies the skill does not over-claim)
- Use read-only operations to avoid side effects
For the full contribution workflow and conventions, see CONTRIBUTING.md.