Skip to content

Testing Framework

Traditional software testing verifies deterministic behavior. AI plugins are different: skills are probabilistic, routing is inferred, and quality degrades silently when one skill steals triggers from another.

The plugin uses a four-layer testing model designed for these challenges. Each layer tests a different aspect of the system, and each is faster than the next.

Testing Architecture

The Four Layers

Layer Name Speed Purpose
L1 Structure 0.1s Linting, spec validation, naming errors
L2 Triggers ~30s Trigger evaluation sets
L3 Sessions 2–3 min Multi-turn integration tests via tmux
L4 Benchmarking 5+ min Skill value comparison (does it help?)

The goal is L1+L2 coverage for every skill, L3 for critical workflows, and L4 for skills whose value is in question.

L1: Structure Validation

Static checks that run instantly. Validates file structure, naming conventions, cross-references, and spec compliance.

make lint

What it checks:

  • Skill directories match SKILL.md name field
  • Agent files follow {name}.agent.md naming
  • Cross-references between agents and skills are valid
  • No orphan skills (referenced but missing) or phantom skills (present but unreferenced)
  • Frontmatter contains only allowed fields (name, description)

L2: Trigger Evaluation

Tests that a skill fires for the right prompts and stays silent for the wrong ones. Each skill has an eval set with positive and negative test cases.

make unit

Each eval set is a JSON file at tests/evals/triggers/{name}.json:

{
  "skill_name": "brain",
  "evals": [
    {
      "query": "what did we decide about using CNPG vs Azure Postgres?",
      "should_trigger": true
    },
    {
      "query": "run the smoke tests on azure/ship",
      "should_trigger": false
    }
  ]
}

Requirements:

  • Minimum 8 positive entries (should_trigger: true)
  • Minimum 8 negative entries (should_trigger: false)
  • Include near misses: prompts that are topically close but belong to a different skill

L3: Session Integration

Multi-turn integration tests that validate real workflows via tmux. These test that skills work end-to-end in an actual Copilot CLI session.

make integration S=glab     # Test one skill
make integration            # Test all skills with scenarios

Session scenarios are JSON files at tests/evals/scenarios/{name}-workflow.json:

{
  "name": "glab-workflow",
  "steps": [
    {
      "name": "list-open-mrs",
      "prompt": "what merge requests are currently open for me?",
      "timeout": 120,
      "assertions": [
        {
          "pattern": "glab mr list|MR|PR|open",
          "type": "regex",
          "description": "Discusses merge requests"
        },
        {
          "pattern": "--state=opened",
          "type": "not_contains",
          "description": "Does not use hallucinated flags"
        }
      ]
    }
  ]
}

How it works: The test runner spawns a tmux session, launches the CLI process inside it, sends each prompt, waits for a response, then runs assertions against the captured output. This tests the full stack: skill loading, routing, tool execution, and response quality.

Design rules:

  • All prompts must be read-only (no mutations)
  • Last step should test a boundary (trigger a different skill)
  • Both copilot and claude CLI targets are supported
  • Timeouts are per-step

L4: Benchmarking

Measures whether a skill provides measurable value. Runs the same scenario twice (with the skill loaded and without) and compares assertion pass rates.

make benchmark S=brain
Delta Verdict
> +10% Valuable — skill clearly helps
+1% to +10% Marginal — review context cost
0% to +1% Redundant — skill adds no value
Negative Harmful — skill makes things worse

Run this quarterly, after model upgrades, or when questioning whether a skill is still needed.

Running Tests

Quick Validation (After Every Change)

make test    # L1 + L2, runs in ~2 seconds

Skill-Specific Testing

make test-skill S=brain    # All layers for one skill

Full Suite

make test                  # L1 + L2
make integration           # L3 (all scenarios)
make report                # Coverage inventory

Writing Good Tests

Trigger Evals (L2)

  • Positive entries should be natural language a real user would type
  • Negative entries should include near misses that test discrimination
  • Cover edge cases: abbreviations, ambiguous phrasing, partial matches
  • Include both short ("scan deps") and long ("run a full dependency analysis on the partition service") prompts

Session Scenarios (L3)

  • Keep scenarios focused: 2–4 steps per workflow
  • First step should clearly activate the target skill
  • Last step should test a boundary (verifies the skill does not over-claim)
  • Use read-only operations to avoid side effects

For the full contribution workflow and conventions, see CONTRIBUTING.md.