SKILL.md Authoring Guide: Two-Zone Design

Reference Document

This document explains WHY the SKILL.md format works the way it does. Read this after completing Module 7 labs when you want to understand the reasoning behind what you built.

Companion labs: Module 7 — Agent Skills | Module 7 Reference

1. Why Skills, Not Prompts?

The Limitation of Ad-Hoc Prompting

When you first use an LLM for infrastructure diagnostics, the natural approach is to describe the problem in natural language: "The RDS CPU is high. What should I check?" The model produces a reasonable-sounding checklist. It works, sort of.

The problems appear at scale:

Inconsistency: Ask the same question twice and you get a different procedure. Ask with a slightly different framing and you get a completely different set of checks. You cannot audit whether the agent followed the correct procedure because "the correct procedure" was never defined — it emerged from a natural language prompt each time.

Incompleteness: An LLM answering a vague question does not know which data points are critical for YOUR environment, which pg_stat_statements fields indicate which failure mode in YOUR PostgreSQL version, or which CloudWatch metric thresholds YOU have calibrated. It improvises from training data.

Scope creep: A vague prompt creates a vague agent. "Help me investigate the database" gives the agent latitude to try anything — including things your DBA team has decided are off-limits.

No auditability: After an incident, can you answer "did the agent follow the correct procedure?" With a prompt-based agent: no. With a skill-based agent: yes, you can compare the agent's actions against the SKILL.md procedure step by step.

What Skills Encode

A SKILL.md file encodes five things that an ad-hoc prompt cannot:

When to activate. Specific, observable trigger conditions — not "when the database is slow" but "when CloudWatch alarm rds-cpu-high fires on $RDS_INSTANCE_ID."
What data to gather. Exact CLI commands with exact expected output.
How to reason about that data. IF/THEN/ELSE decision trees with numeric thresholds.
What is forbidden. A NEVER DO list specific to this domain.
When to stop. Escalation rules with specific triggering conditions.

Skills as Context Engineering Artifacts

A SKILL.md file is a context engineering artifact. When an agent loads a skill, the skill text becomes part of the LLM's context window at the system prompt level. The Brain reasons over:

[SOUL.md — who I am, what I never do]
[SKILL.md — what procedure to follow, what thresholds to apply]
[Tool results — what I have observed so far]

The quality of the agent's diagnostic decisions is directly proportional to the quality of the SKILL.md content. This is why the course teaches SKILL.md authoring as the primary skill, not Python agent code. The code (Hermes) is fixed. The context (SKILL.md) is the variable. Your domain expertise lives in the context, not in the code.

2. The Two-Zone Design

The Problem the Two Zones Solve

Without the two-zone constraint, agents exhibit a failure mode called mid-loop data discovery:

Agent starts reasoning over the initial data (high CPU, slow queries visible)
During reasoning, the agent realizes it needs more data (what's the table size? is there a lock?)
Agent runs a new query to get that data
New data reveals a new dimension to the problem
Agent needs more data to understand the new dimension
Loop continues — the agent is not converging on a diagnosis

The result: unpredictable session duration, escalating token costs, and a diagnosis that arrived at different conclusions depending on what data happened to be discovered in what order.

Scripts Zone — Deterministic Data Collection

Purpose: Run all the CLI commands. Collect all the data. No decisions. No interpretation.

The Scripts Zone is idempotent and deterministic. Running Phase 1 twice on the same database produces the same output. There is no branching based on intermediate results.

Because Scripts Zone commands are exact CLI commands with exact expected outputs, the skill can be tested independently of the agent loop:

export HERMES_LAB_MODE=mock
export HERMES_LAB_SCENARIO=messy
psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c \
  "SELECT mean_exec_time_ms, calls, query FROM pg_stat_statements ORDER BY mean_exec_time_ms DESC"

The output is deterministic. It matches the **Expected output** block in the SKILL.md file. If it does not match, the mock data or the command is wrong — not an LLM reasoning failure.

The RUBRIC.md Tier 1 check: "Every CLI command in a Scripts Zone phase is followed immediately by an **Expected output** block." The expected output blocks are testable contracts, not documentation.

Agents Zone — Bounded Reasoning

Purpose: Reason over the complete dataset collected in Phase 1. Apply the decision tree. Produce a named diagnosis or escalation. No new data collection.

The Agents Zone is where the LLM's reasoning capability is applied — but to a fixed input dataset, not an open-ended exploration.

Why Agents Zone has no CLI commands: Placing CLI commands inside an Agents Zone phase creates a mixed-concern violation:

Breaks testability: You can no longer test data collection separately from reasoning
Creates feedback loops: An LLM in a reasoning loop that can also run queries will run additional queries when its reasoning is uncertain — the exact failure mode the two-zone design prevents

Two-Zone Example

Scripts Zone (Phase 1, Step 1.3) — correct:

psql -h $DB_HOST -p ${DB_PORT:-5432} -U $DB_USER -d $DB_NAME --csv -c \
  "SELECT mean_exec_time_ms, total_exec_time_ms, calls, rows_per_call,
          LEFT(query, 200) as query
   FROM pg_stat_statements
   WHERE mean_exec_time_ms > 1000
   ORDER BY mean_exec_time_ms DESC
   LIMIT 20"

No interpretation. No branching. Run this, get that.

Agents Zone (Phase 2) — correct:

IF mean_exec_time_ms > 5000:
  THEN Diagnosis = "CRITICAL_SLOW_QUERY"
  → Escalate immediately

ELSE IF mean_exec_time_ms > 1000 AND sequential_scan_pct > 80:
  THEN Diagnosis = "SLOW_QUERY_INDEX_GAP"
  → Proceed to Phase 3

ELSE IF mean_exec_time_ms > 1000 AND sequential_scan_pct <= 80:
  THEN Diagnosis = "SLOW_QUERY_OTHER_CAUSE"

No CLI commands. Only IF/THEN/ELSE logic on data already collected.

Mixed violation (Tier 4 FAIL — do not do this):

Phase 2 — Analysis:
Check if the queries are slow:
aws cloudwatch get-metric-statistics ...   ← CLI command in Agents Zone!
IF the metrics show high CPU...            ← vague condition

3. The Hermes SKILL.md Implementation

YAML Frontmatter Fields

---
name: dba-rds-slow-query
description: "Investigate RDS PostgreSQL slow query performance using pg_stat_statements.
  Use when CloudWatch RDS CPUUtilization alarm fires, application reports slow queries,
  or pg_stat_statements shows queries with mean_time > 1000ms."
version: 1.0.0
compatibility: "aws cli v2, psql, HERMES_LAB_MODE=mock|live"
metadata:
  hermes:
    category: devops
    tags: [rds, postgresql, slow-query, pg-stat-statements, index, performance]
---

description: This is what the skills_search tool queries when an agent needs to locate the right skill. It must answer: "When should I use this skill?" Start with an action verb.

compatibility: Lists required tool versions AND the HERMES_LAB_MODE=mock|live declaration. This field is checked by the Tier 1 rubric.

version: Semantic versioning. Skills can be versioned and updated without changing the agent configuration. Starting at 1.0.0 and incrementing with each revision creates an audit trail.

Nine Required Sections (in order)

Section	Purpose	Rubric Tier
YAML frontmatter	Metadata for skill loading, discovery, versioning	Tier 1
`## When to Use`	Specific trigger conditions; anti-cases	Tier 1
`## Inputs`	Input table with `HERMES_LAB_MODE` row	Tier 1
`## Prerequisites`	Tools, permissions, env var setup, mock setup	Tier 1
`## Procedure`	Alternating Scripts Zone / Agents Zone phases	Tier 1
`## Escalation Rules`	Observable triggers + handoff template	Tier 1
`## NEVER DO`	3+ domain-specific prohibited actions with consequences	Tier 1
`## Rollback Procedure`	Steps to undo Phase 3 changes	Tier 1
`## Verification`	4+ checkboxes confirming skill run complete	Tier 1

The template at course/skills/SKILL-TEMPLATE.md uses [square bracket] placeholder syntax throughout:

# Check for unfilled placeholders (should return 0):
grep -c "\[" skills/my-skill/SKILL.md

The agentskills.io Spec Alignment

The SKILL.md format used in this course aligns with the agentskills.io spec published in December 2025. This is a cross-platform standard — skills authored in the SKILL.md format are compatible with any framework that implements the spec (LangGraph, AutoGen, CrewAI, etc.), not just Hermes.

4. RUBRIC.md: Four-Tier Quality Framework

The quality rubric at course/skills/RUBRIC.md has 62 checkboxes organized in four tiers:

Tier 1 — Blockers (must ALL pass before skill can be used)

Frontmatter completeness (name, description, version, compatibility, category, tags, YAML delimiters)
Section completeness (all 8 required sections present)
When to Use quality (specific named triggers, no vague conditions)
Inputs table format (includes HERMES_LAB_MODE row)
Two-zone design enforcement (SCRIPTS ZONE and AGENTS ZONE labels present)
Scripts Zone: CLI commands with expected output blocks (inline, not external references)
Agents Zone: decision trees with numeric thresholds, named termination conditions
Escalation Rules: 2+ triggers with observable conditions + handoff template
NEVER DO: 3+ domain-specific items with stated consequences
Rollback Procedure: numbered steps covering Phase 3 changes
Verification checklist: 4+ checkboxes
Mock mode documentation present

Tier 2 — Quality

Clean two-zone separation throughout. Escalation handoff is copy-paste ready. Expected output matches real API field names. Skill tested end-to-end in mock mode.

Tier 3 — Production-Grade

Messy scenario tested. Mock and live produce equivalent diagnostic decisions. Tested with Haiku-tier model.

Tier 4 — Anti-Patterns (one FAIL disqualifies the skill)

Any decision branch ending in "investigate further" without a stopping criterion
CLI commands inside an Agents Zone phase
Expected output blocks that reference external files instead of inline output
Subjective decision conditions ("slow," "high," "elevated" without numeric threshold)
AWS field names in camelCase instead of PascalCase (DBInstanceStatus, not dbInstanceStatus)
No named diagnosis strings — just prose descriptions

Automated Tier 1 Checks

# All 8 required sections present? (should return 8)
grep -c "## When to Use\|## Inputs\|## Prerequisites\|## Procedure\|## Escalation Rules\|## NEVER DO\|## Rollback Procedure\|## Verification" SKILL.md

# Both zone labels present? (should return 2+)
grep -c "SCRIPTS ZONE\|AGENTS ZONE" SKILL.md

# NEVER DO has 3+ items? (should return 3+)
grep -c "^\- \*\*NEVER\|^- NEVER" SKILL.md

# HERMES_LAB_MODE documented? (should return 1+)
grep -c "HERMES_LAB_MODE" SKILL.md

# Verification has 4+ checkboxes? (should return 4+)
grep -c "\- \[ \]" SKILL.md

# No unfilled placeholders? (should return 0)
grep -c "\[" SKILL.md

5. DevOps Skill Examples

Track A: dba-rds-slow-query Skill Anatomy

Phase 1 — Gather RDS and CloudWatch Data [SCRIPTS ZONE]:

Step 1.1: aws rds describe-db-instances — instance status, class, engine version
Step 1.2: aws cloudwatch get-metric-statistics — CPUUtilization last 60 minutes
Step 1.3: psql -c "SELECT ... FROM pg_stat_statements WHERE mean_exec_time_ms > 1000" — slow query list
Step 1.4: psql -c "SELECT ... FROM pg_stat_user_tables" — sequential scan ratios per table

Each step has an **Expected output** block with the exact JSON or CSV format the tool returns.

Phase 2 — Diagnose Root Cause [AGENTS ZONE]:

SLOW_QUERY_INDEX_GAP — high mean_exec_time_ms AND > 80% sequential scans
CPU_SPIKE_NO_QUERY_MATCH — CPU spike but no query > 1000ms (connection storm)
PARAMETER_GROUP_PENDING — PendingModifiedValues.DBParameterGroupName set
NO_ISSUE_FOUND — all metrics within normal range

Every branch terminates at a named string. No open-ended paths.

Track B: cost-anomaly Skill Anatomy

Phase 1 [SCRIPTS ZONE]:

aws ce get-cost-and-usage — last 14 days of daily cost grouped by service
aws ce get-cost-and-usage — same period, previous month (baseline)
aws cloudwatch describe-alarms --alarm-name-prefix "billing-" — billing alarm status

Phase 2 [AGENTS ZONE]:

IF current_day_cost > 1.5x baseline_daily_average:
  AND specific_service_cost increased > 200%:
    THEN Diagnosis = "SERVICE_COST_SPIKE"

IF cost still > 1.2x baseline at day 7 of anomaly:
  THEN Diagnosis = "SUSTAINED_ELEVATED_SPEND"
  NOTE: Partial resolution — spike not fully resolved

IF all services within 10% of baseline:
  THEN Diagnosis = "NO_ANOMALY_CURRENT_PERIOD"

Track C: sre-k8s-pod-health Skill Anatomy

The sre-k8s-pod-health skill exemplifies the read-only escalation model: the agent diagnoses but never remediates. In Phase 1 [SCRIPTS ZONE], four kubectl commands gather pod state: kubectl get pods -o json, kubectl describe pod, kubectl logs --previous, and kubectl top pods. In Phase 2 [AGENTS ZONE], six decision branches cover the K8S-02 failure modes — ImagePullBackOff, CrashLoopBackOff, OOMKilled, Liveness probe failure, CreateContainerConfigError, and Service port mismatch — each ending in a named diagnosis string or escalation. Phase 3 does not exist — the skill escalates directly from Phase 2 diagnosis to structured handoff. This is NOT a limitation of the skill. It is a deliberate governance decision: SRE agents operate with principle of least privilege. Read-only agents can be trusted with continuous monitoring because their blast radius is zero.

6. Quick Reference

Decision Tree Patterns — Numeric vs Subjective

Acceptable (Numeric threshold)	Fails Tier 4 (Subjective)
`IF CPUUtilization > 80 AND mean_exec_time_ms > 1000`	`IF CPU is high and queries are slow`
`IF DBInstanceStatus == "modifying"`	`IF instance seems like it is changing`
`IF sequential_scan_pct > 80`	`IF sequential scan ratio is elevated`
`IF daily_cost > 1.5 * baseline_daily_average`	`IF costs look unusual compared to normal`
All branches terminate at diagnosis string or escalation	Branch ends: "investigate further"

Two-Zone Design Summary

Aspect	Scripts Zone	Agents Zone
Purpose	Data collection	Reasoning and diagnosis
Contains	CLI commands + expected output	IF/THEN/ELSE decision trees
Does NOT contain	Prose decisions, IF/THEN logic	CLI commands (aws, kubectl, psql)
Is it deterministic?	Yes — same input → same output	No — LLM reasoning varies
Is it testable independently?	Yes — run commands, compare expected output	Yes — feed Phase 1 output, verify diagnosis
Phase label	`[SCRIPTS ZONE — deterministic]`	`[AGENTS ZONE — reasoning]`

Skill Anti-Patterns

Anti-Pattern	Problem	Fix
`"Check the usual metrics"`	Undefined — agent will hallucinate "usual"	List exact metric names and thresholds
`"Escalate if needed"`	No condition	Define explicit escalation conditions
`"Restart the service"`	No context	Full command: `systemctl restart nginx --host {host}`
1000-line skill covering all scenarios	Context budget exceeded	Split into domain-specific sub-skills
CLI command in Agents Zone	Breaks testability; creates feedback loops	Move CLI commands to Scripts Zone

1. Why Skills, Not Prompts?​

The Limitation of Ad-Hoc Prompting​

What Skills Encode​

Skills as Context Engineering Artifacts​

2. The Two-Zone Design​

The Problem the Two Zones Solve​

Scripts Zone — Deterministic Data Collection​

Agents Zone — Bounded Reasoning​

Two-Zone Example​

3. The Hermes SKILL.md Implementation​

YAML Frontmatter Fields​

Nine Required Sections (in order)​

The agentskills.io Spec Alignment​

4. RUBRIC.md: Four-Tier Quality Framework​

Tier 1 — Blockers (must ALL pass before skill can be used)​

Tier 2 — Quality​

Tier 3 — Production-Grade​

Tier 4 — Anti-Patterns (one FAIL disqualifies the skill)​

Automated Tier 1 Checks​

5. DevOps Skill Examples​

Track A: dba-rds-slow-query Skill Anatomy​

Track B: cost-anomaly Skill Anatomy​

Track C: sre-k8s-pod-health Skill Anatomy​

6. Quick Reference​

Decision Tree Patterns — Numeric vs Subjective​

Two-Zone Design Summary​

Skill Anti-Patterns​