SKILL.md Authoring Guide: Two-Zone Design
This document explains WHY the SKILL.md format works the way it does. Read this after completing Module 7 labs when you want to understand the reasoning behind what you built.
Companion labs: Module 7 — Agent Skills | Module 7 Reference
1. Why Skills, Not Prompts?
The Limitation of Ad-Hoc Prompting
When you first use an LLM for infrastructure diagnostics, the natural approach is to describe the problem in natural language: "The RDS CPU is high. What should I check?" The model produces a reasonable-sounding checklist. It works, sort of.
The problems appear at scale:
Inconsistency: Ask the same question twice and you get a different procedure. Ask with a slightly different framing and you get a completely different set of checks. You cannot audit whether the agent followed the correct procedure because "the correct procedure" was never defined — it emerged from a natural language prompt each time.
Incompleteness: An LLM answering a vague question does not know which data points are critical for YOUR environment, which pg_stat_statements fields indicate which failure mode in YOUR PostgreSQL version, or which CloudWatch metric thresholds YOU have calibrated. It improvises from training data.
Scope creep: A vague prompt creates a vague agent. "Help me investigate the database" gives the agent latitude to try anything — including things your DBA team has decided are off-limits.
No auditability: After an incident, can you answer "did the agent follow the correct procedure?" With a prompt-based agent: no. With a skill-based agent: yes, you can compare the agent's actions against the SKILL.md procedure step by step.
What Skills Encode
A SKILL.md file encodes five things that an ad-hoc prompt cannot:
-
When to activate. Specific, observable trigger conditions — not "when the database is slow" but "when CloudWatch alarm
rds-cpu-highfires on$RDS_INSTANCE_ID." -
What data to gather. Exact CLI commands with exact expected output.
-
How to reason about that data. IF/THEN/ELSE decision trees with numeric thresholds.
-
What is forbidden. A NEVER DO list specific to this domain.
-
When to stop. Escalation rules with specific triggering conditions.
Skills as Context Engineering Artifacts
A SKILL.md file is a context engineering artifact. When an agent loads a skill, the skill text becomes part of the LLM's context window at the system prompt level. The Brain reasons over:
[SOUL.md — who I am, what I never do]
[SKILL.md — what procedure to follow, what thresholds to apply]
[Tool results — what I have observed so far]
The quality of the agent's diagnostic decisions is directly proportional to the quality of the SKILL.md content. This is why the course teaches SKILL.md authoring as the primary skill, not Python agent code. The code (Hermes) is fixed. The context (SKILL.md) is the variable. Your domain expertise lives in the context, not in the code.
2. The Two-Zone Design
The Problem the Two Zones Solve
Without the two-zone constraint, agents exhibit a failure mode called mid-loop data discovery:
- Agent starts reasoning over the initial data (high CPU, slow queries visible)
- During reasoning, the agent realizes it needs more data (what's the table size? is there a lock?)
- Agent runs a new query to get that data
- New data reveals a new dimension to the problem
- Agent needs more data to understand the new dimension
- Loop continues — the agent is not converging on a diagnosis
The result: unpredictable session duration, escalating token costs, and a diagnosis that arrived at different conclusions depending on what data happened to be discovered in what order.
Scripts Zone — Deterministic Data Collection
Purpose: Run all the CLI commands. Collect all the data. No decisions. No interpretation.
The Scripts Zone is idempotent and deterministic. Running Phase 1 twice on the same database produces the same output. There is no branching based on intermediate results.
Because Scripts Zone commands are exact CLI commands with exact expected outputs, the skill can be tested independently of the agent loop:
export HERMES_LAB_MODE=mock
export HERMES_LAB_SCENARIO=messy
psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c \
"SELECT mean_exec_time_ms, calls, query FROM pg_stat_statements ORDER BY mean_exec_time_ms DESC"
The output is deterministic. It matches the **Expected output** block in the SKILL.md file. If it does not match, the mock data or the command is wrong — not an LLM reasoning failure.
The RUBRIC.md Tier 1 check: "Every CLI command in a Scripts Zone phase is followed immediately by an **Expected output** block." The expected output blocks are testable contracts, not documentation.
Agents Zone — Bounded Reasoning
Purpose: Reason over the complete dataset collected in Phase 1. Apply the decision tree. Produce a named diagnosis or escalation. No new data collection.
The Agents Zone is where the LLM's reasoning capability is applied — but to a fixed input dataset, not an open-ended exploration.
Why Agents Zone has no CLI commands: Placing CLI commands inside an Agents Zone phase creates a mixed-concern violation:
- Breaks testability: You can no longer test data collection separately from reasoning
- Creates feedback loops: An LLM in a reasoning loop that can also run queries will run additional queries when its reasoning is uncertain — the exact failure mode the two-zone design prevents
Two-Zone Example
Scripts Zone (Phase 1, Step 1.3) — correct:
psql -h $DB_HOST -p ${DB_PORT:-5432} -U $DB_USER -d $DB_NAME --csv -c \
"SELECT mean_exec_time_ms, total_exec_time_ms, calls, rows_per_call,
LEFT(query, 200) as query
FROM pg_stat_statements
WHERE mean_exec_time_ms > 1000
ORDER BY mean_exec_time_ms DESC
LIMIT 20"
No interpretation. No branching. Run this, get that.
Agents Zone (Phase 2) — correct:
IF mean_exec_time_ms > 5000:
THEN Diagnosis = "CRITICAL_SLOW_QUERY"
→ Escalate immediately
ELSE IF mean_exec_time_ms > 1000 AND sequential_scan_pct > 80:
THEN Diagnosis = "SLOW_QUERY_INDEX_GAP"
→ Proceed to Phase 3
ELSE IF mean_exec_time_ms > 1000 AND sequential_scan_pct <= 80:
THEN Diagnosis = "SLOW_QUERY_OTHER_CAUSE"
No CLI commands. Only IF/THEN/ELSE logic on data already collected.
Mixed violation (Tier 4 FAIL — do not do this):
Phase 2 — Analysis:
Check if the queries are slow:
aws cloudwatch get-metric-statistics ... ← CLI command in Agents Zone!
IF the metrics show high CPU... ← vague condition
3. The Hermes SKILL.md Implementation
YAML Frontmatter Fields
---
name: dba-rds-slow-query
description: "Investigate RDS PostgreSQL slow query performance using pg_stat_statements.
Use when CloudWatch RDS CPUUtilization alarm fires, application reports slow queries,
or pg_stat_statements shows queries with mean_time > 1000ms."
version: 1.0.0
compatibility: "aws cli v2, psql, HERMES_LAB_MODE=mock|live"
metadata:
hermes:
category: devops
tags: [rds, postgresql, slow-query, pg-stat-statements, index, performance]
---
description: This is what the skills_search tool queries when an agent needs to locate the right skill. It must answer: "When should I use this skill?" Start with an action verb.
compatibility: Lists required tool versions AND the HERMES_LAB_MODE=mock|live declaration. This field is checked by the Tier 1 rubric.
version: Semantic versioning. Skills can be versioned and updated without changing the agent configuration. Starting at 1.0.0 and incrementing with each revision creates an audit trail.
Nine Required Sections (in order)
| Section | Purpose | Rubric Tier |
|---|---|---|
| YAML frontmatter | Metadata for skill loading, discovery, versioning | Tier 1 |
## When to Use | Specific trigger conditions; anti-cases | Tier 1 |
## Inputs | Input table with HERMES_LAB_MODE row | Tier 1 |
## Prerequisites | Tools, permissions, env var setup, mock setup | Tier 1 |
## Procedure | Alternating Scripts Zone / Agents Zone phases | Tier 1 |
## Escalation Rules | Observable triggers + handoff template | Tier 1 |
## NEVER DO | 3+ domain-specific prohibited actions with consequences | Tier 1 |
## Rollback Procedure | Steps to undo Phase 3 changes | Tier 1 |
## Verification | 4+ checkboxes confirming skill run complete | Tier 1 |
The template at course/skills/SKILL-TEMPLATE.md uses [square bracket] placeholder syntax throughout:
# Check for unfilled placeholders (should return 0):
grep -c "\[" skills/my-skill/SKILL.md
The agentskills.io Spec Alignment
The SKILL.md format used in this course aligns with the agentskills.io spec published in December 2025. This is a cross-platform standard — skills authored in the SKILL.md format are compatible with any framework that implements the spec (LangGraph, AutoGen, CrewAI, etc.), not just Hermes.
4. RUBRIC.md: Four-Tier Quality Framework
The quality rubric at course/skills/RUBRIC.md has 62 checkboxes organized in four tiers:
Tier 1 — Blockers (must ALL pass before skill can be used)
- Frontmatter completeness (name, description, version, compatibility, category, tags, YAML delimiters)
- Section completeness (all 8 required sections present)
- When to Use quality (specific named triggers, no vague conditions)
- Inputs table format (includes
HERMES_LAB_MODErow) - Two-zone design enforcement (
SCRIPTS ZONEandAGENTS ZONElabels present) - Scripts Zone: CLI commands with expected output blocks (inline, not external references)
- Agents Zone: decision trees with numeric thresholds, named termination conditions
- Escalation Rules: 2+ triggers with observable conditions + handoff template
- NEVER DO: 3+ domain-specific items with stated consequences
- Rollback Procedure: numbered steps covering Phase 3 changes
- Verification checklist: 4+ checkboxes
- Mock mode documentation present
Tier 2 — Quality
Clean two-zone separation throughout. Escalation handoff is copy-paste ready. Expected output matches real API field names. Skill tested end-to-end in mock mode.
Tier 3 — Production-Grade
Messy scenario tested. Mock and live produce equivalent diagnostic decisions. Tested with Haiku-tier model.
Tier 4 — Anti-Patterns (one FAIL disqualifies the skill)
- Any decision branch ending in "investigate further" without a stopping criterion
- CLI commands inside an Agents Zone phase
- Expected output blocks that reference external files instead of inline output
- Subjective decision conditions ("slow," "high," "elevated" without numeric threshold)
- AWS field names in camelCase instead of PascalCase (
DBInstanceStatus, notdbInstanceStatus) - No named diagnosis strings — just prose descriptions
Automated Tier 1 Checks
# All 8 required sections present? (should return 8)
grep -c "## When to Use\|## Inputs\|## Prerequisites\|## Procedure\|## Escalation Rules\|## NEVER DO\|## Rollback Procedure\|## Verification" SKILL.md
# Both zone labels present? (should return 2+)
grep -c "SCRIPTS ZONE\|AGENTS ZONE" SKILL.md
# NEVER DO has 3+ items? (should return 3+)
grep -c "^\- \*\*NEVER\|^- NEVER" SKILL.md
# HERMES_LAB_MODE documented? (should return 1+)
grep -c "HERMES_LAB_MODE" SKILL.md
# Verification has 4+ checkboxes? (should return 4+)
grep -c "\- \[ \]" SKILL.md
# No unfilled placeholders? (should return 0)
grep -c "\[" SKILL.md
5. DevOps Skill Examples
Track A: dba-rds-slow-query Skill Anatomy
Phase 1 — Gather RDS and CloudWatch Data [SCRIPTS ZONE]:
- Step 1.1:
aws rds describe-db-instances— instance status, class, engine version - Step 1.2:
aws cloudwatch get-metric-statistics— CPUUtilization last 60 minutes - Step 1.3:
psql -c "SELECT ... FROM pg_stat_statements WHERE mean_exec_time_ms > 1000"— slow query list - Step 1.4:
psql -c "SELECT ... FROM pg_stat_user_tables"— sequential scan ratios per table
Each step has an **Expected output** block with the exact JSON or CSV format the tool returns.
Phase 2 — Diagnose Root Cause [AGENTS ZONE]:
SLOW_QUERY_INDEX_GAP— high mean_exec_time_ms AND > 80% sequential scansCPU_SPIKE_NO_QUERY_MATCH— CPU spike but no query > 1000ms (connection storm)PARAMETER_GROUP_PENDING—PendingModifiedValues.DBParameterGroupNamesetNO_ISSUE_FOUND— all metrics within normal range
Every branch terminates at a named string. No open-ended paths.
Track B: cost-anomaly Skill Anatomy
Phase 1 [SCRIPTS ZONE]:
aws ce get-cost-and-usage— last 14 days of daily cost grouped by serviceaws ce get-cost-and-usage— same period, previous month (baseline)aws cloudwatch describe-alarms --alarm-name-prefix "billing-"— billing alarm status
Phase 2 [AGENTS ZONE]:
IF current_day_cost > 1.5x baseline_daily_average:
AND specific_service_cost increased > 200%:
THEN Diagnosis = "SERVICE_COST_SPIKE"
IF cost still > 1.2x baseline at day 7 of anomaly:
THEN Diagnosis = "SUSTAINED_ELEVATED_SPEND"
NOTE: Partial resolution — spike not fully resolved
IF all services within 10% of baseline:
THEN Diagnosis = "NO_ANOMALY_CURRENT_PERIOD"
Track C: sre-k8s-pod-health Skill Anatomy
The sre-k8s-pod-health skill exemplifies the read-only escalation model: the agent diagnoses but never remediates. In Phase 1 [SCRIPTS ZONE], four kubectl commands gather pod state: kubectl get pods -o json, kubectl describe pod, kubectl logs --previous, and kubectl top pods. In Phase 2 [AGENTS ZONE], six decision branches cover the K8S-02 failure modes — ImagePullBackOff, CrashLoopBackOff, OOMKilled, Liveness probe failure, CreateContainerConfigError, and Service port mismatch — each ending in a named diagnosis string or escalation. Phase 3 does not exist — the skill escalates directly from Phase 2 diagnosis to structured handoff. This is NOT a limitation of the skill. It is a deliberate governance decision: SRE agents operate with principle of least privilege. Read-only agents can be trusted with continuous monitoring because their blast radius is zero.
6. Quick Reference
Decision Tree Patterns — Numeric vs Subjective
| Acceptable (Numeric threshold) | Fails Tier 4 (Subjective) |
|---|---|
IF CPUUtilization > 80 AND mean_exec_time_ms > 1000 | IF CPU is high and queries are slow |
IF DBInstanceStatus == "modifying" | IF instance seems like it is changing |
IF sequential_scan_pct > 80 | IF sequential scan ratio is elevated |
IF daily_cost > 1.5 * baseline_daily_average | IF costs look unusual compared to normal |
| All branches terminate at diagnosis string or escalation | Branch ends: "investigate further" |
Two-Zone Design Summary
| Aspect | Scripts Zone | Agents Zone |
|---|---|---|
| Purpose | Data collection | Reasoning and diagnosis |
| Contains | CLI commands + expected output | IF/THEN/ELSE decision trees |
| Does NOT contain | Prose decisions, IF/THEN logic | CLI commands (aws, kubectl, psql) |
| Is it deterministic? | Yes — same input → same output | No — LLM reasoning varies |
| Is it testable independently? | Yes — run commands, compare expected output | Yes — feed Phase 1 output, verify diagnosis |
| Phase label | [SCRIPTS ZONE — deterministic] | [AGENTS ZONE — reasoning] |
Skill Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
"Check the usual metrics" | Undefined — agent will hallucinate "usual" | List exact metric names and thresholds |
"Escalate if needed" | No condition | Define explicit escalation conditions |
"Restart the service" | No context | Full command: systemctl restart nginx --host {host} |
| 1000-line skill covering all scenarios | Context budget exceeded | Split into domain-specific sub-skills |
| CLI command in Agents Zone | Breaks testability; creates feedback loops | Move CLI commands to Scripts Zone |