Skip to main content

SKILL.md Authoring Guide: Two-Zone Design

Reference Document

This document explains WHY the SKILL.md format works the way it does. Read this after completing Module 7 labs when you want to understand the reasoning behind what you built.

Companion labs: Module 7 — Agent Skills | Module 7 Reference


1. Why Skills, Not Prompts?

The Limitation of Ad-Hoc Prompting

When you first use an LLM for infrastructure diagnostics, the natural approach is to describe the problem in natural language: "The RDS CPU is high. What should I check?" The model produces a reasonable-sounding checklist. It works, sort of.

The problems appear at scale:

Inconsistency: Ask the same question twice and you get a different procedure. Ask with a slightly different framing and you get a completely different set of checks. You cannot audit whether the agent followed the correct procedure because "the correct procedure" was never defined — it emerged from a natural language prompt each time.

Incompleteness: An LLM answering a vague question does not know which data points are critical for YOUR environment, which pg_stat_statements fields indicate which failure mode in YOUR PostgreSQL version, or which CloudWatch metric thresholds YOU have calibrated. It improvises from training data.

Scope creep: A vague prompt creates a vague agent. "Help me investigate the database" gives the agent latitude to try anything — including things your DBA team has decided are off-limits.

No auditability: After an incident, can you answer "did the agent follow the correct procedure?" With a prompt-based agent: no. With a skill-based agent: yes, you can compare the agent's actions against the SKILL.md procedure step by step.

What Skills Encode

A SKILL.md file encodes five things that an ad-hoc prompt cannot:

  1. When to activate. Specific, observable trigger conditions — not "when the database is slow" but "when CloudWatch alarm rds-cpu-high fires on $RDS_INSTANCE_ID."

  2. What data to gather. Exact CLI commands with exact expected output.

  3. How to reason about that data. IF/THEN/ELSE decision trees with numeric thresholds.

  4. What is forbidden. A NEVER DO list specific to this domain.

  5. When to stop. Escalation rules with specific triggering conditions.

Skills as Context Engineering Artifacts

A SKILL.md file is a context engineering artifact. When an agent loads a skill, the skill text becomes part of the LLM's context window at the system prompt level. The Brain reasons over:

[SOUL.md — who I am, what I never do]
[SKILL.md — what procedure to follow, what thresholds to apply]
[Tool results — what I have observed so far]

The quality of the agent's diagnostic decisions is directly proportional to the quality of the SKILL.md content. This is why the course teaches SKILL.md authoring as the primary skill, not Python agent code. The code (Hermes) is fixed. The context (SKILL.md) is the variable. Your domain expertise lives in the context, not in the code.


2. The Two-Zone Design

The Problem the Two Zones Solve

Without the two-zone constraint, agents exhibit a failure mode called mid-loop data discovery:

  1. Agent starts reasoning over the initial data (high CPU, slow queries visible)
  2. During reasoning, the agent realizes it needs more data (what's the table size? is there a lock?)
  3. Agent runs a new query to get that data
  4. New data reveals a new dimension to the problem
  5. Agent needs more data to understand the new dimension
  6. Loop continues — the agent is not converging on a diagnosis

The result: unpredictable session duration, escalating token costs, and a diagnosis that arrived at different conclusions depending on what data happened to be discovered in what order.

Scripts Zone — Deterministic Data Collection

Purpose: Run all the CLI commands. Collect all the data. No decisions. No interpretation.

The Scripts Zone is idempotent and deterministic. Running Phase 1 twice on the same database produces the same output. There is no branching based on intermediate results.

Because Scripts Zone commands are exact CLI commands with exact expected outputs, the skill can be tested independently of the agent loop:

export HERMES_LAB_MODE=mock
export HERMES_LAB_SCENARIO=messy
psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c \
"SELECT mean_exec_time_ms, calls, query FROM pg_stat_statements ORDER BY mean_exec_time_ms DESC"

The output is deterministic. It matches the **Expected output** block in the SKILL.md file. If it does not match, the mock data or the command is wrong — not an LLM reasoning failure.

The RUBRIC.md Tier 1 check: "Every CLI command in a Scripts Zone phase is followed immediately by an **Expected output** block." The expected output blocks are testable contracts, not documentation.

Agents Zone — Bounded Reasoning

Purpose: Reason over the complete dataset collected in Phase 1. Apply the decision tree. Produce a named diagnosis or escalation. No new data collection.

The Agents Zone is where the LLM's reasoning capability is applied — but to a fixed input dataset, not an open-ended exploration.

Why Agents Zone has no CLI commands: Placing CLI commands inside an Agents Zone phase creates a mixed-concern violation:

  • Breaks testability: You can no longer test data collection separately from reasoning
  • Creates feedback loops: An LLM in a reasoning loop that can also run queries will run additional queries when its reasoning is uncertain — the exact failure mode the two-zone design prevents

Two-Zone Example

Scripts Zone (Phase 1, Step 1.3) — correct:

psql -h $DB_HOST -p ${DB_PORT:-5432} -U $DB_USER -d $DB_NAME --csv -c \
"SELECT mean_exec_time_ms, total_exec_time_ms, calls, rows_per_call,
LEFT(query, 200) as query
FROM pg_stat_statements
WHERE mean_exec_time_ms > 1000
ORDER BY mean_exec_time_ms DESC
LIMIT 20"

No interpretation. No branching. Run this, get that.

Agents Zone (Phase 2) — correct:

IF mean_exec_time_ms > 5000:
THEN Diagnosis = "CRITICAL_SLOW_QUERY"
→ Escalate immediately

ELSE IF mean_exec_time_ms > 1000 AND sequential_scan_pct > 80:
THEN Diagnosis = "SLOW_QUERY_INDEX_GAP"
→ Proceed to Phase 3

ELSE IF mean_exec_time_ms > 1000 AND sequential_scan_pct <= 80:
THEN Diagnosis = "SLOW_QUERY_OTHER_CAUSE"

No CLI commands. Only IF/THEN/ELSE logic on data already collected.

Mixed violation (Tier 4 FAIL — do not do this):

Phase 2 — Analysis:
Check if the queries are slow:
aws cloudwatch get-metric-statistics ... ← CLI command in Agents Zone!
IF the metrics show high CPU... ← vague condition

3. The Hermes SKILL.md Implementation

YAML Frontmatter Fields

---
name: dba-rds-slow-query
description: "Investigate RDS PostgreSQL slow query performance using pg_stat_statements.
Use when CloudWatch RDS CPUUtilization alarm fires, application reports slow queries,
or pg_stat_statements shows queries with mean_time > 1000ms."
version: 1.0.0
compatibility: "aws cli v2, psql, HERMES_LAB_MODE=mock|live"
metadata:
hermes:
category: devops
tags: [rds, postgresql, slow-query, pg-stat-statements, index, performance]
---

description: This is what the skills_search tool queries when an agent needs to locate the right skill. It must answer: "When should I use this skill?" Start with an action verb.

compatibility: Lists required tool versions AND the HERMES_LAB_MODE=mock|live declaration. This field is checked by the Tier 1 rubric.

version: Semantic versioning. Skills can be versioned and updated without changing the agent configuration. Starting at 1.0.0 and incrementing with each revision creates an audit trail.

Nine Required Sections (in order)

SectionPurposeRubric Tier
YAML frontmatterMetadata for skill loading, discovery, versioningTier 1
## When to UseSpecific trigger conditions; anti-casesTier 1
## InputsInput table with HERMES_LAB_MODE rowTier 1
## PrerequisitesTools, permissions, env var setup, mock setupTier 1
## ProcedureAlternating Scripts Zone / Agents Zone phasesTier 1
## Escalation RulesObservable triggers + handoff templateTier 1
## NEVER DO3+ domain-specific prohibited actions with consequencesTier 1
## Rollback ProcedureSteps to undo Phase 3 changesTier 1
## Verification4+ checkboxes confirming skill run completeTier 1

The template at course/skills/SKILL-TEMPLATE.md uses [square bracket] placeholder syntax throughout:

# Check for unfilled placeholders (should return 0):
grep -c "\[" skills/my-skill/SKILL.md

The agentskills.io Spec Alignment

The SKILL.md format used in this course aligns with the agentskills.io spec published in December 2025. This is a cross-platform standard — skills authored in the SKILL.md format are compatible with any framework that implements the spec (LangGraph, AutoGen, CrewAI, etc.), not just Hermes.


4. RUBRIC.md: Four-Tier Quality Framework

The quality rubric at course/skills/RUBRIC.md has 62 checkboxes organized in four tiers:

Tier 1 — Blockers (must ALL pass before skill can be used)

  • Frontmatter completeness (name, description, version, compatibility, category, tags, YAML delimiters)
  • Section completeness (all 8 required sections present)
  • When to Use quality (specific named triggers, no vague conditions)
  • Inputs table format (includes HERMES_LAB_MODE row)
  • Two-zone design enforcement (SCRIPTS ZONE and AGENTS ZONE labels present)
  • Scripts Zone: CLI commands with expected output blocks (inline, not external references)
  • Agents Zone: decision trees with numeric thresholds, named termination conditions
  • Escalation Rules: 2+ triggers with observable conditions + handoff template
  • NEVER DO: 3+ domain-specific items with stated consequences
  • Rollback Procedure: numbered steps covering Phase 3 changes
  • Verification checklist: 4+ checkboxes
  • Mock mode documentation present

Tier 2 — Quality

Clean two-zone separation throughout. Escalation handoff is copy-paste ready. Expected output matches real API field names. Skill tested end-to-end in mock mode.

Tier 3 — Production-Grade

Messy scenario tested. Mock and live produce equivalent diagnostic decisions. Tested with Haiku-tier model.

Tier 4 — Anti-Patterns (one FAIL disqualifies the skill)

  • Any decision branch ending in "investigate further" without a stopping criterion
  • CLI commands inside an Agents Zone phase
  • Expected output blocks that reference external files instead of inline output
  • Subjective decision conditions ("slow," "high," "elevated" without numeric threshold)
  • AWS field names in camelCase instead of PascalCase (DBInstanceStatus, not dbInstanceStatus)
  • No named diagnosis strings — just prose descriptions

Automated Tier 1 Checks

# All 8 required sections present? (should return 8)
grep -c "## When to Use\|## Inputs\|## Prerequisites\|## Procedure\|## Escalation Rules\|## NEVER DO\|## Rollback Procedure\|## Verification" SKILL.md

# Both zone labels present? (should return 2+)
grep -c "SCRIPTS ZONE\|AGENTS ZONE" SKILL.md

# NEVER DO has 3+ items? (should return 3+)
grep -c "^\- \*\*NEVER\|^- NEVER" SKILL.md

# HERMES_LAB_MODE documented? (should return 1+)
grep -c "HERMES_LAB_MODE" SKILL.md

# Verification has 4+ checkboxes? (should return 4+)
grep -c "\- \[ \]" SKILL.md

# No unfilled placeholders? (should return 0)
grep -c "\[" SKILL.md

5. DevOps Skill Examples

Track A: dba-rds-slow-query Skill Anatomy

Phase 1 — Gather RDS and CloudWatch Data [SCRIPTS ZONE]:

  • Step 1.1: aws rds describe-db-instances — instance status, class, engine version
  • Step 1.2: aws cloudwatch get-metric-statistics — CPUUtilization last 60 minutes
  • Step 1.3: psql -c "SELECT ... FROM pg_stat_statements WHERE mean_exec_time_ms > 1000" — slow query list
  • Step 1.4: psql -c "SELECT ... FROM pg_stat_user_tables" — sequential scan ratios per table

Each step has an **Expected output** block with the exact JSON or CSV format the tool returns.

Phase 2 — Diagnose Root Cause [AGENTS ZONE]:

  • SLOW_QUERY_INDEX_GAP — high mean_exec_time_ms AND > 80% sequential scans
  • CPU_SPIKE_NO_QUERY_MATCH — CPU spike but no query > 1000ms (connection storm)
  • PARAMETER_GROUP_PENDINGPendingModifiedValues.DBParameterGroupName set
  • NO_ISSUE_FOUND — all metrics within normal range

Every branch terminates at a named string. No open-ended paths.

Track B: cost-anomaly Skill Anatomy

Phase 1 [SCRIPTS ZONE]:

  • aws ce get-cost-and-usage — last 14 days of daily cost grouped by service
  • aws ce get-cost-and-usage — same period, previous month (baseline)
  • aws cloudwatch describe-alarms --alarm-name-prefix "billing-" — billing alarm status

Phase 2 [AGENTS ZONE]:

IF current_day_cost > 1.5x baseline_daily_average:
AND specific_service_cost increased > 200%:
THEN Diagnosis = "SERVICE_COST_SPIKE"

IF cost still > 1.2x baseline at day 7 of anomaly:
THEN Diagnosis = "SUSTAINED_ELEVATED_SPEND"
NOTE: Partial resolution — spike not fully resolved

IF all services within 10% of baseline:
THEN Diagnosis = "NO_ANOMALY_CURRENT_PERIOD"

Track C: sre-k8s-pod-health Skill Anatomy

The sre-k8s-pod-health skill exemplifies the read-only escalation model: the agent diagnoses but never remediates. In Phase 1 [SCRIPTS ZONE], four kubectl commands gather pod state: kubectl get pods -o json, kubectl describe pod, kubectl logs --previous, and kubectl top pods. In Phase 2 [AGENTS ZONE], six decision branches cover the K8S-02 failure modes — ImagePullBackOff, CrashLoopBackOff, OOMKilled, Liveness probe failure, CreateContainerConfigError, and Service port mismatch — each ending in a named diagnosis string or escalation. Phase 3 does not exist — the skill escalates directly from Phase 2 diagnosis to structured handoff. This is NOT a limitation of the skill. It is a deliberate governance decision: SRE agents operate with principle of least privilege. Read-only agents can be trusted with continuous monitoring because their blast radius is zero.


6. Quick Reference

Decision Tree Patterns — Numeric vs Subjective

Acceptable (Numeric threshold)Fails Tier 4 (Subjective)
IF CPUUtilization > 80 AND mean_exec_time_ms > 1000IF CPU is high and queries are slow
IF DBInstanceStatus == "modifying"IF instance seems like it is changing
IF sequential_scan_pct > 80IF sequential scan ratio is elevated
IF daily_cost > 1.5 * baseline_daily_averageIF costs look unusual compared to normal
All branches terminate at diagnosis string or escalationBranch ends: "investigate further"

Two-Zone Design Summary

AspectScripts ZoneAgents Zone
PurposeData collectionReasoning and diagnosis
ContainsCLI commands + expected outputIF/THEN/ELSE decision trees
Does NOT containProse decisions, IF/THEN logicCLI commands (aws, kubectl, psql)
Is it deterministic?Yes — same input → same outputNo — LLM reasoning varies
Is it testable independently?Yes — run commands, compare expected outputYes — feed Phase 1 output, verify diagnosis
Phase label[SCRIPTS ZONE — deterministic][AGENTS ZONE — reasoning]

Skill Anti-Patterns

Anti-PatternProblemFix
"Check the usual metrics"Undefined — agent will hallucinate "usual"List exact metric names and thresholds
"Escalate if needed"No conditionDefine explicit escalation conditions
"Restart the service"No contextFull command: systemctl restart nginx --host {host}
1000-line skill covering all scenariosContext budget exceededSplit into domain-specific sub-skills
CLI command in Agents ZoneBreaks testability; creates feedback loopsMove CLI commands to Scripts Zone