Skip to main content

Reference: SKILL.md Format and Skill Lifecycle

Quick-reference material for Module 7 — writing domain-specific skills for Hermes agents.


1. SKILL.md Structure

A complete SKILL.md has nine required sections in a specific order:

SectionPurposeRequired?Rubric Tier
YAML frontmatterMetadata for skill loading, discovery, versioningYesTier 1
## When to UseSpecific trigger conditions; anti-casesYesTier 1
## InputsInput table with HERMES_LAB_MODE rowYesTier 1
## PrerequisitesTools, permissions, env var setup, mock setupYesTier 1
## ProcedureAlternating Scripts Zone / Agents Zone phasesYesTier 1
## Escalation RulesObservable triggers + handoff templateYesTier 1
## NEVER DO3+ domain-specific prohibited actions with consequencesYesTier 1
## Rollback ProcedureSteps to undo Phase 3 changesYesTier 1
## Verification4+ checkboxes confirming skill run completeYesTier 1

YAML Frontmatter Fields

Every SKILL.md file begins with a YAML frontmatter block:

---
name: dba-rds-slow-query
description: "Investigate RDS PostgreSQL slow query performance using pg_stat_statements.
Use when CloudWatch RDS CPUUtilization alarm fires, application reports slow queries,
or pg_stat_statements shows queries with mean_time > 1000ms."
version: 1.0.0
compatibility: "aws cli v2, psql, HERMES_LAB_MODE=mock|live"
metadata:
hermes:
category: devops
tags: [rds, postgresql, slow-query, pg-stat-statements, index, performance]
---

name: Must match the skill directory name in kebab-case. Used for hermes skill info, skill selection, and audit logging.

description: The skill's searchable summary. This is what the skills_search tool queries when an agent has multiple skills and needs to locate the right one. It must answer: "When should I use this skill?" — not "What does this skill do?" Start with an action verb, include the domain, service, and trigger condition.

version: Semantic versioning (1.0.0). Skills can be versioned and updated without changing the agent configuration. Starting at 1.0.0 and incrementing with each revision creates an audit trail.

compatibility: Lists required tool versions AND the HERMES_LAB_MODE=mock|live declaration. This field is checked by the Tier 1 rubric — a skill that does not declare mock/live compatibility cannot be used in course labs.

metadata.hermes.category: One of devops, sre, dba, observability. Used for skill discovery and filtering.

metadata.hermes.tags: Used for keyword search across the skills hub. At minimum: domain, service, key operations.

The agentskills.io Spec

The SKILL.md format used in this course aligns with the agentskills.io spec published in December 2025. This is a cross-platform standard — skills authored in the SKILL.md format are compatible with any framework that implements the spec (LangGraph, AutoGen, CrewAI, etc.), not just Hermes.

The hermes metadata block (metadata.hermes.*) is a vendor extension — framework-specific extensions are explicitly supported by the spec. The operational knowledge encoded in a SKILL.md file is a reusable asset.


2. Annotated SKILL.md Example: EC2 Health Check

# EC2 Health Check

## Metadata
- version: 1.2.0
- domain: SRE / EC2
- author: Platform Engineering
- last_validated: 2026-03-15
- triggers: ["ec2 health check", "instance unreachable", "high cpu alert"]

## When to Use
- CloudWatch alarm `ec2-cpu-high` fires (CPUUtilization > 80% for 5+ minutes)
- Instance status check reports "impaired"
- Application team reports instance unreachable

NOT this skill: RDS performance issues (use dba-rds-slow-query), cost analysis (use cost-anomaly)

## Inputs

| Input | Source | Required | Description |
|---|---|---|---|
| `INSTANCE_ID` | Environment variable | Yes | AWS EC2 instance ID (format: i-[17 hex chars]) |
| `AWS_REGION` | Environment variable | Yes | AWS region (default: us-east-1) |
| `HERMES_LAB_MODE` | Environment variable | Yes | `mock` for lab; `live` for real AWS |

## Prerequisites

**Tools required:** aws cli v2.15+

**Permissions required:** `ec2:DescribeInstances`, `ec2:DescribeInstanceStatus`, `cloudwatch:GetMetricStatistics`

**Environment setup:**
```bash
export INSTANCE_ID="i-0abc123def456"
export AWS_REGION="us-east-1"
export HERMES_LAB_MODE=mock # or live

Procedure

Phase 1 — Gather Instance Data [SCRIPTS ZONE — deterministic]

Step 1.1 — Verify instance state:

aws ec2 describe-instances --instance-ids $INSTANCE_ID --region $AWS_REGION \
--query 'Reservations[0].Instances[0].{State:State.Name,Type:InstanceType,AZ:Placement.AvailabilityZone}'

Expected output:

{"State": "running", "Type": "t3.medium", "AZ": "us-east-1a"}

Step 1.2 — Check system status checks:

aws ec2 describe-instance-status --instance-ids $INSTANCE_ID --region $AWS_REGION \
--query 'InstanceStatuses[0].{SystemStatus:SystemStatus.Status,InstanceStatus:InstanceStatus.Status}'

Expected output: {"SystemStatus": "ok", "InstanceStatus": "ok"}

Step 1.3 — Retrieve CPU metrics (last 30 minutes):

aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 --metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=$INSTANCE_ID \
--start-time $(date -u -v-30M +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 300 --statistics Average,Maximum \
--region $AWS_REGION

Phase 2 — Diagnose Root Cause [AGENTS ZONE — reasoning]

IF State != "running": THEN Diagnosis = "INSTANCE_NOT_RUNNING" → Escalate immediately (Case A in Escalation Rules)

IF SystemStatus == "impaired" OR InstanceStatus == "impaired": THEN Diagnosis = "STATUS_CHECK_IMPAIRED" → Escalate immediately (Case B in Escalation Rules)

IF CPUUtilization_average_30min > 80 AND CPUUtilization_peak > 95 sustained 10+ min: THEN Diagnosis = "CPU_CRITICAL" → Escalate (Case C)

IF CPUUtilization_average_30min > 80: THEN Diagnosis = "CPU_ELEVATED" → Proceed to Phase 3

IF all metrics within normal range: THEN Diagnosis = "NO_ISSUE_FOUND" → Report false positive

Escalation Rules

Case A (INSTANCE_NOT_RUNNING): Escalate to on-call via PagerDuty P2. Subject: EC2 Health — INSTANCE_NOT_RUNNING on $INSTANCE_ID Findings: State=[value], Time discovered=[timestamp], Last known healthy: [from audit log]

Case B (STATUS_CHECK_IMPAIRED): Escalate to on-call via PagerDuty P1. Do not attempt remediation. Subject: EC2 Health — STATUS_CHECK_IMPAIRED on $INSTANCE_ID Findings: Impaired check=[SystemStatus or InstanceStatus], Region=$AWS_REGION

Case C (CPU_CRITICAL): Escalate to on-call via PagerDuty P2. Do not restart without approval. Subject: EC2 Health — CPU_CRITICAL on $INSTANCE_ID Findings: CPU average=[value]%, CPU peak=[value]% sustained [duration], Service=[name]

NEVER DO

  • NEVER restart the instance without human approval — reason: restart during active traffic causes service interruption; root cause may persist after restart
  • NEVER run commands that modify instance state — this skill is read-only diagnostic; all changes require human execution
  • NEVER report "no issue found" without checking all three Phase 1 data sources — reason: a single metric can appear normal while others indicate a problem

Rollback Procedure

Phase 2 only (no Phase 3 in this read-only skill). No rollback required.

Verification

  • Phase 1 data collected: instance state, status checks, CPU metrics (30 min)
  • Diagnosis string assigned from the Phase 2 decision tree
  • Escalation sent if diagnosis is INSTANCE_NOT_RUNNING, STATUS_CHECK_IMPAIRED, or CPU_CRITICAL
  • All numeric values in the report sourced from Phase 1 output (no estimated values)

---

## 3. Runbook Wiki vs. SKILL.md Comparison

| Property | Wiki Runbook | SKILL.md |
|----------|-------------|---------|
| **Reader** | Human operator | AI agent |
| **Ambiguity** | Human fills gaps with judgment | Gaps cause agent errors |
| **Conditions** | "If high CPU" (implied threshold) | `if cpu_avg > 80: step 4a` (explicit) |
| **Commands** | "Check CloudWatch" (general direction) | Exact command with parameters |
| **Escalation** | "Escalate if needed" | Named target, channel, required info |
| **Versioning** | Updated in place, history unclear | Semantic versioning, changelog |
| **Testing** | Tested by running through incidents | Tested against simulated scenarios |
| **Context cost** | Irrelevant | Matters — verbose runbooks consume context budget |
| **Improvement** | Updated by whoever edited it last | Structured improvement cycle |

---

## 4. Decision Tree Patterns: Specific vs Vague

The most common Tier 4 anti-pattern (which disqualifies a skill) is vague decision conditions:

FAIL — vague condition:

IF the CPU is elevated: THEN investigate further

PASS — specific condition with numeric threshold:

IF CPUUtilization > 80 AND mean_exec_time_ms > 1000: THEN Diagnosis = SLOW_QUERY_INDEX_GAP CONFIDENCE: High — both metrics confirm


The specific version does three things the vague version cannot:

1. **Reproducible:** Two agents running the same skill on the same data reach the same conclusion.
2. **Auditable:** After an incident, you can verify the agent's diagnosis was consistent with the decision tree.
3. **Bounded:** Every branch terminates at a named diagnosis string or an explicit escalation trigger. There is no path through the decision tree that ends in "continue investigating."

### The Diagnosis String Pattern

The Agents Zone produces a named diagnosis string — not a description, a string:

Diagnosis = "SLOW_QUERY_INDEX_GAP" Diagnosis = "PARAMETER_GROUP_DRIFT" Diagnosis = "LOCK_CONTENTION_PEAK_HOURS" Diagnosis = "NO_ISSUE_FOUND"


Why strings? They are greppable. They are consistent. They are actionable. They are comparable across sessions, agents, and incidents.

### Trigger Condition Patterns — Good vs Bad

| Good (Specific) | Bad (Vague) | Why Good Is Better |
|---|---|---|
| `When CloudWatch alarm rds-cpu-high fires (CPUUtilization > 80%)` | `When database is slow` | Names the specific alarm; maps to a specific metric |
| `When pg_stat_statements shows mean_exec_time_ms > 1000ms` | `When queries seem slow` | Numeric threshold; greppable in audit logs |
| `When aws ce get-cost-and-usage shows current day > 1.5x baseline` | `When costs are elevated` | Specific metric and specific formula |
| `When kubectl get pods shows STATUS=OOMKilled` | `When pods are having issues` | Named status string; no interpretation required |

---

## 5. NEVER DO Rules: The Specificity Requirement

Generic safety rules ("never do anything dangerous," "always be careful") are useless in a SKILL.md context. The Brain already knows to be careful — that is part of its training. What it does not know, without explicit encoding, is which specific commands are catastrophic in YOUR domain and WHY.

**Generic (useless):**

NEVER do anything that could harm the production database.


**Domain-specific (useful):**

NEVER execute VACUUM FULL during business hours — reason: acquires exclusive lock on the table, blocks all reads and writes for the duration (minutes to hours on large tables), causes application timeout cascade.

NEVER run CREATE INDEX without CONCURRENTLY — reason: locks the table, blocks writes. Use CREATE INDEX CONCURRENTLY instead (slower, does not block).

NEVER modify max_connections without scheduling a restart — reason: this is a static parameter requiring DB restart; changing it applies immediately on parameter group but does not take effect until the restart window, creating a false expectation that the change is live.


The domain-specific version tells the Brain exactly which actions to avoid and exactly what catastrophic outcome each action causes.

### NEVER DO Patterns by Track

| Track | Domain | Example NEVER DO |
|---|---|---|
| Track A (DBA) | RDS PostgreSQL | NEVER execute ALTER TABLE or CREATE INDEX without explicit human approval — causes table lock, blocks production writes |
| Track A (DBA) | RDS PostgreSQL | NEVER run VACUUM FULL during business hours — acquires exclusive lock, blocks all reads and writes |
| Track B (FinOps) | AWS Cost Explorer | NEVER execute `aws ec2 terminate-instances` based on cost findings alone — requires cross-team approval |
| Track B (FinOps) | AWS Cost Explorer | NEVER modify Reserved Instance or Savings Plan coverage without finance team approval |
| Track C (K8s) | Kubernetes | NEVER run `kubectl delete pod` during active traffic — use rollout restart for controlled pod cycling |
| Track C (K8s) | Kubernetes | NEVER modify resource limits on running deployments without checking PodDisruptionBudget |
| All tracks | General | NEVER skip Phase 2 diagnosis and jump to Phase 3 remediation — blind remediation risks making the problem worse |

---

## 6. Skill Lifecycle

Design → Validate → Version → Deploy → Improve


### Design

Define: domain, inputs, procedure steps, decision trees, escalation paths.

Key questions to answer:
- What triggers this skill? (What does the agent see that makes it select this skill?)
- What are the typed inputs? (What information must be available before execution begins?)
- What are the step-by-step commands? (Exact shell commands, not general descriptions)
- What are the failure modes? (What does the agent do when a step produces unexpected output?)
- What are the escalation conditions? (When does the agent stop and hand off to a human?)

### Validate

Test the skill against realistic scenarios before deploying. Methods:
- **Simulated data run:** Execute the skill against mock CLI responses — verify the agent follows the correct decision tree branches
- **Dry-run on real infra:** Execute the skill with `read-only` tool constraints — verify commands are correct, verify output parsing
- **Edge case table:** Define 5-10 realistic inputs (normal, high-load, impaired, missing data) and verify the agent produces the expected action

### Version

Use semantic versioning: `MAJOR.MINOR.PATCH`
- MAJOR: Breaking change to inputs or procedure structure
- MINOR: New decision tree branch, new escalation case, new step
- PATCH: Clarification, command syntax fix, threshold update

Maintain a changelog at the top of the file:
```markdown
## Changelog
- 1.2.0 (2026-03-15): Added disk I/O evaluation step and root cause routing table
- 1.1.0 (2026-02-01): Added network throughput check in Step 5
- 1.0.0 (2026-01-15): Initial version

Deploy

Place the skill file in your Hermes agent's skills/ directory:

~/.hermes/profiles/track-a/
config.yaml
SOUL.md
skills/
dba-rds-slow-query/
SKILL.md

At agent startup, Hermes scans the skills/ directory, reads each SKILL.md file, and prepends their content to the system prompt. The Brain sees the complete skill procedure as part of its initial context — not retrieved on demand, but present from turn 1.

Improve

After each real-world execution:

  1. Review agent output against expected behavior
  2. Identify where the agent deviated from the decision tree (or where the decision tree was ambiguous)
  3. Update the skill to eliminate the ambiguity
  4. Increment MINOR or PATCH version
  5. Re-validate against the updated edge case table

The improvement loop is the mechanism by which your agent gets better over time — not by retraining the model, but by refining the procedural context it reads.


7. RUBRIC.md Quality Tiers

The quality rubric at course/skills/RUBRIC.md has 62 checkboxes organized in four tiers:

Tier 1 — Blockers (must ALL pass before skill can be used in any lab):

  • Frontmatter completeness (7 items: name, description, version, compatibility, category, tags, YAML delimiters)
  • Section completeness (8 required sections present)
  • When to Use quality (specific named triggers, no vague conditions)
  • Inputs table format (includes HERMES_LAB_MODE row)
  • Two-zone design enforcement (SCRIPTS ZONE and AGENTS ZONE labels present; no CLI in Agents Zone; no prose decisions in Scripts Zone)
  • Scripts Zone: CLI commands with expected output blocks (inline, not external references)
  • Agents Zone: decision trees with numeric thresholds, named termination conditions
  • Escalation Rules: 2+ triggers with observable conditions + handoff template
  • NEVER DO: 3+ domain-specific items with stated consequences
  • Rollback Procedure: numbered steps covering Phase 3 changes
  • Verification checklist: 4+ checkboxes
  • Mock mode documentation present

Tier 2 — Quality (should fix before Module 10 agent build): Clean two-zone separation throughout. Escalation handoff is copy-paste ready. Expected output matches real API field names. Skill tested end-to-end in mock mode.

Tier 3 — Production-Grade (required before shipping to participants as take-home material): Messy scenario tested. Mock and live produce equivalent diagnostic decisions. Tested with Haiku-tier model. Skills Hub metadata validated.

Tier 4 — Anti-Patterns (one FAIL disqualifies the skill):

  • Any decision branch ending in "investigate further" without a stopping criterion
  • CLI commands inside an Agents Zone phase
  • Expected output blocks that reference external files instead of inline output
  • Subjective decision conditions ("slow," "high," "elevated" without numeric threshold)
  • AWS field names in camelCase instead of PascalCase (real AWS uses DBInstanceStatus, not dbInstanceStatus)

8. Tier 1 Quick-Check with Grep

Run these commands on any skill before human review:

# All 8 required sections present? (should return 8)
grep -c "## When to Use\|## Inputs\|## Prerequisites\|## Procedure\|## Escalation Rules\|## NEVER DO\|## Rollback Procedure\|## Verification" SKILL.md

# Both zone labels present? (should return 2+)
grep -c "SCRIPTS ZONE\|AGENTS ZONE" SKILL.md

# NEVER DO has 3+ items? (should return 3+)
grep -c "^\- \*\*NEVER\|^- NEVER" SKILL.md

# HERMES_LAB_MODE documented? (should return 1+)
grep -c "HERMES_LAB_MODE" SKILL.md

# Verification has 4+ checkboxes? (should return 4+)
grep -c "\- \[ \]" SKILL.md

# No unfilled placeholders? (should return 0)
grep -c "\[" SKILL.md

9. Skill Context Budget

Skills consume context tokens. Each skill loaded into the context window reduces the space available for operational data, conversation history, and reasoning.

Approximate token costs:

Skill ComplexityApproximate Tokens
Simple skill (3-5 steps, 1 decision tree)500-800 tokens
Medium skill (5-10 steps, 2-3 decision trees)1,200-2,000 tokens
Complex skill (10+ steps, multiple trees, full escalation matrix)2,500-4,000 tokens

Context budget guidelines:

  • Load at most 3-4 skills concurrently for a standard 100K context window
  • Split large skills into focused sub-skills (e.g., rds-health-diagnosis.md + rds-health-remediation.md)
  • Compress verbose skills: eliminate prose, keep only commands, conditions, and escalation data

10. Skill Anti-Patterns

Anti-PatternProblemFix
"Check the usual metrics"Undefined — agent will hallucinate "usual"List exact metric names and thresholds
"Escalate if needed"No condition — agent never knows when "needed"Define explicit escalation conditions
"Restart the service"No context — which service, which command?Full command: systemctl restart nginx --host {host}
"See runbook section 4"Cross-reference not loadable at runtimeInline the referenced content
1000-line skill covering all scenariosContext budget exceededSplit into domain-specific sub-skills
No version or dateNo auditabilityAlways include version and last_validated
Decision branch ends in "investigate further"Open-ended path — Tier 4 FAILEvery branch must terminate at a named diagnosis or escalation
CLI command inside Agents ZoneMixed-concern violation — breaks testabilityMove CLI commands to Scripts Zone

11. Two-Zone Design Summary

AspectScripts ZoneAgents Zone
PurposeData collectionReasoning and diagnosis
ContainsCLI commands + expected outputIF/THEN/ELSE decision trees
Does NOT containProse decisions, IF/THEN logicCLI commands (aws, kubectl, psql)
Is it deterministic?Yes — same input → same outputNo — LLM reasoning varies
Is it testable independently?Yes — run commands, compare expected outputYes — feed Phase 1 output, verify diagnosis
Phase label[SCRIPTS ZONE — deterministic][AGENTS ZONE — reasoning]
Typical phasesPhase 1 (data collection), Phase 3 (remediation)Phase 2 (diagnosis), Phase 4 (verification)