SKILL.md Reference Files
Skills are the primary mechanism for encoding domain expertise into Hermes agents. Each SKILL.md file is a structured runbook — machine-readable by Hermes, human-readable by the engineer who authors it.
You work with these files in Modules 7 and 8. Four complete example skills ship with the course repo under skills/. A blank authoring template is at skills/SKILL-TEMPLATE.md.
What a Skill File Contains
Every SKILL.md has four required zones:
| Section | Purpose |
|---|---|
| Frontmatter | Machine metadata: name, description, version, compatibility, tags |
| When to Use | Exact trigger conditions — specific alarm names, metric thresholds, CLI output patterns |
| Inputs | Environment variables, required tools, permissions |
| Procedure | Two-phase structure: Scripts Zone (deterministic CLI) + Agents Zone (reasoning/decision tree) |
| Escalation / NEVER DO | Human-approval gates and hard prohibitions |
| Verification | Checklist confirming the investigation is complete |
The split between Scripts Zone and Agents Zone is the key design pattern: deterministic data gathering separated from LLM-driven interpretation.
Course Skills
sre-k8s-pod-health
File: skills/sre-k8s-pod-health/SKILL.md
Diagnose Kubernetes pod health issues. Use when a pod enters CrashLoopBackOff,
ImagePullBackOff, OOMKilled, or CreateContainerConfigError, when a Service has
no endpoints, or when a deployment is reporting unhealthy replicas. Covers pod
status, container states, events, logs, and resource consumption across six
failure modes.
Tags: kubernetes, sre, pod-health, kubectl, k8s, diagnosis, incidents
What it does:
- Phase 1 (Scripts Zone): Runs
kubectl get pods -o json,kubectl describe pod,kubectl logs --previous,kubectl top pods,kubectl get endpoints, andkubectl get events— six deterministic data-gather steps. - Phase 2 (Agents Zone): Six decision branches covering ImagePullBackOff, CrashLoopBackOff, OOMKilled, Liveness probe failure, CreateContainerConfigError, and Service port mismatch — each with explicit field-value conditions and escalation criteria.
Key constraint: Read-only. The skill never kubectl deletes, kubectl patches, or kubectl execs. Any remediation action requires explicit human approval.
Used in: Module 7 lab (Track C primary example), Module 10 lab (Kiran reference agent), Module 11 fleet lab (Kiran delegation target).
sre-k8s-node-health
File: skills/sre-k8s-node-health/SKILL.md
Tags: kubernetes, sre, node-health, kubectl, k8s, diagnosis
Starter scaffold for diagnosing Kubernetes node health issues — NotReady, MemoryPressure, DiskPressure, PIDPressure. Phase 2 is a participant extension point for Module 7 lab.
sre-k8s-resource-quota
File: skills/sre-k8s-resource-quota/SKILL.md
Tags: kubernetes, sre, resource-quota, kubectl, k8s, capacity
Starter scaffold for diagnosing Kubernetes resource quota saturation — blocked pod creation, missing LimitRange, namespace quota exhaustion. Phase 2 is a participant extension point for Module 7 lab.
sre-k8s-rollback-investigator
File: skills/sre-k8s-rollback-investigator/SKILL.md
Tags: kubernetes, sre, rollback, deployment, kubectl, k8s
Starter scaffold for investigating Kubernetes deployment rollback scenarios — ProgressDeadlineExceeded, image regression, replica drift, ReplicaSet sprawl. Phase 2 is a participant extension point for Module 7 lab.
sre-ec2-health-check
File: skills/sre-ec2-health-check/SKILL.md
Diagnose EC2 instance health issues. Use when CloudWatch alert fires on
EC2 CPU, network, or disk metrics, or when instance becomes unreachable
or performance-degraded. Covers status checks, CloudWatch metrics,
CloudTrail events, and health report generation.
Tags: ec2, sre, health-check, cloudwatch, aws, monitoring, incidents
What it does:
- Phase 1 (Scripts Zone): Runs
aws ec2 describe-instance-status, CloudWatch CPU/network/disk metrics, CloudTrail events, and active alarm checks — six deterministic data-gather steps. - Phase 2 (Agents Zone): Four decision branches — status check failures, CPU saturation, network isolation, no active issue — each with explicit numeric thresholds and escalation criteria.
Key constraint: Read-only. The skill never reboots, stops, or modifies the instance. Any remediation action requires explicit human approval.
Used in: Module 7 lab (the primary hands-on example), Module 8 tool-wiring lab.
dba-rds-slow-query
File: skills/dba-rds-slow-query/SKILL.md
Investigate RDS PostgreSQL slow query performance using pg_stat_statements.
Use when CloudWatch RDS CPUUtilization alarm fires, application reports slow
queries, or pg_stat_statements shows queries with mean_time > 1000ms. Covers
slow query identification, index gap analysis, parameter group review, and
tuning recommendations.
Tags: rds, postgresql, slow-query, pg-stat-statements, index, performance, dba, tuning
What it does:
- Phase 1: Queries
pg_stat_statementsfor top slow queries, checks CloudWatch RDS CPU and IOPS metrics, reviews active connections and wait events. - Phase 2: Decision branches for query-plan issues (runs
EXPLAIN), parameter group drift, connection pool saturation, and no active issue. Produces a tuning recommendation with estimated impact.
Key constraint: Recommends index creation and parameter changes but never executes DDL without approval. VACUUM FULL is flagged as a business-hours risk.
Used in: Module 10 Domain Agent Build — Track A (Database Health).
devops-deployment-safety-check
File: skills/devops-deployment-safety-check/SKILL.md
Validate deployment readiness and monitor canary rollout health. Use before
deploying to production, after canary release, or when automated deployment
gate reports failure. Covers pre-deploy validation, canary health monitoring,
rollback criteria, and post-deploy verification.
Tags: deployment, canary, rollback, safety, cicd, production, ec2, kubernetes, ecs
What it does:
- Phase 1: Pre-deploy checks (resource availability, circuit breaker state, error rate baseline), canary traffic split validation, post-canary error rate and latency comparison against baseline.
- Phase 2: Go/no-go decision logic using configurable thresholds. If canary error rate exceeds baseline by 2x or p99 latency increases more than 50%, the skill recommends rollback with the exact
kubectl rollout undoor ECS equivalent.
Key constraint: Never executes rollback autonomously. Outputs the rollback command and rationale for human execution.
Used in: Module 8 (tool wiring), Module 12 (trigger-based automation).
observability-alert-noise-analyzer
File: skills/observability-alert-noise-analyzer/SKILL.md
Analyze CloudWatch alarm patterns to identify noise, duplicates, and
correlated alerts. Use when on-call engineer reports alert fatigue, when
alert volume spikes without a corresponding incident, or as part of weekly
observability hygiene. Produces dedup candidates, correlation clusters, and
snooze window recommendations.
Tags: cloudwatch, observability, alerts, noise, deduplication, correlation, on-call, sre
What it does:
- Phase 1: Retrieves last 24 hours of alarm state history, groups by alarm name and time window, identifies alarms that fire within 60 seconds of each other (correlation candidates).
- Phase 2: Classifies alarms as noisy (fired more than 3x per hour with no incident), correlated (co-firing with root cause alarm), or healthy. Produces a ranked list of dedup and snooze candidates with estimated on-call burden reduction.
Key constraint: Read-only analysis only. Does not modify alarm thresholds or suppress alarms — outputs a structured recommendation report.
Used in: Module 7 stretch lab, Module 9 (design patterns — noise reduction pattern).
Authoring Your Own Skill
The blank template is at skills/SKILL-TEMPLATE.md. It includes all required sections with authoring guidance in comments.
Quality gate: before submitting a new skill, every Tier 1 item in skills/RUBRIC.md must pass. Tier 1 blockers are:
- Frontmatter
descriptionstarts with an action verb and names the trigger condition When to Uselists specific alert names or metric thresholds (not vague descriptions)- Scripts Zone contains only CLI commands and expected output (no prose decisions)
- Agents Zone contains only IF/THEN/ELSE trees (no raw CLI commands)
- All numeric thresholds are explicit (e.g.,
> 80%,> 1000ms) — not qualitative
See Module 7 reading material for a full walkthrough of the authoring process.