Skip to main content

SKILL.md Reference Files

Skills are the primary mechanism for encoding domain expertise into Hermes agents. Each SKILL.md file is a structured runbook — machine-readable by Hermes, human-readable by the engineer who authors it.

You work with these files in Modules 7 and 8. Four complete example skills ship with the course repo under skills/. A blank authoring template is at skills/SKILL-TEMPLATE.md.


What a Skill File Contains

Every SKILL.md has four required zones:

SectionPurpose
FrontmatterMachine metadata: name, description, version, compatibility, tags
When to UseExact trigger conditions — specific alarm names, metric thresholds, CLI output patterns
InputsEnvironment variables, required tools, permissions
ProcedureTwo-phase structure: Scripts Zone (deterministic CLI) + Agents Zone (reasoning/decision tree)
Escalation / NEVER DOHuman-approval gates and hard prohibitions
VerificationChecklist confirming the investigation is complete

The split between Scripts Zone and Agents Zone is the key design pattern: deterministic data gathering separated from LLM-driven interpretation.


Course Skills

sre-k8s-pod-health

File: skills/sre-k8s-pod-health/SKILL.md

Diagnose Kubernetes pod health issues. Use when a pod enters CrashLoopBackOff,
ImagePullBackOff, OOMKilled, or CreateContainerConfigError, when a Service has
no endpoints, or when a deployment is reporting unhealthy replicas. Covers pod
status, container states, events, logs, and resource consumption across six
failure modes.

Tags: kubernetes, sre, pod-health, kubectl, k8s, diagnosis, incidents

What it does:

  • Phase 1 (Scripts Zone): Runs kubectl get pods -o json, kubectl describe pod, kubectl logs --previous, kubectl top pods, kubectl get endpoints, and kubectl get events — six deterministic data-gather steps.
  • Phase 2 (Agents Zone): Six decision branches covering ImagePullBackOff, CrashLoopBackOff, OOMKilled, Liveness probe failure, CreateContainerConfigError, and Service port mismatch — each with explicit field-value conditions and escalation criteria.

Key constraint: Read-only. The skill never kubectl deletes, kubectl patches, or kubectl execs. Any remediation action requires explicit human approval.

Used in: Module 7 lab (Track C primary example), Module 10 lab (Kiran reference agent), Module 11 fleet lab (Kiran delegation target).


sre-k8s-node-health

File: skills/sre-k8s-node-health/SKILL.md

Tags: kubernetes, sre, node-health, kubectl, k8s, diagnosis

Starter scaffold for diagnosing Kubernetes node health issues — NotReady, MemoryPressure, DiskPressure, PIDPressure. Phase 2 is a participant extension point for Module 7 lab.


sre-k8s-resource-quota

File: skills/sre-k8s-resource-quota/SKILL.md

Tags: kubernetes, sre, resource-quota, kubectl, k8s, capacity

Starter scaffold for diagnosing Kubernetes resource quota saturation — blocked pod creation, missing LimitRange, namespace quota exhaustion. Phase 2 is a participant extension point for Module 7 lab.


sre-k8s-rollback-investigator

File: skills/sre-k8s-rollback-investigator/SKILL.md

Tags: kubernetes, sre, rollback, deployment, kubectl, k8s

Starter scaffold for investigating Kubernetes deployment rollback scenarios — ProgressDeadlineExceeded, image regression, replica drift, ReplicaSet sprawl. Phase 2 is a participant extension point for Module 7 lab.


sre-ec2-health-check

File: skills/sre-ec2-health-check/SKILL.md

Diagnose EC2 instance health issues. Use when CloudWatch alert fires on
EC2 CPU, network, or disk metrics, or when instance becomes unreachable
or performance-degraded. Covers status checks, CloudWatch metrics,
CloudTrail events, and health report generation.

Tags: ec2, sre, health-check, cloudwatch, aws, monitoring, incidents

What it does:

  • Phase 1 (Scripts Zone): Runs aws ec2 describe-instance-status, CloudWatch CPU/network/disk metrics, CloudTrail events, and active alarm checks — six deterministic data-gather steps.
  • Phase 2 (Agents Zone): Four decision branches — status check failures, CPU saturation, network isolation, no active issue — each with explicit numeric thresholds and escalation criteria.

Key constraint: Read-only. The skill never reboots, stops, or modifies the instance. Any remediation action requires explicit human approval.

Used in: Module 7 lab (the primary hands-on example), Module 8 tool-wiring lab.


dba-rds-slow-query

File: skills/dba-rds-slow-query/SKILL.md

Investigate RDS PostgreSQL slow query performance using pg_stat_statements.
Use when CloudWatch RDS CPUUtilization alarm fires, application reports slow
queries, or pg_stat_statements shows queries with mean_time > 1000ms. Covers
slow query identification, index gap analysis, parameter group review, and
tuning recommendations.

Tags: rds, postgresql, slow-query, pg-stat-statements, index, performance, dba, tuning

What it does:

  • Phase 1: Queries pg_stat_statements for top slow queries, checks CloudWatch RDS CPU and IOPS metrics, reviews active connections and wait events.
  • Phase 2: Decision branches for query-plan issues (runs EXPLAIN), parameter group drift, connection pool saturation, and no active issue. Produces a tuning recommendation with estimated impact.

Key constraint: Recommends index creation and parameter changes but never executes DDL without approval. VACUUM FULL is flagged as a business-hours risk.

Used in: Module 10 Domain Agent Build — Track A (Database Health).


devops-deployment-safety-check

File: skills/devops-deployment-safety-check/SKILL.md

Validate deployment readiness and monitor canary rollout health. Use before
deploying to production, after canary release, or when automated deployment
gate reports failure. Covers pre-deploy validation, canary health monitoring,
rollback criteria, and post-deploy verification.

Tags: deployment, canary, rollback, safety, cicd, production, ec2, kubernetes, ecs

What it does:

  • Phase 1: Pre-deploy checks (resource availability, circuit breaker state, error rate baseline), canary traffic split validation, post-canary error rate and latency comparison against baseline.
  • Phase 2: Go/no-go decision logic using configurable thresholds. If canary error rate exceeds baseline by 2x or p99 latency increases more than 50%, the skill recommends rollback with the exact kubectl rollout undo or ECS equivalent.

Key constraint: Never executes rollback autonomously. Outputs the rollback command and rationale for human execution.

Used in: Module 8 (tool wiring), Module 12 (trigger-based automation).


observability-alert-noise-analyzer

File: skills/observability-alert-noise-analyzer/SKILL.md

Analyze CloudWatch alarm patterns to identify noise, duplicates, and
correlated alerts. Use when on-call engineer reports alert fatigue, when
alert volume spikes without a corresponding incident, or as part of weekly
observability hygiene. Produces dedup candidates, correlation clusters, and
snooze window recommendations.

Tags: cloudwatch, observability, alerts, noise, deduplication, correlation, on-call, sre

What it does:

  • Phase 1: Retrieves last 24 hours of alarm state history, groups by alarm name and time window, identifies alarms that fire within 60 seconds of each other (correlation candidates).
  • Phase 2: Classifies alarms as noisy (fired more than 3x per hour with no incident), correlated (co-firing with root cause alarm), or healthy. Produces a ranked list of dedup and snooze candidates with estimated on-call burden reduction.

Key constraint: Read-only analysis only. Does not modify alarm thresholds or suppress alarms — outputs a structured recommendation report.

Used in: Module 7 stretch lab, Module 9 (design patterns — noise reduction pattern).


Authoring Your Own Skill

The blank template is at skills/SKILL-TEMPLATE.md. It includes all required sections with authoring guidance in comments.

Quality gate: before submitting a new skill, every Tier 1 item in skills/RUBRIC.md must pass. Tier 1 blockers are:

  • Frontmatter description starts with an action verb and names the trigger condition
  • When to Use lists specific alert names or metric thresholds (not vague descriptions)
  • Scripts Zone contains only CLI commands and expected output (no prose decisions)
  • Agents Zone contains only IF/THEN/ELSE trees (no raw CLI commands)
  • All numeric thresholds are explicit (e.g., > 80%, > 1000ms) — not qualitative

See Module 7 reading material for a full walkthrough of the authoring process.