Module 7 Lab: Write a Kubernetes Pod Health SKILL.md (Track C)

Duration: 60 minutes Track: C — Kubernetes Health & Self-Healing Outcome: A complete SKILL.md that diagnoses Kubernetes pod failures and passes the Track C rubric quality gate

Track C commitment

This lab and the rest of your course work (Modules 8, 10, and beyond) use Track C. The skill you write here is the skill you will attach to your Track C agent profile in Module 8. Do not mix kubectl commands with aws or psql commands — cross-track contamination is a Tier 1 rubric failure.

Prerequisites (5 min)

Verify your KIND cluster is running and you can access it:

# Check that KIND cluster exists and is ready
kubectl cluster-info --context kind-lab

# Expected output: similar to:
# Kubernetes control plane is running at https://127.0.0.1:XXXXX
# CoreDNS is running at https://127.0.0.1:XXXXX/api/v1/namespaces/kube-system/services/coredns:dns/proxy

# Verify you can see nodes
kubectl get nodes

# Expected output:
# NAME                 STATUS   ROLES           AGE   VERSION
# kind-control-plane   Ready    control-plane   XXd   vX.XX.X

That's it. No environment variables. No wrapper setup. Your KIND cluster from Module 6 is all you need.

Your reference material for Track C:

6 broken pod manifests at infrastructure/scenarios/k8s/0[1-6]-*.yaml — you'll apply these during the lab to create failure scenarios on your cluster
The production reference skill at modules/module-07-skills/solution/track-c-kubernetes/SKILL.md (287 lines, all 6 failure modes — your 60-minute version will be smaller, that's fine)
Starter file available at modules/module-07-skills/starter/track-c-kubernetes/SKILL.md

File Structure

modules/module-07-skills/
├── starter/
│   └── track-c-kubernetes/SKILL.md     ← your starting point
└── solution/
    └── track-c-kubernetes/SKILL.md     ← reference implementation (look only after finishing)

Copy your starter file to a working location:

cp modules/module-07-skills/starter/track-c-kubernetes/SKILL.md /tmp/my-k8s-skill.md

Edit /tmp/my-k8s-skill.md throughout this lab, filling in each section as you work through the steps below.

Step 1: Inject a Failure Scenario (10 min)

Before you diagnose pod failures, you need to CREATE them. You'll use one of the broken pod manifests to inject a failure into your KIND cluster.

Pick one scenario:

# The six available failure scenarios are:
# 01-image-pull-backoff.yaml  — Pod cannot pull image (ImagePullBackOff)
# 02-crashloop-backoff.yaml   — Container crashes on startup (CrashLoopBackOff)
# 03-oom-killed.yaml          — Container hits memory limit (OOMKilled)
# 04-liveness-probe.yaml      — Liveness probe fails repeatedly
# 05-missing-secret.yaml      — Pod references a non-existent Secret (CreateContainerConfigError)
# 06-port-mismatch.yaml       — Service selector exists but port doesn't match (no endpoints)

# For this first run, apply the ImagePullBackOff scenario:
kubectl apply -f infrastructure/scenarios/k8s/01-image-pull-backoff.yaml

Verify the failure exists:

# List pods to see the broken one
kubectl get pods -A

# You should see a pod in ImagePullBackOff state in the default namespace
# Example output:
# NAMESPACE     NAME                      READY   STATUS              RESTARTS   AGE
# default       broken-image-pull-XXX     0/1     ImagePullBackOff    0          10s

Understand what you did:

You just created a real, broken Kubernetes pod state — not a mock, not a simulation. Your SKILL.md will diagnose this real failure. This is how Kubernetes debugging actually works: you find a broken pod, and you diagnose it.

Step 2: Write Your SKILL.md Structure (50 min)

Your skill needs to diagnose real pod failures. Fill in each section below, using kubectl commands against your KIND cluster to understand what failure modes look like.

Section 1: Metadata — Skill Identity (5 min)

Fill in every [placeholder] in your SKILL.md frontmatter:

name: kebab-case, e.g., sre-k8s-pod-health
description: one sentence describing what it does and when to use it
compatibility: "kubectl 1.28+, KIND v0.31+"
metadata.hermes.category: sre
metadata.hermes.tags: 3-5 tags from [kubernetes, sre, pod-health, kubectl, diagnosis, crashloop, oomkilled]

Section 2: When to Use — Trigger Conditions (8 min)

Add a ## When to Use section naming 3-5 specific trigger conditions using exact Kubernetes field names:

Pod CrashLoopBackOff with restartCount > N
Pod ImagePullBackOff / ErrImagePull
Container terminated with reason == "OOMKilled" and exitCode == 137
Pod Pending with CreateContainerConfigError (missing Secret/ConfigMap)
kubectl describe pod Events contain Liveness probe failed
Service exists but kubectl get endpoints returns <none>

Add 2 "Do NOT use this skill for" anti-cases (out-of-scope scenarios).

Section 3: Inputs — Parameterize Your Skill (5 min)

Add an ## Inputs section with a table listing:

NAMESPACE (env var, required)
POD_NAME (env var, optional)
Required tools: kubectl 1.28+

For live KIND mode, that's all you need. No HERMES_LAB_MODE, no mock data paths.

Section 4: Phase 1 — Scripts Zone (12 min)

Add a ## Procedure section with ### Phase 1: Gather Pod Data.

Write 3-5 kubectl commands that collect diagnostic data. For each command, include:

Exact command (e.g., kubectl get pods -n $NAMESPACE -o json)
Expected output format (what fields to look for in the JSON or text)
What a healthy pod vs. broken pod looks like

Example Phase 1 commands:

# Step 1.1: Pod inventory
kubectl get pods -n $NAMESPACE -o json

# Step 1.2: Pod details with events
kubectl describe pod $POD_NAME -n $NAMESPACE

# Step 1.3: Container logs (current and previous instance)
kubectl logs $POD_NAME -n $NAMESPACE --tail=100
kubectl logs $POD_NAME -n $NAMESPACE --tail=100 --previous

These commands will work against your real KIND cluster with real failures injected.

Tip: Run these commands manually right now against the broken pod you created in Step 1. See what the output looks like for ImagePullBackOff. Your Phase 1 section should document exactly what you see.

Section 5: Phase 2 — Agents Zone (12 min)

Add ### Phase 2: Interpret and Decide.

Write a decision tree covering at least 3 of the 6 failure modes. Use exact field paths:

IF status.containerStatuses[].state.waiting.reason == "ImagePullBackOff":
  THEN: Diagnosis = "IMAGE_PULL_FAILURE"
ELSE IF lastState.terminated.reason == "OOMKilled" AND exitCode == 137:
  THEN: Diagnosis = "OOM_KILLED"
ELSE:
  THEN: Diagnosis = "NO_ISSUE_FOUND"

Every branch must end in a named diagnosis or escalation. No "investigate further."

Section 6: Escalation & Safety (8 min)

Add ## Escalation Rules and ## NEVER DO sections.

Escalation Rules: 3-4 specific conditions (e.g., restartCount > 10, any unrecognized failure mode).

NEVER DO: 4-5 hard prohibitions, e.g.:

NEVER execute kubectl delete — prevents accidental resource removal
NEVER execute kubectl exec — this skill is read-only
NEVER modify resource limits without approval
NEVER execute kubectl patch, kubectl edit, or kubectl apply

Section 7: Verification (5 min)

Add ## Verification section with a checklist confirming:

All Phase 1 commands ran and returned data
Phase 2 decision tree reached a named diagnosis
Evidence includes specific pod name, namespace, and Kubernetes field paths
No write-verb kubectl commands were executed

Step 3: Test Your Skill Against a Real Broken Pod (Optional: 10 min)

Once you've completed your SKILL.md, test it against the real broken pod:

# If you still have the broken pod from Step 1:
kubectl get pods -A

# Run your Phase 1 commands manually and see if you can identify the failure mode
# using your Phase 2 decision tree

# For example:
export NAMESPACE=default
export POD_NAME=<pod-name-from-step-1>

# Run your Phase 1 commands
kubectl get pods -n $NAMESPACE -o json
kubectl describe pod $POD_NAME -n $NAMESPACE

# Work through your Phase 2 decision tree
# Can you reach a diagnosis?

If you want to test against another scenario, clean up and apply a different one:

# Clean up the old pod
kubectl delete -f infrastructure/scenarios/k8s/01-image-pull-backoff.yaml

# Apply a different scenario
kubectl apply -f infrastructure/scenarios/k8s/02-crashloop-backoff.yaml

# Test your skill again
# Does your Phase 2 decision tree work for this failure mode too?

Quality Gate

Before comparing to the solution, verify your SKILL.md is complete:

# Check 1: No unfilled placeholders
grep -c '\[' /tmp/my-k8s-skill.md
# Expected: 0

# Check 2: Decision tree has specific conditions
grep -E '(reason ==|restartCount|exitCode)' /tmp/my-k8s-skill.md | head -5
# Expected: at least 3 lines with specific Kubernetes conditions

# Check 3: NEVER DO section exists with specific kubectl commands
grep "NEVER.*kubectl" /tmp/my-k8s-skill.md
# Expected: at least 4 NEVER DO rules

# Check 4: Both PROCEDURE and decision branches present
grep "Phase 1\|Phase 2\|Decision Branch" /tmp/my-k8s-skill.md
# Expected: both Phase 1 and Phase 2 sections present

Compare with Solution

Your completed skill vs. the reference implementation:

diff /tmp/my-k8s-skill.md modules/module-07-skills/solution/track-c-kubernetes/SKILL.md

Differences are expected and fine — this is YOUR skill for the scenarios you focused on. The solution file covers all 6 failure modes in 287 lines; your 60-minute version will be smaller. What must match: structure (all sections present), format (specific conditions, named diagnoses), completeness (0 placeholders).

Save Your Work

Your completed skill carries directly into Module 8.

cp /tmp/my-k8s-skill.md modules/module-07-skills/my-track-c-skill.md

Next Steps

Continue to Module 8 where you will:

Create a Hermes track-c agent profile
Examine and copy the reference SOUL.md and config.yaml into your profile
Attach your Module 7 skill to the profile
Run your Track C agent against real Kubernetes scenarios with your KIND cluster

Prerequisites (5 min)​

File Structure​

Step 1: Inject a Failure Scenario (10 min)​

Step 2: Write Your SKILL.md Structure (50 min)​

Section 1: Metadata — Skill Identity (5 min)​

Section 2: When to Use — Trigger Conditions (8 min)​

Section 3: Inputs — Parameterize Your Skill (5 min)​

Section 4: Phase 1 — Scripts Zone (12 min)​

Section 5: Phase 2 — Agents Zone (12 min)​

Section 6: Escalation & Safety (8 min)​

Section 7: Verification (5 min)​

Step 3: Test Your Skill Against a Real Broken Pod (Optional: 10 min)​

Quality Gate​

Compare with Solution​

Save Your Work​

Next Steps​