Module 7 Lab: Write a Kubernetes Pod Health SKILL.md (Track C)
Duration: 60 minutes
Track: C — Kubernetes Health & Self-Healing
Outcome: A complete SKILL.md that diagnoses Kubernetes pod failures and passes the Track C rubric quality gate
This lab and the rest of your course work (Modules 8, 10, and beyond) use Track C.
The skill you write here is the skill you will attach to your Track C agent profile
in Module 8. Do not mix kubectl commands with aws or psql commands — cross-track
contamination is a Tier 1 rubric failure.
Prerequisites (5 min)
Verify your KIND cluster is running and you can access it:
# Check that KIND cluster exists and is ready
kubectl cluster-info --context kind-lab
# Expected output: similar to:
# Kubernetes control plane is running at https://127.0.0.1:XXXXX
# CoreDNS is running at https://127.0.0.1:XXXXX/api/v1/namespaces/kube-system/services/coredns:dns/proxy
# Verify you can see nodes
kubectl get nodes
# Expected output:
# NAME STATUS ROLES AGE VERSION
# kind-control-plane Ready control-plane XXd vX.XX.X
That's it. No environment variables. No wrapper setup. Your KIND cluster from Module 6 is all you need.
Your reference material for Track C:
- 6 broken pod manifests at
infrastructure/scenarios/k8s/0[1-6]-*.yaml— you'll apply these during the lab to create failure scenarios on your cluster - The production reference skill at
modules/module-07-skills/solution/track-c-kubernetes/SKILL.md(287 lines, all 6 failure modes — your 60-minute version will be smaller, that's fine) - Starter file available at
modules/module-07-skills/starter/track-c-kubernetes/SKILL.md
File Structure
modules/module-07-skills/
├── starter/
│ └── track-c-kubernetes/SKILL.md ← your starting point
└── solution/
└── track-c-kubernetes/SKILL.md ← reference implementation (look only after finishing)
Copy your starter file to a working location:
cp modules/module-07-skills/starter/track-c-kubernetes/SKILL.md /tmp/my-k8s-skill.md
Edit /tmp/my-k8s-skill.md throughout this lab, filling in each section as you work through the steps below.
Step 1: Inject a Failure Scenario (10 min)
Before you diagnose pod failures, you need to CREATE them. You'll use one of the broken pod manifests to inject a failure into your KIND cluster.
Pick one scenario:
# The six available failure scenarios are:
# 01-image-pull-backoff.yaml — Pod cannot pull image (ImagePullBackOff)
# 02-crashloop-backoff.yaml — Container crashes on startup (CrashLoopBackOff)
# 03-oom-killed.yaml — Container hits memory limit (OOMKilled)
# 04-liveness-probe.yaml — Liveness probe fails repeatedly
# 05-missing-secret.yaml — Pod references a non-existent Secret (CreateContainerConfigError)
# 06-port-mismatch.yaml — Service selector exists but port doesn't match (no endpoints)
# For this first run, apply the ImagePullBackOff scenario:
kubectl apply -f infrastructure/scenarios/k8s/01-image-pull-backoff.yaml
Verify the failure exists:
# List pods to see the broken one
kubectl get pods -A
# You should see a pod in ImagePullBackOff state in the default namespace
# Example output:
# NAMESPACE NAME READY STATUS RESTARTS AGE
# default broken-image-pull-XXX 0/1 ImagePullBackOff 0 10s
Understand what you did:
You just created a real, broken Kubernetes pod state — not a mock, not a simulation. Your SKILL.md will diagnose this real failure. This is how Kubernetes debugging actually works: you find a broken pod, and you diagnose it.
Step 2: Write Your SKILL.md Structure (50 min)
Your skill needs to diagnose real pod failures. Fill in each section below, using kubectl commands against your KIND cluster to understand what failure modes look like.
Section 1: Metadata — Skill Identity (5 min)
Fill in every [placeholder] in your SKILL.md frontmatter:
name: kebab-case, e.g.,sre-k8s-pod-healthdescription: one sentence describing what it does and when to use itcompatibility:"kubectl 1.28+, KIND v0.31+"metadata.hermes.category:sremetadata.hermes.tags: 3-5 tags from[kubernetes, sre, pod-health, kubectl, diagnosis, crashloop, oomkilled]
Section 2: When to Use — Trigger Conditions (8 min)
Add a ## When to Use section naming 3-5 specific trigger conditions using exact Kubernetes field names:
- Pod
CrashLoopBackOffwithrestartCount > N - Pod
ImagePullBackOff/ErrImagePull - Container terminated with
reason == "OOMKilled"andexitCode == 137 - Pod
PendingwithCreateContainerConfigError(missing Secret/ConfigMap) kubectl describe podEvents containLiveness probe failed- Service exists but
kubectl get endpointsreturns<none>
Add 2 "Do NOT use this skill for" anti-cases (out-of-scope scenarios).
Section 3: Inputs — Parameterize Your Skill (5 min)
Add an ## Inputs section with a table listing:
NAMESPACE(env var, required)POD_NAME(env var, optional)- Required tools:
kubectl 1.28+
For live KIND mode, that's all you need. No HERMES_LAB_MODE, no mock data paths.
Section 4: Phase 1 — Scripts Zone (12 min)
Add a ## Procedure section with ### Phase 1: Gather Pod Data.
Write 3-5 kubectl commands that collect diagnostic data. For each command, include:
- Exact command (e.g.,
kubectl get pods -n $NAMESPACE -o json) - Expected output format (what fields to look for in the JSON or text)
- What a healthy pod vs. broken pod looks like
Example Phase 1 commands:
# Step 1.1: Pod inventory
kubectl get pods -n $NAMESPACE -o json
# Step 1.2: Pod details with events
kubectl describe pod $POD_NAME -n $NAMESPACE
# Step 1.3: Container logs (current and previous instance)
kubectl logs $POD_NAME -n $NAMESPACE --tail=100
kubectl logs $POD_NAME -n $NAMESPACE --tail=100 --previous
These commands will work against your real KIND cluster with real failures injected.
Tip: Run these commands manually right now against the broken pod you created in Step 1. See what the output looks like for ImagePullBackOff. Your Phase 1 section should document exactly what you see.
Section 5: Phase 2 — Agents Zone (12 min)
Add ### Phase 2: Interpret and Decide.
Write a decision tree covering at least 3 of the 6 failure modes. Use exact field paths:
IF status.containerStatuses[].state.waiting.reason == "ImagePullBackOff":
THEN: Diagnosis = "IMAGE_PULL_FAILURE"
ELSE IF lastState.terminated.reason == "OOMKilled" AND exitCode == 137:
THEN: Diagnosis = "OOM_KILLED"
ELSE:
THEN: Diagnosis = "NO_ISSUE_FOUND"
Every branch must end in a named diagnosis or escalation. No "investigate further."
Section 6: Escalation & Safety (8 min)
Add ## Escalation Rules and ## NEVER DO sections.
Escalation Rules: 3-4 specific conditions (e.g., restartCount > 10, any unrecognized failure mode).
NEVER DO: 4-5 hard prohibitions, e.g.:
- NEVER execute
kubectl delete— prevents accidental resource removal - NEVER execute
kubectl exec— this skill is read-only - NEVER modify resource limits without approval
- NEVER execute
kubectl patch,kubectl edit, orkubectl apply
Section 7: Verification (5 min)
Add ## Verification section with a checklist confirming:
- All Phase 1 commands ran and returned data
- Phase 2 decision tree reached a named diagnosis
- Evidence includes specific pod name, namespace, and Kubernetes field paths
- No write-verb
kubectlcommands were executed
Step 3: Test Your Skill Against a Real Broken Pod (Optional: 10 min)
Once you've completed your SKILL.md, test it against the real broken pod:
# If you still have the broken pod from Step 1:
kubectl get pods -A
# Run your Phase 1 commands manually and see if you can identify the failure mode
# using your Phase 2 decision tree
# For example:
export NAMESPACE=default
export POD_NAME=<pod-name-from-step-1>
# Run your Phase 1 commands
kubectl get pods -n $NAMESPACE -o json
kubectl describe pod $POD_NAME -n $NAMESPACE
# Work through your Phase 2 decision tree
# Can you reach a diagnosis?
If you want to test against another scenario, clean up and apply a different one:
# Clean up the old pod
kubectl delete -f infrastructure/scenarios/k8s/01-image-pull-backoff.yaml
# Apply a different scenario
kubectl apply -f infrastructure/scenarios/k8s/02-crashloop-backoff.yaml
# Test your skill again
# Does your Phase 2 decision tree work for this failure mode too?
Quality Gate
Before comparing to the solution, verify your SKILL.md is complete:
# Check 1: No unfilled placeholders
grep -c '\[' /tmp/my-k8s-skill.md
# Expected: 0
# Check 2: Decision tree has specific conditions
grep -E '(reason ==|restartCount|exitCode)' /tmp/my-k8s-skill.md | head -5
# Expected: at least 3 lines with specific Kubernetes conditions
# Check 3: NEVER DO section exists with specific kubectl commands
grep "NEVER.*kubectl" /tmp/my-k8s-skill.md
# Expected: at least 4 NEVER DO rules
# Check 4: Both PROCEDURE and decision branches present
grep "Phase 1\|Phase 2\|Decision Branch" /tmp/my-k8s-skill.md
# Expected: both Phase 1 and Phase 2 sections present
Compare with Solution
Your completed skill vs. the reference implementation:
diff /tmp/my-k8s-skill.md modules/module-07-skills/solution/track-c-kubernetes/SKILL.md
Differences are expected and fine — this is YOUR skill for the scenarios you focused on. The solution file covers all 6 failure modes in 287 lines; your 60-minute version will be smaller. What must match: structure (all sections present), format (specific conditions, named diagnoses), completeness (0 placeholders).
Save Your Work
Your completed skill carries directly into Module 8.
cp /tmp/my-k8s-skill.md modules/module-07-skills/my-track-c-skill.md
Next Steps
Continue to Module 8 where you will:
- Create a Hermes
track-cagent profile - Examine and copy the reference
SOUL.mdandconfig.yamlinto your profile - Attach your Module 7 skill to the profile
- Run your Track C agent against real Kubernetes scenarios with your KIND cluster