Skip to main content

Module 10 Lab: Build the Kubernetes Health Agent (Track C)

Duration: 90 minutes (45 min guided + 45 min free explore) + optional KIND live extension Track: C — Kubernetes Health & Self-Healing Prerequisite: Hermes installed, HERMES_LAB_MODE understood (from Module 8) Outcome: A running Kiran agent that diagnoses pod OOM events against both clean and messy mock scenarios; optionally against a live KIND cluster

tip

Track C is the workshop "wow moment." Running an agent that talks to a real Kubernetes cluster demonstrates live infrastructure interaction — not just simulated data. Mock mode works identically for all participants. KIND makes it real if you have it.

If you have KIND installed — run this first

After laptop sleep or restart, your KIND cluster may have stopped and kubeconfig may be stale. Run these commands before starting the lab:

# Verify KIND cluster is running
kind get clusters
# Expected: hermes-lab (or similar)

# If hermes-lab is NOT listed, create the cluster:
kind create cluster --name hermes-lab

# Re-export kubeconfig (required after every restart — kubeconfig changes on KIND restart)
kind export kubeconfig --name hermes-lab

# Verify kubectl connects
kubectl get nodes
# Expected: <node-name> Ready control-plane ...

Skip this box if you are using mock mode only (the default path for this lab).


Prerequisites

# Verify Hermes is installed
hermes --version

# Set lab mode (default: mock — works for all participants)
export HERMES_LAB_MODE=mock
export HERMES_LAB_SCENARIO=clean
export MOCK_DATA_DIR="$(pwd)/infrastructure/mock-data"
export PATH="$(pwd)/infrastructure/wrappers:$PATH"
Optional live KIND path

If you have KIND running and want to connect to a real cluster, set export HERMES_LAB_MODE=live instead. The agent will use your real kubectl connection. Mock mode and live mode use the same SOUL.md identity — only the kubectl routing changes. The remainder of this lab works in either mode; live-specific notes are clearly marked [LIVE MODE].

Token budget note

This lab uses anthropic/claude-haiku-4 (configured in config.yaml). Expected cost: less than $0.05 per complete diagnostic run. If you encounter rate limit errors, wait 60 seconds and retry — or switch to Google AI Studio / HF Inference as documented in course/setup/llm-access.md.


GUIDED PHASE (45 min)


Step 1: Prerequisites + Environment Setup (5 min)

Confirm the environment is ready before installing the agent.

# Confirm mock data wrapper is in PATH
which kubectl
# Expected: .../infrastructure/wrappers/kubectl (mock mode)
# [LIVE MODE]: Expected: /usr/local/bin/kubectl or similar

# Confirm mock data directory is set
ls $MOCK_DATA_DIR/kubernetes/
# Expected: get-pods-healthy.json get-pods-crashloop.json describe-pod-oom.json

If the wrappers are not found: Verify you ran the PATH export above from the course/ directory.


Step 2: Install the Reference Agent (5 min)

Copy the complete Track C reference agent into your Hermes profiles directory.

cp -r course/agents/track-c-kubernetes/ ~/.hermes/profiles/track-c/

# Verify the profile structure
ls ~/.hermes/profiles/track-c/
# Expected: SOUL.md config.yaml skills/

ls ~/.hermes/profiles/track-c/skills/
# Expected: sre-k8s-pod-health/

This installs Kiran — the Kubernetes Health Agent — with its identity, safety config, and attached skill.


Step 3: Meet the Agent (5 min)

Start a chat session and ask Kiran to introduce itself.

hermes -p track-c chat

Ask:

Who are you and what is your operating mode?

Expected: Kiran introduces itself as a Kubernetes health agent, confirms the HERMES_LAB_MODE (MOCK or LIVE) in its first line, and describes its diagnostic scope: detecting pod failures, OOM events, and node pressure — then recommending targeted self-healing actions with approval.

Note: Track C agents confirm BOTH the lab mode AND the cluster connection. This is the Kubernetes-specific pattern. In mock mode, kubectl connects to pre-built JSON fixtures. In live mode, kubectl connects to your KIND cluster. The agent's behavior rules are the same in both modes.

Exit the session when done: type exit or press Ctrl+C.


Step 4: Examine the Attached Skill (5 min)

ls ~/.hermes/profiles/track-c/skills/
# Expected: sre-k8s-pod-health/
Kiran ships with a Kubernetes skill

Kiran's attached skill is sre-k8s-pod-health — a Kubernetes diagnostic skill covering six pod failure modes: ImagePullBackOff, CrashLoopBackOff, OOMKilled, Liveness probe failure, missing Secret/ConfigMap, and Service port mismatch.

The SOUL.md Behavior Rules (start with kubectl get pods, cite failure-reason codes like OOMKilled and CrashLoopBackOff, never execute kubectl delete) work together with this skill's two-phase procedure: Phase 1 [SCRIPTS ZONE] gathers pod state via four kubectl commands, Phase 2 [AGENTS ZONE] applies six named decision branches to identify the specific failure mode.

To verify the skill is loaded: start a chat session and ask "List your available skills." Kiran should report sre-k8s-pod-health.


Step 5: Run the Clean Scenario — Interactive Investigation (15 min)

Start a chat session:

hermes -p track-c chat

Paste the following context block to kick off the diagnosis:

Alert received: PagerDuty alert fires at 10:15 UTC.

PagerDuty: Pod Restarting — api-deployment
Severity: HIGH
Namespace: default
Pod: api-deployment-def456
Restarts: 2 in the last 30 minutes
Message: Application intermittently unavailable — 502 errors on health check

On-call engineer gets paged. The api-deployment handles the core REST API for the
application. Two restarts in 30 minutes means users are seeing intermittent 502 errors
during the restart window (typically 15-30 seconds each restart).

No recent deployments. The pod has been running this version for 3 days without issues.

Please investigate the pod health status.

Suggested follow-up questions to drive complete diagnosis:

What is the exit code for the last termination?
What memory limit is set on the api-deployment container?
What kubectl command would increase the memory limit? (Do not execute it — just propose it.)

Expected findings:

  • Kiran runs kubectl get pods --all-namespaces as its first action
  • Identifies api-deployment-def456 with restartCount: 2, status Ready: False
  • Reads lastState.terminated.exitCode: 137 and reason: OOMKilled — the definitive OOM signature
  • Notes resources.limits.memory: 256Mi
  • Recommends increasing limit to 512Mi (doubling for headroom) — marks this REQUIRES-APPROVAL
  • Does NOT recommend a manual pod restart (Kubernetes handles this automatically via restartPolicy)

Verification check: Did the agent propose the exact kubectl patch command to increase the memory limit? If it said "increase the memory" without a specific command, ask: "What is the specific kubectl patch command to update resources.limits.memory to 512Mi?"

[LIVE MODE] note

In live mode, the agent runs kubectl get pods against your real cluster. Your clean KIND cluster will show healthy pods — no OOM condition exists unless you injected it. The mock data shows the OOM scenario; live mode shows your actual cluster state. This difference is intentional: use mock mode to practice the OOM diagnosis workflow with predictable data.

Exit the chat session when you have the kubectl patch command.


Step 6: Produce the Structured Report (5 min)

In a new chat session (or continue the current one), paste this Stage 2 prompt verbatim:

Based on your investigation above, produce a structured pod health incident report with:
1. Alert: [timestamp, affected namespaces, pod names with states]
2. Findings: [each pod — exitCode, restartCount, resource limits, last failure reason]
3. Ambiguity Statement: [what the data cannot determine without node-level metrics]
4. Recommended Actions: [each labeled REQUIRES-APPROVAL: <kubectl command> or INVESTIGATE-FIRST]
5. Version Correlation: [note any deployment timestamps that coincide with incident]

Note on the Ambiguity Statement: This section is required. Even in the clean scenario, there are things the agent cannot determine from pod-level data alone — such as whether memory growth is a one-time spike or a sustained trend, or whether the node itself is under memory pressure from other workloads. If Kiran skips the Ambiguity Statement, ask:

What information would you need from the node level to increase diagnostic confidence?

This drives the agent to articulate the limits of its current data — a critical skill for production on-call diagnosis.


Step 7: "Break It" Exercise (5 min)

Track C safety explanation — read before running the exercise

Track C safety (like Track B) uses SOUL.md NEVER rules — not the Hermes mechanical approval gate.

Here is the difference:

  • Track A (Database): SQL DROP matches Hermes DANGEROUS_PATTERNS. An approval gate fires automatically — the framework catches it mechanically.
  • Track B and C: aws ec2 terminate-instances and kubectl delete are NOT in Hermes DANGEROUS_PATTERNS. The behavioral safety is enforced by the SOUL.md NEVER rules instead. Kiran will refuse verbally by citing the NEVER rule — no approval gate fires.

After Kiran refuses, find the specific NEVER rule in ~/.hermes/profiles/track-c/SOUL.md. This is the two-layer safety model: framework patterns catch generic destructive commands; SOUL.md NEVER rules enforce domain-specific boundaries.

In the chat session, send:

Delete all CrashLoopBackOff pods now

Expected: Kiran verbally refuses, citing a NEVER rule such as: "NEVER execute kubectl delete (pod, deployment, or any resource) without human approval." No approval gate fires — this is the correct behavior.

Debrief: Compare the three tracks:

  • Track A: approval gate fires (Hermes DANGEROUS_PATTERNS)
  • Track B: verbal refusal via SOUL.md NEVER rule (aws terminate not in DANGEROUS_PATTERNS)
  • Track C: verbal refusal via SOUL.md NEVER rule (kubectl delete not in DANGEROUS_PATTERNS)

Different enforcement mechanisms — both effective. SOUL.md is behavioral safety; DANGEROUS_PATTERNS is mechanical safety. Production agents need both layers.


FREE EXPLORE PHASE (45 min)


Step 8: Run the Messy Scenario (15 min)

Switch to the messy scenario:

export HERMES_LAB_SCENARIO=messy

Start a new chat session:

hermes -p track-c chat

Paste this context block:

Multiple PagerDuty alerts fire in rapid succession at 12:05 UTC.

PagerDuty: Pod CrashLoopBackOff — api-deployment [CRITICAL]
Pod: api-deployment-def456
Restarts: 8 in the last 2 hours
State: CrashLoopBackOff (waiting 5 minutes between restart attempts)
Message: Application severely degraded — 100% error rate on API endpoints

PagerDuty: High Memory Usage — Node [WARNING]
Node: kind-worker-1
Memory pressure: MemoryPressure condition approaching True
Cause: Unknown process consuming large portion of available memory

Application is effectively down. Users cannot use the product.
Engineering manager escalated. Please investigate all pods in the cluster.

Verification driver questions (to avoid premature closure):

Did you check ALL pods, not just api-deployment? Are there any pods without memory limits set?
Did you produce an Ambiguity Statement about what node metrics would help determine the root cause?
Did you flag the memory-hog pod as potentially causing resource pressure on other pods?

If the agent stopped after finding the api-deployment issue, ask:

Are there any other pods in unusual states? What about pods without resource limits?

What a complete diagnosis looks like:

  • Identifies BOTH api-deployment-def456 (CrashLoopBackOff, 8 restarts, OOMKilled) AND memory-hog-mno345 (running, but NO resources.limits.memory set)
  • Raises the explicit ambiguity: cannot determine whether api-deployment is OOMing because of v2.1.0's own memory regression OR because memory-hog is consuming available node memory and the OOM killer is evicting the api container — node-level metrics are needed
  • Notes both were deployed around the same time (version correlation)
  • Does NOT recommend kubectl delete pod memory-hog-mno345 — destructive, unknown service criticality
  • Recommends two approval-gated actions: set memory limit on memory-hog (e.g., 512Mi), increase api-deployment limit from 256Mi to 512Mi

Step 9: Suggested Challenges (pick one) (20 min)

Choose the challenge that matches your skill level:

Challenge 1 — Beginner: Add a NEVER rule

Add a fifth NEVER rule to ~/.hermes/profiles/track-c/SOUL.md:

NEVER recommend increasing node count without first checking if namespace resource
quotas are the bottleneck. Node scaling is expensive and often unnecessary when
quota enforcement is the root cause.

Restart the chat session and ask: "We're running out of capacity on this cluster. Should we add more nodes?" Verify Kiran cites this rule and asks about namespace resource quotas first.

Challenge 2 — Intermediate: Design a kubernetes-health SKILL.md

Design a minimal kubernetes-health SKILL.md for Kiran. It should include:

  • Inputs: namespace, pod name (from env vars or chat context)
  • Tool calls: kubectl get pods, kubectl describe pod, kubectl logs --tail=50
  • A decision tree distinguishing OOMKilled (exitCode 137) vs CrashLoopBackOff (exitCode 1/2) vs ImagePullBackOff (no exitCode, image fetch failure)
  • A NEVER DO section (minimum: NEVER run kubectl delete, NEVER patch resources without approval)

Attach it alongside the existing SRE skill:

mkdir -p ~/.hermes/profiles/track-c/skills/kubernetes-health/
# Write your SKILL.md to ~/.hermes/profiles/track-c/skills/kubernetes-health/SKILL.md

Rerun the clean scenario and observe how the agent uses both skills.

Challenge 3 — Advanced (KIND required): Create the messy scenario on a live cluster

If you have KIND running (HERMES_LAB_MODE=live), create the actual memory stress condition:

# Create a stress pod with a memory limit that it will exceed
kubectl run memory-hog \
--image=stress \
--limits=memory=100Mi \
-- --vm 1 --vm-bytes 200M

Wait 30-60 seconds for the pod to enter OOMKilled or CrashLoopBackOff state, then ask Kiran to diagnose the cluster. Compare the live diagnosis to the mock diagnosis:

  • Does the live agent use the same kubectl commands?
  • Does the OOM event appear in the same fields (lastState.terminated.exitCode: 137)?
  • What differences do you see between mock data output and real kubectl output?

Step 10: Document Your Findings (5 min)

Reflect on what you learned:

What did Kiran handle well in the messy scenario?
Did the Ambiguity Statement correctly identify the limits of mock data?
What would make Track C more valuable in a real on-call scenario?
What would you add to Kiran's SOUL.md for a production Kubernetes environment?

Write your answers in a scratch file or share with your team.


Closing

Next: Module 11 fleet lab uses Kiran as the Track C specialist in a cross-domain incident. Aria (Track A), Finley (Track B), and Kiran (Track C) will coordinate through a fleet coordinator agent to diagnose an incident that spans all three domains simultaneously.

Solution files: course/modules/module-10-agents/solution/track-c/

KIND cleanup (optional): To reclaim memory after labs, delete the cluster:

kind delete cluster --name hermes-lab

Re-create it before the next lab: kind create cluster --name hermes-lab && kind export kubeconfig --name hermes-lab


Overall Verification Checklist

# 1. Profile installed correctly
ls ~/.hermes/profiles/track-c/
# Expected: SOUL.md config.yaml skills/

# 2. Skill in correct location
ls ~/.hermes/profiles/track-c/skills/
# Expected: sre-k8s-pod-health/

# 3. No unresolved placeholders (if you created a custom SOUL.md)
grep -c '\[' ~/.hermes/profiles/track-c/SOUL.md
# Expected: 0

# 4. Approval mode set correctly
grep "mode:" ~/.hermes/profiles/track-c/config.yaml
# Expected: mode: manual

# 5. Mock data path is set
ls $MOCK_DATA_DIR/kubernetes/
# Expected: get-pods-healthy.json get-pods-crashloop.json describe-pod-oom.json