Exploratory: Fleet Orchestration Stretch Projects

These are exploratory stretch projects — not required to complete Module 12. They extend fleet orchestration concepts into more realistic operational scenarios.

Project 1: Incident Response Fleet

Estimated time: 60 minutes Extends: Module 12 lab (fleet with coordinator) Prerequisites: Fleet lab completed, coordinator and three specialists running

What You Will Build

A complete incident response simulation: the coordinator receives a multi-signal incident, delegates to all three specialists in parallel, receives their findings, performs correlation analysis, and generates an executive-level incident report in the format your organization uses for P2 postmortems.

Challenge

The challenge is synthesis quality. With three specialists producing separate findings, the coordinator must identify: which findings are independent (separate issues that coincidentally coincided), which findings are causally linked (one finding explains another), and which findings are correlated but not causal (happening at the same time, same root cause, different symptoms).

These three cases produce different incident conclusions and different recommended actions.

Steps

Design a realistic cross-domain incident scenario using simulated data that has a clear causal chain across domains. Example:
- EC2 autoscaling added 5 new instances at 02:10 (FinOps data: cost spike)
- 5 new instances were added to the K8s node pool at 02:12 (K8s data: new nodes)
- New pods were scheduled on the new nodes at 02:14 (K8s data: pod events)
- RDS received 50 new connections from new pods at 02:15 (DB data: connection count spike)
- API latency increased at 02:16 as new pods initialized (K8s data: pod readiness)
Run the coordinator against the simulated data — observe whether it reconstructs the causal chain
Write a "postmortem format" SOUL.md constraint: the coordinator's output must match your organization's postmortem format (Timeline, Root Cause, Contributing Factors, Recommendations, Action Items)
If the coordinator misses the causal chain, update the coordination skill to include a "timestamp correlation" step that explicitly checks for events occurring within 2-minute windows across all specialist findings

Expected Deliverable

A simulated incident scenario (mock data files), coordinator output in postmortem format, and notes on whether the coordinator correctly identified the causal chain vs. treated each finding as independent.

Project 2: Auto-Routing Coordinator

Estimated time: 45 minutes Extends: Module 12 lab (coordinator skill) Prerequisites: Fleet lab completed

What You Will Build

Improve the coordinator's routing decision: instead of routing based on keyword matching in the incident description, build a routing skill that uses structured triage questions to determine which specialists to engage.

Challenge

Keyword routing is fragile. "Latency spike" might be a database issue, a Kubernetes scheduling issue, a network issue, or a cost-related instance degradation. Better routing asks: "What is observable? What time window? What scale?" and then decides which specialists can contribute based on the answers.

Steps

Write a triage questionnaire that the coordinator runs before delegating:

## Triage (Step 0 — before delegation)

Ask user (if information is not in the incident report):
1. What is the primary observable symptom? (high latency / error rate / cost spike / pod failures / etc.)
2. What time did it start?
3. What is the affected scope? (specific service, all services, a region, all regions)
4. Is this P1 (customer-facing SLA breach), P2 (degraded service), or P3 (internal only)?

Based on answers, route to:
- High latency → all three specialists (API latency is multi-domain)
- Pod failures only → k8s-health-agent only
- Cost spike, no user impact → finops-agent only
- DB slow queries + latency → rds-health-agent + k8s-health-agent (confirm no pod pressure causing connection load)

Update the coordinator's coordination SKILL.md to include the triage step before delegation
Test with 3 different incident descriptions: one that routes to single specialist, one to two specialists, one to all three
Measure: does triage-based routing produce faster, more focused specialist responses than routing all incidents to all specialists?

Expected Deliverable

Updated coordination SKILL.md with triage step, test results for three incident types, and a comparison: triage routing vs. broadcast-to-all routing in terms of synthesis quality.

Which Project Should You Do?

Your Interest	Recommended Project
Incident response realism	Project 1 (incident simulation)
Routing intelligence	Project 2 (auto-routing)
Under 30 minutes available	Project 2 — skill update is focused and measurable

Both projects prepare you for Module 13 governance work: Project 1 produces a postmortem format that can be incorporated into audit logging; Project 2 produces routing logic that can be extended with approval gates for high-severity incidents.

Project 3: K8s Agent Sandbox — Isolated Agent Deployment (Alpha, Exploratory)

Estimated time: 75 minutes Extends: Module 12 lab (fleet coordinator) Prerequisites: KIND cluster running, Phase 6 Track C agent profile installed

Alpha Status — Course v1.1 exploratory only

K8s Agent Sandbox is at v0.2.1 (alpha) as of 2026-04-07. The CRDs are under agents.x-k8s.io/v1alpha1 and are subject to breaking changes between releases. Phase 9 ships this as an exploratory project only — the main Module 12 lab does NOT depend on Sandbox. If you run into breakage, fall back to the main lab's Morgan fleet coordinator flow.

This project will be promoted to a required GUIDED lab step in a future course version (v1.2+) once the Sandbox CRDs reach beta/stable.

What You Will Build

A sandboxed deployment of one of your Phase 6 Track C agents (the K8s diagnostic specialist) running inside a Kubernetes-managed isolation boundary using the K8s Agent Sandbox CRDs. You will observe the isolation: the sandboxed agent cannot access resources outside its namespace, cannot escape its resource quota, and can be cleanly destroyed via CRD delete.

This is the Kubernetes-native answer to the multi-tenancy question that arises in production fleets: "how do I run N agents for N teams without them stepping on each other?"

Prerequisites

# Confirm KIND is running
kubectl cluster-info --context kind-kind

# Confirm Track C agent profile is available
ls agents/track-c-kubernetes/SOUL.md

Step 1: Install the Sandbox CRDs (pinned v0.2.1)

# Primary CRDs: Sandbox, SandboxTemplate
kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/v0.2.1/manifest.yaml

# Extension CRDs: SandboxClaim, SandboxWarmPool
kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/v0.2.1/extensions.yaml

# Verify CRDs installed
kubectl get crd | grep agents.x-k8s.io
# Expected:
#   sandboxes.agents.x-k8s.io
#   sandboxtemplates.agents.x-k8s.io
#   sandboxclaims.agents.x-k8s.io
#   sandboxwarmpools.agents.x-k8s.io

Install URL may change

If the v0.2.1 release URL 404s, check https://github.com/kubernetes-sigs/agent-sandbox/releases for the current release. This project pins v0.2.1 for reproducibility at course release time.

Step 2: Create a SandboxTemplate for Your Track C Agent

Write sandbox-template-track-c.yaml:

apiVersion: agents.x-k8s.io/v1alpha1
kind: SandboxTemplate
metadata:
  name: track-c-diagnostic-template
  namespace: default
spec:
  image: your-registry/track-c-kubernetes:v0.4.2  # build from agents/track-c-kubernetes/
  resources:
    limits:
      memory: "512Mi"
      cpu: "250m"
    requests:
      memory: "256Mi"
      cpu: "100m"
  networkPolicy:
    ingress: []  # no incoming traffic
    egress:
      - to: kube-system   # allow DNS
      - to: k8s-trouble-crashloop  # only the target diagnostic namespace
  env:
    - name: HERMES_LAB_GOVERNANCE
      value: "L2"   # read-only diagnostic (not L4 for exploratory)
    - name: HERMES_LAB_TRACK
      value: "track-c"

Apply:

kubectl apply -f sandbox-template-track-c.yaml

Step 3: Instantiate a Sandbox from the Template

Write track-c-sandbox-001.yaml:

apiVersion: agents.x-k8s.io/v1alpha1
kind: Sandbox
metadata:
  name: track-c-sandbox-001
  namespace: default
spec:
  template: track-c-diagnostic-template
  ttl: 30m   # auto-destroy after 30 minutes

Apply and inspect:

kubectl apply -f track-c-sandbox-001.yaml
kubectl get sandbox track-c-sandbox-001 -o yaml

Wait for status.phase: Running before proceeding.

Step 4: Verify Isolation

Test 1 — Agent can access its target namespace:

kubectl exec -n default track-c-sandbox-001 -- kubectl get pods -n k8s-trouble-crashloop
# Expected: pod list including the crasher pod

Test 2 — Agent CANNOT access unrelated namespaces:

kubectl exec -n default track-c-sandbox-001 -- kubectl get pods -n kube-system
# Expected: Error from server (Forbidden): pods is forbidden

Test 3 — Agent cannot exceed its resource quota:

# Trigger a memory-intensive agent run — observe eviction
kubectl exec -n default track-c-sandbox-001 -- hermes -p track-c chat \
  --prompt "Analyze all logs from all pods in all namespaces in extreme detail"
# Expected: OOMKilled or eviction event visible in Sandbox status
kubectl get events -n default | grep track-c-sandbox-001

Document the actual outputs of all three tests against the expected outputs above. Note any deviations — Sandbox is alpha and behavior may differ.

Step 5: Clean Up

# Delete the Sandbox (CRD-managed teardown)
kubectl delete sandbox track-c-sandbox-001

# Delete the SandboxTemplate
kubectl delete sandboxtemplate track-c-diagnostic-template

# Optionally uninstall the CRDs (removes agent-sandbox from your cluster entirely)
kubectl delete -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/v0.2.1/extensions.yaml
kubectl delete -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/v0.2.1/manifest.yaml

Expected Deliverable

sandbox-template-track-c.yaml and track-c-sandbox-001.yaml applied to KIND
Isolation Tests 1, 2, 3 documented with expected vs actual outputs
Notes on what you would change for a production Sandbox deployment (image registry, network policy hardening, TTL tuning, governance level)

Why This Matters

The Phase 9 FLEET-01 chain runs all agents in a single gateway process. That works for a solo operator or a small team, but at scale you need per-agent isolation: different teams, different trust levels, different blast radius boundaries.

K8s Agent Sandbox is the emerging Kubernetes-native answer. It models the pattern that will become standard: agents are CRDs, lifecycle is K8s-managed, isolation is network+quota enforced, and destruction is declarative (kubectl delete sandbox).

For now: explore, document what breaks, and stay aware. When Sandbox reaches beta/stable, this project becomes a required lab step.

Module 12 reading reference §7.4 Scaling — multi-tenancy isolation discussion
https://github.com/kubernetes-sigs/agent-sandbox — upstream project (alpha v0.2.1)
Module 13 governance reading — per-agent blast radius concepts that Sandbox enforces at the K8s layer

Project 1: Incident Response Fleet​

What You Will Build​

Challenge​

Steps​

Expected Deliverable​

Project 2: Auto-Routing Coordinator​

What You Will Build​

Challenge​

Steps​

Expected Deliverable​

Which Project Should You Do?​

Project 3: K8s Agent Sandbox — Isolated Agent Deployment (Alpha, Exploratory)​

What You Will Build​

Prerequisites​

Step 1: Install the Sandbox CRDs (pinned v0.2.1)​

Step 2: Create a SandboxTemplate for Your Track C Agent​

Step 3: Instantiate a Sandbox from the Template​

Step 4: Verify Isolation​

Step 5: Clean Up​

Expected Deliverable​

Why This Matters​

Related​

Project 1: Incident Response Fleet

What You Will Build

Challenge

Steps

Expected Deliverable

Project 2: Auto-Routing Coordinator

What You Will Build

Challenge

Steps

Expected Deliverable

Which Project Should You Do?

Project 3: K8s Agent Sandbox — Isolated Agent Deployment (Alpha, Exploratory)

What You Will Build

Prerequisites

Step 1: Install the Sandbox CRDs (pinned v0.2.1)

Step 2: Create a SandboxTemplate for Your Track C Agent

Step 3: Instantiate a Sandbox from the Template

Step 4: Verify Isolation

Step 5: Clean Up

Expected Deliverable

Why This Matters

Related