Exploratory: Fleet Orchestration Stretch Projects

These are exploratory stretch projects — not required to complete Module 11. They extend fleet orchestration concepts into more realistic operational scenarios.

Project 1: Incident Response Fleet

Estimated time: 60 minutes Extends: Module 11 lab (fleet with coordinator) Prerequisites: Fleet lab completed, coordinator and three specialists running

What You Will Build

A complete incident response simulation: the coordinator receives a multi-signal incident, delegates to all three specialists in parallel, receives their findings, performs correlation analysis, and generates an executive-level incident report in the format your organization uses for P2 postmortems.

Challenge

The challenge is synthesis quality. With three specialists producing separate findings, the coordinator must identify: which findings are independent (separate issues that coincidentally coincided), which findings are causally linked (one finding explains another), and which findings are correlated but not causal (happening at the same time, same root cause, different symptoms).

These three cases produce different incident conclusions and different recommended actions.

Steps

Design a realistic cross-domain incident scenario using simulated data that has a clear causal chain across domains. Example:
- EC2 autoscaling added 5 new instances at 02:10 (FinOps data: cost spike)
- 5 new instances were added to the K8s node pool at 02:12 (K8s data: new nodes)
- New pods were scheduled on the new nodes at 02:14 (K8s data: pod events)
- RDS received 50 new connections from new pods at 02:15 (DB data: connection count spike)
- API latency increased at 02:16 as new pods initialized (K8s data: pod readiness)
Run the coordinator against the simulated data — observe whether it reconstructs the causal chain
Write a "postmortem format" SOUL.md constraint: the coordinator's output must match your organization's postmortem format (Timeline, Root Cause, Contributing Factors, Recommendations, Action Items)
If the coordinator misses the causal chain, update the coordination skill to include a "timestamp correlation" step that explicitly checks for events occurring within 2-minute windows across all specialist findings

Expected Deliverable

A simulated incident scenario (mock data files), coordinator output in postmortem format, and notes on whether the coordinator correctly identified the causal chain vs. treated each finding as independent.

Project 2: Auto-Routing Coordinator

Estimated time: 45 minutes Extends: Module 11 lab (coordinator skill) Prerequisites: Fleet lab completed

What You Will Build

Improve the coordinator's routing decision: instead of routing based on keyword matching in the incident description, build a routing skill that uses structured triage questions to determine which specialists to engage.

Challenge

Keyword routing is fragile. "Latency spike" might be a database issue, a Kubernetes scheduling issue, a network issue, or a cost-related instance degradation. Better routing asks: "What is observable? What time window? What scale?" and then decides which specialists can contribute based on the answers.

Steps

Write a triage questionnaire that the coordinator runs before delegating:

## Triage (Step 0 — before delegation)

Ask user (if information is not in the incident report):
1. What is the primary observable symptom? (high latency / error rate / cost spike / pod failures / etc.)
2. What time did it start?
3. What is the affected scope? (specific service, all services, a region, all regions)
4. Is this P1 (customer-facing SLA breach), P2 (degraded service), or P3 (internal only)?

Based on answers, route to:
- High latency → all three specialists (API latency is multi-domain)
- Pod failures only → k8s-health-agent only
- Cost spike, no user impact → finops-agent only
- DB slow queries + latency → rds-health-agent + k8s-health-agent (confirm no pod pressure causing connection load)

Update the coordinator's coordination SKILL.md to include the triage step before delegation
Test with 3 different incident descriptions: one that routes to single specialist, one to two specialists, one to all three
Measure: does triage-based routing produce faster, more focused specialist responses than routing all incidents to all specialists?

Expected Deliverable

Updated coordination SKILL.md with triage step, test results for three incident types, and a comparison: triage routing vs. broadcast-to-all routing in terms of synthesis quality.

Which Project Should You Do?

Your Interest	Recommended Project
Incident response realism	Project 1 (incident simulation)
Routing intelligence	Project 2 (auto-routing)
Under 30 minutes available	Project 2 — skill update is focused and measurable

Both projects prepare you for Module 13 governance work: Project 1 produces a postmortem format that can be incorporated into audit logging; Project 2 produces routing logic that can be extended with approval gates for high-severity incidents.

Project 1: Incident Response Fleet​

What You Will Build​

Challenge​

Steps​

Expected Deliverable​

Project 2: Auto-Routing Coordinator​

What You Will Build​

Challenge​

Steps​

Expected Deliverable​

Which Project Should You Do?​

Project 1: Incident Response Fleet

What You Will Build

Challenge

Steps

Expected Deliverable

Project 2: Auto-Routing Coordinator

What You Will Build

Challenge

Steps

Expected Deliverable

Which Project Should You Do?