Skip to main content

Reference: Fleet Configuration and Coordinator Templates

Quick-reference for Module 12 — configuring a Hermes fleet with a coordinator and specialist agents.


1. Fleet Architecture Overview

┌─────────────────────────────────────────────────────┐
│ Coordinator Agent │
│ │
│ soul.md "I route, collect, synthesize" │
│ config.yaml Has delegate_task tool enabled │
│ skills/ coordination-skill.md │
└─────────────┬───────────────┬───────────────┬───────┘
│ │ │
┌──────��──▼──┐ ┌─────────▼──┐ ┌─────────▼──┐
│ DB Health │ │ FinOps │ │ K8s Health│
│ Agent │ │ Agent │ │ Agent │
│ (Track A) │ │ (Track B) │ │ (Track C) │
└────────────┘ └────────────┘ └────────────┘

2. Coordinator SOUL.md Template

# Hermes — Incident Coordinator

## Identity
You are Hermes Coordinator, the fleet orchestrator for the Platform Engineering team.
Your role is to receive cross-domain incidents, delegate to specialist agents, and synthesize their findings into a unified diagnosis.

You are NOT a domain specialist. You have no deep expertise in databases, Kubernetes, or cost analysis individually. Your expertise is in knowing which specialist to ask and how to synthesize their responses.

## Specialist Agents Available
- **rds-health-agent**: Database performance, connection pool, slow query analysis
- **k8s-health-agent**: Kubernetes pod health, resource pressure, deployment issues
- **finops-agent**: AWS cost anomalies, EC2 utilization, right-sizing

## Coordination Procedure
1. Analyze the incident: identify which domains are involved
2. Delegate to each relevant specialist with a bounded, specific task
3. Wait for all specialist responses
4. Identify cross-domain patterns (same timestamp, correlated metrics)
5. Generate unified incident report

## Communication Style
- Lead the output with: "Cross-Domain Incident Report"
- Structure: Executive Summary → Domain Findings (one section per specialist) → Correlation Analysis → Root Cause Hypothesis → Recommended Actions → Escalation Decision
- Label each domain finding with the specialist agent name that produced it

## Behavioral Constraints
- You NEVER attempt domain-specific diagnosis yourself — delegate to specialists
- You ALWAYS include correlation analysis even if specialists found independent issues
- You ESCALATE the entire fleet report to on-call if any specialist escalates at P1 or P2
- You DO NOT add context that was not in the specialist outputs — your job is synthesis, not speculation

## What You Do Not Do
- Domain-specific commands (no direct kubectl, aws, psql calls — delegates handle this)
- Recommendations without grounding in specialist evidence
- Pretend to have domain expertise you do not have

3. Fleet config.yaml with Delegation

profile_name: "incident-coordinator"
soul: "./soul.md"
model: "claude-opus-4-5"

tools:
delegation:
enabled: true
agents:
rds-health-agent:
profile_path: "../rds-health-agent/"
timeout: 60 # seconds to wait for specialist response
k8s-health-agent:
profile_path: "../k8s-health-agent/"
timeout: 45
finops-agent:
profile_path: "../finops-agent/"
timeout: 60
max_concurrent_delegations: 3 # all three can run in parallel
delegation_timeout: 90 # overall timeout if specialists don't respond

skills:
- path: "./skills/coordination.md"
triggers: ["incident", "investigate", "analyze", "cross-domain", "latency", "spike", "anomaly"]

4. Coordinator Skill Template

# Cross-Domain Incident Coordination

## Metadata
- version: 1.0.0
- domain: Coordination / Fleet
- author: Platform Engineering
- triggers: ["incident", "investigate", "cross-domain analysis"]

## Inputs
- incident_description: string — what the engineer reported
- time_window: string — incident start and duration (e.g., "02:00-06:00 UTC April 1")
- severity: string — P1/P2/P3 or Unknown

## Procedure

1. Analyze incident description to identify affected domains:
- Keywords suggesting DB domain: latency, query, connection, RDS, slow, database
- Keywords suggesting K8s domain: pod, crashloop, restart, deploy, container, OOMKilled
- Keywords suggesting cost domain: bill, spend, cost, charge, usage, anomaly

2. For each identified domain, delegate with bounded task:

delegate_task( agent="[specialist-agent-name]", task="[specific-domain-question]", context="Incident: {incident_description}. Time window: {time_window}. Specifically: [domain-specific question]" )


3. Collect all specialist responses. Note: which domains found issues, which found normal.

4. Correlation analysis:
- Do any specialists report anomalies at the same timestamp?
- Does one finding explain another? (e.g., pod increase → DB connection spike)
- Are there independent issues that happen to coincide?

5. Generate cross-domain report per format in SOUL.md.

## Decision Trees

### Domain Routing

| Incident Keywords | Delegate To |
|------------------|-------------|
| RDS, database, query, latency, connection | rds-health-agent |
| pod, crashloop, deploy, kubernetes, OOMKilled | k8s-health-agent |
| cost, spend, bill, EC2 usage, unused | finops-agent |
| API latency, service slow, timeout | All three (API latency crosses all domains) |

### Escalation Aggregation

| Specialist Escalations | Coordinator Action |
|-----------------------|-------------------|
| Any P1 escalation | Escalate full fleet report at P1 immediately |
| Any P2 escalation | Escalate full fleet report at P2 |
| All specialists: no action | Document as normal — no escalation |
| Mixed P3 findings | Escalate at P3 with correlation analysis |

5. Delegation Message Examples

Good Delegation (Bounded and Context-Rich)

To: rds-health-agent
Task: Analyze RDS db-prod-01 connection pool and query latency for 2026-04-01 02:00-06:00 UTC.
Context: Incident report: API service response times increased 300% starting 02:15 UTC. EC2 CPU is normal (35% average). Specifically: is RDS connection pool saturation contributing to the API latency increase?
Expected output: Structured diagnosis with Evidence, Root Cause Hypothesis, and Escalation Decision.

Poor Delegation (Too Broad)

To: rds-health-agent
Task: Check if the database is okay.

The second form makes the specialist do the scoping work the coordinator should have done. The specialist has no time window, no incident context, and no specific question to answer.


6. Solo Learner Fleet Setup

If completing the fleet lab solo, configure all three agents sequentially and run a simulated incident:

# Directory structure for solo fleet
solo-fleet/
├── coordinator/
│ ├── soul.md
│ ├── config.yaml
│ └── skills/coordination.md
├── rds-health-agent/ # From Module 10 Track A
├── k8s-health-agent/ # From Module 10 Track C
└── finops-agent/ # From Module 10 Track B

# Run the coordinator with the cross-domain incident
hermes --profile ./coordinator --task "Investigate: API latency spike started at 02:15. All three infrastructure domains potentially involved."

The coordinator will delegate to each specialist, collect their analyses of the simulated data, and produce a cross-domain synthesis. You are playing the role of all three specialists' "infrastructure" — the mock data files provide the evidence each specialist reads.


7. Productionizing Hermes Agents

Phase 9 closes v1.1 with a production-grade incident response chain running on KIND. This section answers the natural follow-up question: "How do I take this to production?" It covers four topics — packaging, deployment, monitoring, and scaling — with real Hermes config examples and cross-references to the Phase 6/7/8 components you built earlier in the course.

This is not generic cloud architecture theory. Every example below ties to a specific artifact you created in the course repo.


7.1 Packaging

Agents are shipped as three kinds of artifacts: profile directories (agents/), skills libraries (skills/), and the Hermes runtime (pip-installed or container-packaged).

Container image structure

The canonical Hermes agent container has four layers:

FROM python:3.12-slim

# Layer 1: Hermes runtime (pinned version from PyPI or GitHub source)
RUN pip install 'hermes-agent[messaging,cron]==0.4.2'

# Layer 2: Agent profile (copy your agents/fleet-coordinator/ or agents/track-c-kubernetes/)
COPY agents/fleet-coordinator /app/profiles/fleet
COPY agents/track-c-kubernetes /app/profiles/track-c

# Layer 3: Skills library (required skills for the agent's domain)
COPY skills/sre-k8s-pod-health /app/skills/sre-k8s-pod-health

# Layer 4: Governance configs (per-track L1-L4 allowlists)
COPY governance /app/governance

# Runtime env
ENV HERMES_PROFILE_DIR=/app/profiles \
HERMES_SKILLS_DIR=/app/skills \
HERMES_GOVERNANCE_DIR=/app/governance

ENTRYPOINT ["hermes", "-p", "fleet", "gateway", "run"]

Build with explicit version tags — NEVER :latest. Tag by git sha + semver:

docker build -t fleet-coordinator:v0.4.2-$(git rev-parse --short HEAD) .
docker push your-registry/fleet-coordinator:v0.4.2-a1b2c3d

Version pinning

Hermes runtime version pinning matters because inter-agent delegation semantics change between releases. The toolset intersection logic you observed in Morgan's Phase 9 fix (why terminal must be in Morgan's platform_toolsets.cli) was introduced in a specific minor version. Pin exact versions in production:

ArtifactPin strategy
hermes-agent pip package==0.4.2 exact version, bump in a dedicated PR
Container base imagepython:3.12.8-slim exact patch version
Agent profileGit commit SHA tracked in your agent registry
SkillsGit commit SHA tracked per skill
Governance configsGit commit SHA, synced together with skills

Tag all four together: fleet-coordinator:0.4.2-skills-a1b2c3d-gov-e4f5g6h.

Dependency management

Agents with external dependencies (gh CLI for Path B, kubectl for Track C, aws CLI for Track B) must ship those binaries in the container. Phase 9 uses:

# Additional Layer 5: external CLIs the agent will invoke via terminal toolset
RUN apt-get update && apt-get install -y curl git jq \
&& curl -LO "https://dl.k8s.io/release/v1.32.0/bin/linux/amd64/kubectl" \
&& install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl \
&& curl -LO "https://github.com/cli/cli/releases/download/v2.63.0/gh_2.63.0_linux_amd64.tar.gz" \
&& tar -xzf gh_2.63.0_linux_amd64.tar.gz \
&& mv gh_2.63.0_linux_amd64/bin/gh /usr/local/bin/gh

This is also where you install infrastructure/wrappers/mock-kubectl (Phase 7) if you want wrapper enforcement to travel with the container rather than requiring runtime PATH manipulation.

Pitfall: profile directory naming

Hermes loads profiles by the directory name inside HERMES_PROFILE_DIR. If you copy your agents/fleet-coordinator/ to /app/profiles/fleet/, Hermes launches it as hermes -p fleet. If the directory is named fleet-coordinator (not fleet), the launch command changes to hermes -p fleet-coordinator. The Module 12 lab uses -p fleet — match the convention in your Dockerfile's COPY destination path.


7.2 Deployment

Three deployment patterns are in scope for Hermes agents running in production.

Pattern A — Kubernetes Deployment (long-lived agents)

Long-lived agents like Morgan (webhook-triggered fleet coordinator) run as K8s Deployments. The Phase 9 FLEET-01 chain runs this way in production — a single fleet-coordinator Deployment receives webhook traffic from AlertManager and scales horizontally under alert volume.

apiVersion: apps/v1
kind: Deployment
metadata:
name: fleet-coordinator
namespace: hermes-agents
spec:
replicas: 2
selector:
matchLabels:
app: fleet-coordinator
template:
metadata:
labels:
app: fleet-coordinator
spec:
containers:
- name: fleet-coordinator
image: your-registry/fleet-coordinator:0.4.2-a1b2c3d
env:
- name: HERMES_LAB_GOVERNANCE
value: "L4"
- name: HERMES_LAB_TRACK
value: "track-c"
- name: TELEGRAM_BOT_TOKEN
valueFrom:
secretKeyRef:
name: telegram-bot
key: token
- name: TELEGRAM_ALLOWED_USERS
valueFrom:
secretKeyRef:
name: telegram-bot
key: allowed-users
ports:
- containerPort: 8644
name: webhook
livenessProbe:
httpGet:
path: /health
port: 8644
periodSeconds: 30
resources:
limits:
memory: "1Gi"
cpu: "500m"
requests:
memory: "512Mi"
cpu: "250m"

Create the Telegram secret from Phase 8 setup:

kubectl create namespace hermes-agents
kubectl create secret generic telegram-bot \
--from-literal=token="$TELEGRAM_BOT_TOKEN" \
--from-literal=allowed-users="$TELEGRAM_ALLOWED_USERS" \
-n hermes-agents

Pattern B — Kubernetes CronJob (scheduled agents)

Scheduled agents like periodic health-check bots run as K8s CronJobs. This is the Phase 8 TRIG-02 pattern — see infrastructure/scenarios/k8s/cronjob/ for the manifests you built in Module 11.

Key production considerations for scheduled agents:

  • Use restartPolicy: Never and backoffLimit: 2 — failed health checks should not retry forever
  • Mount a secret volume for API tokens (not env vars embedded in the manifest)
  • Set explicit activeDeadlineSeconds to kill runaway agents
  • Use successfulJobsHistoryLimit: 3 and failedJobsHistoryLimit: 5 to prevent audit log bloat
apiVersion: batch/v1
kind: CronJob
metadata:
name: track-c-health-check
namespace: hermes-agents
spec:
schedule: "*/15 * * * *" # every 15 minutes
concurrencyPolicy: Forbid # never run two simultaneously
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 5
jobTemplate:
spec:
activeDeadlineSeconds: 300
backoffLimit: 2
template:
spec:
restartPolicy: Never
containers:
- name: track-c-check
image: your-registry/track-c-kubernetes:0.4.2-a1b2c3d
args: ["hermes", "-p", "track-c", "run", "--task", "Periodic pod health check — all namespaces"]
envFrom:
- secretRef:
name: hermes-agent-secrets

Pattern C — GitOps deployment via PR merge

The Phase 9 FLEET-01 Path B pattern (specialist opens PR → human merges → apply.sh syncs) is the same pattern used in production GitOps deployments. ArgoCD, Flux, or plain CI-triggered helm upgrade calls all implement this pattern. Decision table:

Your setupRecommended sync mechanism
Team already runs ArgoCDArgoCD Application → auto-sync on merge
Team runs FluxFlux Kustomization → auto-sync on merge
Team runs CI-driven deploysCI pipeline on merge → helm upgrade or kubectl apply
Small shop / solo opsapply.sh equivalent script triggered manually or via webhook

Phase 9 Path B Sub-path B2 (infrastructure/scenarios/k8s/gitops/apply.sh) models the last row. ArgoCD (Sub-path B1) is a v1.2 course alternative once ArgoCD infrastructure is established.

GitOps repo structure recommendation

For agent-managed changes, keep the GitOps repo structure simple:

hermes-fleet-fixes/
├── README.md # from gitops-repo-template/README.md
├── patches/ # YAML overlays generated by agents
│ └── memory-patch-<sha>.yaml # each agent run creates a new file
└── applied/ # move patches here after apply.sh syncs
└── memory-patch-<sha>.yaml

The applied/ subdirectory provides a lightweight audit trail without a separate CMDB.


7.3 Monitoring

Four classes of telemetry matter for production Hermes agents: agent metrics, audit logs, governance events, and business outcome metrics.

Agent metrics (Prometheus-native)

Hermes emits Prometheus-compatible metrics at /metrics on the gateway port. Prometheus scrape configuration:

- job_name: 'hermes-fleet'
static_configs:
- targets: ['fleet-coordinator.hermes-agents.svc:8644']
metrics_path: /metrics
scrape_interval: 15s

Key metrics to alert on:

MetricAlert conditionWhat it means
hermes_agent_run_duration_secondsp99 > 120sRunaway agent — likely delegation loop
hermes_delegate_task_errors_totalrate > 0.05/minDelegation failures (toolset intersection, allowlist rejection)
hermes_webhook_events_received_totalrate > 20/minAlert storm — consider circuit breaker
hermes_approval_pending_totalcount > 10Human approval queue backing up
hermes_governance_denied_totalrate > 0.1/5mAgent attempting out-of-allowlist commands

Audit logs — Phase 7 governance event stream

The infrastructure/wrappers/mock-kubectl wrapper (Phase 7) emits an audit event for every intercepted command. In production, pipe these to a central log aggregator via a fluent-bit sidecar or the wrapper's JSON output mode.

Cross-reference: governance/governance-L4-track-c.yaml defines the allowlist. Every allowlist hit and miss becomes an audit event. A blocked command produces:

{
"ts": "2026-04-07T14:32:01Z",
"wrapper": "mock-kubectl",
"command": "kubectl delete deployment crasher",
"governance_level": "L4",
"track": "track-c",
"allowlist_hit": false,
"decision": "BLOCKED",
"caller_profile": "track-c",
"caller_agent_run_id": "run-abc123"
}

Send these to your SIEM or compliance tooling. Audit logs are the "black box recorder" for agent actions — critical for postmortems after automated remediation incidents.

Governance event stream

Beyond raw audit logs, aggregate governance events by (agent_profile, track, decision) to produce operational dashboards:

PanelPromQL aggregation
Denied command ratesum by (profile) (rate(hermes_governance_denied_total[5m]))
L4 escalation ratesum by (profile) (rate(hermes_governance_escalated_total[5m]))
Apply volumesum(rate(hermes_kubectl_apply_total[1h]))
Mean approval latencyhistogram_quantile(0.5, hermes_approval_duration_seconds_bucket)

Alert when denied rate spikes — it usually means an agent is attempting commands outside its allowlist, which indicates either a misconfigured profile or unexpected agent behavior.

Phase 8 AlertManager integration for self-monitoring

Cross-reference: infrastructure/scenarios/k8s/alertmanager/prometheus-rules.yaml — the same Prometheus stack that fires alerts INTO Morgan can also fire alerts ON Morgan. Self-monitoring PrometheusRules belong next to the cluster monitoring rules:

# Add to infrastructure/scenarios/k8s/alertmanager/prometheus-rules.yaml
groups:
- name: hermes-agent-health
rules:
- alert: FleetCoordinatorHighErrorRate
expr: rate(hermes_delegate_task_errors_total{profile="fleet"}[5m]) > 0.1
for: 5m
labels:
severity: warning
release: monitoring # must match your Helm release name (verify: kubectl get prometheus -n monitoring -o jsonpath='{.items[0].spec.ruleSelector}')
annotations:
summary: "Morgan delegation error rate > 10% over 5m"
description: "Delegation errors may indicate toolset intersection failures or allowlist misconfig"

- alert: HermesPendingApprovalQueueDepth
expr: hermes_approval_pending_total > 10
for: 10m
labels:
severity: warning
annotations:
summary: "Human approval queue depth > 10 for 10m"
description: "Check Telegram bot connectivity and admin allowlist configuration"

Note the release: monitoring label — this must match the Helm release name (the same requirement you discovered in Phase 8). The course setup uses monitoring as the release name; adjust if your release is named differently.

Structured logging for delegation traces

Enable structured logging in the gateway for delegation trace reconstruction in postmortems:

# In agents/fleet-coordinator/config.yaml
logging:
level: INFO
format: json # structured output for log aggregators
fields:
- agent_run_id
- delegation_depth
- governance_level
- profile_name

JSON logs make it possible to reconstruct the full delegation trace per agent_run_id across all tool calls, even when Morgan delegates across multiple in-process child agents.


7.4 Scaling

Production agent scaling has four axes: horizontal replica scaling, queue-based trigger rate limiting, multi-tenancy isolation, and Sandbox CRD isolation.

Horizontal replica scaling

Morgan (webhook receiver) scales horizontally under alert volume. The webhook gateway is stateless between requests — each webhook fires a fresh agent run. Use a K8s HPA:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: fleet-coordinator-hpa
namespace: hermes-agents
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: fleet-coordinator
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: hermes_webhook_events_received_rate
target:
type: AverageValue
averageValue: "5" # scale up if > 5 events/min/pod

Track C specialists are spawned in-process as children of a parent Morgan agent run. Scaling specialists horizontally requires scaling the parent Morgan deployment. If Morgan runs 5 replicas, you have up to 5 concurrent Track C children — each in its own Morgan process.

Queue-based vs trigger-based

Two patterns for agent execution at scale:

PatternWhen to useExample in this course
Trigger-based (webhook → agent run)Bursty, low-volume, latency-sensitiveMorgan FLEET-01 (Phase 9)
Queue-based (SQS/Kafka → worker pool)High-volume, latency-tolerant, needs retryNot in v1.1 — v1.2 candidate

Phase 9 uses trigger-based for FLEET-01. If your alert volume exceeds ~10 alerts/second, switch to queue-based: drop the webhook gateway in front of a queue, run a pool of agent workers pulling from the queue, with retry + dead-letter handling.

The Phase 8 K8s CronJob pattern (TRIG-02) is a time-triggered variant — not queue-based, but it decouples the trigger (time) from the agent execution via the K8s scheduler.

Multi-tenant isolation

Production fleets often serve multiple teams. Two isolation models:

Model 1 — Profile-per-team: Each team gets its own fleet profile (fleet-teamA, fleet-teamB, ...) with team-specific allowlists and skill libraries. Shared Hermes runtime and container image.

# fleet-teamA/config.yaml
platform_toolsets:
cli: [terminal, web, skills]
delegation:
max_iterations: 15 # tighter limit for team A's use cases
# ...

Model 2 — Deployment-per-team: Each team gets its own Hermes deployment and namespace. Stronger isolation at cost of operational overhead.

kubectl create namespace hermes-agents-teamA
kubectl create namespace hermes-agents-teamB
# Deploy fleet-coordinator per namespace with team-specific secrets

Small shops use Model 1. Regulated environments or teams with conflicting governance requirements use Model 2.

Network policies for agent isolation

Even within a shared deployment, apply Kubernetes NetworkPolicy to prevent agents from accessing resources they should not:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: fleet-coordinator-netpol
namespace: hermes-agents
spec:
podSelector:
matchLabels:
app: fleet-coordinator
policyTypes:
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: monitoring # can reach AlertManager
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: k8s-trouble-crashloop # can apply patches here
- ports:
- port: 53 # DNS
- port: 443 # HTTPS (Telegram API, GitHub API)

K8s Agent Sandbox CRD isolation (alpha — exploratory)

The K8s Agent Sandbox project (kubernetes-sigs/agent-sandbox) is an emerging approach to per-agent isolation using CRDs. It is currently alpha (v0.2.1 as of 2026-04-07) and Phase 9 ships it only as an exploratory project — see course-site/docs/module-12-fleet/exploratory/PROJECTS.mdx Project 3.

The Sandbox CRD model provides:

  • Namespace-scoped agent deployment with lifecycle management
  • Resource quota enforcement per agent instance
  • Network policy isolation via SandboxTemplate spec
  • Automatic cleanup via ttl field on Sandbox objects

When Sandbox reaches beta/stable (v1.2+ course candidate), it becomes the recommended isolation mechanism for multi-tenant agent fleets. Until then, treat it as forward-looking architecture.

Scaling decision checklist

Before adding replicas or switching to queue-based, ask:

  1. Is the bottleneck latency (trigger → agent response) or throughput (agents/second)?

    • Latency → reduce model context window, simplify SOUL.md, remove unused skills
    • Throughput → add replicas (Pattern A HPA) or queue-based worker pool
  2. Are agents failing because of governance rejections (over-restrictive allowlist) or timeout (agent takes too long)?

    • Governance rejections → review allowlist configuration
    • Timeout → increase approvals.timeout in config.yaml or reduce task scope
  3. Is the queue of human approvals backing up?

    • If hermes_approval_pending_total stays high, the bottleneck is human review latency, not agent throughput. Adding replicas does not help.

7.5 Production Decision Table

When you are ready to take a Phase 6-9 agent to production, use this table to choose the right deployment strategy:

Your constraintsPackagingDeploymentMonitoringScaling
Small team, single cluster, bursty alertsSingle container image, pinned versionsK8s Deployment, replicas: 2Prometheus scrape + audit logs to stdoutHPA on webhook event rate
Regulated environment, audit mandateContainer + image signing (cosign), SBOMK8s Deployment + GitOps sync only (no direct apply)SIEM integration, all audit logs persisted off-clusterManual scaling, change-control gates before replica changes
Multi-team, shared infraMulti-profile container, team-specific configs via ConfigMapNamespace-per-team, shared runtimePer-team dashboards, team-scoped PrometheusRulesHPA per team namespace, resource quotas
High-volume scheduled checksCronJob-optimized image (no gateway runtime)K8s CronJob, multiple schedulesCronJob success/failure metrics, time-to-report SLAMore concurrent jobs, adjusted backoffLimit
Experimental / learningMinimal image, Hermes from GitHub sourceK8s Deployment, single replicaLocal log inspection onlyNo autoscaling, manual scale-to-zero
Production + ArgoCD already deployedContainer + GitOps repoGitOps sync via ArgoCD Application (Sub-path B1)ArgoCD sync status as alert surfaceArgoCD-managed replica count

Rule of thumb: Start with the smallest deployment option that satisfies your audit requirements. Add complexity only when a specific scaling or isolation requirement drives it.


7.6 Cross-References

The following artifacts from earlier in the course map directly to the productionization topics in this section:

ArtifactPhaseRelevant section
infrastructure/wrappers/mock-kubectlPhase 77.3 Monitoring — governance audit events
governance/governance-L4-track-c.yamlPhase 77.3 Monitoring — allowlist definition
infrastructure/scenarios/k8s/alertmanager/prometheus-rules.yamlPhase 87.3 Monitoring — self-monitoring rules
infrastructure/scenarios/k8s/cronjob/Phase 87.2 Deployment Pattern B — CronJob manifests
infrastructure/scenarios/k8s/gitops/apply.shPhase 9 Plan 017.2 Deployment Pattern C — GitOps sync
agents/fleet-coordinator/config.yamlPhase 9 Plan 017.1 Packaging — platform_toolsets.cli
agents/fleet-coordinator/SOUL.mdPhase 9 Plan 017.1 Packaging — behavioral spec travels with container
course-site/docs/module-12-fleet/exploratory/PROJECTS.mdx Project 3Phase 97.4 Scaling — Sandbox CRD exploratory
Phase 9 Module 12 LabPhase 9Full live FLEET-01 walkthrough for all sections

End of Section 7. The rest of the reading above covers fleet patterns and coordinator templates that are the architectural foundation. Section 7 builds on that foundation to answer "how do I run this in production." Module 13 reference covers governance depth; Module 11 reference covers trigger patterns. This section covers everything that sits between those and the production cluster.