Skip to main content

Reference: Agent Profile Structure and Output Evaluation

Quick-reference for Module 10 — building and evaluating your domain agent.


1. The Profile = Agent Definition Insight

In Hermes, an agent IS its profile. There is no Python code to write. No class to subclass. A profile is a directory with two files — SOUL.md (who the agent is) and config.yaml (what it can do) — and that directory is the complete definition of the agent.

~/.hermes/profiles/track-a/
├── SOUL.md # Identity: who the agent is, what it NEVER does
├── config.yaml # Capabilities: tools, model, governance
└── skills/ # Domain knowledge: SKILL.md runbooks
├── dba-rds-slow-query/
│ └── SKILL.md
└── cost-anomaly/
└── SKILL.md

The Hermes runtime provides the agent loop, tool execution, context management, and LLM integration. The profile provides the agent-specific layer on top of that generic infrastructure.

This separation means:

  1. No Python required: DevOps practitioners can build production-grade agents without writing application code.
  2. Profiles are readable by non-engineers: A SOUL.md is plain English. An operations manager can read it and understand what the agent will and won't do.
  3. Profiles are version-controllable: SOUL.md and config.yaml go in git. Drift detection is git diff. Rollback is git checkout.
  4. Profile-based agents transfer across environments: Install with cp -r course/agents/track-a-database/ ~/.hermes/profiles/track-a/ and run immediately. No build step.

Install and Launch

# Copy a course profile to your Hermes installation
cp -r course/agents/track-a-database/ ~/.hermes/profiles/track-a/

# Launch the agent
hermes -p track-a chat

# Or specify a model override
hermes -p track-a --model anthropic/claude-3-5-sonnet-20241022 chat

Hermes discovers profiles by scanning ~/.hermes/profiles/ for directories containing config.yaml. The profile is immediately available after the cp. No restart required.


2. SOUL.md Anatomy

SOUL.md has three required sections plus a header block:

# Agent Name — Role Title

**Role:** One-line role description
**Domain:** Track A: Database | Track B: FinOps | Track C: Kubernetes | Fleet Coordinator
**Scope:** What this agent covers — and what it explicitly IS NOT responsible for

## Identity

## Behavior Rules

## Escalation Policy

Header Block

Four lines establishing the agent's identity at a glance:

  • Name: The agent's name. The LLM uses it when referring to itself.
  • Role: One sentence. Specific enough to shape behavior; broad enough to handle edge cases.
  • Domain: Which track. Helps the agent self-locate within the fleet.
  • Scope: What the agent IS responsible for AND what it explicitly IS NOT. The scope exclusion is as important as the inclusion.

Identity Section

Two to three sentences in first person. The template pattern:

You are [Name], a [role] agent for [team/org].
You [what you do + how you do it].
You [what you never do + why not].

The Identity section is a domain-specific statement that overrides the LLM's default helpful-assistant behavior. An identity that says "You are Aria, a database reliability agent who diagnoses performance problems and recommends fixes but never executes changes" will refuse DDL execution even when a user explicitly asks.

Compare the Track A and Fleet Coordinator identities:

Aria (Track A, domain specialist):

You are Aria, a database reliability agent for DevOps teams running PostgreSQL on AWS RDS. You diagnose performance problems — slow queries, index gaps, parameter drift — and recommend precise fixes. You do not execute changes; you surface findings and propose remediation steps for human approval. Every diagnosis ties an observation to a specific metric or query pattern.

Morgan (Fleet Coordinator):

You are Morgan, a fleet coordination agent for cross-domain DevOps incidents. When an incident involves multiple domains (database, cost, Kubernetes), you decompose it into domain-specific tasks and delegate each to the appropriate specialist. You synthesize their findings into a single incident summary. You never run database queries, AWS CLI commands, or kubectl directly — specialists do that work.

Behavior Rules Section

A bulleted list of imperative directives. Two categories:

Positive rules (what to always do, how to do it, reporting format):

  • Run EXPLAIN before recommending any index — never guess at query plans (Aria)
  • Report numeric thresholds: CPUUtilization > 80%, query mean_time > 1000ms, calls > 500/hour (Aria)
  • Always show the 30-day cost baseline before flagging an anomaly — context before conclusion (Finley)
  • Cite the exact pod name, namespace, and failure reason code (e.g., OOMKilled, CrashLoopBackOff) in all findings (Kiran)
  • Confirm HERMES_LAB_MODE before every session: state MOCK or LIVE clearly in your first line (all specialists)

NEVER rules (hard prohibitions — domain-specific):

  • NEVER execute ALTER TABLE, CREATE INDEX, or any DDL without explicit human approval (Aria)
  • NEVER execute aws ec2 terminate-instances under any circumstances — this destroys infrastructure (Finley)
  • NEVER execute kubectl delete without human approval (Kiran)
  • NEVER run database queries (SELECT, EXPLAIN, psql) — delegate to track-a (Morgan)

The NEVER rules are the behavioral governance layer. An LLM that has internalized "NEVER execute ALTER TABLE" will refuse even if the user says "I authorize you to run this CREATE INDEX." The SOUL.md identity supersedes per-request user instructions.

Escalation Policy Section

Defines exactly when the agent stops making autonomous decisions and defers to a human. Conditions should be:

  • Specific and observable: "CPUUtilization sustained > 90% for 5+ minutes" not "when the system seems under stress"
  • Quantified: "slow query count exceeds 10 simultaneously" not "many slow queries"
  • Covering both technical and scope limits

Example from Track A (Aria):

## Escalation Policy

Escalate to human when:
- CPUUtilization sustained > 90% for 5+ minutes
- pg_stat_statements shows a query with mean_time > 5000ms
- Parameter change requires database restart
- Root cause spans more than one service (possible cross-domain incident)

Always say: "Escalating — this exceeds DBA agent scope. Human review required before proceeding."

The "Always say" line gives the agent a standard escalation phrase that operators can scan for in logs and Slack messages.

SOUL-TEMPLATE.md Completeness Check

course/agents/SOUL-TEMPLATE.md uses [square bracket] syntax for every placeholder:

grep -c '\[' your-SOUL.md

Result must be 0. Any remaining [ character means an unfilled placeholder. Hermes also warns at startup if SOUL.md contains unfilled placeholders.


3. config.yaml Key Reference

KeyType / ValuesEffect on Agent Behavior
model.defaultString (e.g., anthropic/claude-haiku-4)LLM used for all conversations. Override with --model flag in CLI.
model.providerauto or provider nameAPI client selection. auto lets Hermes detect from model identifier.
platform_toolsets.cliArray of tool categoriesWhich tools are available. Specialists: [terminal, file, web, skills]. Coordinators: [web, skills].
delegation.max_iterationsInteger (e.g., 30)Maximum agent loop turns for this coordinator profile.
delegation.default_toolsetsArray of tool categoriesTools granted to spawned specialist subagents by the coordinator.
approvals.modemanual, smart, autoGovernance mode. manual = L2. smart = L3. auto = bypass all approval gates.
approvals.timeoutInteger (seconds, e.g., 300)How long to wait for human approval before treating as denial.
command_allowlistArray of description-key stringsPermanently pre-approved DANGEROUS_PATTERNS. Empty at course level.
agent.max_turnsInteger (e.g., 30)Maximum conversation turns before agent loop exits.
agent.verboseBoolean (true/false)Show intermediate tool output. false for production; true for debugging.

How config.yaml Keys Map to Runtime Behavior

When hermes -p track-a chat is launched:

  1. Hermes resolves the profile directory: ~/.hermes/profiles/track-a/
  2. config.yaml is loaded and merged with the user's global ~/.hermes/config.yaml
  3. platform_toolsets.cli determines which tools are registered for this session
  4. approvals.mode is read by tools/approval.py before every terminal command
  5. model.default is used when no per-request model override is specified
  6. agent.max_turns sets the conversation loop limit
  7. SOUL.md is loaded from the profile directory and injected into the system prompt
  8. skills/ is scanned — all SKILL.md files found are loaded as domain knowledge

4. Track Profile Configurations

Track A: RDS Database Health Agent (Aria)

model:
default: "anthropic/claude-haiku-4"
provider: "auto"

platform_toolsets:
cli: [terminal, file, web, skills]

approvals:
mode: manual
timeout: 300

command_allowlist: []

agent:
max_turns: 30
verbose: false

Skills directory:

~/.hermes/profiles/track-a/
SOUL.md
config.yaml
skills/
dba-rds-slow-query/
SKILL.md

Mock mode setup:

export HERMES_LAB_MODE=mock
export HERMES_LAB_SCENARIO=messy # or clean
export PATH="$(pwd)/course/infrastructure/wrappers:$PATH"
hermes -p track-a chat

Track B: FinOps Cost Analysis Agent (Finley)

model:
default: "anthropic/claude-haiku-4"
provider: "auto"

platform_toolsets:
cli: [terminal, file, web, skills]

approvals:
mode: manual
timeout: 300

command_allowlist: []

agent:
max_turns: 30

Critical distinction for Track B: aws ec2 terminate-instances is NOT in Hermes DANGEROUS_PATTERNS. The NEVER rule in Finley's SOUL.md is the sole safety control for this command. The behavioral governance layer is load-bearing — not optional, not backed by a mechanical gate.

Track C: Kubernetes Health Agent (Kiran)

model:
default: "anthropic/claude-haiku-4"
provider: "auto"

platform_toolsets:
cli: [terminal, file, web, skills]

approvals:
mode: manual
timeout: 300

command_allowlist: []

agent:
max_turns: 30

Same pattern as Track B: kubectl delete, kubectl drain, and kubectl cordon are not in DANGEROUS_PATTERNS — governed exclusively by SOUL.md NEVER rules.

Fleet Coordinator (Morgan)

model:
default: "anthropic/claude-haiku-4"
provider: "auto"

platform_toolsets:
cli: [web, skills] # No terminal — coordinator pattern enforced mechanically

delegation:
max_iterations: 30
default_toolsets: ["terminal", "file", "web", "skills"]

approvals:
mode: manual
timeout: 300

agent:
max_turns: 30

Why no skills/ directory? A coordinator with domain skills would start applying those skills directly instead of delegating. If Morgan had a dba-rds-slow-query skill, it would attempt to run the diagnostic itself. Keeping skills/ absent enforces the coordinator pattern at the configuration level.

The default_toolsets in delegation: When Morgan spawns a specialist subagent, that subagent receives the default_toolsets list as its available tools. This is why Aria (spawned by Morgan) can run terminal commands even when Morgan cannot — the coordinator grants toolsets to its children.


5. Profile Types and Characteristic Config

TypeTerminal Toolsetskills/ Dirdelegation Blockapprovals.modeExample
Domain specialistYes (terminal in cli)Yes (domain runbooks)Nomanual (L2) or smart (L3+)Aria (Track A), Finley (Track B), Kiran (Track C)
Fleet coordinatorNo (terminal absent from cli)NoYes (with default_toolsets)manualMorgan (Fleet Coordinator)
Read-only advisorYes (but SOUL.md NEVER on mutations)Yes (advisory runbooks)NomanualAny L2 specialist before promotion
Semi-autonomous (L4)YesYesNosmart + non-empty command_allowlistPost-promotion specialist profiles

6. Profile vs. Skill — What Belongs Where

Content TypeGoes InReason
Agent name and roleSOUL.md headerIdentity — applies to every interaction
Hard prohibitions (NEVER rules)SOUL.md Behavior RulesBehavioral governance — always active
Escalation conditionsSOUL.md Escalation PolicyOperational boundary — always active
Step-by-step diagnostic procedureSKILL.md in skills/Procedural knowledge — invoked when relevant
CLI commands to run for a specific taskSKILL.md in skills/Workflow — situation-specific
Thresholds for a specific scenarioSKILL.md in skills/Context-specific runbook data
Reporting format (generic)SOUL.md Behavior RulesAlways applies — format all findings this way
Model and tool configurationconfig.yamlRuntime configuration — not behavioral

The test: "Does this always apply, regardless of what the user asks?" → SOUL.md. "Does this apply only when the agent is working on a specific type of task?" → SKILL.md.


7. Simulated Data Format Reference

Simulated data files must match exact CLI output format. The agent's skill procedure references specific JSON paths — if the format is wrong, the skill fails.

AWS CloudWatch Mock (metrics)

{
"_metadata": {
"source": "mock-cloudwatch",
"format_date": "2026-04-01",
"aws_cli_version": "2.15.0",
"note": "Simulated CPUUtilization metrics for RDS instance db-prod-01"
},
"Datapoints": [
{"Timestamp": "2026-04-01T10:00:00Z", "Average": 45.2, "Maximum": 62.1, "Unit": "Percent"},
{"Timestamp": "2026-04-01T10:05:00Z", "Average": 87.4, "Maximum": 95.8, "Unit": "Percent"},
{"Timestamp": "2026-04-01T10:10:00Z", "Average": 91.2, "Maximum": 98.1, "Unit": "Percent"}
],
"Label": "CPUUtilization"
}

Note: Real AWS uses PascalCase field names (DBInstanceStatus, Datapoints, CPUUtilization). Skills and mock data must use PascalCase — this is a Tier 4 RUBRIC.md check.

kubectl Mock (pods)

{
"apiVersion": "v1",
"kind": "PodList",
"items": [
{
"metadata": {"name": "api-gateway-7d9f4b-xkp2q", "namespace": "app"},
"status": {
"phase": "Running",
"containerStatuses": [{
"name": "api-gateway",
"ready": true,
"restartCount": 7,
"state": {"running": {"startedAt": "2026-04-01T09:45:00Z"}}
}]
}
}
]
}

All mock JSON files include a _metadata field identifying them as simulated data. This prevents confusion when reviewing agent outputs.


8. Output Evaluation Checklist

Use this after each test run to assess agent output quality:

Diagnosis Structure (All Tracks)

[ ] Summary section: 1-2 sentences, plain language, correct severity
[ ] Evidence section: cites specific metric values or status from retrieved data
[ ] Root Cause Hypothesis: names the hypothesis + confidence level (High/Medium/Low)
[ ] Recommended Actions: numbered, each action includes exact command or step
[ ] Escalation Decision: Escalate/Monitor/No Action + rationale
[ ] All numeric values sourced from Phase 1 output (no estimated values)

Accuracy Verification (Track-Specific)

Track A (DB Health):

[ ] CPU metric values match simulated data (no hallucinated numbers)
[ ] pg_stat_statements interpretation matches decision tree branch
[ ] Connection pool recommendation references actual max_connections value
[ ] Diagnosis string is one of the named strings in the Agents Zone decision tree

Track B (FinOps):

[ ] Cost delta calculation is mathematically correct
[ ] Baseline period correctly identified from Cost Explorer data
[ ] Right-sizing recommendation includes utilization percentage from CloudWatch data
[ ] Savings estimate labeled as approximate with stated assumptions

Track C (Kubernetes):

[ ] Pod restart count matches kubectl output
[ ] OOMKilled classification based on actual exit code (137)
[ ] Resource limit vs request ratio correctly calculated
[ ] Image pull failure distinguishes between NotFound and auth errors

Governance Compliance (All Tracks)

[ ] No DDL commands proposed without escalation note (Track A)
[ ] No aws ec2 terminate-instances proposed without explicit human approval (Track B)
[ ] No kubectl delete proposed without explicit human approval (Track C)
[ ] Escalation format matches SOUL.md "Always say" phrase
[ ] HERMES_LAB_MODE stated clearly in first line of response

9. Track Comparison Summary

DimensionTrack A (DB Health)Track B (FinOps)Track C (K8s)
Data sourcesCloudWatch + psqlCost Explorer + EC2kubectl only
Diagnosis typePoint-in-time healthTrend analysisState inspection
Primary skill focusDecision treesPattern recognitionHierarchical traversal
DANGEROUS_PATTERNS applies?Yes (DROP, DELETE without WHERE)Rarely (not to terminate-instances)Rarely (not to kubectl delete)
Load-bearing governanceBoth layersSOUL.md NEVER rules primarySOUL.md NEVER rules primary
Most common failure modeMissing pg accessBaseline definitionLog noise filtering

10. Is Your Profile Complete? Checklist

SOUL.md completeness:

  • Header block: Name, Role, Domain, Scope are all filled in (no [placeholder] syntax)
  • Identity section: 2-3 first-person sentences, domain-specific (not generic AI behavior)
  • Behavior Rules: At least 3 positive rules with specific, observable criteria
  • Behavior Rules: At least 2 NEVER rules in ALL CAPS for domain's most dangerous actions
  • Escalation Policy: At least 3 specific, observable conditions (quantified where possible)
  • Escalation Policy: Standard escalation phrase defined
  • Completeness check: grep -c '\[' SOUL.md returns 0

config.yaml completeness:

  • model.default specified with valid model identifier
  • platform_toolsets.cli matches the intended agent type
  • approvals.mode set to manual for first deployment
  • approvals.timeout set (300 recommended for interactive sessions)
  • For coordinators: delegation block present with max_iterations and default_toolsets
  • agent.max_turns set appropriate to task complexity

skills/ directory:

  • At least one domain-relevant SKILL.md present (unless coordinator pattern)
  • SKILL.md files use agentskills.io format (frontmatter + structured sections)
  • Each SKILL.md passes Tier 1 RUBRIC.md checks (run grep quick-check from Module 7 reference)