Concepts: Agent Skills, Memory, and RAG
This module's lab has you writing a SKILL.md — a machine-readable runbook your agent can execute. Here's why that format matters and the knowledge retrieval concepts behind it. Understanding these concepts will make you a better skill author and help you design agents that stay reliable under operational pressure.
1. The Runbook Reliability Problem
Consider an AI agent given your wiki runbook as plain text. It reads it correctly. It understands the intent. And then on step 7, where the runbook says "check the usual CloudWatch metrics," the agent checks... what it thinks is usual. Maybe that matches your intent. Maybe it doesn't.
This is the ambiguity gap — the space between what your runbook implies and what the agent executes. Experienced engineers fill this gap with institutional knowledge. AI agents fill it with prediction.
DevOps analogy: Think of the difference between documentation for a human operator and a configuration file for a service. A human reads "configure with appropriate timeouts" and applies judgment. A service needs timeout: 30s. SKILL.md is the timeout: 30s version of your runbook.
The specific problems with prose runbooks for agents:
- Ambiguous conditions — "if the service is slow" requires defining slow (threshold, metric, window)
- Implied steps — "verify the deployment" requires specifying which verification commands, in which order, with what success criteria
- Missing escalation specifics — "escalate if needed" requires naming who, via which channel, with what information
SKILL.md resolves all three by making conditions, steps, and escalation explicit.
What Skills Encode
A SKILL.md file encodes five things that an ad-hoc prompt cannot:
-
When to activate. Specific, observable trigger conditions — not "when the database is slow" but "when CloudWatch alarm
rds-cpu-highfires on$RDS_INSTANCE_ID." -
What data to gather. Exact CLI commands with exact expected output. Not "check the metrics" but
aws rds describe-db-instances --db-instance-identifier $RDS_INSTANCE_ID --query 'DBInstances[0].{Status:DBInstanceStatus,...}'. -
How to reason about that data. IF/THEN/ELSE decision trees with numeric thresholds. Not "if CPU is high, investigate further" but
IF CPUUtilization > 80 AND mean_exec_time_ms > 1000: Diagnosis = SLOW_QUERY_INDEX_GAP. -
What is forbidden. A NEVER DO list specific to this domain and this agent's scope. Not "be safe" but "NEVER execute
CREATE INDEXwithout explicit human approval — reason: locks the table for the duration, blocks production writes during business hours." -
When to stop. Escalation rules with specific triggering conditions. The agent does not decide when to escalate based on its own judgment. The skill defines the escalation threshold.
The Expert Vocabulary Effect
Consider two versions of a skill excerpt:
Vague (poor skill):
Check if the database is having performance problems.
Look at the slow queries and see if anything looks wrong.
Expert vocabulary (good skill):
Step 1.2 — Query pg_stat_statements for high-latency queries:
SELECT mean_exec_time_ms, total_exec_time_ms, calls, rows_per_call, query
FROM pg_stat_statements
WHERE mean_exec_time_ms > 1000
ORDER BY mean_exec_time_ms DESC
LIMIT 20;
The vague version forces the LLM to invent the query structure. The expert vocabulary version leaves the LLM no room to improvise. The Brain's job is to execute this query and interpret the output — not to decide how to investigate slow queries.
This is the expert vocabulary effect: writing SKILL.md in the language of your domain (field names, threshold values, service identifiers, procedure steps) activates the LLM's training on that domain precisely.
2. Memory Types: State Management for Agents
Agents need to remember things. Not all memory works the same way, and choosing the wrong type creates either agents that forget too quickly or agents that accumulate stale context.
DevOps analogy: State management in distributed systems. You have in-memory state (fast, volatile), database state (persistent, queryable), and cached procedures (pre-compiled, reusable). Agents have the same architecture.
Short-Term Memory (Conversation Context)
Short-term memory is the active conversation context — everything the agent has received and generated in the current session. It lives in the context window.
Characteristics:
- Bounded by context window size (typically 100K-200K tokens in modern models)
- Lost when the session ends
- Perfect for: in-session reasoning, multi-step tasks, intermediate results
Operational equivalent: Your working memory during an incident. You hold the current log lines, the timeline, the commands you've run. When the incident closes, that context disperses.
Failure mode: Context window overflow. When a session accumulates too much context, older information gets truncated. An agent investigating a long-running incident may "forget" early observations by the time it reaches conclusions. Skill design must account for this — break long tasks into discrete sessions with explicit state handoffs.
Long-Term Memory (Cross-Session Storage)
Long-term memory persists across sessions. The agent can store findings, learned patterns, and operational context in a database and retrieve them in future sessions.
Characteristics:
- Unlimited (bounded only by storage)
- Persistent across sessions
- Requires explicit retrieval (does not auto-appear in context)
- Tools: vector databases, key-value stores, structured databases
Operational equivalent: Your postmortem database, runbook wiki, incident history. Valuable knowledge accumulated over time. The problem is the same for both humans and agents: finding the relevant piece when you need it.
In Hermes: The memory tool lets agents store and retrieve long-term observations. A DB Health agent can store "this RDS instance consistently hits connection limits on Mondays at 09:00" — and retrieve that context in the next investigation session.
Procedural Memory (Skills)
Procedural memory is encoded expertise — not facts to recall, but procedures to execute. For agents, this is SKILL.md files loaded at runtime.
Characteristics:
- Instruction-based, not fact-based
- Loaded selectively (only the relevant skills for the current task)
- Versioned and improvable over time
- The agent follows the procedure, it does not recall it from training
Operational equivalent: Ansible playbooks. You don't need to remember every step of provisioning a new server — you run the playbook and it executes the procedure correctly every time. SKILL.md is the Ansible playbook for your agent's operational expertise.
Why this matters: An AI agent without procedural memory makes up its own procedure for every task. That procedure may be reasonable or it may miss your organization's specific safety steps, escalation paths, or conditional branches. SKILL.md replaces "AI improvises" with "AI executes your expertise."
How Skills Are Loaded at Runtime
When Hermes loads an agent profile, it scans the skills/ subdirectory and reads each SKILL.md file. The skill content is prepended to the system prompt — every skill the agent has is visible to the Brain from the first turn.
The session startup sequence:
- Load
config.yaml— determines model, toolsets, approval mode - Load
SOUL.md— agent identity and behavioral constraints - Scan
skills/directory — load all SKILL.md files - Initialize tool registry with enabled toolsets
- First LLM call with full context: SOUL.md + all skills + tool schemas
The agent's context window contains everything it needs from the first token. This is a deliberate tradeoff: simplicity and reliability over token efficiency.
For agents with 1-3 skills (typical for course agents), this is the right approach: the agent does not need to decide which skill to use before using it, because all skills are already in context. For production agents with many skills (10+), the skills_search tool becomes important — the agent queries it to locate the right skill by keyword before loading its full text.
3. RAG: When Skills Are Not Enough
SKILL.md covers procedural knowledge. But some agent tasks require factual retrieval — looking up current documentation, recent incidents, pricing data, or service configurations that change frequently and cannot be hardcoded.
RAG (Retrieval-Augmented Generation) is the pattern for giving agents access to a knowledge base they can query at runtime:
Query → Retrieve relevant documents → Augment context → Generate response
DevOps analogy: Incident response. A skilled SRE doesn't just remember everything — they know which dashboards to check, which runbooks to pull, and which recent tickets to review. RAG is this retrieval pattern formalized for AI agents.
Embeddings and Semantic Search
Traditional search is keyword-based. "EC2 high CPU" returns documents containing those exact words. But the document you need might use "compute instance memory pressure" or "worker node resource saturation" — the same concept, different words.
Embeddings solve this. An embedding is a vector representation of text meaning — a list of numbers that encodes semantic content. Texts with similar meanings have similar vectors, regardless of word choice.
DevOps analogy: Container image layers. Each image layer is a content hash of its contents. Images with similar contents have similar fingerprints (partial layer reuse). Embeddings do the same for text meaning — similar content, similar vector fingerprint.
Vector Databases
Vector databases store embeddings and support similarity search: "find the N documents whose meaning is most similar to this query."
| Database | Operational Analogy | Use Case |
|---|---|---|
| ChromaDB | Local file storage | Development, small knowledge bases |
| Pinecone | Managed S3 | Production, large knowledge bases |
| pgvector | PostgreSQL extension | Existing Postgres infrastructure |
| Weaviate | Elasticsearch for meaning | Hybrid keyword + semantic search |
The key operation: Given a query embedding, find documents whose embeddings are within a similarity threshold (cosine similarity, dot product). This is "find what means the same thing" as opposed to "find what contains the same words."
Agentic RAG: The Agent Decides What to Retrieve
Standard RAG retrieves once, adds results to context, generates response. Agentic RAG is more dynamic: the agent decides when and what to retrieve, mid-task, based on what it discovers.
Example flow:
- Agent receives task: "Investigate high P99 latency on the payments service"
- Agent retrieves: recent architecture docs for payments service
- Agent runs diagnostic commands, discovers connection pool saturation
- Agent decides to retrieve: documentation on PostgreSQL connection pool tuning
- Agent generates structured diagnosis + recommendations
DevOps analogy: An SRE who knows when to consult documentation and when to work from memory. They don't look everything up (too slow), and they don't rely purely on memory (too error-prone). Agentic RAG is this calibrated retrieval — triggered by what the agent discovers, not just by the initial query.
Graph RAG: When Relationships Matter
Graph RAG extends standard RAG with a knowledge graph — a network of entities and relationships. Instead of finding similar documents, the agent can traverse relationships: "this service depends on this database, which is also used by this other service, which recently had an incident."
DevOps analogy: A service mesh dependency map. You don't just need the most similar service definition — you need to understand which services are upstream and downstream of an incident, which share infrastructure, which have correlated failure patterns.
Graph RAG is more complex to set up than standard RAG and is typically introduced in advanced agent architectures. For Module 7's lab, standard RAG and procedural skills are the primary tools.
4. The Two-Zone Design: Why SKILL.md Has Distinct Phases
The most important structural concept in SKILL.md authoring is the two-zone design. This separates data collection from reasoning into distinct, sequential phases.
Without this constraint, agents exhibit a failure mode called mid-loop data discovery:
- Agent starts reasoning over the initial data (high CPU, slow queries visible)
- During reasoning, the agent realizes it needs more data (what's the table size? is there a lock?)
- Agent runs a new query to get that data
- New data reveals a new dimension to the problem
- Agent needs more data to understand the new dimension
- Loop continues — the agent is not converging on a diagnosis, it is discovering new data indefinitely
The result: unpredictable session duration, escalating token costs, and a diagnosis that arrived at different conclusions depending on what data happened to be discovered in what order.
The two-zone design solves this:
Scripts Zone (Phase 1 and Phase 3) — deterministic: Run all the CLI commands. Collect all the data. No decisions. No interpretation. No "if this result looks concerning, also run X." Just: run this, get that, move to Phase 2.
The Scripts Zone is idempotent and deterministic. Running Phase 1 twice on the same database produces the same output. There is no branching based on intermediate results.
Agents Zone (Phase 2 and Phase 4) — reasoning: Reason over the complete dataset collected in Phase 1. Apply the decision tree. Produce a named diagnosis or escalation. No new data collection. The reasoning is bounded.
Correct Scripts Zone example:
psql -h $DB_HOST -p ${DB_PORT:-5432} -U $DB_USER -d $DB_NAME --csv -c \
"SELECT mean_exec_time_ms, total_exec_time_ms, calls, rows_per_call,
LEFT(query, 200) as query
FROM pg_stat_statements
WHERE mean_exec_time_ms > 1000
ORDER BY mean_exec_time_ms DESC
LIMIT 20"
No interpretation. No branching. Run this, get that.
Correct Agents Zone example:
IF mean_exec_time_ms > 5000:
THEN Diagnosis = "CRITICAL_SLOW_QUERY"
→ Escalate immediately (see Escalation Rules)
ELSE IF mean_exec_time_ms > 1000 AND sequential_scan_pct > 80:
THEN Diagnosis = "SLOW_QUERY_INDEX_GAP"
→ Proceed to Phase 3 (index recommendation)
No CLI commands. Only IF/THEN/ELSE logic on the data already collected.
Mixed violation (Tier 4 FAIL — do not do this):
Phase 2 — Analysis:
Check if the queries are slow:
aws cloudwatch get-metric-statistics ... ← CLI command in Agents Zone!
IF the metrics show high CPU... ← vague condition
This violates both zones. The result is an agent that runs unpredictable queries during reasoning, making its behavior session-dependent and unauditable.
5. Hallucination and Why Skills Provide Guardrails
AI agents hallucinate. This is not a bug — it is a property of how language models work. The model generates the most probable next token, and sometimes the most probable token is wrong.
The risk in operational contexts:
An agent without guardrails might:
- Recommend deleting a table that "probably" isn't in use
- Generate a kubectl command with a flag that does not exist
- Describe a resolution step from a different service's runbook that happens to look similar
Why SKILL.md reduces hallucination risk:
Skills constrain the agent's action space. Instead of "what should I do about high CPU?", the agent reads: "for CPU > 80% sustained 5 min: run aws cloudwatch get-metric-statistics, check against threshold table, if confirmed: escalate via PagerDuty priority P2, do not restart the instance."
The agent cannot hallucinate steps that are not in the skill. It follows the decision tree. Gaps in the skill are gaps in behavior — which is why you test skills against realistic scenarios before deploying them.
Temperature and skill execution: Skills should be executed at low temperature (0.0-0.1). High temperature introduces variability in which branch of a decision tree the agent follows. For operational reliability, you want deterministic behavior — which requires both good skill design and low temperature configuration.
6. Skills as Procedural Memory: The Architecture Connection
Skills are loaded into the agent's context window at runtime. They are not training data — the agent does not "learn" skills in the machine learning sense. It reads them the same way it reads any text in its context.
This has important implications:
Skills compete for context space: An agent with 20 loaded skills is using significant context tokens on skill definitions, leaving less space for operational data and conversation history. Selective skill loading (load only the skills relevant to the current task) is essential for context efficiency.
Skills are versionable: Like Ansible playbooks, you can maintain multiple versions of a skill. A skill for an older RDS version can be replaced with a new one for the upgraded version. The agent picks up the new procedure without retraining.
Skills are auditable: Every action the agent takes is grounded in a specific step in a specific skill version. When an agent recommends "increase max_connections to 200", that recommendation traces back to rds-health-v1.2.md, step 4.b.iii. This auditability matters for enterprise governance (Module 13).
The improvement loop: Skills can be improved iteratively. Run the skill, observe where the agent's execution diverges from expected behavior, refine the skill, retest. This is the Design → Validate → Version → Deploy → Improve lifecycle.
Summary
| Concept | What It Is | Operational Analogy |
|---|---|---|
| Short-term memory | Active conversation context (context window) | Working memory during an incident |
| Long-term memory | Cross-session persistent storage | Postmortem database, incident history |
| Procedural memory | Skills loaded at runtime (SKILL.md) | Ansible playbooks |
| RAG | Retrieve documents to augment context before generation | SRE checking dashboards and runbooks before diagnosis |
| Embeddings | Vector representation of text meaning | Container image layer hashes (similar content, similar fingerprint) |
| Agentic RAG | Agent-triggered retrieval mid-task | SRE knowing when to look something up vs. work from memory |
| Scripts Zone | Phase of a skill that runs deterministic CLI commands | Automation runbook — exact commands, no judgment |
| Agents Zone | Phase of a skill that applies IF/THEN/ELSE reasoning to collected data | Post-incident analysis using gathered facts |
The core principle: Skills encode what your agent knows how to do. RAG retrieves what your agent needs to know for this specific situation. Together, they replace "AI improvises" with "AI executes expertise."