Module 11 Lab: Triggers & Scheduling for Kubernetes (Track C)
Duration: 90 minutes (70 min guided + 20 min free explore)
Track: C -- Kubernetes Health & Self-Healing
Prerequisite: Module 8 Track C lab complete (working track-c profile); KIND cluster running
Outcome: Working examples of all 5 trigger patterns using real KIND infrastructure
This lab moves your Hermes agent from reactive (you type a prompt) to proactive (the agent runs on a schedule or reacts to events). You will build five trigger mechanisms: a cron schedule for daily health checks, an AlertManager webhook for incident response, a K8s CronJob for containerized agent runs, a GitHub webhook for PR review automation, and a Telegram bot for slash-command diagnostics from your phone.
All exercises use real infrastructure on your KIND cluster -- no mock mode, no simulated data.
Prerequisites (5 min)
Verify your environment before starting.
Confirm the KIND cluster is running and reachable:
kubectl get nodes
kubectl get pods -A
Verify the Track C agent responds:
hermes -p track-c chat
# Type "hello" to confirm the agent responds, then exit with Ctrl+C
Verify Hermes gateway capability:
hermes --version
Start the gateway (runs in the background as a service):
hermes gateway start
hermes gateway status
# Expected: Gateway running
Throughout this lab, keep a separate terminal open with:
tail -f ~/.hermes/logs/gateway.log
This is your real-time view into everything the gateway does — cron ticks, webhook receipts, agent runs, delivery results. When something doesn't work, this is the first place to check.
GUIDED PHASE -- 70 minutes
Step 1: Hermes Cron -- Daily K8s Health Check (10 min)
Cron expressions use the standard five-field format: minute hour day-of-month month day-of-week
| Expression | Meaning |
|---|---|
0 8 * * * | 8:00 AM every day |
30 9 * * 1-5 | 9:30 AM weekdays only |
*/5 * * * * | Every 5 minutes |
0 0 * * 0 | Midnight every Sunday |
One-time setup: install skill globally and configure API key
Hermes cron jobs run outside any profile context — the cron scheduler creates its own agent instance with access to globally installed skills only. Before creating your first cron job, install the Track C skill globally and ensure the API key is available:
# Install the sre-k8s-pod-health skill globally
cp -r ~/.hermes/profiles/track-c/skills/sre-k8s-pod-health ~/.hermes/skills/
# Verify it is visible globally
ls ~/.hermes/skills/sre-k8s-pod-health/
# Expected: SKILL.md (and possibly other files)
# Ensure the Anthropic API key is in the global .env
# (Cron jobs don't read profile-level .env files)
grep -q ANTHROPIC ~/.hermes/.env 2>/dev/null || \
grep ANTHROPIC ~/.hermes/profiles/track-c/.env >> ~/.hermes/.env
When you run hermes -p track-c chat, Hermes loads the profile's SOUL.md, config, and
skills. But the cron scheduler runs jobs in its own thread — it builds a fresh agent with
no profile context. It resolves skills from ~/.hermes/skills/ (global) only. The API key
must also be in ~/.hermes/.env since the cron agent does not read profile .env files.
Create the cron job
hermes cron create "0 8 * * *" \
"Run morning pod health check across all namespaces. Report only if pods or nodes show issues." \
--name "daily-k8s-check" \
--skill "sre-k8s-pod-health" \
--deliver local
What each argument does
| Argument | Purpose |
|---|---|
schedule (1st positional) | Cron expression defining when the job runs (e.g. "0 8 * * *", "*/5 * * * *", or a shorthand like "30m", "every 2h"). |
prompt (2nd positional) | What the agent is asked to do when it fires. Must be self-contained -- the cron agent has no chat history. |
--name | Human-readable job name (kebab-case). Used to reference the job in other commands. |
--skill | Skill to load before running the prompt. The agent reads the SKILL.md runbook first. Repeat the flag to attach multiple skills. |
--deliver local | Output goes to your terminal. In production, use --deliver slack or --deliver telegram to route findings to your notification system. |
--repeat N | Optional -- limits the job to N executions. Omit for indefinite scheduling. |
schedule must come first, prompt second. You can put the --name, --skill,
and --deliver flags anywhere (before, between, or after the positionals) since
they are keyword arguments.
Verify the job was registered
hermes cron list
Expected output:
ID Name Schedule Next Run Skill State
f8081a091378 daily-k8s-check 0 8 * * * 2026-04-10 08:00:00 sre-k8s-pod-health scheduled
Using --deliver local routes the agent's output to your terminal
session. This is the right choice for lab work -- you see output immediately without needing
to configure Slack or Telegram. In production, you would use --deliver slack (configured
in ~/.hermes/config.yaml) or --deliver telegram so findings reach your on-call channel
even if you are not at your terminal.
Trigger manually and verify output
You scheduled the job for 8 AM -- but you do not need to wait. Manual trigger fires the job immediately and is your primary verification tool.
Note the Job ID (the hex hash like f8081a091378) from hermes cron list -- hermes cron run
takes the job ID, not the name:
# Use the Job ID from hermes cron list output (first column)
hermes cron run <job-id>
# Example: hermes cron run f8081a091378
How cron execution works:
hermes cron run does NOT execute the job in your terminal. It marks the job as "due"
and waits for the gateway's scheduler to pick it up on the next tick (up to 60 seconds).
The gateway runs the job in its own thread, and with --deliver local, the output is
saved to a file.
If you don't want to wait for the next tick, force it immediately:
# Option 1: Wait for the gateway's scheduler (~60 seconds)
hermes cron run <job-id>
# then wait...
# Option 2: Force the tick immediately (recommended for lab work)
hermes cron run <job-id>
hermes cron tick
hermes cron tick manually triggers the scheduler — it checks for any due jobs and
executes them right now. You will see the agent's tool calls (kubectl commands) stream
to your terminal in real time. This is the fastest way to verify your cron job works.
cron tickUse hermes cron tick during lab work whenever you want to see results immediately
after hermes cron run. In production, the gateway runs ticks automatically every
60 seconds — you never need to call tick manually.
Read the saved output file:
ls ~/.hermes/cron/output/
# You should see a directory named with your job ID
# Read the output (replace with your job ID)
cat ~/.hermes/cron/output/<job-id>/*.md
Expected output shape (from a healthy KIND cluster):
# Kubernetes Health Check -- kind-lab
Namespaces scanned: kube-system, default, local-path-storage
All pods running. No restarts detected above threshold.
Node kind-control-plane: Ready, allocatable CPU 4 cores, memory 8Gi.
When the cron agent finds nothing to report, it may respond with [SILENT].
This suppresses delivery to external channels -- you will not receive a Slack or Telegram
notification. This is by design: agents that cry wolf on every run lose their usefulness.
The agent only delivers a full report when it finds something worth reporting.
With --deliver local, output is always saved to ~/.hermes/cron/output/<job-id>/ regardless
of whether the agent reports findings or stays silent. Check the file to see what happened.
Lifecycle demo: pause and resume
Pause the job without deleting it:
hermes cron pause <job-id>
Check the paused state:
hermes cron status
Expected output shows the job in paused state:
Scheduler: running
Jobs registered: 1
daily-k8s-check PAUSED
Next tick: in ~43s
Resume the job:
hermes cron resume <job-id>
Verify it is back to scheduled state:
hermes cron status
Expected:
Scheduler: running
Jobs registered: 1
daily-k8s-check scheduled next: 2026-04-10 08:00:00
Next tick: in ~51s
Step 2: AlertManager -- Prometheus Stack + Live Webhook (20 min)
You used hermes cron run to fire an agent manually. Now you wire a REAL alert source: the Prometheus + AlertManager stack on your KIND cluster, firing on a real broken pod. This is the moment Hermes stops being a chat agent and starts being an incident-response agent -- alerts arrive without you typing anything.
Sub-step 2a: Verify the Prometheus + AlertManager stack
The Prometheus stack was installed during initial course setup (make monitoring in the
reference-app directory, Helm release name monitoring). Verify it is running:
# Verify Prometheus and AlertManager pods are running
kubectl get pods -n monitoring | grep -E "prometheus|alertmanager"
# Expected: prometheus-monitoring-* and alertmanager-monitoring-* pods in Running state
# Verify AlertManager is accessible
curl -s http://localhost:30093/-/healthy
# Expected: OK
If you don't see the monitoring pods, install the stack:
cd reference-app && make monitoring && cd ..
This installs kube-prometheus-stack with Helm release name monitoring.
Sub-step 2b: Apply the PrometheusRule
Apply the PrometheusRule that fires when a pod in the k8s-trouble-crashloop namespace has more than 1 restart:
kubectl apply -f infrastructure/scenarios/k8s/alertmanager/prometheus-rules.yaml
Expected output:
prometheusrule.monitoring.coreos.com/hermes-lab-rules created
Verify the rule loaded:
kubectl get prometheusrule -n monitoring hermes-lab-rules -o yaml | yq '.spec.groups[0].rules[0].alert'
Expected output:
PodCrashLooping
Open the Prometheus UI at http://localhost:30091 and navigate to Status then Rules. You should see the hermes-lab.k8s-crashloop group with the PodCrashLooping rule listed and state "inactive" (no broken pods yet).
release label mattersThe PrometheusRule manifest includes labels: release: monitoring. The kube-prometheus-stack
Helm chart configures Prometheus to ONLY load rules whose release label matches the Helm
release name. In this course the release is named monitoring (from make monitoring), so
the label must match. Without it, your rule would be silently ignored — kubectl get prometheusrule
would show it, but the Prometheus UI Rules page would not. Verify with:
kubectl get prometheus -n monitoring -o jsonpath='{.items[0].spec.ruleSelector}'
Sub-step 2c: Configure the webhook subscription and restart the gateway
The webhook subscription needs two settings that differ from the defaults:
- No HMAC signature — AlertManager doesn't sign its POST requests, so the subscription
must use
INSECURE_NO_AUTHto skip signature validation - Deliver to log — webhook-triggered output uses
logdelivery (saved to gateway log), notlocal(which is cron-only)
Edit the subscription file directly (or create it if it doesn't exist):
cat > ~/.hermes/webhook_subscriptions.json << 'EOF'
{
"alertmanager": {
"description": "AlertManager PodCrashLooping webhook",
"events": [],
"secret": "INSECURE_NO_AUTH",
"prompt": "AlertManager PodCrashLooping alert fired. Details: {alerts}. Load the sre-k8s-pod-health skill and diagnose the affected pod in the namespace shown in the alert labels.",
"skills": [
"sre-k8s-pod-health"
],
"deliver": "log"
}
}
EOF
{alerts} not {alerts[0].labels.pod}The Hermes prompt template only supports dot-notation access to dict keys, NOT array index
access. The agent receives the full alerts[] JSON array as a string and parses it to find
the pod and namespace. If you use {alerts[0].labels.pod}, it renders as a literal string.
INSECURE_NO_AUTH?In production, webhook endpoints should validate HMAC signatures to prevent unauthorized
requests from triggering agent runs. AlertManager does not support HMAC signing natively.
For lab work, INSECURE_NO_AUTH skips validation. In production, you would place
AlertManager behind a reverse proxy that adds HMAC signatures, or use network-level
access control (Kubernetes NetworkPolicy, firewall rules).
"events": []?AlertManager POSTs do not include event-type headers (X-GitHub-Event, X-GitLab-Event).
An empty events list means "accept all POSTs to this route" — no event filtering.
Now restart the gateway to pick up the new subscription:
hermes gateway stop
hermes gateway start
# Verify the subscription loaded
hermes webhook list
Sub-step 2d: Apply the crashloop scenario and observe
Apply the broken pod manifest and watch the restart count climb:
# Terminal 3: Apply the broken pod (reuses the crashloop scenario manifest)
kubectl apply -f infrastructure/scenarios/k8s/02-crashloop-backoff.yaml
# Watch the pod restart count climb
watch kubectl get pods -n k8s-trouble-crashloop
How the AlertManager → Hermes chain works
┌──────────────────────────────────────────────────────────────────────┐
│ KIND Cluster │
│ │
│ ┌─────────────┐ scrapes ┌──────────────┐ │
│ │ kube-state- │──────────────▶│ Prometheus │ │
│ │ metrics │ every 30s │ :30091 │ │
│ └─────────────┘ └──────┬───────┘ │
│ ▲ │ rule fires │
│ │ watches │ (restarts > 1) │
│ ┌─────┴───────┐ ┌──────▼───────┐ │
│ │ crasher pod │ │ AlertManager │ │
│ │ CrashLoop │ │ :30093 │ │
│ │ BackOff │ └──────┬───────┘ │
│ └─────────────┘ │ POST webhook │
│ │ │
└─────────────────────────────────────────┼────────────────────────────┘
│
▼
┌───────────────────────┐
│ Hermes Gateway │ ← your Mac
│ :8644 │
│ /webhooks/alertmanager│
└───────────┬───────────┘
│ spawns agent
▼
┌───────────────────────┐
│ Hermes Agent │
│ loads sre-k8s-pod- │
│ health skill │
│ runs kubectl get/ │
│ describe/logs │
│ produces diagnosis │
└───────────────────────┘
Expected timeline
- t=0s: pod applied, status
ContainerCreating - t=10s: pod status
CrashLoopBackOff, restartCount=1 - t=30s: Prometheus scrapes kube-state-metrics, sees restartCount > 1
- t=60s: Rule fires → alert goes PENDING → FIRING (no
fordelay) - t=70s: AlertManager dispatches POST to
host.docker.internal:8644 - t=75s: Hermes gateway accepts webhook, spawns agent run
- t=110s: Agent completes diagnosis (kubectl calls + LLM reasoning ~35s)
Open the AlertManager UI at http://localhost:30093 to see the active alert and its receiver
routing to hermes-webhook.
How to verify each step
# 1. Pod is crashing?
kubectl get pods -n k8s-trouble-crashloop
# Expected: CrashLoopBackOff with restarts > 1
# 2. Prometheus sees the metric?
# Open http://localhost:30091 → Query tab → run:
# kube_pod_container_status_restarts_total{namespace="k8s-trouble-crashloop"}
# Expected: value > 1
# 3. Alert is firing?
# Open http://localhost:30091 → Alerts tab
# Expected: PodCrashLooping in FIRING state
# 4. AlertManager received the alert?
# Open http://localhost:30093
# Expected: PodCrashLooping alert with receiver hermes-webhook
# 5. Gateway received the POST?
tail -20 ~/.hermes/logs/gateway.log | grep -i "accepted\|alertmanager"
# Expected: {"status": "accepted", "route": "alertmanager", ...}
# 6. Agent ran and produced output?
tail -50 ~/.hermes/logs/gateway.log | grep -i "response ready"
# Expected: response ready: platform=webhook ... response=NNNN chars
Where to see the agent's diagnosis
With --deliver log, the agent's full diagnosis is written to the gateway log:
# View the full agent response
tail -200 ~/.hermes/logs/gateway.log | grep -A100 "Response for webhook:alertmanager"
These use different delivery mechanisms:
- Cron jobs (
--deliver local): saved to~/.hermes/cron/output/<job-id>/*.md - Webhook triggers (
--deliver log): written to~/.hermes/logs/gateway.log
In production, both can deliver to Telegram or Slack instead.
You now have both trigger mechanisms running:
-
Cron asks "is anything wrong?" It fires on a schedule whether or not an alarm has fired. Use it for proactive health checks and daily summaries.
-
Webhook reacts to "something IS wrong." It fires in response to an external event -- an AlertManager alert, a Kubernetes pod crash, a GitHub PR event. Use it for incident response automation where latency matters.
Both can load the same skill and run the same investigation prompt. The difference is timing and trigger: scheduled vs event-driven.
Cleanup after observing
kubectl delete -f infrastructure/scenarios/k8s/02-crashloop-backoff.yaml
hermes webhook unsubscribe alertmanager
Step 3: K8s CronJob -- Agent as a Kubernetes Workload (15 min)
The Hermes cron jobs you built in Step 1 are the production pattern for most agent work. This step demonstrates the SAME agent wrapped in a native K8s CronJob -- and makes explicit WHEN each pattern is the right answer.
Build the minimal hermes-agent image
# Build the minimal hermes-agent container image
# (This takes 5-10 min -- the image is ~700-900MB of Python deps)
docker build -t hermes-lab:cronjob infrastructure/scenarios/k8s/cronjob/
Verify the image exists:
docker images | grep hermes-lab
The infrastructure/scenarios/k8s/cronjob/Dockerfile uses python:3.12-slim as a base, not the official nousresearch/hermes-agent:latest (which is 2-3GB with Playwright/ffmpeg). This minimal image is a teaching artifact about packaging agents for K8s -- keeping the image lean makes KIND loads faster and demonstrates production best practices for agent containers.
Load the image into KIND and create the API key secret
# Load into KIND (required -- the CronJob uses imagePullPolicy: IfNotPresent, not a registry)
kind load docker-image hermes-lab:cronjob --name lab
# Create the API key secret (NEVER commit this token -- it lives only in your local KIND cluster)
kubectl create secret generic hermes-secrets \
--from-literal=anthropic-api-key="$ANTHROPIC_API_KEY"
# Verify the secret was created
kubectl get secret hermes-secrets
Apply the CronJob for Track C
kubectl apply -f infrastructure/scenarios/k8s/cronjob/agent-health-check.yaml -l track=track-c
# Watch jobs spawn (schedule is */5 * * * * -- wait up to 5 min for the first run)
watch kubectl get jobs,pods
View logs from the first completed job
Once a job pod reaches Completed status, read its logs:
kubectl logs -l job-name=$(kubectl get jobs -o jsonpath='{.items[-1].metadata.name}')
Use Hermes cron when:
- The agent benefits from gateway-shared state (loaded skills, audit trail, conversation history)
- You want one-stop CLI management (
hermes cron create/list/trigger/pause/resume) - You are iterating fast -- tweak a prompt, re-register the cron, done. No image rebuild cycle.
- You need audit trail context linking cron runs to skill and prompt versions
- You are not (yet) in Kubernetes
Use K8s CronJob when:
- Stateless one-shot diagnostics -- no state needed from previous runs
- GitOps schedule-in-git -- you want the schedule reviewed via PR and deployed via ArgoCD/Flux
- K8s-native observability -- Prometheus
kube_job_status_*metrics, kubectl get jobs, Loki logs - Multi-tenant resource quotas -- namespace isolation, NetworkPolicies, resource quotas, Secrets
Real-world honest stance: Most agent work uses Hermes cron because state matters. K8s CronJob shines for fire-and-forget diagnostic jobs deployed alongside other K8s primitives via the same GitOps pipeline.
Cleanup
kubectl delete -f infrastructure/scenarios/k8s/cronjob/agent-health-check.yaml -l track=track-c
kubectl delete secret hermes-secrets
Step 4: GitHub Webhook -- PR Review Bot (15 min)
GitHub webhooks need a public HTTPS endpoint to POST to. smee.io is a free public webhook proxy: you get a unique channel URL, GitHub POSTs to it, and a smee-client on your laptop forwards events to your local Hermes gateway.
This step requires a personal GitHub repo and a GitHub PAT. If you do not have these, skip to the Solo Learner fallback section at the bottom of this step -- you can simulate the full GitHub webhook flow without any external service using the bundled sample PR payload.
Sub-step 4a: Get a smee.io channel URL
Visit https://smee.io/ in your browser and click "Start a new channel". Copy the URL -- it looks like https://smee.io/abc123XYZ.
Sub-step 4b: Get a GitHub PAT with repo scope
- Open https://github.com/settings/tokens and click "Generate new token" then "Generate new token (classic)"
- Note:
hermes-lab-triggers, Expiration: 30 days - Scopes: check
repo(includes read+write to PRs and comments) - Click "Generate token" and copy the
ghp_...value
Set the required environment variables:
export GITHUB_TOKEN="ghp_..." # Your PAT from step above
export SMEE_URL="https://smee.io/your-channel-id" # Your smee.io channel URL
Authenticate the gh CLI with your PAT:
gh auth login --with-token <<< "$GITHUB_TOKEN"
gh auth status # Should show "Logged in to github.com as <you>"
Sub-step 4c: Start the gateway and smee-client
# Terminal 1: Start the Hermes gateway
hermes gateway run
# Terminal 2: Run the smee setup script (foreground -- leave it running)
./infrastructure/scenarios/k8s/github-webhook/smee-setup.sh
Expected smee output:
================================================================
smee.io -> Hermes webhook gateway forwarder
================================================================
Source: https://smee.io/abc123XYZ
Target: http://localhost:8644/webhooks/github
Client: smee-client@5.0.0
================================================================
Forwarding https://smee.io/abc123XYZ to http://localhost:8644/webhooks/github
Sub-step 4d: Add the webhook to your GitHub repo
In your test GitHub repo: Settings -> Webhooks -> Add webhook
- Payload URL: your smee.io channel URL (
$SMEE_URL) - Content type:
application/json - Secret: (leave blank for the lab)
- Events: "Let me select individual events" then check Pull requests
- Active: checked
Click Add webhook. GitHub will send a test ping -- you should see Forwarding event to localhost:8644/webhooks/github in the smee terminal.
The smee-setup.sh script targets http://localhost:8644/webhooks/github. When you subscribe below, use hermes webhook subscribe github (NOT github-webhook). The route name after /webhooks/ MUST match the subscription name exactly, or events arrive at the gateway but no subscription receives them.
Subscribe the GitHub webhook
# Terminal 3: Subscribe with the agent prompt template and github_comment delivery
hermes webhook subscribe github \
--events "pull_request" \
--prompt "$(cat infrastructure/scenarios/k8s/github-webhook/agent-prompt-template.txt)" \
--deliver github_comment \
--deliver-chat-id "{repository.full_name}:{pull_request.number}"
# Verify the subscription is active
hermes webhook list
Trigger the event
Open a PR on your test repo (or push a commit to a branch that already has an open PR).
Watch the flow:
- smee terminal shows
Forwarding event to localhost:8644/webhooks/github - gateway terminal shows
Received github webhook event - ~10 seconds -- agent runs, generates review summary, posts comment back to the PR
- GitHub PR -- refresh the PR comments and you should see the Hermes review comment
Verify the comment landed:
gh pr view <PR_NUMBER> --repo <OWNER>/<REPO> --json comments
Solo Learner fallback -- no GitHub repo or smee.io needed
Skip the smee setup above and use the bundled sample payload instead:
# Subscribe with --deliver local instead of --deliver github_comment
hermes webhook subscribe github \
--events "pull_request" \
--prompt "$(cat infrastructure/scenarios/k8s/github-webhook/agent-prompt-template.txt)" \
--deliver local
# Inject the bundled sample payload (PR #42: feat(api): add /health readiness endpoint)
hermes webhook test github \
--payload @infrastructure/scenarios/k8s/github-webhook/sample-pr-payload.json
The agent runs identically -- the review goes to your terminal instead of back to GitHub. The sample payload is a valid GitHub PR webhook structure with all the fields the prompt template references (pull_request.number, pull_request.title, repository.full_name, etc.).
Cleanup
hermes webhook unsubscribe github
Step 5: Telegram Bot -- Chat Interface (15 min)
Telegram is the right primary chat platform for this lab: free, no admin approval, works for every learner globally. Hermes has a built-in interactive setup that handles bot creation, user allowlists, and home channel configuration in one step.
Telegram bots use long polling. Only ONE bot instance can poll at a time. If you previously
ran hermes gateway run in another terminal, stop it first with hermes gateway stop and
wait 30 seconds before starting, or you will get 409 Conflict errors. This is a Telegram
API restriction, not a Hermes bug.
Sub-step 5a: Create your bot via @BotFather (~2 min)
- Open Telegram (mobile app or https://web.telegram.org)
- Search for @BotFather (verified blue checkmark)
- Send
/newbot - Choose a display name (e.g.,
Hermes Lab Bot) - Choose a username -- must end in
bot(e.g.,hermes_lab_yourname_bot) - Copy the bot token -- looks like
123456:ABC-DEF1234ghIkl-zyx57W2v1u123ew11 - Get your user ID: search for @userinfobot, send
/start, copy the numeric ID
Sub-step 5b: Configure Telegram via hermes gateway setup
Instead of manually exporting environment variables, use the interactive setup:
hermes gateway stop # stop any running gateway first
hermes gateway setup
The setup wizard walks you through:
- Select Telegram from the platform list
- Paste your bot token from @BotFather
- Enter your user ID from @userinfobot (this creates the allowlist)
- Set the home channel ID -- use your user ID (your DM with the bot)
- Say yes to restart the gateway
The home channel is where Hermes delivers cron job results and notifications. When you
set your Telegram user ID as the home channel, cron jobs created with --deliver telegram
send their output directly to your Telegram DM. Without a home channel, delivery to
"telegram" fails silently because Hermes doesn't know which chat to target.
After setup completes, verify the gateway is running with Telegram connected:
hermes gateway status
# Expected: Gateway running, telegram connected
Sub-step 5c: Send slash commands from Telegram
- Open Telegram on your phone
- Search for your bot's username and open the chat
- Send:
/help
Expected: bot replies with available commands.
- Send the diagnostic command:
/diagnose k8s-trouble-crashloop
Within ~30 seconds, the bot replies with the agent's diagnosis from running kubectl commands against your KIND cluster.
- Check gateway status:
/status
Sub-step 5d: Conversational chat via Telegram
The Telegram bot is not limited to slash commands -- it is a full conversational interface
to your Hermes agent. Any free-form text you type goes directly to the agent as a prompt.
This is the same experience as hermes chat, but from your phone.
Try these conversational prompts:
Check all pods in the default namespace and tell me if anything looks unhealthy
What nodes are in my cluster and how much CPU is available?
I just deployed a new version of the API. Can you verify the rollout status?
The agent runs kubectl commands against your KIND cluster and replies in the same chat.
Both work identically. Slash commands like /diagnose are just text starting with /.
Free-form text works the same way. Use whichever feels natural.
Sub-step 5e: Set home channel for cron delivery
You can also set the home channel directly from Telegram by typing /sethome in the
bot chat. This tells Hermes "deliver cron job output and notifications to this chat."
/sethome
Expected: bot confirms this chat is now the home channel for Telegram delivery.
Now update your cron job to deliver via Telegram instead of local files:
# Delete old cron job
hermes cron delete daily-k8s-check
# Recreate with Telegram delivery
hermes cron create "0 8 * * *" \
"Run morning pod health check across all namespaces. Report only if pods or nodes show issues." \
--name "daily-k8s-check" \
--skill "sre-k8s-pod-health" \
--deliver telegram
# Trigger it
hermes cron run <job-id>
hermes cron tick
The diagnosis report now arrives in your Telegram DM instead of a local file.
Sub-step 5f: Default agent vs profile agent
When you use hermes -p track-c chat, Hermes loads Kiran's identity (SOUL.md),
config, and skills. The gateway runs the default agent -- it has access to
ALL globally installed skills (everything in ~/.hermes/skills/) but no profile-specific
identity.
| Mode | Identity | Skills Available | Use When |
|---|---|---|---|
hermes -p track-c chat | Kiran (SOUL.md) | Profile skills only | Interactive investigation with behavioral rules |
hermes chat (no profile) | Default Hermes | All global skills | General-purpose agent |
| Gateway (Telegram/Slack) | Default Hermes | All global skills | Chat interface from phone |
| Cron jobs | No identity | Global skills only | Scheduled automation |
The gateway currently runs as the default agent. To give it Kiran's identity for all Telegram interactions, you could copy the Track C SOUL.md into the global config:
cp ~/.hermes/profiles/track-c/SOUL.md ~/.hermes/SOUL.md
This makes the gateway agent behave like Kiran (with NEVER rules and K8s expertise) for all chat platforms. Remove it to restore the default agent.
In production, you might want separate Telegram bots for different agents — one for the K8s health agent (Kiran), another for the FinOps agent, and a general-purpose bot.
Each Hermes profile can run its own gateway on a different port:
# Terminal 1: Kiran (Track C) on port 8644
hermes -p track-c gateway run
# Terminal 2: Default agent on port 8645
WEBHOOK_PORT=8645 hermes gateway run
Each gateway needs its own Telegram bot token (create multiple bots via @BotFather).
Set different TELEGRAM_BOT_TOKEN values in each profile's .env file
(~/.hermes/profiles/track-c/.env vs ~/.hermes/.env).
This gives you dedicated bots with different identities, skills, and behavioral rules — all running simultaneously.
Each conversation maintains session context within the gateway's lifetime. Follow-up messages like "what about the kube-system namespace?" work because the agent remembers the prior exchange. Restarting the gateway resets all sessions.
Cleanup
hermes gateway stop
Step 6: Slack Integration (Optional -- 15 min)
Slack is the production-grade chat platform for most DevOps teams. This step is optional because it requires Slack workspace admin access to create an app. If you have admin access, this gives you the same conversational agent experience as Telegram but inside your team's Slack workspace.
- You don't have Slack workspace admin access
- You're a solo Udemy learner without a team Slack
- You've already demonstrated the chat interface via Telegram
The Telegram bot from Step 5 covers the same learning objectives. Slack adds production realism but is not required for course completion.
Sub-step 6a: Create a Slack App (~5 min)
- Go to https://api.slack.com/apps → Create New App → From Scratch
- Name it (e.g.,
Hermes Lab Bot) and select your workspace - Enable Socket Mode: Settings → Socket Mode → Enable → Create App-Level Token with scope
connections:write→ copy thexapp-...token - Add Bot Token Scopes: Features → OAuth & Permissions → Bot Token Scopes:
chat:write,app_mentions:read,channels:history,channels:readgroups:history(optional, for private channels)im:history,im:read,im:write,users:read,files:write
- Subscribe to Events: Features → Event Subscriptions → Enable:
message.im,message.channels,app_mentionmessage.groups(optional)
- Install to Workspace: Settings → Install App → copy the
xoxb-...token - Reinstall if you changed scopes or events after initial install
- Find your user ID: Click profile → three dots → Copy member ID
- Invite the bot: In your channel, type
/invite @HermesLabBot
message.channels, the bot ONLY works in DMsAdd both message.channels and app_mention to enable channel interaction.
Sub-step 6b: Configure Slack via hermes gateway setup
Just like Telegram, use the interactive setup:
hermes gateway stop
hermes gateway setup
Select Slack from the platform list and enter:
- Bot Token (
xoxb-...) from step 6 - App Token (
xapp-...) from step 3 - Allowed user IDs -- your member ID from step 8
- Home channel ID -- optional, or set later with
/sethomein Slack
Say yes to restart the gateway. Both Telegram and Slack can run simultaneously.
Expected output after restart:
✓ telegram connected
✓ slack connected
Sub-step 6c: Talk to the agent in Slack
DM the bot:
Check all pods in the default namespace and tell me if anything looks unhealthy
The agent runs kubectl against your KIND cluster and replies in the DM.
In a channel (@ mention):
@HermesLabBot diagnose pods in k8s-trouble-crashloop namespace
The bot replies in a thread under your message.
Set home channel for delivery:
Type /sethome in the Slack channel where you want cron reports delivered. This works
identically to the Telegram /sethome command.
Sub-step 6d: Cleanup
hermes gateway stop
| Aspect | Telegram | Slack |
|---|---|---|
| Setup | 2 min via @BotFather | 5 min via api.slack.com, needs admin |
| Access control | User ID allowlist | User ID allowlist + workspace boundary |
| Channels | Groups (limited) | Full channel support with threads |
| Threading | Flat conversation | Thread replies under trigger message |
| Best for | Personal/lab use, solo learners | Team environments, on-call workflows |
In production, most teams use Slack because it integrates with their existing incident response workflow (PagerDuty → Slack channel → agent responds in thread). Telegram is excellent for personal agents and labs where Slack admin access is not available.
Verification Checklist (5 min)
Run these commands to confirm all 5 trigger types completed successfully:
# 1. Cron -- scheduler running and daily job registered
hermes cron status
# Expected: Scheduler: running, daily-k8s-check listed
hermes cron list
# Expected: daily-k8s-check 0 8 * * * sre-k8s-pod-health scheduled
# 2. AlertManager -- Prometheus stack and PrometheusRule deployed
kubectl get pods -n monitoring -l app.kubernetes.io/name=alertmanager
# Expected: alertmanager-monitoring-kube-prometheus-alertmanager-0 Running
kubectl get prometheusrule -n monitoring hermes-lab-rules
# Expected: row exists
# 3. K8s CronJob -- image built and loaded
docker images hermes-lab:cronjob
# Expected: REPOSITORY: hermes-lab, TAG: cronjob
# 4. GitHub webhook -- smee-setup.sh is executable
test -x infrastructure/scenarios/k8s/github-webhook/smee-setup.sh && echo OK
# Expected: OK
# Verify sample PR payload is valid JSON
jq . infrastructure/scenarios/k8s/github-webhook/sample-pr-payload.json > /dev/null && echo OK
# Expected: OK
# 5. Telegram -- python-telegram-bot installed
python3 -c "from telegram import Bot; print('telegram OK')"
# Expected: telegram OK
- Hermes cron job created, triggered, paused, resumed
- AlertManager fired PodCrashLooping alert and agent diagnosed automatically
- K8s CronJob ran containerized agent on schedule
- GitHub webhook forwarded PR event and agent posted review (or Solo Learner fallback tested)
- Telegram bot responded to
/help,/diagnose, AND free-form conversational prompts - (Optional) Slack bot responded to DMs and channel @ mentions
FREE EXPLORE PHASE -- 20 minutes
Choose challenges based on your available time and experience level.
Challenge 1 (Starter -- 10 min): Cross-namespace cron
Create a cron job that checks pods across both kube-system and default namespaces:
hermes cron create "*/5 * * * *" \
"Run pod health check across kube-system and default namespaces. Report pod counts, restart counts, and any non-Running pods." \
--name "cross-namespace-check" \
--skill "sre-k8s-pod-health" \
--deliver local
Trigger it manually to see the output:
hermes cron run <job-id> # use the ID from hermes cron list
Questions to explore:
- How many pods are running in
kube-systemvsdefault? - Does the agent correctly identify which namespace each pod belongs to?
- What happens if you add a third namespace to the prompt?
Challenge 2 (Intermediate -- 10 min): AlertManager to Telegram notification chain
Wire AlertManager alerts to your Telegram bot instead of the terminal:
# Stop and restart gateway with Telegram adapter
hermes gateway stop
sleep 30
export TELEGRAM_BOT_TOKEN="123456:ABC..."
export TELEGRAM_ALLOWED_USERS="987654321"
hermes gateway run
# Subscribe alertmanager webhook with Telegram delivery
hermes webhook subscribe alertmanager \
--events "alertmanager-alert" \
--prompt "AlertManager alert fired. Details: {alerts}. Diagnose the affected pod." \
--skill "sre-k8s-pod-health" \
--deliver telegram
Apply the crashloop scenario and watch for the notification on your phone:
kubectl apply -f infrastructure/scenarios/k8s/02-crashloop-backoff.yaml
Within ~2 minutes, your Telegram bot should send you the diagnosis automatically.
Cleanup:
kubectl delete -f infrastructure/scenarios/k8s/02-crashloop-backoff.yaml
hermes webhook unsubscribe alertmanager
Challenge 3 (Advanced -- 10 min): GitHub PR triggers full diagnosis and posts report
Combine the GitHub webhook with a Kubernetes-aware prompt so the agent checks cluster health when a PR touches K8s manifests:
hermes webhook subscribe github \
--events "pull_request" \
--prompt "PR #{pull_request.number} on {repository.full_name} modifies infrastructure. Run a full K8s health check across all namespaces and post the results as a PR comment." \
--skill "sre-k8s-pod-health" \
--deliver github_comment \
--deliver-chat-id "{repository.full_name}:{pull_request.number}"
Open a PR that modifies a YAML file in your test repo. The agent should:
- Receive the PR event via smee.io
- Run a full cluster health check against your KIND cluster
- Post the health report as a comment on the PR
Questions to explore:
- Does the agent correctly correlate the PR content with cluster state?
- What changes in the report if you have a crashloop pod running vs a clean cluster?
Closing
What you built in this lab:
- A cron-scheduled health check that fires at 8 AM daily, loads a domain skill, and delivers findings to your terminal (or Slack/Telegram in production)
- A real Prometheus + AlertManager pipeline on KIND firing on a broken pod, with the agent receiving the alert and diagnosing without manual invocation
- A K8s CronJob running a containerized Hermes agent on a schedule, with explicit "use this when..." framing for Hermes cron vs K8s CronJob
- A GitHub webhook via smee.io routing real PR events to Hermes, with the agent posting review comments back via the built-in
github_commentdelivery type - A Telegram bot you can poke from your phone, with slash commands and immediate agent responses
Key commands reference:
# Cron management
hermes cron status # Always run at session start
hermes cron create "<schedule>" "<prompt>" --name ... --skill ... --deliver local
hermes cron list
hermes cron run <job-id> # Manual fire (use ID from cron list, not name)
hermes cron pause <job-id>
hermes cron resume <job-id>
hermes cron delete <name>
# Webhook management
hermes gateway run # Start the webhook gateway
curl http://localhost:8644/health # Verify endpoint is live
hermes webhook list
hermes webhook test <name> --payload '{"key": "value"}'
hermes webhook unsubscribe <name>
# AlertManager: edit ~/.hermes/webhook_subscriptions.json directly
# (set secret=INSECURE_NO_AUTH, events=[], deliver=log)
# Trigger-specific commands
# AlertManager webhook: configure via webhook_subscriptions.json (see Step 2c)
hermes webhook subscribe github --events "pull_request" --prompt "..." --deliver github_comment
kubectl apply -f infrastructure/scenarios/k8s/cronjob/agent-health-check.yaml -l track=track-c
./infrastructure/scenarios/k8s/github-webhook/smee-setup.sh
hermes gateway run # With TELEGRAM_BOT_TOKEN set, activates Telegram adapter
hermes gateway stop # Stop gateway before restarting with new governance level
Next: Module 13 covers governance -- approval workflows, maturity levels, and audit trails. The cron and webhook triggers you built here become the entry points for governed agent actions in Module 13.