Module 11 Lab: Triggers and Scheduling
Duration: 145 minutes (120 min guided + 25 min free explore) Prerequisites: Module 10 agent working for your track, hermes gateway available Outcome: Working examples of all 5 trigger patterns: Hermes cron, simulated CloudWatch webhook, real AlertManager webhook on KIND, K8s CronJob, GitHub webhook via smee.io, and a Telegram bot — each agent invocation observed end-to-end
This lab moves your Hermes agent from reactive (you type a prompt) to proactive (the agent runs on a schedule or reacts to events). You will build two trigger mechanisms: a cron schedule for daily health checks and a webhook subscription for alert-driven investigations.
Both exercises are fully hands-on — you run the commands, see the output, and verify the behavior yourself.
GUIDED PHASE — 120 minutes
Step 1: Morning Startup Sequence (5 min)
WARNING: Cron jobs do NOT auto-recover after a laptop sleep or KIND cluster restart. If you closed your laptop overnight or ran kind delete cluster, your scheduled jobs appear registered but will never fire. Always run hermes cron status first — before assuming any cron job is active.
Startup checklist
1. Check scheduler health:
hermes cron status
Expected output when healthy:
Scheduler: running
Jobs registered: 0
Next tick: in ~60s
If you see Scheduler: not running or an error about the gateway not being available:
hermes gateway setup
Then re-run hermes cron status to confirm the scheduler is now running.
2. Verify any existing jobs are still registered:
hermes cron list
If you created jobs in a previous session, they should appear here. If the list is empty after you expected jobs to be present, the cron store was reset — you will need to recreate them (Step 2).
The cron scheduler runs as part of the Hermes gateway process. If the
gateway was stopped (laptop sleep, restart, terminal close), the scheduler stops ticking.
Jobs persist in ~/.hermes/cron/jobs.json, but they will not run until the gateway is
active again. The hermes cron status check confirms the scheduler is alive and watching
for due jobs.
Step 2: Create a Daily Health Check Cron Job (10 min)
Cron expressions use the standard five-field format: minute hour day-of-month month day-of-week
| Expression | Meaning |
|---|---|
0 8 * * * | 8:00 AM every day |
30 9 * * 1-5 | 9:30 AM weekdays only |
*/5 * * * * | Every 5 minutes |
0 0 * * 0 | Midnight every Sunday |
Create your daily health check cron job. Run the command for your track:
Track A — Database Health:
hermes cron create "0 8 * * *" \
"Run daily health check for RDS. Report only if anomalies found." \
--name "daily-db-health" \
--skill "dba-rds-slow-query" \
--deliver local
Track B — Cost Anomaly:
hermes cron create "0 8 * * *" \
"Run daily cost anomaly check. Report only if spending anomalies detected." \
--name "daily-cost-check" \
--skill "cost-anomaly" \
--deliver local
Track C — Kubernetes Health:
hermes cron create "0 8 * * *" \
"Run daily Kubernetes cluster health check. Report only if pods or nodes show issues." \
--name "daily-k8s-check" \
--skill "kubernetes-health" \
--deliver local
What each argument does
| Argument | Purpose |
|---|---|
schedule (1st positional) | Cron expression defining when the job runs (e.g. "0 8 * * *", "*/5 * * * *", or a shorthand like "30m", "every 2h"). |
prompt (2nd positional) | What the agent is asked to do when it fires. Must be self-contained — the cron agent has no chat history. |
--name | Human-readable job name (kebab-case). Used to reference the job in other commands. |
--skill | Skill to load before running the prompt. The agent reads the SKILL.md runbook first. Repeat the flag to attach multiple skills. |
--deliver local | Output goes to your terminal. In production, use --deliver slack or --deliver telegram to route findings to your notification system. |
--repeat N | Optional — limits the job to N executions. Omit for indefinite scheduling. |
schedule must come first, prompt second. You can put the --name, --skill,
and --deliver flags anywhere (before, between, or after the positionals) since
they're keyword arguments.
Verify the job was registered
hermes cron list
Expected output shape:
Name Schedule Next Run Skill State
daily-db-health 0 8 * * * 2026-04-05 08:00:00 dba-rds-slow-query scheduled
Using --deliver local routes the agent's output to your terminal
session. This is the right choice for lab work — you see output immediately without needing
to configure Slack or Telegram. In production, you would use --deliver slack (configured
in ~/.hermes/config.yaml) or --deliver telegram so findings reach your on-call channel
even if you are not at your terminal.
Step 3: Trigger Manually and Verify Output (10 min)
You scheduled the job for 8 AM — but you do not need to wait. Manual trigger fires the job immediately and is your primary verification tool:
hermes cron run <job-id> # use the ID from hermes cron list
(Use the job name you created: daily-cost-check for Track B, daily-k8s-check for Track C.)
Watch the terminal. The agent will:
- Load the skill SKILL.md runbook
- Run the investigation using mock data (if
HERMES_LAB_MODE=mockis set) - Print its findings to the terminal
Expected output shape:
[Cron] Firing job: daily-db-health
[MOCK MODE] Running dba-rds-slow-query investigation...
Daily Health Check — prod-db-01
Status: HEALTHY
No slow query anomalies detected above threshold.
pg_stat_statements: 12 queries sampled, max mean_time_ms = 45ms (threshold: 500ms)
[SILENT] (no anomalies to report)
When the cron agent finds nothing to report, it responds with [SILENT].
This suppresses delivery — you will not receive a Slack or Telegram notification. This is by
design: agents that cry wolf on every run lose their usefulness. The agent only delivers a
full report when it finds something worth reporting.
If you suspect a cron job silently failed overnight (Step 1 warning), trigger it manually to confirm the job still executes correctly. A successful manual trigger proves the skill, prompt, and delivery path are all working — the only missing piece was the scheduler tick.
Confirm the run was recorded:
hermes cron list
Check that Last Run now shows today's timestamp.
Step 4: Pause, Resume, and Status (5 min)
Pause the job without deleting it:
hermes cron pause <job-id>
Check the paused state:
hermes cron status
Expected output shows the job in paused state:
Scheduler: running
Jobs registered: 1
daily-db-health PAUSED
Next tick: in ~43s
Resume the job:
hermes cron resume <job-id>
Verify it is back to scheduled state:
hermes cron status
Expected:
Scheduler: running
Jobs registered: 1
daily-db-health scheduled next: 2026-04-05 08:00:00
Next tick: in ~51s
Pause and resume is how you stop overnight runs without deleting the job configuration. Use pause when you are doing maintenance, running a planned load test, or temporarily silencing a job that is generating too many alerts. Deleting and recreating a job is the wrong approach — you lose the configuration and have to remember all the flags.
Step 5: Start the Webhook Gateway (5 min)
Set up the webhook platform:
hermes gateway setup
Follow the prompts. When asked about webhooks, enable them and accept the default port (8644).
Verify the endpoint is live:
curl http://localhost:8644/health
Expected response:
{"status": "ok"}
If curl returns Connection refused, the gateway did not start on that port. Check for an
existing process:
lsof -i :8644
If a process is listed, either stop it or use a different port when running hermes gateway setup.
If curl returns an error about the connection being reset (not refused), the gateway is
running but the webhook adapter is not enabled. Re-run hermes gateway setup and confirm
that webhooks are enabled in the prompt sequence.
Step 6: Subscribe a Webhook for CloudWatch Alerts (10 min)
A webhook subscription tells Hermes: "when a POST arrives at this route, fire an agent run using this prompt."
Subscribe to CloudWatch alarm events:
hermes webhook subscribe cloudwatch-alerts \
--events "cloudwatch-alarm" \
--prompt "CloudWatch alert received: {alarm.name} is {alarm.state}. Investigate." \
--deliver local
What each part means
| Part | Purpose |
|---|---|
cloudwatch-alerts | The route name. Hermes creates an endpoint at /webhooks/cloudwatch-alerts. |
--events "cloudwatch-alarm" | Event type filter. Only payloads matching this event type trigger the agent. |
--prompt "..." | Template string. {alarm.name} and {alarm.state} are replaced with values from the incoming JSON payload. |
--deliver local | Route agent output to the terminal. |
After running the command, Hermes prints the webhook URL and HMAC secret:
Subscription created: cloudwatch-alerts
URL: http://localhost:8644/webhooks/cloudwatch-alerts
Secret: <auto-generated HMAC-SHA256 secret>
Event filter: cloudwatch-alarm
Verify the subscription is listed:
hermes webhook list
Expected output:
Name Route Events Deliver
cloudwatch-alerts /webhooks/cloudwatch-alerts cloudwatch-alarm local
In production, you would configure CloudWatch SNS to POST to your public webhook URL (not localhost) with the HMAC secret for signature verification. For this lab, you simulate the POST locally in the next step.
Step 7: Test the Webhook with a Simulated Alert (10 min)
Simulate a CloudWatch alarm firing — no real AWS needed:
hermes webhook test cloudwatch-alerts \
--payload '{"alarm": {"name": "rds-cpu-high", "state": "ALARM"}}'
Watch the terminal. The sequence:
- Hermes receives the simulated POST to
/webhooks/cloudwatch-alerts - The payload
{"alarm": {"name": "rds-cpu-high", "state": "ALARM"}}is matched against the prompt template - The resolved prompt becomes:
CloudWatch alert received: rds-cpu-high is ALARM. Investigate. - Hermes fires an agent run with your track skill loaded
- The agent investigates using mock data and prints findings to the terminal
Expected output shape:
[Webhook] Received cloudwatch-alarm event on cloudwatch-alerts
[MOCK MODE] Investigating: CloudWatch alert received: rds-cpu-high is ALARM. Investigate.
Alert Investigation — rds-cpu-high
State: ALARM | CPUUtilization: 78.4%
Finding: Sequential scan detected on users table (created_at column, 12,847 rows).
Recommendation: CREATE INDEX CONCURRENTLY idx_users_created_at ON users (created_at)
Action required: REQUIRES-DBA-APPROVAL
No additional anomalies found.
You now have both trigger mechanisms running:
-
Cron asks "is anything wrong?" It fires on a schedule whether or not an alarm has fired. Use it for proactive health checks and daily summaries.
-
Webhook reacts to "something IS wrong." It fires in response to an external event — a CloudWatch alarm, a Kubernetes pod eviction, a Stripe payment failure. Use it for incident response automation where latency matters.
Both can load the same skill and run the same investigation prompt. The difference is timing and trigger: scheduled vs event-driven.
Try a different alarm in the payload:
hermes webhook test cloudwatch-alerts \
--payload '{"alarm": {"name": "rds-storage-low", "state": "ALARM"}}'
Observe how the resolved prompt changes: CloudWatch alert received: rds-storage-low is ALARM. Investigate.
Step 8: Slack — What This Looks Like in Production (5 min)
Slack bot configuration requires admin access to your Slack workspace.
If you are following this lab solo, skip the Slack config steps — you have already done the
hands-on equivalent with --deliver local.
In production, replacing --deliver local with --deliver slack routes the agent's findings
to your #devops-alerts channel automatically. Here is what that configuration looks like:
Slack config format (requires Slack admin to add bot)
# In ~/.hermes/config.yaml
notifications:
slack:
webhook_url: "https://hooks.slack.com/services/T.../B.../..."
channel: "#devops-alerts"
What changes
Only the delivery target changes — the skill, prompt, and investigation logic are identical:
# Lab version (what you ran above):
hermes cron create "0 8 * * *" \
"Run daily health check for RDS. Report only if anomalies found." \
--name "daily-db-health" \
--skill "dba-rds-slow-query" \
--deliver local
# Production version (Slack delivery):
hermes cron create "0 8 * * *" \
"Run daily health check for RDS. Report only if anomalies found." \
--name "daily-db-health" \
--skill "dba-rds-slow-query" \
--deliver slack
What the Slack bot message looks like
When a cron job or webhook fires with --deliver slack, the Hermes bot posts to the configured
channel:
Cronjob Response: daily-db-health
-------------
Daily Health Check — prod-db-01
Status: ALERT
Finding: Sequential scan on users table...
Recommendation: CREATE INDEX CONCURRENTLY...
Note: The agent cannot see this message, and therefore cannot respond to it.
--deliver local is the same pipeline — skill
loaded, agent runs, investigation executes — but output goes to your terminal instead of
Slack. Everything you practiced today is the production workflow. Switching to Slack is a
one-flag change once Slack admin has added the bot.
Step 9: AlertManager — Real Prometheus Stack Setup (10 min)
You have used hermes webhook test to fire simulated webhooks (Step 7). Now you wire a REAL alert source: the Prometheus + AlertManager stack on your KIND cluster, firing on a real broken pod from Phase 6.
This is the moment Hermes stops being a chat agent and starts being an incident-response agent — alerts arrive without you typing anything.
This step requires a running KIND cluster with kube-prometheus-stack already installed (from Phase 1 setup). If you skipped Phase 1's KIND setup, jump to the "Skipping AlertManager" callout at the bottom of this step — you can still complete TRIG-02, TRIG-03, and TRIG-04 without TRIG-01.
Set environment variables
export HERMES_LAB_MODE=live
export HERMES_LAB_SCENARIO=crashloop2
export HERMES_LAB_GOVERNANCE=L2
export HERMES_LAB_TRACK=track-c
export MOCK_DATA_DIR="$(pwd)/infrastructure/mock-data/kubernetes"
export PATH="$(pwd)/infrastructure/wrappers:$PATH"
Enable AlertManager in the helm release
The Phase 1 helm values disabled AlertManager to save resources. Phase 8 flips it back on.
# Verify the helm values now have alertmanager.enabled: true
yq '.alertmanager.enabled' infrastructure/helm/prometheus-lab-values.yaml
# Expected output: true
# Apply the updated values
helm upgrade --install kube-prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
-f infrastructure/helm/prometheus-lab-values.yaml
# Wait for AlertManager pods to reach Ready
kubectl wait --for=condition=Ready pod \
-l app.kubernetes.io/name=alertmanager \
-n monitoring --timeout=120s
Expected output:
pod/alertmanager-kube-prometheus-alertmanager-0 condition met
Apply the PrometheusRule
kubectl apply -f infrastructure/scenarios/k8s/alertmanager/prometheus-rules.yaml
Expected output:
prometheusrule.monitoring.coreos.com/hermes-lab-rules created
Verify the rule loaded
kubectl get prometheusrule -n monitoring hermes-lab-rules -o yaml | yq '.spec.groups[0].rules[0].alert'
Expected output:
PodCrashLooping
Open the Prometheus UI at http://localhost:30091 → Status → Rules. You should see hermes-lab.k8s-crashloop group with the PodCrashLooping rule listed and state "inactive" (no broken pods YET).
release label mattersThe PrometheusRule manifest includes a labels: release: monitoring field. The kube-prometheus-stack Helm chart configures Prometheus to ONLY load rules whose label matches the Helm release name. In this course the release is named monitoring (from helm install monitoring ...), so the label must be release: monitoring. Without it, your rule would be silently ignored — kubectl get prometheusrule would show it, but the Prometheus UI Rules page would not. Verify the expected label with: kubectl get prometheus -n monitoring -o jsonpath='{.items[0].spec.ruleSelector}'
Skipping AlertManager (no KIND)
If you don't have a KIND cluster running, skip to Step 12 (GitHub webhook). TRIG-02 (K8s CronJob) also requires KIND, so you'll skip Step 11 as well. All four trigger types are independent: missing one doesn't block the others.
Step 10: AlertManager — Fire and Observe (10 min)
Now wire the full end-to-end flow: gateway running, alertmanager subscription active, crashloop2 pod applied — then watch the alert fire and the agent diagnose automatically.
Set environment variables
export HERMES_LAB_MODE=live
export HERMES_LAB_SCENARIO=crashloop2
export HERMES_LAB_GOVERNANCE=L2
export HERMES_LAB_TRACK=track-c
export MOCK_DATA_DIR="$(pwd)/infrastructure/mock-data/kubernetes"
export PATH="$(pwd)/infrastructure/wrappers:$PATH"
Start the gateway and subscribe
Run the following in separate terminals:
# Terminal 1: Start the gateway in the foreground so you see live POSTs arrive
hermes gateway run
# Terminal 2: Subscribe the alertmanager webhook
hermes webhook subscribe alertmanager \
--events "alertmanager-alert" \
--prompt "AlertManager PodCrashLooping alert fired. Details: {alerts}. Load the sre-k8s-pod-health skill and diagnose the affected pod in the namespace shown in the alert labels." \
--skill "sre-k8s-pod-health" \
--deliver local
# Verify the subscription is active
hermes webhook list
{alerts} not {alerts[0].labels.pod}Notice the {alerts} placeholder in the subscribe command above — NOT {alerts[0].labels.pod}. The Hermes prompt template only supports dot-notation access to dict keys, NOT array index access. The agent receives the full alerts[] JSON array as a string and parses it to find the pod and namespace. If you accidentally use {alerts[0].labels.pod}, it will render as a literal string — the agent will see that text instead of the pod name.
Apply the Phase 6 crashloop2 scenario
# Terminal 3: Apply the broken pod (reuses Phase 6 manifest — do NOT modify it)
kubectl apply -f infrastructure/scenarios/k8s/02-crashloop-backoff.yaml
# Watch the pod restart count climb
watch kubectl get pods -n k8s-trouble-crashloop
Expected timeline
- t=0s: pod applied, status
ContainerCreating - t=10s: pod status
CrashLoopBackOff, restartCount=1 - t=60s: restartCount=4-6
- t=90s: PromQL
increase()exceeds 2 over the 2-min window - t=120s: AlertManager dispatches; Terminal 1 shows
Received alertmanager-alert event - t=125s: Hermes spawns agent run;
sre-k8s-pod-healthloads; diagnosis appears
Open AlertManager UI at http://localhost:30093 to see the active alert and its receiver routing.
Cleanup after observing
kubectl delete -f infrastructure/scenarios/k8s/02-crashloop-backoff.yaml
hermes webhook unsubscribe alertmanager
Step 11: K8s CronJob — Same Agent, Different Trigger Mechanism (10 min)
The Hermes cron jobs you built in Steps 2-4 are the production pattern for most agent work. This step demonstrates the SAME agent wrapped in a native K8s CronJob — and makes explicit WHEN each pattern is the right answer.
Set environment variables
export HERMES_LAB_MODE=mock
export HERMES_LAB_SCENARIO=crashloop2
export HERMES_LAB_GOVERNANCE=L2
export HERMES_LAB_TRACK=track-c
export MOCK_DATA_DIR="$(pwd)/infrastructure/mock-data/kubernetes"
export PATH="$(pwd)/infrastructure/wrappers:$PATH"
Build the minimal hermes-agent image
# Build the minimal hermes-agent container image
# (This takes 5-10 min — the image is ~700-900MB of Python deps)
docker build -t hermes-lab:cronjob infrastructure/scenarios/k8s/cronjob/
# Verify the image exists
docker images | grep hermes-lab
The infrastructure/scenarios/k8s/cronjob/Dockerfile uses python:3.12-slim as a base, not the official nousresearch/hermes-agent:latest (which is 2-3GB with Playwright/ffmpeg). This minimal image is a teaching artifact about packaging agents for K8s.
Load the image into KIND and create the API key secret
# Load into KIND (required — the CronJob uses imagePullPolicy: IfNotPresent, not a registry)
kind load docker-image hermes-lab:cronjob --name lab
# Create the API key secret (NEVER commit this token — it lives only in your local KIND cluster)
kubectl create secret generic hermes-secrets \
--from-literal=anthropic-api-key="$ANTHROPIC_API_KEY"
# Verify
kubectl get secret hermes-secrets
Apply the CronJob for your track
# Track A: kubectl apply -f infrastructure/scenarios/k8s/cronjob/agent-health-check.yaml -l track=track-a
# Track B: kubectl apply -f infrastructure/scenarios/k8s/cronjob/agent-health-check.yaml -l track=track-b
# Track C:
kubectl apply -f infrastructure/scenarios/k8s/cronjob/agent-health-check.yaml -l track=track-c
# Watch jobs spawn (schedule is */5 * * * * — wait up to 5 min for the first run)
watch kubectl get jobs,pods
View logs from the first completed job
kubectl logs -l job-name=$(kubectl get jobs -o jsonpath='{.items[-1].metadata.name}')
Use Hermes cron when:
- The agent benefits from gateway-shared state (loaded skills, audit trail, conversation history)
- You want one-stop CLI management (
hermes cron create/list/trigger/pause/resume) - You're iterating fast — tweak a prompt, re-register the cron, done. No image rebuild cycle.
- You need audit trail context linking cron runs to skill and prompt versions
- You're not (yet) in Kubernetes
Use K8s CronJob when:
- Stateless one-shot diagnostics — no state needed from previous runs
- GitOps schedule-in-git — you want the schedule reviewed via PR and deployed via ArgoCD/Flux
- K8s-native observability — Prometheus
kube_job_status_*metrics, kubectl get jobs, Loki logs - Multi-tenant resource quotas — namespace isolation, NetworkPolicies, resource quotas, Secrets
Real-world honest stance: Most agent work uses Hermes cron because state matters. K8s CronJob shines for fire-and-forget diagnostic jobs deployed alongside other K8s primitives via the same GitOps pipeline.
Cleanup
kubectl delete -f infrastructure/scenarios/k8s/cronjob/agent-health-check.yaml -l track=track-c
kubectl delete secret hermes-secrets
Step 12: GitHub Webhook — smee.io Setup (10 min)
GitHub webhooks need a public HTTPS endpoint to POST to. smee.io is a free public webhook proxy: you get a unique channel URL, GitHub POSTs to it, and a smee-client on your laptop forwards events to your local Hermes gateway.
This step requires a personal GitHub repo and a GitHub PAT. If you don't have these, skip to Step 13's Solo Learner fallback section — you can simulate the full GitHub webhook flow without any external service using the bundled sample PR payload.
Set environment variables (including new Phase 8 TRIG-03 vars)
export HERMES_LAB_MODE=live
export HERMES_LAB_SCENARIO=crashloop2
export HERMES_LAB_GOVERNANCE=L2
export HERMES_LAB_TRACK=track-c
export MOCK_DATA_DIR="$(pwd)/infrastructure/mock-data/kubernetes"
export PATH="$(pwd)/infrastructure/wrappers:$PATH"
# Phase 8 NEW (TRIG-03):
export GITHUB_TOKEN="ghp_..." # Your PAT — see Get a GitHub PAT below
export SMEE_URL="https://smee.io/your-channel-id" # Your smee.io channel — see step 1 below
Step 1: Get a smee.io channel URL
Visit https://smee.io/ in your browser and click "Start a new channel". Copy the URL — it looks like https://smee.io/abc123XYZ. Set it in the env var above.
Step 2: Get a GitHub PAT with repo scope
- Open https://github.com/settings/tokens → "Generate new token" → "Generate new token (classic)"
- Note:
hermes-lab-trig03, Expiration: 30 days - Scopes: check
repo(includes read+write to PRs and comments) - Click "Generate token" → copy the
ghp_...value
# Authenticate gh CLI with your PAT
gh auth login --with-token <<< "$GITHUB_TOKEN"
gh auth status # Should show "Logged in to github.com as <you>"
Use classic PAT with repo scope — simplest path. Fine-grained PATs work too but require selecting "Pull requests: Read and Write" as a specific permission, which is a common source of 403 errors (see Phase 8 Research Pitfall 7).
Step 3: Start the gateway and smee-client
# Terminal 1: Start the Hermes gateway
hermes gateway run
# Terminal 2: Run the smee setup script (foreground — leave it running)
./infrastructure/scenarios/k8s/github-webhook/smee-setup.sh
Expected smee output:
================================================================
smee.io → Hermes webhook gateway forwarder
================================================================
Source: https://smee.io/abc123XYZ
Target: http://localhost:8644/webhooks/github
Client: smee-client@5.0.0
================================================================
Forwarding https://smee.io/abc123XYZ to http://localhost:8644/webhooks/github
Step 4: Add the webhook to your GitHub repo
In your test GitHub repo: Settings → Webhooks → Add webhook
- Payload URL:
$SMEE_URL(the smee.io channel URL) - Content type:
application/json - Secret: (leave blank for the lab)
- Events: "Let me select individual events" → check Pull requests
- Active: checked
Click Add webhook. GitHub will send a test ping — you should see Forwarding event to localhost:8644/webhooks/github in the smee terminal.
The smee-setup.sh script targets http://localhost:8644/webhooks/github. When you subscribe in Step 13, use hermes webhook subscribe github (NOT github-webhook). The route name after /webhooks/ MUST match the subscription name exactly, or events arrive at the gateway but no subscription receives them (Pitfall 5 from Phase 8 Research).
Step 13: GitHub Agent Comment Back — Full Round-Trip (10 min)
Subscribe the GitHub webhook with --deliver github_comment, open a PR on your test repo, and watch the agent post a review comment automatically.
Set environment variables
export HERMES_LAB_MODE=live
export HERMES_LAB_SCENARIO=crashloop2
export HERMES_LAB_GOVERNANCE=L2
export HERMES_LAB_TRACK=track-c
export MOCK_DATA_DIR="$(pwd)/infrastructure/mock-data/kubernetes"
export PATH="$(pwd)/infrastructure/wrappers:$PATH"
# Phase 8 NEW (TRIG-03):
export GITHUB_TOKEN="ghp_..."
export SMEE_URL="https://smee.io/your-channel-id"
Subscribe the GitHub webhook
# Terminal 3: Subscribe with the agent prompt template and github_comment delivery
hermes webhook subscribe github \
--events "pull_request" \
--prompt "$(cat infrastructure/scenarios/k8s/github-webhook/agent-prompt-template.txt)" \
--deliver github_comment \
--deliver-chat-id "{repository.full_name}:{pull_request.number}"
# Verify the subscription is active
hermes webhook list
The --deliver github_comment type is built into Hermes (gateway/platforms/webhook.py lines 525-558). Internally it calls gh pr comment {pr_number} --repo {repo} --body "{content}". You wrote zero HTTP code — the only requirement is gh CLI installed and authenticated with your GITHUB_TOKEN.
Trigger the event
Open a PR on your test repo (or push a commit to a branch that already has an open PR).
Watch the flow:
- smee terminal →
Forwarding event to localhost:8644/webhooks/github - gateway terminal →
Received github webhook event - ~10 seconds → agent runs, generates review summary, posts comment back to the PR
- GitHub PR → refresh the PR comments — you should see the Hermes review comment
# Verify the comment landed
gh pr view <PR_NUMBER> --repo <OWNER>/<REPO> --json comments
Solo Learner fallback — no GitHub repo or smee.io needed
Skip Steps 12-13 smee setup and use the bundled sample payload instead:
# Subscribe with --deliver local instead of --deliver github_comment
hermes webhook subscribe github \
--events "pull_request" \
--prompt "$(cat infrastructure/scenarios/k8s/github-webhook/agent-prompt-template.txt)" \
--deliver local
# Inject the bundled sample payload (PR #42: feat(api): add /health readiness endpoint)
hermes webhook test github \
--payload @infrastructure/scenarios/k8s/github-webhook/sample-pr-payload.json
The agent runs identically — the review goes to your terminal instead of back to GitHub. The sample payload is a valid GitHub PR webhook structure with all the fields the prompt template references (pull_request.number, pull_request.title, repository.full_name, etc.).
Cleanup
hermes webhook unsubscribe github
Step 14: Telegram Bot Setup — @BotFather to Gateway (10 min)
Telegram is the right primary chat platform for this lab: free, no admin approval, works for every Udemy learner globally. You create a real bot via @BotFather, configure Hermes to activate the Telegram adapter, and start receiving slash commands from your phone.
Telegram bots use long polling. Only ONE bot instance can poll at a time. If you previously ran hermes gateway run in another terminal, stop it first with hermes gateway stop and wait 30 seconds before starting a new instance, or you'll get 409 Conflict errors (telegram_polling_conflict fatal error after 3 retries). This is a Telegram API restriction, not a Hermes bug.
Step 1: Create your bot via @BotFather (~2 min)
The @BotFather setup takes about 2 minutes total. You only do it once — your bot stays registered with Telegram permanently. No payment, no admin approval, no workspace required.
- Open Telegram (mobile app or https://web.telegram.org)
- Search for @BotFather (verified blue checkmark — the official bot creation bot)
- Send
/newbot - Choose a display name (e.g.,
Hermes Lab Bot) - Choose a username — must end in
bot(e.g.,hermes_lab_yourname_bot) - Copy the bot token from BotFather's response — looks like
123456:ABC-DEF1234ghIkl-zyx57W2v1u123ew11 - Treat this token like a password — anyone with it can impersonate your bot
Step 2: Get your Telegram user ID via @userinfobot (~30 sec)
The Telegram adapter restricts the bot to user IDs listed in TELEGRAM_ALLOWED_USERS. To find your own ID:
- In Telegram, search for @userinfobot (a public utility bot)
- Send
/start - Copy your numeric user ID (e.g.,
987654321)
Step 3: Configure env vars and start the gateway
# Stop any existing gateway (Telegram polling lock)
hermes gateway stop
sleep 30
# Set the base env vars
export HERMES_LAB_MODE=live
export HERMES_LAB_SCENARIO=crashloop2
export HERMES_LAB_GOVERNANCE=L2
export HERMES_LAB_TRACK=track-c
export MOCK_DATA_DIR="$(pwd)/infrastructure/mock-data/kubernetes"
export PATH="$(pwd)/infrastructure/wrappers:$PATH"
# Phase 8 NEW (TRIG-04):
export TELEGRAM_BOT_TOKEN="123456:ABC..." # From @BotFather
export TELEGRAM_ALLOWED_USERS="987654321" # From @userinfobot
# Verify the python-telegram-bot package is installed (from [messaging] extra)
python3 -c "from telegram import Bot; print('telegram OK')"
# Start the gateway — look for "Telegram adapter started, polling..." in the output
hermes gateway run
Expected output includes:
Telegram adapter started, polling...
See infrastructure/scenarios/k8s/telegram-bot/README.md for additional troubleshooting guidance and the bot-config.example.yaml reference.
Step 15: Telegram /diagnose Command — Slash Commands From Your Phone (5 min)
You've been typing into a terminal for 14 steps. Now pick up your phone, send a slash command, and watch the agent respond in Telegram.
Set environment variables
export HERMES_LAB_MODE=live
export HERMES_LAB_SCENARIO=crashloop2
export HERMES_LAB_GOVERNANCE=L2
export HERMES_LAB_TRACK=track-c
export MOCK_DATA_DIR="$(pwd)/infrastructure/mock-data/kubernetes"
export PATH="$(pwd)/infrastructure/wrappers:$PATH"
# Phase 8 NEW (TRIG-04):
export TELEGRAM_BOT_TOKEN="123456:ABC..."
export TELEGRAM_ALLOWED_USERS="987654321"
Send slash commands from Telegram
- Open Telegram on your phone or in the web browser
- Search for your bot's username (the one you chose in Step 14 with @BotFather)
- Open the chat and send:
/help
Expected: bot replies with the list of available commands.
- Send the command for your track:
| Track | Command | Skill Used |
|---|---|---|
| Track A | /diagnose users | sre-dba-rds-slow-query |
| Track B | /diagnose web-tier | sre-ec2-health-check |
| Track C | /diagnose k8s-trouble-crashloop | sre-k8s-pod-health |
Watch the gateway terminal — you should see the message arrive and the agent run. Within ~10 seconds, the bot replies in the same Telegram chat thread with the agent's findings.
The Hermes Telegram adapter (gateway/platforms/telegram.py line 1549 — _handle_command) passes the entire slash command message through to the agent as the prompt. There's no per-command Python registration needed. The adapter handles ALL slash commands automatically.
Experiment: send /status to see gateway health, or send /whatever and the agent will see exactly that text and respond.
- Send
/statusto see current gateway state:
/status
Expected: bot replies with scheduler status, active webhook subscriptions, and current governance level.
See infrastructure/scenarios/k8s/telegram-bot/slash-command-spec.md for the full three-command spec (/diagnose, /status, /help) with per-track examples.
Step 16: Telegram Governance Escalation — Per-Process Override (5 min)
This step demonstrates that bot governance is per-PROCESS — set when the gateway starts and inherited by every command during that gateway's lifetime. To escalate, restart the gateway with a different governance level.
Set environment variables (escalated to L4)
# Stop the current gateway (releases Telegram polling lock)
hermes gateway stop
sleep 30
# Restart with L4 governance — full export block with L4
export HERMES_LAB_MODE=live
export HERMES_LAB_SCENARIO=crashloop2
export HERMES_LAB_GOVERNANCE=L4
export HERMES_LAB_TRACK=track-c
export MOCK_DATA_DIR="$(pwd)/infrastructure/mock-data/kubernetes"
export PATH="$(pwd)/infrastructure/wrappers:$PATH"
# Phase 8 NEW (TRIG-04):
export TELEGRAM_BOT_TOKEN="123456:ABC..."
export TELEGRAM_ALLOWED_USERS="987654321"
hermes gateway run
Test escalated governance from Telegram
From Telegram, send a write-action command (Track C example):
/diagnose --apply k8s-trouble-crashloop
At L4, the agent CAN attempt kubectl apply (the Phase 7 wrapper allows write actions at L4). At L2 from Step 15, the same command would have triggered the GOVERNANCE REJECTED banner from the wrapper.
Verify the governance level in session logs
hermes sessions list
hermes sessions show <session-id> | grep -i governance
What happens when a non-admin tries L4 from the message text?
If a user sends /diagnose --governance L4 <arg> from Telegram while the gateway is running at L2, the agent runs at L2 — the --governance L4 text is just part of the prompt, not a directive to the wrapper. The wrapper reads HERMES_LAB_GOVERNANCE=L2 from the process env and enforces L2 allowlists.
Telegram bot governance is per-process, not per-message. To change levels, restart the gateway with a different HERMES_LAB_GOVERNANCE env var. The combination of TELEGRAM_ALLOWED_USERS (who can talk to the bot at all) and HERMES_LAB_GOVERNANCE (what they can do) is the per-context governance model — consistent with how AlertManager and K8s CronJob agents work.
See infrastructure/scenarios/k8s/telegram-bot/admin-allowlist.example.yaml for the team-wide admin roster pattern.
Cleanup
hermes gateway stop
# Restart at L2 for ongoing lab work:
# export HERMES_LAB_GOVERNANCE=L2 && hermes gateway run
FREE EXPLORE PHASE — 25 minutes
Choose challenges based on your available time and experience level.
Challenge 1 (Starter — 10 min): Cross-track cron job
Create a second cron job using a skill from a different track than your primary one:
- Track A participant: try
--skill "cost-anomaly"or--skill "kubernetes-health" - Track B participant: try
--skill "dba-rds-slow-query"or--skill "kubernetes-health" - Track C participant: try
--skill "dba-rds-slow-query"or--skill "cost-anomaly"
Use a fast schedule to see it fire during the lab:
hermes cron create "*/5 * * * *" \
"Quick cost anomaly scan. Report only if anomalies detected." \
--name "cross-track-check" \
--skill "cost-anomaly" \
--deliver local
Trigger it manually to confirm it works:
hermes cron run <job-id> # use the ID from hermes cron list
Then run:
hermes cron status
What does the output tell you about both jobs? Does running a cross-track skill produce useful output, or does the agent report it cannot find the expected data source?
Challenge 2 (Intermediate — 15 min): Webhook for a different event type
Unsubscribe the CloudWatch webhook and create one for a different event type:
hermes webhook unsubscribe cloudwatch-alerts
Subscribe to a Kubernetes pod event (Track C) or cost spike event (Track B):
# Track B — cost spike
hermes webhook subscribe cost-spike \
--events "cost-alert" \
--prompt "Cost alert received: {alert.service} exceeded budget by {alert.overage_pct}%. Investigate." \
--deliver local
# Track C — pod eviction
hermes webhook subscribe pod-eviction \
--events "pod-event" \
--prompt "Pod event received: {pod.name} in namespace {pod.namespace} has state {pod.state}. Investigate." \
--deliver local
Test with a matching payload:
# Track B test
hermes webhook test cost-spike \
--payload '{"alert": {"service": "RDS", "overage_pct": "47"}}'
# Track C test
hermes webhook test pod-eviction \
--payload '{"pod": {"name": "api-server-xyz", "namespace": "production", "state": "OOMKilled"}}'
Observe: does the agent adapt its investigation based on the payload values? What changes in the output between a 10% overage and a 47% overage?
Challenge 3 (Advanced — 20 min): Combine cron + webhook on the same skill
Set up both a cron schedule and a webhook subscription pointing at the same skill:
# Cron: fires every 5 min for lab speed
hermes cron create "*/5 * * * *" \
"Quick RDS health snapshot. Use [SILENT] if no anomalies." \
--name "rapid-health" \
--skill "dba-rds-slow-query" \
--deliver local
# Webhook: fires on demand
hermes webhook subscribe rds-alerts \
--events "cloudwatch-alarm" \
--prompt "CloudWatch alert received: {alarm.name} is {alarm.state}. Run full investigation." \
--deliver local
Now fire the webhook while the cron is also scheduled to tick:
hermes webhook test rds-alerts \
--payload '{"alarm": {"name": "rds-cpu-high", "state": "ALARM"}}'
Then check status:
hermes cron status
hermes webhook list
Questions to explore:
- Do both executions complete without interfering with each other?
- The cron prompt asks for a quick snapshot; the webhook prompt asks for a full investigation. Does the output differ?
- What does
hermes cron statusshow while a webhook-triggered run is executing? - If you trigger the webhook 3 times in quick succession, what happens?
Clean up when done:
hermes cron pause <job-id>
hermes webhook unsubscribe rds-alerts
Closing
What you built in this lab:
- A cron-scheduled health check that fires at 8 AM daily, loads a domain skill, and delivers findings to your terminal (or Slack in production)
- A webhook subscription that reacts to CloudWatch alarm payloads and runs an agent investigation on demand
- Hands-on experience with the full cron lifecycle: create, trigger, pause, resume, delete
- Understanding of when to use scheduled triggers (proactive) vs event-driven webhooks (reactive)
- A real Prometheus + AlertManager pipeline on KIND firing on a Phase 6 broken pod, with the agent receiving the alert and diagnosing without manual invocation
- A K8s CronJob running a containerized Hermes agent on a schedule, with explicit "use this when…" framing for Hermes cron vs K8s CronJob
- A GitHub webhook via smee.io routing real PR events to Hermes, with the agent posting review comments back via the built-in
github_commentdelivery type - A Telegram bot you can poke from your phone, with three slash commands and per-process governance inheritance from Phase 7
Key commands reference:
# Cron management
hermes cron status # Always run at session start
hermes cron create "<schedule>" "<prompt>" --name ... --skill ... --deliver local
hermes cron list
hermes cron run <job-id> # Manual fire (use ID from cron list, not name)
hermes cron pause <job-id>
hermes cron resume <job-id>
hermes cron delete <job-id>
# Webhook management
hermes gateway setup # Enable webhook platform
curl http://localhost:8644/health # Verify endpoint is live
hermes webhook subscribe <name> --events ... --prompt ... --deliver local
hermes webhook list
hermes webhook test <name> --payload '{"key": "value"}'
hermes webhook unsubscribe <name>
# Phase 8 trigger commands
hermes webhook subscribe alertmanager --events "alertmanager-alert" --prompt "..." --skill "..." --deliver local
hermes webhook subscribe github --events "pull_request" --prompt "..." --deliver github_comment --deliver-chat-id "{repository.full_name}:{pull_request.number}"
kubectl apply -f infrastructure/scenarios/k8s/cronjob/agent-health-check.yaml -l track=track-c
./infrastructure/scenarios/k8s/github-webhook/smee-setup.sh
hermes gateway run # With TELEGRAM_BOT_TOKEN set, activates Telegram adapter
hermes gateway stop # Stop gateway before restarting with new governance level
Starter file: course-site/docs/module-11-triggers/lab/starter/cron-job-starter.yaml provides
a parameter reference card for building your own cron jobs.
Next: Module 13 covers governance — approval workflows, maturity levels, and audit trails. The cron and webhook triggers you built here become the entry points for governed agent actions in Module 13.
Verification Checklist
Run these commands to confirm your lab completed successfully:
# 1. Cron scheduler is running
hermes cron status
# Expected: Scheduler: running
# 2. Daily health check job is registered
hermes cron list
# Expected: daily-db-health (or daily-cost-check / daily-k8s-check) in the list
# 3. Job fires successfully on demand
hermes cron run <job-id> # use the ID from hermes cron list
# Expected: agent runs and prints output (or [SILENT] if no anomalies)
# 4. Webhook endpoint is live
curl http://localhost:8644/health
# Expected: {"status": "ok"}
# 5. Webhook subscription is active
hermes webhook list
# Expected: cloudwatch-alerts (or your custom subscription) in the list
# 6. Webhook test fires the agent
hermes webhook test cloudwatch-alerts \
--payload '{"alarm": {"name": "rds-cpu-high", "state": "ALARM"}}'
# Expected: agent runs investigation and prints findings to terminal
# Phase 8 checks
# 7. AlertManager is enabled and reachable
kubectl get pods -n monitoring -l app.kubernetes.io/name=alertmanager
# Expected: alertmanager-kube-prometheus-alertmanager-0 Running
# 8. PrometheusRule loaded
kubectl get prometheusrule -n monitoring hermes-lab-rules
# Expected: row exists
# 9. K8s CronJob image built and loaded
docker images hermes-lab:cronjob
# Expected: REPOSITORY: hermes-lab, TAG: cronjob
# 10. K8s CronJob registered (after Step 11)
kubectl get cronjob | grep hermes-track
# Expected: hermes-track-{a,b,c}-health */5 * * * *
# 11. smee-setup.sh is executable
test -x infrastructure/scenarios/k8s/github-webhook/smee-setup.sh && echo OK
# 12. Sample GitHub PR payload is valid JSON
jq . infrastructure/scenarios/k8s/github-webhook/sample-pr-payload.json > /dev/null && echo OK
# 13. Telegram adapter prerequisites
python3 -c "from telegram import Bot; print('telegram OK')"
# Expected: telegram OK