Module 11 Lab: Triggers and Scheduling

Duration: 145 minutes (120 min guided + 25 min free explore) Prerequisites: Module 10 agent working for your track, hermes gateway available Outcome: Working examples of all 5 trigger patterns: Hermes cron, simulated CloudWatch webhook, real AlertManager webhook on KIND, K8s CronJob, GitHub webhook via smee.io, and a Telegram bot — each agent invocation observed end-to-end

tip

This lab moves your Hermes agent from reactive (you type a prompt) to proactive (the agent runs on a schedule or reacts to events). You will build two trigger mechanisms: a cron schedule for daily health checks and a webhook subscription for alert-driven investigations.

Both exercises are fully hands-on — you run the commands, see the output, and verify the behavior yourself.

GUIDED PHASE — 120 minutes

Step 1: Morning Startup Sequence (5 min)

Run this checklist at the START of every session before any cron work.

WARNING: Cron jobs do NOT auto-recover after a laptop sleep or KIND cluster restart. If you closed your laptop overnight or ran kind delete cluster, your scheduled jobs appear registered but will never fire. Always run hermes cron status first — before assuming any cron job is active.

Startup checklist

1. Check scheduler health:

hermes cron status

Expected output when healthy:

Scheduler: running
Jobs registered: 0
Next tick: in ~60s

If you see Scheduler: not running or an error about the gateway not being available:

hermes gateway setup

Then re-run hermes cron status to confirm the scheduler is now running.

2. Verify any existing jobs are still registered:

hermes cron list

If you created jobs in a previous session, they should appear here. If the list is empty after you expected jobs to be present, the cron store was reset — you will need to recreate them (Step 2).

Why this happens

The cron scheduler runs as part of the Hermes gateway process. If the gateway was stopped (laptop sleep, restart, terminal close), the scheduler stops ticking. Jobs persist in ~/.hermes/cron/jobs.json, but they will not run until the gateway is active again. The hermes cron status check confirms the scheduler is alive and watching for due jobs.

Step 2: Create a Daily Health Check Cron Job (10 min)

Cron expressions use the standard five-field format: minute hour day-of-month month day-of-week

Expression	Meaning
`0 8 * * *`	8:00 AM every day
`30 9 * * 1-5`	9:30 AM weekdays only
`/5 * * *`	Every 5 minutes
`0 0 * * 0`	Midnight every Sunday

Create your daily health check cron job. Run the command for your track:

Track A — Database Health:

hermes cron create "0 8 * * *" \
  "Run daily health check for RDS. Report only if anomalies found." \
  --name "daily-db-health" \
  --skill "dba-rds-slow-query" \
  --deliver local

Track B — Cost Anomaly:

hermes cron create "0 8 * * *" \
  "Run daily cost anomaly check. Report only if spending anomalies detected." \
  --name "daily-cost-check" \
  --skill "cost-anomaly" \
  --deliver local

Track C — Kubernetes Health:

hermes cron create "0 8 * * *" \
  "Run daily Kubernetes cluster health check. Report only if pods or nodes show issues." \
  --name "daily-k8s-check" \
  --skill "kubernetes-health" \
  --deliver local

What each argument does

Argument	Purpose
`schedule` (1st positional)	Cron expression defining when the job runs (e.g. `"0 8 * * "`, `"/5 * * * *"`, or a shorthand like `"30m"`, `"every 2h"`).
`prompt` (2nd positional)	What the agent is asked to do when it fires. Must be self-contained — the cron agent has no chat history.
`--name`	Human-readable job name (kebab-case). Used to reference the job in other commands.
`--skill`	Skill to load before running the prompt. The agent reads the SKILL.md runbook first. Repeat the flag to attach multiple skills.
`--deliver local`	Output goes to your terminal. In production, use `--deliver slack` or `--deliver telegram` to route findings to your notification system.
`--repeat N`	Optional — limits the job to N executions. Omit for indefinite scheduling.

Positional argument order matters

schedule must come first, prompt second. You can put the --name, --skill, and --deliver flags anywhere (before, between, or after the positionals) since they're keyword arguments.

Verify the job was registered

hermes cron list

Expected output shape:

Name              Schedule        Next Run              Skill                  State
daily-db-health   0 8 * * *       2026-04-05 08:00:00   dba-rds-slow-query    scheduled

About --deliver local

Using --deliver local routes the agent's output to your terminal session. This is the right choice for lab work — you see output immediately without needing to configure Slack or Telegram. In production, you would use --deliver slack (configured in ~/.hermes/config.yaml) or --deliver telegram so findings reach your on-call channel even if you are not at your terminal.

Step 3: Trigger Manually and Verify Output (10 min)

You scheduled the job for 8 AM — but you do not need to wait. Manual trigger fires the job immediately and is your primary verification tool:

hermes cron run <job-id>   # use the ID from hermes cron list

(Use the job name you created: daily-cost-check for Track B, daily-k8s-check for Track C.)

Watch the terminal. The agent will:

Load the skill SKILL.md runbook
Run the investigation using mock data (if HERMES_LAB_MODE=mock is set)
Print its findings to the terminal

Expected output shape:

[Cron] Firing job: daily-db-health
[MOCK MODE] Running dba-rds-slow-query investigation...

Daily Health Check — prod-db-01
Status: HEALTHY
No slow query anomalies detected above threshold.
pg_stat_statements: 12 queries sampled, max mean_time_ms = 45ms (threshold: 500ms)

[SILENT] (no anomalies to report)

About [SILENT]

When the cron agent finds nothing to report, it responds with [SILENT]. This suppresses delivery — you will not receive a Slack or Telegram notification. This is by design: agents that cry wolf on every run lose their usefulness. The agent only delivers a full report when it finds something worth reporting.

Manual trigger is also your recovery mechanism

If you suspect a cron job silently failed overnight (Step 1 warning), trigger it manually to confirm the job still executes correctly. A successful manual trigger proves the skill, prompt, and delivery path are all working — the only missing piece was the scheduler tick.

Confirm the run was recorded:

hermes cron list

Check that Last Run now shows today's timestamp.

Step 4: Pause, Resume, and Status (5 min)

Pause the job without deleting it:

hermes cron pause <job-id>

Check the paused state:

hermes cron status

Expected output shows the job in paused state:

Scheduler: running
Jobs registered: 1
  daily-db-health   PAUSED
Next tick: in ~43s

Resume the job:

hermes cron resume <job-id>

Verify it is back to scheduled state:

hermes cron status

Expected:

Scheduler: running
Jobs registered: 1
  daily-db-health   scheduled   next: 2026-04-05 08:00:00
Next tick: in ~51s

Teaching point

Pause and resume is how you stop overnight runs without deleting the job configuration. Use pause when you are doing maintenance, running a planned load test, or temporarily silencing a job that is generating too many alerts. Deleting and recreating a job is the wrong approach — you lose the configuration and have to remember all the flags.

Step 5: Start the Webhook Gateway (5 min)

This step is explicit — do not assume the webhook gateway is running from a previous module. Always verify before subscribing.

Set up the webhook platform:

hermes gateway setup

Follow the prompts. When asked about webhooks, enable them and accept the default port (8644).

Verify the endpoint is live:

curl http://localhost:8644/health

Expected response:

{"status": "ok"}

Troubleshooting — port 8644 already in use

If curl returns Connection refused, the gateway did not start on that port. Check for an existing process:

lsof -i :8644

If a process is listed, either stop it or use a different port when running hermes gateway setup.

If curl returns an error about the connection being reset (not refused), the gateway is running but the webhook adapter is not enabled. Re-run hermes gateway setup and confirm that webhooks are enabled in the prompt sequence.

A webhook subscription tells Hermes: "when a POST arrives at this route, fire an agent run using this prompt."

Subscribe to CloudWatch alarm events:

hermes webhook subscribe cloudwatch-alerts \
  --events "cloudwatch-alarm" \
  --prompt "CloudWatch alert received: {alarm.name} is {alarm.state}. Investigate." \
  --deliver local

What each part means

Part	Purpose
`cloudwatch-alerts`	The route name. Hermes creates an endpoint at `/webhooks/cloudwatch-alerts`.
`--events "cloudwatch-alarm"`	Event type filter. Only payloads matching this event type trigger the agent.
`--prompt "..."`	Template string. `{alarm.name}` and `{alarm.state}` are replaced with values from the incoming JSON payload.
`--deliver local`	Route agent output to the terminal.

After running the command, Hermes prints the webhook URL and HMAC secret:

Subscription created: cloudwatch-alerts
URL: http://localhost:8644/webhooks/cloudwatch-alerts
Secret: <auto-generated HMAC-SHA256 secret>
Event filter: cloudwatch-alarm

Verify the subscription is listed:

hermes webhook list

Expected output:

Name                 Route                              Events             Deliver
cloudwatch-alerts    /webhooks/cloudwatch-alerts        cloudwatch-alarm   local

Note

In production, you would configure CloudWatch SNS to POST to your public webhook URL (not localhost) with the HMAC secret for signature verification. For this lab, you simulate the POST locally in the next step.

Step 7: Test the Webhook with a Simulated Alert (10 min)

Simulate a CloudWatch alarm firing — no real AWS needed:

hermes webhook test cloudwatch-alerts \
  --payload '{"alarm": {"name": "rds-cpu-high", "state": "ALARM"}}'

Watch the terminal. The sequence:

Hermes receives the simulated POST to /webhooks/cloudwatch-alerts
The payload {"alarm": {"name": "rds-cpu-high", "state": "ALARM"}} is matched against the prompt template
The resolved prompt becomes: CloudWatch alert received: rds-cpu-high is ALARM. Investigate.
Hermes fires an agent run with your track skill loaded
The agent investigates using mock data and prints findings to the terminal

Expected output shape:

[Webhook] Received cloudwatch-alarm event on cloudwatch-alerts
[MOCK MODE] Investigating: CloudWatch alert received: rds-cpu-high is ALARM. Investigate.

Alert Investigation — rds-cpu-high
State: ALARM | CPUUtilization: 78.4%

Finding: Sequential scan detected on users table (created_at column, 12,847 rows).
Recommendation: CREATE INDEX CONCURRENTLY idx_users_created_at ON users (created_at)
Action required: REQUIRES-DBA-APPROVAL

No additional anomalies found.

Teaching point — cron vs webhook

You now have both trigger mechanisms running:

Cron asks "is anything wrong?" It fires on a schedule whether or not an alarm has fired. Use it for proactive health checks and daily summaries.
Webhook reacts to "something IS wrong." It fires in response to an external event — a CloudWatch alarm, a Kubernetes pod eviction, a Stripe payment failure. Use it for incident response automation where latency matters.

Both can load the same skill and run the same investigation prompt. The difference is timing and trigger: scheduled vs event-driven.

Try a different alarm in the payload:

hermes webhook test cloudwatch-alerts \
  --payload '{"alarm": {"name": "rds-storage-low", "state": "ALARM"}}'

Observe how the resolved prompt changes: CloudWatch alert received: rds-storage-low is ALARM. Investigate.

Step 8: Slack — What This Looks Like in Production (5 min)

Demo-only section

Slack bot configuration requires admin access to your Slack workspace. If you are following this lab solo, skip the Slack config steps — you have already done the hands-on equivalent with --deliver local.

In production, replacing --deliver local with --deliver slack routes the agent's findings to your #devops-alerts channel automatically. Here is what that configuration looks like:

Slack config format (requires Slack admin to add bot)

# In ~/.hermes/config.yaml
notifications:
  slack:
    webhook_url: "https://hooks.slack.com/services/T.../B.../..."
    channel: "#devops-alerts"

What changes

Only the delivery target changes — the skill, prompt, and investigation logic are identical:

# Lab version (what you ran above):
hermes cron create "0 8 * * *" \
  "Run daily health check for RDS. Report only if anomalies found." \
  --name "daily-db-health" \
  --skill "dba-rds-slow-query" \
  --deliver local

# Production version (Slack delivery):
hermes cron create "0 8 * * *" \
  "Run daily health check for RDS. Report only if anomalies found." \
  --name "daily-db-health" \
  --skill "dba-rds-slow-query" \
  --deliver slack

What the Slack bot message looks like

When a cron job or webhook fires with --deliver slack, the Hermes bot posts to the configured channel:

Cronjob Response: daily-db-health
-------------

Daily Health Check — prod-db-01
Status: ALERT
Finding: Sequential scan on users table...
Recommendation: CREATE INDEX CONCURRENTLY...

Note: The agent cannot see this message, and therefore cannot respond to it.

Local alternative (what you did today)

--deliver local is the same pipeline — skill loaded, agent runs, investigation executes — but output goes to your terminal instead of Slack. Everything you practiced today is the production workflow. Switching to Slack is a one-flag change once Slack admin has added the bot.

Step 9: AlertManager — Real Prometheus Stack Setup (10 min)

You have used hermes webhook test to fire simulated webhooks (Step 7). Now you wire a REAL alert source: the Prometheus + AlertManager stack on your KIND cluster, firing on a real broken pod from Phase 6.

This is the moment Hermes stops being a chat agent and starts being an incident-response agent — alerts arrive without you typing anything.

Solo Learner

This step requires a running KIND cluster with kube-prometheus-stack already installed (from Phase 1 setup). If you skipped Phase 1's KIND setup, jump to the "Skipping AlertManager" callout at the bottom of this step — you can still complete TRIG-02, TRIG-03, and TRIG-04 without TRIG-01.

Set environment variables

export HERMES_LAB_MODE=live
export HERMES_LAB_SCENARIO=crashloop2
export HERMES_LAB_GOVERNANCE=L2
export HERMES_LAB_TRACK=track-c
export MOCK_DATA_DIR="$(pwd)/infrastructure/mock-data/kubernetes"
export PATH="$(pwd)/infrastructure/wrappers:$PATH"

Enable AlertManager in the helm release

The Phase 1 helm values disabled AlertManager to save resources. Phase 8 flips it back on.

# Verify the helm values now have alertmanager.enabled: true
yq '.alertmanager.enabled' infrastructure/helm/prometheus-lab-values.yaml
# Expected output: true

# Apply the updated values
helm upgrade --install kube-prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  -f infrastructure/helm/prometheus-lab-values.yaml

# Wait for AlertManager pods to reach Ready
kubectl wait --for=condition=Ready pod \
  -l app.kubernetes.io/name=alertmanager \
  -n monitoring --timeout=120s

Expected output:

pod/alertmanager-kube-prometheus-alertmanager-0 condition met

Apply the PrometheusRule

kubectl apply -f infrastructure/scenarios/k8s/alertmanager/prometheus-rules.yaml

Expected output:

prometheusrule.monitoring.coreos.com/hermes-lab-rules created

Verify the rule loaded

kubectl get prometheusrule -n monitoring hermes-lab-rules -o yaml | yq '.spec.groups[0].rules[0].alert'

Expected output:

PodCrashLooping

Open the Prometheus UI at http://localhost:30091 → Status → Rules. You should see hermes-lab.k8s-crashloop group with the PodCrashLooping rule listed and state "inactive" (no broken pods YET).

Why the release label matters

The PrometheusRule manifest includes a labels: release: monitoring field. The kube-prometheus-stack Helm chart configures Prometheus to ONLY load rules whose label matches the Helm release name. In this course the release is named monitoring (from helm install monitoring ...), so the label must be release: monitoring. Without it, your rule would be silently ignored — kubectl get prometheusrule would show it, but the Prometheus UI Rules page would not. Verify the expected label with: kubectl get prometheus -n monitoring -o jsonpath='{.items[0].spec.ruleSelector}'

Skipping AlertManager (no KIND)

Solo Learner

If you don't have a KIND cluster running, skip to Step 12 (GitHub webhook). TRIG-02 (K8s CronJob) also requires KIND, so you'll skip Step 11 as well. All four trigger types are independent: missing one doesn't block the others.

Step 10: AlertManager — Fire and Observe (10 min)

Now wire the full end-to-end flow: gateway running, alertmanager subscription active, crashloop2 pod applied — then watch the alert fire and the agent diagnose automatically.

Set environment variables

export HERMES_LAB_MODE=live
export HERMES_LAB_SCENARIO=crashloop2
export HERMES_LAB_GOVERNANCE=L2
export HERMES_LAB_TRACK=track-c
export MOCK_DATA_DIR="$(pwd)/infrastructure/mock-data/kubernetes"
export PATH="$(pwd)/infrastructure/wrappers:$PATH"

Run the following in separate terminals:

# Terminal 1: Start the gateway in the foreground so you see live POSTs arrive
hermes gateway run

# Terminal 2: Subscribe the alertmanager webhook
hermes webhook subscribe alertmanager \
  --events "alertmanager-alert" \
  --prompt "AlertManager PodCrashLooping alert fired. Details: {alerts}. Load the sre-k8s-pod-health skill and diagnose the affected pod in the namespace shown in the alert labels." \
  --skill "sre-k8s-pod-health" \
  --deliver local

# Verify the subscription is active
hermes webhook list

CRITICAL: {alerts} not {alerts[0].labels.pod}

Notice the {alerts} placeholder in the subscribe command above — NOT {alerts[0].labels.pod}. The Hermes prompt template only supports dot-notation access to dict keys, NOT array index access. The agent receives the full alerts[] JSON array as a string and parses it to find the pod and namespace. If you accidentally use {alerts[0].labels.pod}, it will render as a literal string — the agent will see that text instead of the pod name.

Apply the Phase 6 crashloop2 scenario

# Terminal 3: Apply the broken pod (reuses Phase 6 manifest — do NOT modify it)
kubectl apply -f infrastructure/scenarios/k8s/02-crashloop-backoff.yaml

# Watch the pod restart count climb
watch kubectl get pods -n k8s-trouble-crashloop

Expected timeline

t=0s: pod applied, status ContainerCreating
t=10s: pod status CrashLoopBackOff, restartCount=1
t=60s: restartCount=4-6
t=90s: PromQL increase() exceeds 2 over the 2-min window
t=120s: AlertManager dispatches; Terminal 1 shows Received alertmanager-alert event
t=125s: Hermes spawns agent run; sre-k8s-pod-health loads; diagnosis appears

Open AlertManager UI at http://localhost:30093 to see the active alert and its receiver routing.

Cleanup after observing

kubectl delete -f infrastructure/scenarios/k8s/02-crashloop-backoff.yaml
hermes webhook unsubscribe alertmanager

Step 11: K8s CronJob — Same Agent, Different Trigger Mechanism (10 min)

The Hermes cron jobs you built in Steps 2-4 are the production pattern for most agent work. This step demonstrates the SAME agent wrapped in a native K8s CronJob — and makes explicit WHEN each pattern is the right answer.

Set environment variables

export HERMES_LAB_MODE=mock
export HERMES_LAB_SCENARIO=crashloop2
export HERMES_LAB_GOVERNANCE=L2
export HERMES_LAB_TRACK=track-c
export MOCK_DATA_DIR="$(pwd)/infrastructure/mock-data/kubernetes"
export PATH="$(pwd)/infrastructure/wrappers:$PATH"

Build the minimal hermes-agent image

# Build the minimal hermes-agent container image
# (This takes 5-10 min — the image is ~700-900MB of Python deps)
docker build -t hermes-lab:cronjob infrastructure/scenarios/k8s/cronjob/

# Verify the image exists
docker images | grep hermes-lab

Image size note

The infrastructure/scenarios/k8s/cronjob/Dockerfile uses python:3.12-slim as a base, not the official nousresearch/hermes-agent:latest (which is 2-3GB with Playwright/ffmpeg). This minimal image is a teaching artifact about packaging agents for K8s.

Load the image into KIND and create the API key secret

# Load into KIND (required — the CronJob uses imagePullPolicy: IfNotPresent, not a registry)
kind load docker-image hermes-lab:cronjob --name lab

# Create the API key secret (NEVER commit this token — it lives only in your local KIND cluster)
kubectl create secret generic hermes-secrets \
  --from-literal=anthropic-api-key="$ANTHROPIC_API_KEY"

# Verify
kubectl get secret hermes-secrets

Apply the CronJob for your track

# Track A: kubectl apply -f infrastructure/scenarios/k8s/cronjob/agent-health-check.yaml -l track=track-a
# Track B: kubectl apply -f infrastructure/scenarios/k8s/cronjob/agent-health-check.yaml -l track=track-b
# Track C:
kubectl apply -f infrastructure/scenarios/k8s/cronjob/agent-health-check.yaml -l track=track-c

# Watch jobs spawn (schedule is */5 * * * * — wait up to 5 min for the first run)
watch kubectl get jobs,pods

View logs from the first completed job

kubectl logs -l job-name=$(kubectl get jobs -o jsonpath='{.items[-1].metadata.name}')

Use Hermes cron when... / Use K8s CronJob when...

Use Hermes cron when:

The agent benefits from gateway-shared state (loaded skills, audit trail, conversation history)
You want one-stop CLI management (hermes cron create/list/trigger/pause/resume)
You're iterating fast — tweak a prompt, re-register the cron, done. No image rebuild cycle.
You need audit trail context linking cron runs to skill and prompt versions
You're not (yet) in Kubernetes

Use K8s CronJob when:

Stateless one-shot diagnostics — no state needed from previous runs
GitOps schedule-in-git — you want the schedule reviewed via PR and deployed via ArgoCD/Flux
K8s-native observability — Prometheus kube_job_status_* metrics, kubectl get jobs, Loki logs
Multi-tenant resource quotas — namespace isolation, NetworkPolicies, resource quotas, Secrets

Real-world honest stance: Most agent work uses Hermes cron because state matters. K8s CronJob shines for fire-and-forget diagnostic jobs deployed alongside other K8s primitives via the same GitOps pipeline.

Cleanup

kubectl delete -f infrastructure/scenarios/k8s/cronjob/agent-health-check.yaml -l track=track-c
kubectl delete secret hermes-secrets

Step 12: GitHub Webhook — smee.io Setup (10 min)

GitHub webhooks need a public HTTPS endpoint to POST to. smee.io is a free public webhook proxy: you get a unique channel URL, GitHub POSTs to it, and a smee-client on your laptop forwards events to your local Hermes gateway.

Solo Learner

This step requires a personal GitHub repo and a GitHub PAT. If you don't have these, skip to Step 13's Solo Learner fallback section — you can simulate the full GitHub webhook flow without any external service using the bundled sample PR payload.

Set environment variables (including new Phase 8 TRIG-03 vars)

export HERMES_LAB_MODE=live
export HERMES_LAB_SCENARIO=crashloop2
export HERMES_LAB_GOVERNANCE=L2
export HERMES_LAB_TRACK=track-c
export MOCK_DATA_DIR="$(pwd)/infrastructure/mock-data/kubernetes"
export PATH="$(pwd)/infrastructure/wrappers:$PATH"
# Phase 8 NEW (TRIG-03):
export GITHUB_TOKEN="ghp_..."                          # Your PAT — see Get a GitHub PAT below
export SMEE_URL="https://smee.io/your-channel-id"     # Your smee.io channel — see step 1 below

Step 1: Get a smee.io channel URL

Visit https://smee.io/ in your browser and click "Start a new channel". Copy the URL — it looks like https://smee.io/abc123XYZ. Set it in the env var above.

Step 2: Get a GitHub PAT with repo scope

Open https://github.com/settings/tokens → "Generate new token" → "Generate new token (classic)"
Note: hermes-lab-trig03, Expiration: 30 days
Scopes: check repo (includes read+write to PRs and comments)
Click "Generate token" → copy the ghp_... value

# Authenticate gh CLI with your PAT
gh auth login --with-token <<< "$GITHUB_TOKEN"
gh auth status   # Should show "Logged in to github.com as <you>"

PAT scope tip

Use classic PAT with repo scope — simplest path. Fine-grained PATs work too but require selecting "Pull requests: Read and Write" as a specific permission, which is a common source of 403 errors (see Phase 8 Research Pitfall 7).

Step 3: Start the gateway and smee-client

# Terminal 1: Start the Hermes gateway
hermes gateway run

# Terminal 2: Run the smee setup script (foreground — leave it running)
./infrastructure/scenarios/k8s/github-webhook/smee-setup.sh

Expected smee output:

================================================================
 smee.io → Hermes webhook gateway forwarder
================================================================
  Source:  https://smee.io/abc123XYZ
  Target:  http://localhost:8644/webhooks/github
  Client:  smee-client@5.0.0
================================================================
Forwarding https://smee.io/abc123XYZ to http://localhost:8644/webhooks/github

Step 4: Add the webhook to your GitHub repo

In your test GitHub repo: Settings → Webhooks → Add webhook

Payload URL: $SMEE_URL (the smee.io channel URL)
Content type: application/json
Secret: (leave blank for the lab)
Events: "Let me select individual events" → check Pull requests
Active: checked

Click Add webhook. GitHub will send a test ping — you should see Forwarding event to localhost:8644/webhooks/github in the smee terminal.

Route name must match subscription name

The smee-setup.sh script targets http://localhost:8644/webhooks/github. When you subscribe in Step 13, use hermes webhook subscribe github (NOT github-webhook). The route name after /webhooks/ MUST match the subscription name exactly, or events arrive at the gateway but no subscription receives them (Pitfall 5 from Phase 8 Research).

Step 13: GitHub Agent Comment Back — Full Round-Trip (10 min)

Subscribe the GitHub webhook with --deliver github_comment, open a PR on your test repo, and watch the agent post a review comment automatically.

Set environment variables

export HERMES_LAB_MODE=live
export HERMES_LAB_SCENARIO=crashloop2
export HERMES_LAB_GOVERNANCE=L2
export HERMES_LAB_TRACK=track-c
export MOCK_DATA_DIR="$(pwd)/infrastructure/mock-data/kubernetes"
export PATH="$(pwd)/infrastructure/wrappers:$PATH"
# Phase 8 NEW (TRIG-03):
export GITHUB_TOKEN="ghp_..."
export SMEE_URL="https://smee.io/your-channel-id"

# Terminal 3: Subscribe with the agent prompt template and github_comment delivery
hermes webhook subscribe github \
  --events "pull_request" \
  --prompt "$(cat infrastructure/scenarios/k8s/github-webhook/agent-prompt-template.txt)" \
  --deliver github_comment \
  --deliver-chat-id "{repository.full_name}:{pull_request.number}"

# Verify the subscription is active
hermes webhook list

Built-in github_comment delivery

The --deliver github_comment type is built into Hermes (gateway/platforms/webhook.py lines 525-558). Internally it calls gh pr comment {pr_number} --repo {repo} --body "{content}". You wrote zero HTTP code — the only requirement is gh CLI installed and authenticated with your GITHUB_TOKEN.

Trigger the event

Open a PR on your test repo (or push a commit to a branch that already has an open PR).

Watch the flow:

smee terminal → Forwarding event to localhost:8644/webhooks/github
gateway terminal → Received github webhook event
~10 seconds → agent runs, generates review summary, posts comment back to the PR
GitHub PR → refresh the PR comments — you should see the Hermes review comment

# Verify the comment landed
gh pr view <PR_NUMBER> --repo <OWNER>/<REPO> --json comments

Solo Learner fallback — no GitHub repo or smee.io needed

Solo Learner

Skip Steps 12-13 smee setup and use the bundled sample payload instead:

# Subscribe with --deliver local instead of --deliver github_comment
hermes webhook subscribe github \
  --events "pull_request" \
  --prompt "$(cat infrastructure/scenarios/k8s/github-webhook/agent-prompt-template.txt)" \
  --deliver local

# Inject the bundled sample payload (PR #42: feat(api): add /health readiness endpoint)
hermes webhook test github \
  --payload @infrastructure/scenarios/k8s/github-webhook/sample-pr-payload.json

The agent runs identically — the review goes to your terminal instead of back to GitHub. The sample payload is a valid GitHub PR webhook structure with all the fields the prompt template references (pull_request.number, pull_request.title, repository.full_name, etc.).

Cleanup

hermes webhook unsubscribe github

Step 14: Telegram Bot Setup — @BotFather to Gateway (10 min)

Telegram is the right primary chat platform for this lab: free, no admin approval, works for every Udemy learner globally. You create a real bot via @BotFather, configure Hermes to activate the Telegram adapter, and start receiving slash commands from your phone.

Telegram long-polling conflict

Telegram bots use long polling. Only ONE bot instance can poll at a time. If you previously ran hermes gateway run in another terminal, stop it first with hermes gateway stop and wait 30 seconds before starting a new instance, or you'll get 409 Conflict errors (telegram_polling_conflict fatal error after 3 retries). This is a Telegram API restriction, not a Hermes bug.

Step 1: Create your bot via @BotFather (~2 min)

Solo Learner

The @BotFather setup takes about 2 minutes total. You only do it once — your bot stays registered with Telegram permanently. No payment, no admin approval, no workspace required.

Open Telegram (mobile app or https://web.telegram.org)
Search for @BotFather (verified blue checkmark — the official bot creation bot)
Send /newbot
Choose a display name (e.g., Hermes Lab Bot)
Choose a username — must end in bot (e.g., hermes_lab_yourname_bot)
Copy the bot token from BotFather's response — looks like 123456:ABC-DEF1234ghIkl-zyx57W2v1u123ew11
Treat this token like a password — anyone with it can impersonate your bot

Step 2: Get your Telegram user ID via @userinfobot (~30 sec)

The Telegram adapter restricts the bot to user IDs listed in TELEGRAM_ALLOWED_USERS. To find your own ID:

In Telegram, search for @userinfobot (a public utility bot)
Send /start
Copy your numeric user ID (e.g., 987654321)

Step 3: Configure env vars and start the gateway

# Stop any existing gateway (Telegram polling lock)
hermes gateway stop
sleep 30

# Set the base env vars
export HERMES_LAB_MODE=live
export HERMES_LAB_SCENARIO=crashloop2
export HERMES_LAB_GOVERNANCE=L2
export HERMES_LAB_TRACK=track-c
export MOCK_DATA_DIR="$(pwd)/infrastructure/mock-data/kubernetes"
export PATH="$(pwd)/infrastructure/wrappers:$PATH"
# Phase 8 NEW (TRIG-04):
export TELEGRAM_BOT_TOKEN="123456:ABC..."      # From @BotFather
export TELEGRAM_ALLOWED_USERS="987654321"      # From @userinfobot

# Verify the python-telegram-bot package is installed (from [messaging] extra)
python3 -c "from telegram import Bot; print('telegram OK')"

# Start the gateway — look for "Telegram adapter started, polling..." in the output
hermes gateway run

Expected output includes:

Telegram adapter started, polling...

See infrastructure/scenarios/k8s/telegram-bot/README.md for additional troubleshooting guidance and the bot-config.example.yaml reference.

Step 15: Telegram /diagnose Command — Slash Commands From Your Phone (5 min)

You've been typing into a terminal for 14 steps. Now pick up your phone, send a slash command, and watch the agent respond in Telegram.

Set environment variables

export HERMES_LAB_MODE=live
export HERMES_LAB_SCENARIO=crashloop2
export HERMES_LAB_GOVERNANCE=L2
export HERMES_LAB_TRACK=track-c
export MOCK_DATA_DIR="$(pwd)/infrastructure/mock-data/kubernetes"
export PATH="$(pwd)/infrastructure/wrappers:$PATH"
# Phase 8 NEW (TRIG-04):
export TELEGRAM_BOT_TOKEN="123456:ABC..."
export TELEGRAM_ALLOWED_USERS="987654321"

Send slash commands from Telegram

Open Telegram on your phone or in the web browser
Search for your bot's username (the one you chose in Step 14 with @BotFather)
Open the chat and send:

/help

Expected: bot replies with the list of available commands.

Send the command for your track:

Track	Command	Skill Used
Track A	`/diagnose users`	`sre-dba-rds-slow-query`
Track B	`/diagnose web-tier`	`sre-ec2-health-check`
Track C	`/diagnose k8s-trouble-crashloop`	`sre-k8s-pod-health`

Watch the gateway terminal — you should see the message arrive and the agent run. Within ~10 seconds, the bot replies in the same Telegram chat thread with the agent's findings.

All slash commands are automatic

The Hermes Telegram adapter (gateway/platforms/telegram.py line 1549 — _handle_command) passes the entire slash command message through to the agent as the prompt. There's no per-command Python registration needed. The adapter handles ALL slash commands automatically.

Experiment: send /status to see gateway health, or send /whatever and the agent will see exactly that text and respond.

Send /status to see current gateway state:

/status

Expected: bot replies with scheduler status, active webhook subscriptions, and current governance level.

See infrastructure/scenarios/k8s/telegram-bot/slash-command-spec.md for the full three-command spec (/diagnose, /status, /help) with per-track examples.

Step 16: Telegram Governance Escalation — Per-Process Override (5 min)

This step demonstrates that bot governance is per-PROCESS — set when the gateway starts and inherited by every command during that gateway's lifetime. To escalate, restart the gateway with a different governance level.

Set environment variables (escalated to L4)

# Stop the current gateway (releases Telegram polling lock)
hermes gateway stop
sleep 30

# Restart with L4 governance — full export block with L4
export HERMES_LAB_MODE=live
export HERMES_LAB_SCENARIO=crashloop2
export HERMES_LAB_GOVERNANCE=L4
export HERMES_LAB_TRACK=track-c
export MOCK_DATA_DIR="$(pwd)/infrastructure/mock-data/kubernetes"
export PATH="$(pwd)/infrastructure/wrappers:$PATH"
# Phase 8 NEW (TRIG-04):
export TELEGRAM_BOT_TOKEN="123456:ABC..."
export TELEGRAM_ALLOWED_USERS="987654321"

hermes gateway run

Test escalated governance from Telegram

From Telegram, send a write-action command (Track C example):

/diagnose --apply k8s-trouble-crashloop

At L4, the agent CAN attempt kubectl apply (the Phase 7 wrapper allows write actions at L4). At L2 from Step 15, the same command would have triggered the GOVERNANCE REJECTED banner from the wrapper.

Verify the governance level in session logs

hermes sessions list
hermes sessions show <session-id> | grep -i governance

What happens when a non-admin tries L4 from the message text?

If a user sends /diagnose --governance L4 <arg> from Telegram while the gateway is running at L2, the agent runs at L2 — the --governance L4 text is just part of the prompt, not a directive to the wrapper. The wrapper reads HERMES_LAB_GOVERNANCE=L2 from the process env and enforces L2 allowlists.

No per-message governance escalation

Telegram bot governance is per-process, not per-message. To change levels, restart the gateway with a different HERMES_LAB_GOVERNANCE env var. The combination of TELEGRAM_ALLOWED_USERS (who can talk to the bot at all) and HERMES_LAB_GOVERNANCE (what they can do) is the per-context governance model — consistent with how AlertManager and K8s CronJob agents work.

See infrastructure/scenarios/k8s/telegram-bot/admin-allowlist.example.yaml for the team-wide admin roster pattern.

Cleanup

hermes gateway stop
# Restart at L2 for ongoing lab work:
# export HERMES_LAB_GOVERNANCE=L2 && hermes gateway run

FREE EXPLORE PHASE — 25 minutes

Choose challenges based on your available time and experience level.

Challenge 1 (Starter — 10 min): Cross-track cron job

Create a second cron job using a skill from a different track than your primary one:

Track A participant: try --skill "cost-anomaly" or --skill "kubernetes-health"
Track B participant: try --skill "dba-rds-slow-query" or --skill "kubernetes-health"
Track C participant: try --skill "dba-rds-slow-query" or --skill "cost-anomaly"

Use a fast schedule to see it fire during the lab:

hermes cron create "*/5 * * * *" \
  "Quick cost anomaly scan. Report only if anomalies detected." \
  --name "cross-track-check" \
  --skill "cost-anomaly" \
  --deliver local

Trigger it manually to confirm it works:

hermes cron run <job-id>   # use the ID from hermes cron list

Then run:

hermes cron status

What does the output tell you about both jobs? Does running a cross-track skill produce useful output, or does the agent report it cannot find the expected data source?

Challenge 2 (Intermediate — 15 min): Webhook for a different event type

Unsubscribe the CloudWatch webhook and create one for a different event type:

hermes webhook unsubscribe cloudwatch-alerts

Subscribe to a Kubernetes pod event (Track C) or cost spike event (Track B):

# Track B — cost spike
hermes webhook subscribe cost-spike \
  --events "cost-alert" \
  --prompt "Cost alert received: {alert.service} exceeded budget by {alert.overage_pct}%. Investigate." \
  --deliver local

# Track C — pod eviction
hermes webhook subscribe pod-eviction \
  --events "pod-event" \
  --prompt "Pod event received: {pod.name} in namespace {pod.namespace} has state {pod.state}. Investigate." \
  --deliver local

Test with a matching payload:

# Track B test
hermes webhook test cost-spike \
  --payload '{"alert": {"service": "RDS", "overage_pct": "47"}}'

# Track C test
hermes webhook test pod-eviction \
  --payload '{"pod": {"name": "api-server-xyz", "namespace": "production", "state": "OOMKilled"}}'

Observe: does the agent adapt its investigation based on the payload values? What changes in the output between a 10% overage and a 47% overage?

Challenge 3 (Advanced — 20 min): Combine cron + webhook on the same skill

Set up both a cron schedule and a webhook subscription pointing at the same skill:

# Cron: fires every 5 min for lab speed
hermes cron create "*/5 * * * *" \
  "Quick RDS health snapshot. Use [SILENT] if no anomalies." \
  --name "rapid-health" \
  --skill "dba-rds-slow-query" \
  --deliver local

# Webhook: fires on demand
hermes webhook subscribe rds-alerts \
  --events "cloudwatch-alarm" \
  --prompt "CloudWatch alert received: {alarm.name} is {alarm.state}. Run full investigation." \
  --deliver local

Now fire the webhook while the cron is also scheduled to tick:

hermes webhook test rds-alerts \
  --payload '{"alarm": {"name": "rds-cpu-high", "state": "ALARM"}}'

Then check status:

hermes cron status
hermes webhook list

Questions to explore:

Do both executions complete without interfering with each other?
The cron prompt asks for a quick snapshot; the webhook prompt asks for a full investigation. Does the output differ?
What does hermes cron status show while a webhook-triggered run is executing?
If you trigger the webhook 3 times in quick succession, what happens?

Clean up when done:

hermes cron pause <job-id>
hermes webhook unsubscribe rds-alerts

Closing

What you built in this lab:

A cron-scheduled health check that fires at 8 AM daily, loads a domain skill, and delivers findings to your terminal (or Slack in production)
A webhook subscription that reacts to CloudWatch alarm payloads and runs an agent investigation on demand
Hands-on experience with the full cron lifecycle: create, trigger, pause, resume, delete
Understanding of when to use scheduled triggers (proactive) vs event-driven webhooks (reactive)
A real Prometheus + AlertManager pipeline on KIND firing on a Phase 6 broken pod, with the agent receiving the alert and diagnosing without manual invocation
A K8s CronJob running a containerized Hermes agent on a schedule, with explicit "use this when…" framing for Hermes cron vs K8s CronJob
A GitHub webhook via smee.io routing real PR events to Hermes, with the agent posting review comments back via the built-in github_comment delivery type
A Telegram bot you can poke from your phone, with three slash commands and per-process governance inheritance from Phase 7

Key commands reference:

# Cron management
hermes cron status                      # Always run at session start
hermes cron create "<schedule>" "<prompt>" --name ... --skill ... --deliver local
hermes cron list
hermes cron run <job-id>            # Manual fire (use ID from cron list, not name)
hermes cron pause <job-id>
hermes cron resume <job-id>
hermes cron delete <job-id>

# Webhook management
hermes gateway setup                    # Enable webhook platform
curl http://localhost:8644/health       # Verify endpoint is live
hermes webhook subscribe <name> --events ... --prompt ... --deliver local
hermes webhook list
hermes webhook test <name> --payload '{"key": "value"}'
hermes webhook unsubscribe <name>

# Phase 8 trigger commands
hermes webhook subscribe alertmanager --events "alertmanager-alert" --prompt "..." --skill "..." --deliver local
hermes webhook subscribe github --events "pull_request" --prompt "..." --deliver github_comment --deliver-chat-id "{repository.full_name}:{pull_request.number}"
kubectl apply -f infrastructure/scenarios/k8s/cronjob/agent-health-check.yaml -l track=track-c
./infrastructure/scenarios/k8s/github-webhook/smee-setup.sh
hermes gateway run    # With TELEGRAM_BOT_TOKEN set, activates Telegram adapter
hermes gateway stop   # Stop gateway before restarting with new governance level

Starter file: course-site/docs/module-11-triggers/lab/starter/cron-job-starter.yaml provides a parameter reference card for building your own cron jobs.

Next: Module 13 covers governance — approval workflows, maturity levels, and audit trails. The cron and webhook triggers you built here become the entry points for governed agent actions in Module 13.

Verification Checklist

Run these commands to confirm your lab completed successfully:

# 1. Cron scheduler is running
hermes cron status
# Expected: Scheduler: running

# 2. Daily health check job is registered
hermes cron list
# Expected: daily-db-health (or daily-cost-check / daily-k8s-check) in the list

# 3. Job fires successfully on demand
hermes cron run <job-id>   # use the ID from hermes cron list
# Expected: agent runs and prints output (or [SILENT] if no anomalies)

# 4. Webhook endpoint is live
curl http://localhost:8644/health
# Expected: {"status": "ok"}

# 5. Webhook subscription is active
hermes webhook list
# Expected: cloudwatch-alerts (or your custom subscription) in the list

# 6. Webhook test fires the agent
hermes webhook test cloudwatch-alerts \
  --payload '{"alarm": {"name": "rds-cpu-high", "state": "ALARM"}}'
# Expected: agent runs investigation and prints findings to terminal

# Phase 8 checks
# 7. AlertManager is enabled and reachable
kubectl get pods -n monitoring -l app.kubernetes.io/name=alertmanager
# Expected: alertmanager-kube-prometheus-alertmanager-0  Running

# 8. PrometheusRule loaded
kubectl get prometheusrule -n monitoring hermes-lab-rules
# Expected: row exists

# 9. K8s CronJob image built and loaded
docker images hermes-lab:cronjob
# Expected: REPOSITORY: hermes-lab, TAG: cronjob

# 10. K8s CronJob registered (after Step 11)
kubectl get cronjob | grep hermes-track
# Expected: hermes-track-{a,b,c}-health  */5 * * * *

# 11. smee-setup.sh is executable
test -x infrastructure/scenarios/k8s/github-webhook/smee-setup.sh && echo OK

# 12. Sample GitHub PR payload is valid JSON
jq . infrastructure/scenarios/k8s/github-webhook/sample-pr-payload.json > /dev/null && echo OK

# 13. Telegram adapter prerequisites
python3 -c "from telegram import Bot; print('telegram OK')"
# Expected: telegram OK

GUIDED PHASE — 120 minutes​

Step 1: Morning Startup Sequence (5 min)​

Startup checklist​

Step 2: Create a Daily Health Check Cron Job (10 min)​

What each argument does​

Verify the job was registered​

Step 3: Trigger Manually and Verify Output (10 min)​

Step 4: Pause, Resume, and Status (5 min)​

Step 5: Start the Webhook Gateway (5 min)​

Step 6: Subscribe a Webhook for CloudWatch Alerts (10 min)​

What each part means​

Step 7: Test the Webhook with a Simulated Alert (10 min)​

Step 8: Slack — What This Looks Like in Production (5 min)​

Slack config format (requires Slack admin to add bot)​

What changes​

What the Slack bot message looks like​

Step 9: AlertManager — Real Prometheus Stack Setup (10 min)​

Set environment variables​

Enable AlertManager in the helm release​

Apply the PrometheusRule​

Verify the rule loaded​

Skipping AlertManager (no KIND)​

Step 10: AlertManager — Fire and Observe (10 min)​

Set environment variables​

Start the gateway and subscribe​

Apply the Phase 6 crashloop2 scenario​

Expected timeline​

Cleanup after observing​

Step 11: K8s CronJob — Same Agent, Different Trigger Mechanism (10 min)​

Set environment variables​

Build the minimal hermes-agent image​

Load the image into KIND and create the API key secret​

Apply the CronJob for your track​

View logs from the first completed job​

Cleanup​

Step 12: GitHub Webhook — smee.io Setup (10 min)​

Set environment variables (including new Phase 8 TRIG-03 vars)​

Step 1: Get a smee.io channel URL​

Step 2: Get a GitHub PAT with repo scope​

Step 3: Start the gateway and smee-client​

Step 4: Add the webhook to your GitHub repo​

Step 13: GitHub Agent Comment Back — Full Round-Trip (10 min)​

Set environment variables​

Subscribe the GitHub webhook​

Trigger the event​

Solo Learner fallback — no GitHub repo or smee.io needed​

Cleanup​

Step 14: Telegram Bot Setup — @BotFather to Gateway (10 min)​

Step 1: Create your bot via @BotFather (~2 min)​

Step 2: Get your Telegram user ID via @userinfobot (~30 sec)​

Step 3: Configure env vars and start the gateway​

Step 15: Telegram /diagnose Command — Slash Commands From Your Phone (5 min)​

Set environment variables​

Send slash commands from Telegram​

Step 16: Telegram Governance Escalation — Per-Process Override (5 min)​

Set environment variables (escalated to L4)​

Test escalated governance from Telegram​

Verify the governance level in session logs​

What happens when a non-admin tries L4 from the message text?​

Cleanup​

FREE EXPLORE PHASE — 25 minutes​

Challenge 1 (Starter — 10 min): Cross-track cron job​

Challenge 2 (Intermediate — 15 min): Webhook for a different event type​

Challenge 3 (Advanced — 20 min): Combine cron + webhook on the same skill​

Closing​

Verification Checklist​

GUIDED PHASE — 120 minutes

Step 1: Morning Startup Sequence (5 min)

Startup checklist

Step 2: Create a Daily Health Check Cron Job (10 min)

What each argument does

Verify the job was registered

Step 3: Trigger Manually and Verify Output (10 min)

Step 4: Pause, Resume, and Status (5 min)

Step 5: Start the Webhook Gateway (5 min)

Step 6: Subscribe a Webhook for CloudWatch Alerts (10 min)

What each part means

Step 7: Test the Webhook with a Simulated Alert (10 min)

Step 8: Slack — What This Looks Like in Production (5 min)

Slack config format (requires Slack admin to add bot)

What changes

What the Slack bot message looks like

Step 9: AlertManager — Real Prometheus Stack Setup (10 min)

Set environment variables

Enable AlertManager in the helm release

Apply the PrometheusRule

Verify the rule loaded

Skipping AlertManager (no KIND)

Step 10: AlertManager — Fire and Observe (10 min)

Set environment variables

Start the gateway and subscribe

Apply the Phase 6 crashloop2 scenario

Expected timeline

Cleanup after observing

Step 11: K8s CronJob — Same Agent, Different Trigger Mechanism (10 min)

Set environment variables

Build the minimal hermes-agent image

Load the image into KIND and create the API key secret

Apply the CronJob for your track

View logs from the first completed job

Cleanup

Step 12: GitHub Webhook — smee.io Setup (10 min)

Set environment variables (including new Phase 8 TRIG-03 vars)

Step 1: Get a smee.io channel URL

Step 2: Get a GitHub PAT with repo scope

Step 3: Start the gateway and smee-client

Step 4: Add the webhook to your GitHub repo

Step 13: GitHub Agent Comment Back — Full Round-Trip (10 min)

Set environment variables

Subscribe the GitHub webhook

Trigger the event

Solo Learner fallback — no GitHub repo or smee.io needed

Cleanup

Step 14: Telegram Bot Setup — @BotFather to Gateway (10 min)

Step 1: Create your bot via @BotFather (~2 min)

Step 2: Get your Telegram user ID via @userinfobot (~30 sec)

Step 3: Configure env vars and start the gateway

Step 15: Telegram /diagnose Command — Slash Commands From Your Phone (5 min)

Set environment variables

Send slash commands from Telegram

Step 16: Telegram Governance Escalation — Per-Process Override (5 min)

Set environment variables (escalated to L4)

Test escalated governance from Telegram

Verify the governance level in session logs

What happens when a non-admin tries L4 from the message text?

Cleanup

FREE EXPLORE PHASE — 25 minutes

Challenge 1 (Starter — 10 min): Cross-track cron job

Challenge 2 (Intermediate — 15 min): Webhook for a different event type

Challenge 3 (Advanced — 20 min): Combine cron + webhook on the same skill

Closing

Verification Checklist