Module 11 Lab: Triggers & Scheduling for Kubernetes (Track C)

Duration: 90 minutes (70 min guided + 20 min free explore) Track: C -- Kubernetes Health & Self-Healing Prerequisite: Module 8 Track C lab complete (working track-c profile); KIND cluster running Outcome: Working examples of all 5 trigger patterns using real KIND infrastructure

tip

This lab moves your Hermes agent from reactive (you type a prompt) to proactive (the agent runs on a schedule or reacts to events). You will build five trigger mechanisms: a cron schedule for daily health checks, an AlertManager webhook for incident response, a K8s CronJob for containerized agent runs, a GitHub webhook for PR review automation, and a Telegram bot for slash-command diagnostics from your phone.

All exercises use real infrastructure on your KIND cluster -- no mock mode, no simulated data.

Prerequisites (5 min)

Verify your environment before starting.

Confirm the KIND cluster is running and reachable:

kubectl get nodes 
kubectl get pods -A 

Verify the Track C agent responds:

hermes -p track-c chat
# Type "hello" to confirm the agent responds, then exit with Ctrl+C

Verify Hermes gateway capability:

hermes --version

Start the gateway (runs in the background as a service):

hermes gateway start
hermes gateway status
# Expected: Gateway running

Watching gateway logs

Throughout this lab, keep a separate terminal open with:

tail -f ~/.hermes/logs/gateway.log

This is your real-time view into everything the gateway does — cron ticks, webhook receipts, agent runs, delivery results. When something doesn't work, this is the first place to check.

GUIDED PHASE -- 70 minutes

Step 1: Hermes Cron -- Daily K8s Health Check (10 min)

Cron expressions use the standard five-field format: minute hour day-of-month month day-of-week

Expression	Meaning
`0 8 * * *`	8:00 AM every day
`30 9 * * 1-5`	9:30 AM weekdays only
`/5 * * *`	Every 5 minutes
`0 0 * * 0`	Midnight every Sunday

One-time setup: install skill globally and configure API key

Hermes cron jobs run outside any profile context — the cron scheduler creates its own agent instance with access to globally installed skills only. Before creating your first cron job, install the Track C skill globally and ensure the API key is available:

# Install the sre-k8s-pod-health skill globally
cp -r ~/.hermes/profiles/track-c/skills/sre-k8s-pod-health ~/.hermes/skills/

# Verify it is visible globally
ls ~/.hermes/skills/sre-k8s-pod-health/
# Expected: SKILL.md (and possibly other files)

# Ensure the Anthropic API key is in the global .env
# (Cron jobs don't read profile-level .env files)
grep -q ANTHROPIC ~/.hermes/.env 2>/dev/null || \
  grep ANTHROPIC ~/.hermes/profiles/track-c/.env >> ~/.hermes/.env

Why global, not profile-scoped?

When you run hermes -p track-c chat, Hermes loads the profile's SOUL.md, config, and skills. But the cron scheduler runs jobs in its own thread — it builds a fresh agent with no profile context. It resolves skills from ~/.hermes/skills/ (global) only. The API key must also be in ~/.hermes/.env since the cron agent does not read profile .env files.

Create the cron job

hermes cron create "0 8 * * *" \
  "Run morning pod health check across all namespaces. Report only if pods or nodes show issues." \
  --name "daily-k8s-check" \
  --skill "sre-k8s-pod-health" \
  --deliver local

What each argument does

Argument	Purpose
`schedule` (1st positional)	Cron expression defining when the job runs (e.g. `"0 8 * * "`, `"/5 * * * *"`, or a shorthand like `"30m"`, `"every 2h"`).
`prompt` (2nd positional)	What the agent is asked to do when it fires. Must be self-contained -- the cron agent has no chat history.
`--name`	Human-readable job name (kebab-case). Used to reference the job in other commands.
`--skill`	Skill to load before running the prompt. The agent reads the SKILL.md runbook first. Repeat the flag to attach multiple skills.
`--deliver local`	Output goes to your terminal. In production, use `--deliver slack` or `--deliver telegram` to route findings to your notification system.
`--repeat N`	Optional -- limits the job to N executions. Omit for indefinite scheduling.

Positional argument order matters

schedule must come first, prompt second. You can put the --name, --skill, and --deliver flags anywhere (before, between, or after the positionals) since they are keyword arguments.

Verify the job was registered

hermes cron list

Expected output:

ID              Name              Schedule        Next Run              Skill               State
f8081a091378    daily-k8s-check   0 8 * * *       2026-04-10 08:00:00   sre-k8s-pod-health   scheduled

About --deliver local

Using --deliver local routes the agent's output to your terminal session. This is the right choice for lab work -- you see output immediately without needing to configure Slack or Telegram. In production, you would use --deliver slack (configured in ~/.hermes/config.yaml) or --deliver telegram so findings reach your on-call channel even if you are not at your terminal.

Trigger manually and verify output

You scheduled the job for 8 AM -- but you do not need to wait. Manual trigger fires the job immediately and is your primary verification tool.

Note the Job ID (the hex hash like f8081a091378) from hermes cron list -- hermes cron run takes the job ID, not the name:

# Use the Job ID from hermes cron list output (first column)
hermes cron run <job-id>
# Example: hermes cron run f8081a091378

How cron execution works:

hermes cron run does NOT execute the job in your terminal. It marks the job as "due" and waits for the gateway's scheduler to pick it up on the next tick (up to 60 seconds). The gateway runs the job in its own thread, and with --deliver local, the output is saved to a file.

If you don't want to wait for the next tick, force it immediately:

# Option 1: Wait for the gateway's scheduler (~60 seconds)
hermes cron run <job-id>
# then wait...

# Option 2: Force the tick immediately (recommended for lab work)
hermes cron run <job-id>
hermes cron tick

hermes cron tick manually triggers the scheduler — it checks for any due jobs and executes them right now. You will see the agent's tool calls (kubectl commands) stream to your terminal in real time. This is the fastest way to verify your cron job works.

When to use cron tick

Use hermes cron tick during lab work whenever you want to see results immediately after hermes cron run. In production, the gateway runs ticks automatically every 60 seconds — you never need to call tick manually.

Read the saved output file:

ls ~/.hermes/cron/output/
# You should see a directory named with your job ID

# Read the output (replace with your job ID)
cat ~/.hermes/cron/output/<job-id>/*.md

Expected output shape (from a healthy KIND cluster):

# Kubernetes Health Check -- kind-lab

Namespaces scanned: kube-system, default, local-path-storage

All pods running. No restarts detected above threshold.
Node kind-control-plane: Ready, allocatable CPU 4 cores, memory 8Gi.

About [SILENT] and delivery

When the cron agent finds nothing to report, it may respond with [SILENT]. This suppresses delivery to external channels -- you will not receive a Slack or Telegram notification. This is by design: agents that cry wolf on every run lose their usefulness. The agent only delivers a full report when it finds something worth reporting.

With --deliver local, output is always saved to ~/.hermes/cron/output/<job-id>/ regardless of whether the agent reports findings or stays silent. Check the file to see what happened.

Lifecycle demo: pause and resume

Pause the job without deleting it:

hermes cron pause <job-id>

Check the paused state:

hermes cron status

Expected output shows the job in paused state:

Scheduler: running
Jobs registered: 1
  daily-k8s-check   PAUSED
Next tick: in ~43s

Resume the job:

hermes cron resume <job-id>

Verify it is back to scheduled state:

hermes cron status

Expected:

Scheduler: running
Jobs registered: 1
  daily-k8s-check   scheduled   next: 2026-04-10 08:00:00
Next tick: in ~51s

Step 2: AlertManager -- Prometheus Stack + Live Webhook (20 min)

You used hermes cron run to fire an agent manually. Now you wire a REAL alert source: the Prometheus + AlertManager stack on your KIND cluster, firing on a real broken pod. This is the moment Hermes stops being a chat agent and starts being an incident-response agent -- alerts arrive without you typing anything.

Sub-step 2a: Verify the Prometheus + AlertManager stack

The Prometheus stack was installed during initial course setup (make monitoring in the reference-app directory, Helm release name monitoring). Verify it is running:

# Verify Prometheus and AlertManager pods are running
kubectl get pods -n monitoring | grep -E "prometheus|alertmanager"
# Expected: prometheus-monitoring-* and alertmanager-monitoring-* pods in Running state

# Verify AlertManager is accessible
curl -s http://localhost:30093/-/healthy
# Expected: OK

If Prometheus is not running

If you don't see the monitoring pods, install the stack:

cd reference-app && make monitoring && cd ..

This installs kube-prometheus-stack with Helm release name monitoring.

Sub-step 2b: Apply the PrometheusRule

Apply the PrometheusRule that fires when a pod in the k8s-trouble-crashloop namespace has more than 1 restart:

kubectl apply -f infrastructure/scenarios/k8s/alertmanager/prometheus-rules.yaml

Expected output:

prometheusrule.monitoring.coreos.com/hermes-lab-rules created

Verify the rule loaded:

kubectl get prometheusrule -n monitoring hermes-lab-rules -o yaml | yq '.spec.groups[0].rules[0].alert'

Expected output:

PodCrashLooping

Open the Prometheus UI at http://localhost:30091 and navigate to Status then Rules. You should see the hermes-lab.k8s-crashloop group with the PodCrashLooping rule listed and state "inactive" (no broken pods yet).

Why the release label matters

The PrometheusRule manifest includes labels: release: monitoring. The kube-prometheus-stack Helm chart configures Prometheus to ONLY load rules whose release label matches the Helm release name. In this course the release is named monitoring (from make monitoring), so the label must match. Without it, your rule would be silently ignored — kubectl get prometheusrule would show it, but the Prometheus UI Rules page would not. Verify with: kubectl get prometheus -n monitoring -o jsonpath='{.items[0].spec.ruleSelector}'

Sub-step 2c: Configure the webhook subscription and restart the gateway

The webhook subscription needs two settings that differ from the defaults:

No HMAC signature — AlertManager doesn't sign its POST requests, so the subscription must use INSECURE_NO_AUTH to skip signature validation
Deliver to log — webhook-triggered output uses log delivery (saved to gateway log), not local (which is cron-only)

Edit the subscription file directly (or create it if it doesn't exist):

cat > ~/.hermes/webhook_subscriptions.json << 'EOF'
{
  "alertmanager": {
    "description": "AlertManager PodCrashLooping webhook",
    "events": [],
    "secret": "INSECURE_NO_AUTH",
    "prompt": "AlertManager PodCrashLooping alert fired. Details: {alerts}. Load the sre-k8s-pod-health skill and diagnose the affected pod in the namespace shown in the alert labels.",
    "skills": [
      "sre-k8s-pod-health"
    ],
    "deliver": "log"
  }
}
EOF

CRITICAL: {alerts} not {alerts[0].labels.pod}

The Hermes prompt template only supports dot-notation access to dict keys, NOT array index access. The agent receives the full alerts[] JSON array as a string and parses it to find the pod and namespace. If you use {alerts[0].labels.pod}, it renders as a literal string.

Why INSECURE_NO_AUTH?

In production, webhook endpoints should validate HMAC signatures to prevent unauthorized requests from triggering agent runs. AlertManager does not support HMAC signing natively. For lab work, INSECURE_NO_AUTH skips validation. In production, you would place AlertManager behind a reverse proxy that adds HMAC signatures, or use network-level access control (Kubernetes NetworkPolicy, firewall rules).

Why "events": []?

AlertManager POSTs do not include event-type headers (X-GitHub-Event, X-GitLab-Event). An empty events list means "accept all POSTs to this route" — no event filtering.

Now restart the gateway to pick up the new subscription:

hermes gateway stop
hermes gateway start

# Verify the subscription loaded
hermes webhook list

Sub-step 2d: Apply the crashloop scenario and observe

Apply the broken pod manifest and watch the restart count climb:

# Terminal 3: Apply the broken pod (reuses the crashloop scenario manifest)
kubectl apply -f infrastructure/scenarios/k8s/02-crashloop-backoff.yaml

# Watch the pod restart count climb
watch kubectl get pods -n k8s-trouble-crashloop

How the AlertManager → Hermes chain works

┌──────────────────────────────────────────────────────────────────────┐
│  KIND Cluster                                                        │
│                                                                      │
│  ┌─────────────┐    scrapes     ┌──────────────┐                     │
│  │ kube-state-  │──────────────▶│  Prometheus   │                    │
│  │ metrics      │   every 30s   │  :30091       │                    │
│  └─────────────┘                └──────┬───────┘                     │
│        ▲                               │ rule fires                  │
│        │ watches                        │ (restarts > 1)             │
│  ┌─────┴───────┐                ┌──────▼───────┐                     │
│  │ crasher pod  │                │ AlertManager  │                    │
│  │ CrashLoop    │                │  :30093       │                    │
│  │ BackOff      │                └──────┬───────┘                     │
│  └─────────────┘                        │ POST webhook               │
│                                         │                            │
└─────────────────────────────────────────┼────────────────────────────┘
                                          │
                                          ▼
                              ┌───────────────────────┐
                              │  Hermes Gateway       │  ← your Mac
                              │  :8644                │
                              │  /webhooks/alertmanager│
                              └───────────┬───────────┘
                                          │ spawns agent
                                          ▼
                              ┌───────────────────────┐
                              │  Hermes Agent         │
                              │  loads sre-k8s-pod-   │
                              │  health skill         │
                              │  runs kubectl get/    │
                              │  describe/logs        │
                              │  produces diagnosis   │
                              └───────────────────────┘

Expected timeline

t=0s: pod applied, status ContainerCreating
t=10s: pod status CrashLoopBackOff, restartCount=1
t=30s: Prometheus scrapes kube-state-metrics, sees restartCount > 1
t=60s: Rule fires → alert goes PENDING → FIRING (no for delay)
t=70s: AlertManager dispatches POST to host.docker.internal:8644
t=75s: Hermes gateway accepts webhook, spawns agent run
t=110s: Agent completes diagnosis (kubectl calls + LLM reasoning ~35s)

Open the AlertManager UI at http://localhost:30093 to see the active alert and its receiver routing to hermes-webhook.

How to verify each step

# 1. Pod is crashing?
kubectl get pods -n k8s-trouble-crashloop
# Expected: CrashLoopBackOff with restarts > 1

# 2. Prometheus sees the metric?
# Open http://localhost:30091 → Query tab → run:
# kube_pod_container_status_restarts_total{namespace="k8s-trouble-crashloop"}
# Expected: value > 1

# 3. Alert is firing?
# Open http://localhost:30091 → Alerts tab
# Expected: PodCrashLooping in FIRING state

# 4. AlertManager received the alert?
# Open http://localhost:30093
# Expected: PodCrashLooping alert with receiver hermes-webhook

# 5. Gateway received the POST?
tail -20 ~/.hermes/logs/gateway.log | grep -i "accepted\|alertmanager"
# Expected: {"status": "accepted", "route": "alertmanager", ...}

# 6. Agent ran and produced output?
tail -50 ~/.hermes/logs/gateway.log | grep -i "response ready"
# Expected: response ready: platform=webhook ... response=NNNN chars

Where to see the agent's diagnosis

With --deliver log, the agent's full diagnosis is written to the gateway log:

# View the full agent response
tail -200 ~/.hermes/logs/gateway.log | grep -A100 "Response for webhook:alertmanager"

Cron output vs webhook output

These use different delivery mechanisms:

Cron jobs (--deliver local): saved to ~/.hermes/cron/output/<job-id>/*.md
Webhook triggers (--deliver log): written to ~/.hermes/logs/gateway.log

In production, both can deliver to Telegram or Slack instead.

Teaching point -- cron vs webhook

You now have both trigger mechanisms running:

Cron asks "is anything wrong?" It fires on a schedule whether or not an alarm has fired. Use it for proactive health checks and daily summaries.
Webhook reacts to "something IS wrong." It fires in response to an external event -- an AlertManager alert, a Kubernetes pod crash, a GitHub PR event. Use it for incident response automation where latency matters.

Both can load the same skill and run the same investigation prompt. The difference is timing and trigger: scheduled vs event-driven.

Cleanup after observing

kubectl delete -f infrastructure/scenarios/k8s/02-crashloop-backoff.yaml
hermes webhook unsubscribe alertmanager

Step 3: K8s CronJob -- Agent as a Kubernetes Workload (15 min)

The Hermes cron jobs you built in Step 1 are the production pattern for most agent work. This step demonstrates the SAME agent wrapped in a native K8s CronJob -- and makes explicit WHEN each pattern is the right answer.

Build the minimal hermes-agent image

# Build the minimal hermes-agent container image
# (This takes 5-10 min -- the image is ~700-900MB of Python deps)
docker build -t hermes-lab:cronjob infrastructure/scenarios/k8s/cronjob/

Verify the image exists:

docker images | grep hermes-lab

Image size note

The infrastructure/scenarios/k8s/cronjob/Dockerfile uses python:3.12-slim as a base, not the official nousresearch/hermes-agent:latest (which is 2-3GB with Playwright/ffmpeg). This minimal image is a teaching artifact about packaging agents for K8s -- keeping the image lean makes KIND loads faster and demonstrates production best practices for agent containers.

Load the image into KIND and create the API key secret

# Load into KIND (required -- the CronJob uses imagePullPolicy: IfNotPresent, not a registry)
kind load docker-image hermes-lab:cronjob --name lab

# Create the API key secret (NEVER commit this token -- it lives only in your local KIND cluster)
kubectl create secret generic hermes-secrets \
  --from-literal=anthropic-api-key="$ANTHROPIC_API_KEY"

# Verify the secret was created
kubectl get secret hermes-secrets

Apply the CronJob for Track C

kubectl apply -f infrastructure/scenarios/k8s/cronjob/agent-health-check.yaml -l track=track-c

# Watch jobs spawn (schedule is */5 * * * * -- wait up to 5 min for the first run)
watch kubectl get jobs,pods

View logs from the first completed job

Once a job pod reaches Completed status, read its logs:

kubectl logs -l job-name=$(kubectl get jobs -o jsonpath='{.items[-1].metadata.name}')

Use Hermes cron when... / Use K8s CronJob when...

Use Hermes cron when:

The agent benefits from gateway-shared state (loaded skills, audit trail, conversation history)
You want one-stop CLI management (hermes cron create/list/trigger/pause/resume)
You are iterating fast -- tweak a prompt, re-register the cron, done. No image rebuild cycle.
You need audit trail context linking cron runs to skill and prompt versions
You are not (yet) in Kubernetes

Use K8s CronJob when:

Stateless one-shot diagnostics -- no state needed from previous runs
GitOps schedule-in-git -- you want the schedule reviewed via PR and deployed via ArgoCD/Flux
K8s-native observability -- Prometheus kube_job_status_* metrics, kubectl get jobs, Loki logs
Multi-tenant resource quotas -- namespace isolation, NetworkPolicies, resource quotas, Secrets

Real-world honest stance: Most agent work uses Hermes cron because state matters. K8s CronJob shines for fire-and-forget diagnostic jobs deployed alongside other K8s primitives via the same GitOps pipeline.

Cleanup

kubectl delete -f infrastructure/scenarios/k8s/cronjob/agent-health-check.yaml -l track=track-c
kubectl delete secret hermes-secrets

Step 4: GitHub Webhook -- PR Review Bot (15 min)

GitHub webhooks need a public HTTPS endpoint to POST to. smee.io is a free public webhook proxy: you get a unique channel URL, GitHub POSTs to it, and a smee-client on your laptop forwards events to your local Hermes gateway.

Solo Learner

This step requires a personal GitHub repo and a GitHub PAT. If you do not have these, skip to the Solo Learner fallback section at the bottom of this step -- you can simulate the full GitHub webhook flow without any external service using the bundled sample PR payload.

Sub-step 4a: Get a smee.io channel URL

Visit https://smee.io/ in your browser and click "Start a new channel". Copy the URL -- it looks like https://smee.io/abc123XYZ.

Sub-step 4b: Get a GitHub PAT with repo scope

Open https://github.com/settings/tokens and click "Generate new token" then "Generate new token (classic)"
Note: hermes-lab-triggers, Expiration: 30 days
Scopes: check repo (includes read+write to PRs and comments)
Click "Generate token" and copy the ghp_... value

Set the required environment variables:

export GITHUB_TOKEN="ghp_..."                          # Your PAT from step above
export SMEE_URL="https://smee.io/your-channel-id"     # Your smee.io channel URL

Authenticate the gh CLI with your PAT:

gh auth login --with-token <<< "$GITHUB_TOKEN"
gh auth status   # Should show "Logged in to github.com as <you>"

Sub-step 4c: Start the gateway and smee-client

# Terminal 1: Start the Hermes gateway
hermes gateway run

# Terminal 2: Run the smee setup script (foreground -- leave it running)
./infrastructure/scenarios/k8s/github-webhook/smee-setup.sh

Expected smee output:

================================================================
 smee.io -> Hermes webhook gateway forwarder
================================================================
  Source:  https://smee.io/abc123XYZ
  Target:  http://localhost:8644/webhooks/github
  Client:  smee-client@5.0.0
================================================================
Forwarding https://smee.io/abc123XYZ to http://localhost:8644/webhooks/github

Sub-step 4d: Add the webhook to your GitHub repo

In your test GitHub repo: Settings -> Webhooks -> Add webhook

Payload URL: your smee.io channel URL ($SMEE_URL)
Content type: application/json
Secret: (leave blank for the lab)
Events: "Let me select individual events" then check Pull requests
Active: checked

Click Add webhook. GitHub will send a test ping -- you should see Forwarding event to localhost:8644/webhooks/github in the smee terminal.

Route name must match subscription name

The smee-setup.sh script targets http://localhost:8644/webhooks/github. When you subscribe below, use hermes webhook subscribe github (NOT github-webhook). The route name after /webhooks/ MUST match the subscription name exactly, or events arrive at the gateway but no subscription receives them.

# Terminal 3: Subscribe with the agent prompt template and github_comment delivery
hermes webhook subscribe github \
  --events "pull_request" \
  --prompt "$(cat infrastructure/scenarios/k8s/github-webhook/agent-prompt-template.txt)" \
  --deliver github_comment \
  --deliver-chat-id "{repository.full_name}:{pull_request.number}"

# Verify the subscription is active
hermes webhook list

Trigger the event

Open a PR on your test repo (or push a commit to a branch that already has an open PR).

Watch the flow:

smee terminal shows Forwarding event to localhost:8644/webhooks/github
gateway terminal shows Received github webhook event
~10 seconds -- agent runs, generates review summary, posts comment back to the PR
GitHub PR -- refresh the PR comments and you should see the Hermes review comment

Verify the comment landed:

gh pr view <PR_NUMBER> --repo <OWNER>/<REPO> --json comments

Solo Learner fallback -- no GitHub repo or smee.io needed

Solo Learner

Skip the smee setup above and use the bundled sample payload instead:

# Subscribe with --deliver local instead of --deliver github_comment
hermes webhook subscribe github \
  --events "pull_request" \
  --prompt "$(cat infrastructure/scenarios/k8s/github-webhook/agent-prompt-template.txt)" \
  --deliver local

# Inject the bundled sample payload (PR #42: feat(api): add /health readiness endpoint)
hermes webhook test github \
  --payload @infrastructure/scenarios/k8s/github-webhook/sample-pr-payload.json

The agent runs identically -- the review goes to your terminal instead of back to GitHub. The sample payload is a valid GitHub PR webhook structure with all the fields the prompt template references (pull_request.number, pull_request.title, repository.full_name, etc.).

Cleanup

hermes webhook unsubscribe github

Step 5: Telegram Bot -- Chat Interface (15 min)

Telegram is the right primary chat platform for this lab: free, no admin approval, works for every learner globally. Hermes has a built-in interactive setup that handles bot creation, user allowlists, and home channel configuration in one step.

Telegram long-polling conflict

Telegram bots use long polling. Only ONE bot instance can poll at a time. If you previously ran hermes gateway run in another terminal, stop it first with hermes gateway stop and wait 30 seconds before starting, or you will get 409 Conflict errors. This is a Telegram API restriction, not a Hermes bug.

Sub-step 5a: Create your bot via @BotFather (~2 min)

Open Telegram (mobile app or https://web.telegram.org)
Search for @BotFather (verified blue checkmark)
Send /newbot
Choose a display name (e.g., Hermes Lab Bot)
Choose a username -- must end in bot (e.g., hermes_lab_yourname_bot)
Copy the bot token -- looks like 123456:ABC-DEF1234ghIkl-zyx57W2v1u123ew11
Get your user ID: search for @userinfobot, send /start, copy the numeric ID

Sub-step 5b: Configure Telegram via `hermes gateway setup`

Instead of manually exporting environment variables, use the interactive setup:

hermes gateway stop    # stop any running gateway first
hermes gateway setup

The setup wizard walks you through:

Select Telegram from the platform list
Paste your bot token from @BotFather
Enter your user ID from @userinfobot (this creates the allowlist)
Set the home channel ID -- use your user ID (your DM with the bot)
Say yes to restart the gateway

What is a home channel?

The home channel is where Hermes delivers cron job results and notifications. When you set your Telegram user ID as the home channel, cron jobs created with --deliver telegram send their output directly to your Telegram DM. Without a home channel, delivery to "telegram" fails silently because Hermes doesn't know which chat to target.

After setup completes, verify the gateway is running with Telegram connected:

hermes gateway status
# Expected: Gateway running, telegram connected

Sub-step 5c: Send slash commands from Telegram

Open Telegram on your phone
Search for your bot's username and open the chat
Send:

/help

Expected: bot replies with available commands.

Send the diagnostic command:

/diagnose k8s-trouble-crashloop

Within ~30 seconds, the bot replies with the agent's diagnosis from running kubectl commands against your KIND cluster.

Check gateway status:

/status

Sub-step 5d: Conversational chat via Telegram

The Telegram bot is not limited to slash commands -- it is a full conversational interface to your Hermes agent. Any free-form text you type goes directly to the agent as a prompt. This is the same experience as hermes chat, but from your phone.

Try these conversational prompts:

Check all pods in the default namespace and tell me if anything looks unhealthy

What nodes are in my cluster and how much CPU is available?

I just deployed a new version of the API. Can you verify the rollout status?

The agent runs kubectl commands against your KIND cluster and replies in the same chat.

Slash commands vs free-form text

Both work identically. Slash commands like /diagnose are just text starting with /. Free-form text works the same way. Use whichever feels natural.

Sub-step 5e: Set home channel for cron delivery

You can also set the home channel directly from Telegram by typing /sethome in the bot chat. This tells Hermes "deliver cron job output and notifications to this chat."

/sethome

Expected: bot confirms this chat is now the home channel for Telegram delivery.

Now update your cron job to deliver via Telegram instead of local files:

# Delete old cron job
hermes cron delete daily-k8s-check

# Recreate with Telegram delivery
hermes cron create "0 8 * * *" \
  "Run morning pod health check across all namespaces. Report only if pods or nodes show issues." \
  --name "daily-k8s-check" \
  --skill "sre-k8s-pod-health" \
  --deliver telegram

# Trigger it
hermes cron run <job-id>
hermes cron tick

The diagnosis report now arrives in your Telegram DM instead of a local file.

Sub-step 5f: Default agent vs profile agent

When you use hermes -p track-c chat, Hermes loads Kiran's identity (SOUL.md), config, and skills. The gateway runs the default agent -- it has access to ALL globally installed skills (everything in ~/.hermes/skills/) but no profile-specific identity.

Mode	Identity	Skills Available	Use When
`hermes -p track-c chat`	Kiran (SOUL.md)	Profile skills only	Interactive investigation with behavioral rules
`hermes chat` (no profile)	Default Hermes	All global skills	General-purpose agent
Gateway (Telegram/Slack)	Default Hermes	All global skills	Chat interface from phone
Cron jobs	No identity	Global skills only	Scheduled automation

Making the gateway use a specific profile

The gateway currently runs as the default agent. To give it Kiran's identity for all Telegram interactions, you could copy the Track C SOUL.md into the global config:

cp ~/.hermes/profiles/track-c/SOUL.md ~/.hermes/SOUL.md

This makes the gateway agent behave like Kiran (with NEVER rules and K8s expertise) for all chat platforms. Remove it to restore the default agent.

Advanced: Multiple bots for different profiles

In production, you might want separate Telegram bots for different agents — one for the K8s health agent (Kiran), another for the FinOps agent, and a general-purpose bot.

Each Hermes profile can run its own gateway on a different port:

# Terminal 1: Kiran (Track C) on port 8644
hermes -p track-c gateway run

# Terminal 2: Default agent on port 8645
WEBHOOK_PORT=8645 hermes gateway run

Each gateway needs its own Telegram bot token (create multiple bots via @BotFather). Set different TELEGRAM_BOT_TOKEN values in each profile's .env file (~/.hermes/profiles/track-c/.env vs ~/.hermes/.env).

This gives you dedicated bots with different identities, skills, and behavioral rules — all running simultaneously.

Chat history in Telegram

Each conversation maintains session context within the gateway's lifetime. Follow-up messages like "what about the kube-system namespace?" work because the agent remembers the prior exchange. Restarting the gateway resets all sessions.

Cleanup

hermes gateway stop

Step 6: Slack Integration (Optional -- 15 min)

Slack is the production-grade chat platform for most DevOps teams. This step is optional because it requires Slack workspace admin access to create an app. If you have admin access, this gives you the same conversational agent experience as Telegram but inside your team's Slack workspace.

Skip this step if...

You don't have Slack workspace admin access
You're a solo Udemy learner without a team Slack
You've already demonstrated the chat interface via Telegram

The Telegram bot from Step 5 covers the same learning objectives. Slack adds production realism but is not required for course completion.

Sub-step 6a: Create a Slack App (~5 min)

Go to https://api.slack.com/apps → Create New App → From Scratch
Name it (e.g., Hermes Lab Bot) and select your workspace
Enable Socket Mode: Settings → Socket Mode → Enable → Create App-Level Token with scope connections:write → copy the xapp-... token
Add Bot Token Scopes: Features → OAuth & Permissions → Bot Token Scopes:
- chat:write, app_mentions:read, channels:history, channels:read
- groups:history (optional, for private channels)
- im:history, im:read, im:write, users:read, files:write
Subscribe to Events: Features → Event Subscriptions → Enable:
- message.im, message.channels, app_mention
- message.groups (optional)
Install to Workspace: Settings → Install App → copy the xoxb-... token
Reinstall if you changed scopes or events after initial install
Find your user ID: Click profile → three dots → Copy member ID
Invite the bot: In your channel, type /invite @HermesLabBot

Without message.channels, the bot ONLY works in DMs

Add both message.channels and app_mention to enable channel interaction.

Sub-step 6b: Configure Slack via `hermes gateway setup`

Just like Telegram, use the interactive setup:

hermes gateway stop
hermes gateway setup

Select Slack from the platform list and enter:

Bot Token (xoxb-...) from step 6
App Token (xapp-...) from step 3
Allowed user IDs -- your member ID from step 8
Home channel ID -- optional, or set later with /sethome in Slack

Say yes to restart the gateway. Both Telegram and Slack can run simultaneously.

Expected output after restart:

✓ telegram connected
✓ slack connected

Sub-step 6c: Talk to the agent in Slack

DM the bot:

Check all pods in the default namespace and tell me if anything looks unhealthy

The agent runs kubectl against your KIND cluster and replies in the DM.

In a channel (@ mention):

@HermesLabBot diagnose pods in k8s-trouble-crashloop namespace

The bot replies in a thread under your message.

Set home channel for delivery:

Type /sethome in the Slack channel where you want cron reports delivered. This works identically to the Telegram /sethome command.

Sub-step 6d: Cleanup

hermes gateway stop

Slack vs Telegram for production

Aspect	Telegram	Slack
Setup	2 min via @BotFather	5 min via api.slack.com, needs admin
Access control	User ID allowlist	User ID allowlist + workspace boundary
Channels	Groups (limited)	Full channel support with threads
Threading	Flat conversation	Thread replies under trigger message
Best for	Personal/lab use, solo learners	Team environments, on-call workflows

In production, most teams use Slack because it integrates with their existing incident response workflow (PagerDuty → Slack channel → agent responds in thread). Telegram is excellent for personal agents and labs where Slack admin access is not available.

Verification Checklist (5 min)

Run these commands to confirm all 5 trigger types completed successfully:

# 1. Cron -- scheduler running and daily job registered
hermes cron status
# Expected: Scheduler: running, daily-k8s-check listed

hermes cron list
# Expected: daily-k8s-check  0 8 * * *  sre-k8s-pod-health  scheduled

# 2. AlertManager -- Prometheus stack and PrometheusRule deployed
kubectl get pods -n monitoring -l app.kubernetes.io/name=alertmanager
# Expected: alertmanager-monitoring-kube-prometheus-alertmanager-0  Running

kubectl get prometheusrule -n monitoring hermes-lab-rules
# Expected: row exists

# 3. K8s CronJob -- image built and loaded
docker images hermes-lab:cronjob
# Expected: REPOSITORY: hermes-lab, TAG: cronjob

# 4. GitHub webhook -- smee-setup.sh is executable
test -x infrastructure/scenarios/k8s/github-webhook/smee-setup.sh && echo OK
# Expected: OK

# Verify sample PR payload is valid JSON
jq . infrastructure/scenarios/k8s/github-webhook/sample-pr-payload.json > /dev/null && echo OK
# Expected: OK

# 5. Telegram -- python-telegram-bot installed
python3 -c "from telegram import Bot; print('telegram OK')"
# Expected: telegram OK

Hermes cron job created, triggered, paused, resumed
AlertManager fired PodCrashLooping alert and agent diagnosed automatically
K8s CronJob ran containerized agent on schedule
GitHub webhook forwarded PR event and agent posted review (or Solo Learner fallback tested)
Telegram bot responded to /help, /diagnose, AND free-form conversational prompts
(Optional) Slack bot responded to DMs and channel @ mentions

FREE EXPLORE PHASE -- 20 minutes

Choose challenges based on your available time and experience level.

Challenge 1 (Starter -- 10 min): Cross-namespace cron

Create a cron job that checks pods across both kube-system and default namespaces:

hermes cron create "*/5 * * * *" \
  "Run pod health check across kube-system and default namespaces. Report pod counts, restart counts, and any non-Running pods." \
  --name "cross-namespace-check" \
  --skill "sre-k8s-pod-health" \
  --deliver local

Trigger it manually to see the output:

hermes cron run <job-id>   # use the ID from hermes cron list

Questions to explore:

How many pods are running in kube-system vs default?
Does the agent correctly identify which namespace each pod belongs to?
What happens if you add a third namespace to the prompt?

Challenge 2 (Intermediate -- 10 min): AlertManager to Telegram notification chain

Wire AlertManager alerts to your Telegram bot instead of the terminal:

# Stop and restart gateway with Telegram adapter
hermes gateway stop
sleep 30

export TELEGRAM_BOT_TOKEN="123456:ABC..."
export TELEGRAM_ALLOWED_USERS="987654321"

hermes gateway run

# Subscribe alertmanager webhook with Telegram delivery
hermes webhook subscribe alertmanager \
  --events "alertmanager-alert" \
  --prompt "AlertManager alert fired. Details: {alerts}. Diagnose the affected pod." \
  --skill "sre-k8s-pod-health" \
  --deliver telegram

Apply the crashloop scenario and watch for the notification on your phone:

kubectl apply -f infrastructure/scenarios/k8s/02-crashloop-backoff.yaml

Within ~2 minutes, your Telegram bot should send you the diagnosis automatically.

Cleanup:

kubectl delete -f infrastructure/scenarios/k8s/02-crashloop-backoff.yaml
hermes webhook unsubscribe alertmanager

Challenge 3 (Advanced -- 10 min): GitHub PR triggers full diagnosis and posts report

Combine the GitHub webhook with a Kubernetes-aware prompt so the agent checks cluster health when a PR touches K8s manifests:

hermes webhook subscribe github \
  --events "pull_request" \
  --prompt "PR #{pull_request.number} on {repository.full_name} modifies infrastructure. Run a full K8s health check across all namespaces and post the results as a PR comment." \
  --skill "sre-k8s-pod-health" \
  --deliver github_comment \
  --deliver-chat-id "{repository.full_name}:{pull_request.number}"

Open a PR that modifies a YAML file in your test repo. The agent should:

Receive the PR event via smee.io
Run a full cluster health check against your KIND cluster
Post the health report as a comment on the PR

Questions to explore:

Does the agent correctly correlate the PR content with cluster state?
What changes in the report if you have a crashloop pod running vs a clean cluster?

Closing

What you built in this lab:

A cron-scheduled health check that fires at 8 AM daily, loads a domain skill, and delivers findings to your terminal (or Slack/Telegram in production)
A real Prometheus + AlertManager pipeline on KIND firing on a broken pod, with the agent receiving the alert and diagnosing without manual invocation
A K8s CronJob running a containerized Hermes agent on a schedule, with explicit "use this when..." framing for Hermes cron vs K8s CronJob
A GitHub webhook via smee.io routing real PR events to Hermes, with the agent posting review comments back via the built-in github_comment delivery type
A Telegram bot you can poke from your phone, with slash commands and immediate agent responses

Key commands reference:

# Cron management
hermes cron status                      # Always run at session start
hermes cron create "<schedule>" "<prompt>" --name ... --skill ... --deliver local
hermes cron list
hermes cron run <job-id>            # Manual fire (use ID from cron list, not name)
hermes cron pause <job-id>
hermes cron resume <job-id>
hermes cron delete <name>

# Webhook management
hermes gateway run                      # Start the webhook gateway
curl http://localhost:8644/health       # Verify endpoint is live
hermes webhook list
hermes webhook test <name> --payload '{"key": "value"}'
hermes webhook unsubscribe <name>
# AlertManager: edit ~/.hermes/webhook_subscriptions.json directly
#   (set secret=INSECURE_NO_AUTH, events=[], deliver=log)

# Trigger-specific commands
# AlertManager webhook: configure via webhook_subscriptions.json (see Step 2c)
hermes webhook subscribe github --events "pull_request" --prompt "..." --deliver github_comment
kubectl apply -f infrastructure/scenarios/k8s/cronjob/agent-health-check.yaml -l track=track-c
./infrastructure/scenarios/k8s/github-webhook/smee-setup.sh
hermes gateway run    # With TELEGRAM_BOT_TOKEN set, activates Telegram adapter
hermes gateway stop   # Stop gateway before restarting with new governance level

Next: Module 13 covers governance -- approval workflows, maturity levels, and audit trails. The cron and webhook triggers you built here become the entry points for governed agent actions in Module 13.

Prerequisites (5 min)​

GUIDED PHASE -- 70 minutes​

Step 1: Hermes Cron -- Daily K8s Health Check (10 min)​

One-time setup: install skill globally and configure API key​

Create the cron job​

What each argument does​

Verify the job was registered​

Trigger manually and verify output​

Lifecycle demo: pause and resume​

Step 2: AlertManager -- Prometheus Stack + Live Webhook (20 min)​

Sub-step 2a: Verify the Prometheus + AlertManager stack​

Sub-step 2b: Apply the PrometheusRule​

Sub-step 2c: Configure the webhook subscription and restart the gateway​

Sub-step 2d: Apply the crashloop scenario and observe​

How the AlertManager → Hermes chain works​

Expected timeline​

How to verify each step​

Where to see the agent's diagnosis​

Cleanup after observing​

Step 3: K8s CronJob -- Agent as a Kubernetes Workload (15 min)​

Build the minimal hermes-agent image​

Load the image into KIND and create the API key secret​

Apply the CronJob for Track C​

View logs from the first completed job​

Cleanup​

Step 4: GitHub Webhook -- PR Review Bot (15 min)​

Sub-step 4a: Get a smee.io channel URL​

Sub-step 4b: Get a GitHub PAT with repo scope​

Sub-step 4c: Start the gateway and smee-client​

Sub-step 4d: Add the webhook to your GitHub repo​

Subscribe the GitHub webhook​

Trigger the event​

Solo Learner fallback -- no GitHub repo or smee.io needed​

Cleanup​

Step 5: Telegram Bot -- Chat Interface (15 min)​

Sub-step 5a: Create your bot via @BotFather (~2 min)​

Sub-step 5b: Configure Telegram via hermes gateway setup​

Sub-step 5c: Send slash commands from Telegram​

Sub-step 5d: Conversational chat via Telegram​

Sub-step 5e: Set home channel for cron delivery​

Sub-step 5f: Default agent vs profile agent​

Cleanup​

Step 6: Slack Integration (Optional -- 15 min)​

Sub-step 6a: Create a Slack App (~5 min)​

Sub-step 6b: Configure Slack via hermes gateway setup​

Sub-step 6c: Talk to the agent in Slack​

Sub-step 6d: Cleanup​

Verification Checklist (5 min)​

FREE EXPLORE PHASE -- 20 minutes​

Challenge 1 (Starter -- 10 min): Cross-namespace cron​

Challenge 2 (Intermediate -- 10 min): AlertManager to Telegram notification chain​

Challenge 3 (Advanced -- 10 min): GitHub PR triggers full diagnosis and posts report​

Closing​

Prerequisites (5 min)

GUIDED PHASE -- 70 minutes

Step 1: Hermes Cron -- Daily K8s Health Check (10 min)

One-time setup: install skill globally and configure API key

Create the cron job

What each argument does

Verify the job was registered

Trigger manually and verify output

Lifecycle demo: pause and resume

Step 2: AlertManager -- Prometheus Stack + Live Webhook (20 min)

Sub-step 2a: Verify the Prometheus + AlertManager stack

Sub-step 2b: Apply the PrometheusRule

Sub-step 2c: Configure the webhook subscription and restart the gateway

Sub-step 2d: Apply the crashloop scenario and observe

How the AlertManager → Hermes chain works

Expected timeline

How to verify each step

Where to see the agent's diagnosis

Cleanup after observing

Step 3: K8s CronJob -- Agent as a Kubernetes Workload (15 min)

Build the minimal hermes-agent image

Load the image into KIND and create the API key secret

Apply the CronJob for Track C

View logs from the first completed job

Cleanup

Step 4: GitHub Webhook -- PR Review Bot (15 min)

Sub-step 4a: Get a smee.io channel URL

Sub-step 4b: Get a GitHub PAT with repo scope

Sub-step 4c: Start the gateway and smee-client

Sub-step 4d: Add the webhook to your GitHub repo

Subscribe the GitHub webhook

Trigger the event

Solo Learner fallback -- no GitHub repo or smee.io needed

Cleanup

Step 5: Telegram Bot -- Chat Interface (15 min)

Sub-step 5a: Create your bot via @BotFather (~2 min)

Sub-step 5b: Configure Telegram via `hermes gateway setup`

Sub-step 5c: Send slash commands from Telegram

Sub-step 5d: Conversational chat via Telegram

Sub-step 5e: Set home channel for cron delivery

Sub-step 5f: Default agent vs profile agent

Cleanup

Step 6: Slack Integration (Optional -- 15 min)

Sub-step 6a: Create a Slack App (~5 min)

Sub-step 6b: Configure Slack via `hermes gateway setup`

Sub-step 6c: Talk to the agent in Slack

Sub-step 6d: Cleanup

Verification Checklist (5 min)

FREE EXPLORE PHASE -- 20 minutes

Challenge 1 (Starter -- 10 min): Cross-namespace cron

Challenge 2 (Intermediate -- 10 min): AlertManager to Telegram notification chain

Challenge 3 (Advanced -- 10 min): GitHub PR triggers full diagnosis and posts report

Closing