Skip to main content

Lab: Triaging a Broken Fleet with Platform AI

Duration: 75 minutes Deliverable: A completed Platform AI Assessment for your environment (starter/platform-ai-assessment.md) Prerequisites: AWS account with free tier active, AWS CLI configured (aws configure), an EC2 key pair (or we'll create one in Section 0)


Overview

Instead of creating a single instance and generating load by hand, this lab hands you a pre-broken fleet of three EC2 instances inside a VPC that has its own set of network-level misconfigurations. Your job is to triage them using platform AI tools and see what each tool catches and what it misses.

The fleet is deployed by a single CloudFormation stack. The stack is self-contained: VPC, subnet, IGW, security groups, IAM role, three instances, and three CloudWatch alarms. Teardown is one command.

The three instances you're about to meet:

InstanceSymptom (baked in)Which tool should catch it?
alpha-noisyCPU pinned at 100%Native CloudWatch (CPU is free metric)
beta-leakyMemory grows 80 MB/min until OOMCloudWatch Agent (native CW can't see memory)
gamma-diskRoot filesystem fills at 40 MB/min, plus a broken /etc/fstab lineCustom CW metric for disk, Amazon Q for the fstab

And at the network layer:

  • An overly permissive security group (SSH 0.0.0.0/0) — Amazon Q should flag it on template review.

The four tool passes you'll make:

  1. CloudWatch native metrics — what AWS gives you for free, out of the box.
  2. CloudWatch Agent + custom metrics — memory and disk signals that native CloudWatch can't see.
  3. Amazon Q in the Console — the purple Q icon on the AWS Console. Code/config review and account-aware Q&A, no Builder ID or IDE setup needed.
  4. AWS DevOps Agent (the 2026 agentic service — not DevOps Guru) — observed via the Sample Investigation in CloudWatch.

The answer key lives in starter/fleet-issues.md — don't peek until you've tried each tool pass.


Section 0: One-Time Setup (10 min)

0.1 — Enable Cost Explorer

Free but must be enabled once. First data load can take up to 24 hours.

AWS Console → Billing and Cost ManagementCost ExplorerLaunch Cost Explorer.

0.2 — Locate Amazon Q in the AWS Console

No signup or Builder ID needed for the lab. When you sign in to the AWS Console, look at the left edge (or top-right on some layouts) — there's a purple Q icon. Click it; the Amazon Q panel expands with a chat box titled "How can I help you today?". That's the tool you'll use in Section 3.

Optional: If you want Q inside your IDE later (VS Code / JetBrains), that's a separate product called Amazon Q Developer and it uses an AWS Builder ID from https://profile.aws.amazon.com. Not required for this lab.

0.3 — Create an EC2 key pair

If you don't already have one in the region you'll deploy to (default: us-east-1):

AWS Console → EC2Key PairsCreate key pair → name it platform-ai-lab-key, type RSA, format .pem. Download and save.

chmod 400 ~/Downloads/platform-ai-lab-key.pem

0.4 — Launch the Broken Fleet stack

The stack template is hosted at s3://pai342/ec2_broken_fleet.yaml. Use the one-click quick-create URL below — it opens the AWS Console with the template and stack name pre-filled.

One-click launch URL:

https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/quickcreate?templateURL=https://pai342.s3.amazonaws.com/ec2_broken_fleet.yaml&stackName=platform-ai-lab

Then:

  1. Fill in parameters:
    • KeyName: platform-ai-lab-key (from Section 0.3).
    • AllowedSshCidr: narrow to YOUR_IP/32 (find your IP at https://checkip.amazonaws.com); 0.0.0.0/0 is acceptable for a training lab.
    • LatestAmiId: leave default (SSM resolves Amazon Linux 2023 automatically).
    • InstanceType: leave at t2.micro (free tier).
  2. Check "I acknowledge that AWS CloudFormation might create IAM resources" (needed for the CloudWatch Agent role).
  3. Click Create stack.

CLI alternative (one command):

aws cloudformation deploy \
--stack-name platform-ai-lab \
--template-url https://pai342.s3.amazonaws.com/ec2_broken_fleet.yaml \
--parameter-overrides KeyName=platform-ai-lab-key \
--capabilities CAPABILITY_IAM \
--region us-east-1

The stack takes 3–5 minutes to create. When it shows CREATE_COMPLETE, check the Outputs tab — you'll see the three instance IDs and console shortcut URLs you'll use through the rest of the lab.

Why CloudFormation instead of run-instances? Because the failure modes live across five resource types (VPC, NACL, SG, EC2, Alarm) and nine resources total. Launching them one by one in the CLI is tedious and error-prone; a declarative stack makes the whole scenario reproducible and disposable.

0.5 — Wait ~10 minutes before starting Section 1

The CloudWatch Agent on beta-leaky and gamma-disk needs a cycle or two to publish its first metrics, and the memory leak / disk fill need time to ramp. Grab coffee. When you come back:

  • alpha-noisy should already be showing high CPU in CloudWatch.
  • beta-leaky memory usage should be climbing through the 50–75% range.
  • gamma-disk should be publishing a PlatformAILab/data_fill_percent custom metric that trends upward each minute.

Section 1: Tool Pass #1 — CloudWatch native metrics (15 min)

1.1 — See what native EC2 metrics reveal

# Pull CPU for each instance (use the Instance IDs from your stack Outputs tab)
for ID in <alpha-id> <beta-id> <gamma-id>; do
echo "=== $ID ==="
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=$ID \
--start-time $(date -u -v-30M '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || date -u -d '30 minutes ago' '+%Y-%m-%dT%H:%M:%SZ') \
--end-time $(date -u '+%Y-%m-%dT%H:%M:%SZ') \
--period 300 \
--statistics Average \
--region us-east-1
done

What you should see:

  • alpha-noisy → CPU at or above 95% (our alarm fires at 70%).
  • beta-leaky → CPU low. Native CloudWatch gives you no hint about the memory issue.
  • gamma-disk → CPU low. Native CloudWatch gives you no hint about the disk issue.

This is the first gap. Native EC2 metrics only show CPU, network, and status checks. Memory and disk require the agent.

1.2 — Review the alarms already firing

aws cloudwatch describe-alarms \
--alarm-name-prefix platform-ai-lab- \
--state-value ALARM \
--region us-east-1 \
--query 'MetricAlarms[].[AlarmName,StateReason]' \
--output table

Record which alarms are in ALARM vs OK vs INSUFFICIENT_DATA in your assessment.

1.3 — CloudWatch Anomaly Detection (DEMO)

Same as before — $0.30/alarm beyond free tier. Observe in Console: Alarms → Create Alarm → select metric → Additional configuration → Anomaly detection band. Don't create one. The point is that it only helps on metrics you're already collecting.


Section 2: Tool Pass #2 — CloudWatch Agent + custom metrics (10 min)

The stack publishes to the PlatformAILab namespace two different ways:

  • CW Agent on beta-leaky → emits mem_used_percent with only InstanceId as a dimension.
  • A cron script on gamma-disk → emits data_fill_percent with only InstanceId as a dimension, computed from df / every minute.

Why two mechanisms instead of just the CW Agent everywhere? The agent's disk plugin publishes four dimensions (InstanceId, path, device, fstype) and alarms need an exact dimension-tuple match. A custom script gives us control over dimensions — a useful pattern when agent defaults don't line up with what you want to alarm on. (This is itself a platform-AI gap: CloudWatch tooling exists but wiring it up correctly is still your job.)

# See all PlatformAILab metrics
aws cloudwatch list-metrics \
--namespace PlatformAILab \
--region us-east-1 \
--output table

# Memory for beta-leaky
aws cloudwatch get-metric-statistics \
--namespace PlatformAILab --metric-name mem_used_percent \
--dimensions Name=InstanceId,Value=<beta-id> \
--start-time $(date -u -v-30M '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || date -u -d '30 minutes ago' '+%Y-%m-%dT%H:%M:%SZ') \
--end-time $(date -u '+%Y-%m-%dT%H:%M:%SZ') \
--period 300 --statistics Average --region us-east-1

# Disk fill for gamma-disk
aws cloudwatch get-metric-statistics \
--namespace PlatformAILab --metric-name data_fill_percent \
--dimensions Name=InstanceId,Value=<gamma-id> \
--start-time $(date -u -v-30M '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || date -u -d '30 minutes ago' '+%Y-%m-%dT%H:%M:%SZ') \
--end-time $(date -u '+%Y-%m-%dT%H:%M:%SZ') \
--period 300 --statistics Average --region us-east-1

Key observation: With the agent (and a small custom publisher), you now see the issues native CloudWatch missed. Two of the three problems only become visible once you opt in to some form of agent — a real-world reminder that "I have CloudWatch" doesn't mean "I have observability."

What the agent still doesn't tell you:

  • Why memory is growing (is it a leak? a legitimate cache? a spike in traffic?).
  • What's writing to the disk (logs that should be rotated, or a runaway process?).
  • Whether the /etc/fstab on gamma-disk will prevent the next reboot from succeeding.

Section 3: Tool Pass #3 — Amazon Q in the Console (15 min)

Amazon Q is already available in your AWS Console — no install, no Builder ID needed. Click the purple Q icon (left panel or top-right, depending on your layout) to open the chat panel.

3.1 — Ask Q about what's running in your account

Start with a scoped, account-aware question Q can actually answer from the live console:

List the EC2 instances in my account tagged with the CloudFormation stack "platform-ai-lab" and tell me which security groups they use.

What you should see: Q returns the three instances by name, with their SG attachments. Notably, it can already tell you that alpha-noisy is in a security group with 0.0.0.0/0 on port 22 — platform AI catches issue #S-1 for free, just by asking.

3.2 — Review the CloudFormation template

Open starter/broken-fleet.yaml in any editor and paste its contents into Q with:

Review this CloudFormation template for reliability, security, and operational issues. List every concern you can find.

What Q should surface (roughly in this order):

  • PermissiveSg allows SSH from 0.0.0.0/0 (issue #S-1).
  • cpu-burn.service and mem-leak.service look like deliberate fault injection, not real workloads (issues #E-1, #E-2).
  • The broken fstab UUID in GammaDisk's user-data will fail on reboot (issue #E-4).
  • The disk-fill cron on gamma-disk will eventually exhaust the root filesystem (issue #E-3).

Compare Q's findings against the answer key in fleet-issues.md. Typical result: Q finds 3–4 of the 5 issues reliably; strongest on the SG and code-level issues.

3.3 — Test Q's context limits

Paste the alarm output from Section 1 into Q and ask:

Based on this alarm state, what would you investigate first and why?

What you'll see: Q gives generic "check CPU saturation on alpha-noisy" advice. It doesn't know your topology, it doesn't know beta-leaky has a memory leak (CPU is normal there), and it can't follow your team's runbook.

The gap between Q's generic advice and what you now know (from the answer key) is exactly the space custom agents fill in Module 3 onward.


Section 4: Tool Pass #4 — AWS DevOps Agent (DEMO, 10 min)

Important distinction. AWS has two related services with confusingly similar names:

  • Amazon DevOps Guru (2020) — passive, ML-based anomaly detection. Fires "insights" when metrics deviate. No investigation, no action, no agent loop.
  • AWS DevOps Agent (GA March 31, 2026) — an agentic SRE teammate. When a CloudWatch alarm fires, it builds a topology map, pulls logs/traces/code changes, proposes a hypothesis, and suggests a fix. Delivered inside CloudWatch as "Investigations."

This lab uses the newer DevOps Agent, because it's the closest thing AWS sells to what Modules 3+ will teach you to build yourself.

4.1 — Observe a sample investigation

Go to CloudWatch Console → AI Operations → Investigations. If you haven't enabled DevOps Agent, click Sample Investigation — AWS provides a canned DynamoDB throttling scenario you can walk through without enabling anything.

Walk through the sample and note:

  • Feed (left pane): observations, queries, evidence the agent gathered.
  • Hypothesis (right pane): the agent's root-cause guess with a "Show reasoning" expander.
  • Accept / Reject buttons: human-in-the-loop feedback that improves future suggestions.

4.2 — What it would do against our fleet (if enabled)

Given the alpha-cpu-high, beta-mem-high, and gamma-disk-full alarms we pre-created, DevOps Agent would:

  • Start an investigation when any alarm enters ALARM state.
  • Query the corresponding instance's logs, traces, and recent CloudFormation events.
  • Generate a hypothesis like "alpha-noisy CPU is saturated because a systemd unit cpu-burn.service is running stress-ng continuously."
  • Suggest remediation: disable the unit, cordon the instance, or roll back the stack.

Why we don't enable it in this lab: 2-month free trial for new customers, then metered billing. We observe via the sample investigation instead.

4.3 — Where DevOps Agent still hits the ceiling

Even this purpose-built agentic tool can't:

  • Read your runbook and apply your escalation policy (it uses AWS's generic patterns).
  • File a ticket in your Jira/Linear with your custom fields.
  • Coordinate with teams outside AWS (on-call rotation in PagerDuty, comms in Slack).
  • Make decisions that require business context (e.g. "should we fail over to the other region during business hours?").

This is the punchline of Module 2. AWS has shipped the most capable platform-AI-for-DevOps on the market, and it still stops at a ceiling. The space above that ceiling — your runbook, your context, your policies — is exactly what custom agents fill. That's Module 3 onward.


Section 5: Complete the Platform AI Assessment (10 min)

Open starter/platform-ai-assessment.md and fill it in with what you just observed. In particular, use the coverage matrix in fleet-issues.md (once you've checked your work) to rate tool-vs-issue coverage honestly.


Section 6: Teardown (5 min) — Don't skip

aws cloudformation delete-stack \
--stack-name platform-ai-lab \
--region us-east-1

# Watch it disappear
aws cloudformation describe-stacks \
--stack-name platform-ai-lab \
--region us-east-1 \
--query 'Stacks[0].StackStatus'
# When this command errors with "does not exist", teardown is complete.

Everything the stack created — VPC, subnet, IGW, NACL, SGs, IAM role, EC2 instances, EBS volumes, alarms — is deleted in one call. That's the main reason we used CloudFormation: unambiguous cleanup.

What you do NOT need to tear down:

  • Cost Explorer (free, keep it enabled).
  • AWS Builder ID (free, you'll reuse it in Module 7).

Wrap-Up

After completing this lab, you should be able to answer:

  1. Which AWS AI features are available to you at zero cost, and what do they each catch?
  2. Where is the "capability ceiling" of each tool — what does it not see?
  3. For each of the six issues in this fleet, which tool was the best signal, and which tool was silent?

Bring your completed assessment to Module 3, where a custom agent (Hermes) will demonstrate what's possible when you cross that ceiling.


Appendix A — Offline / Mock-Data Fallback

If you cannot launch the stack (locked-down environment, no free tier, etc.):

cat infrastructure/mock-data/cloudwatch/describe-alarms-clean.json
cat infrastructure/mock-data/cloudwatch/describe-alarms-anomaly.json
cat infrastructure/mock-data/cost-explorer/normal-spend.json
cat infrastructure/mock-data/cost-explorer/anomaly-spike.json

The mock data mirrors roughly what the stack would produce. The tool passes in Sections 2–4 won't be live, but you can still complete the assessment using the mock data + the issues catalog as a scenario walkthrough.

Appendix B — Optional: Add Datadog to the mix

Datadog integrates cleanly with AWS and adds a second pane of glass so you can compare UX against CloudWatch. It's optional — skip if you don't already have a workspace.

B.1 — If you have a Datadog account

  1. In Datadog: Integrations → AWS → Add AWS Account. Follow the CloudFormation-based role setup (Datadog provides its own template URL).
  2. Wait ~10 min for metrics to flow in. You'll see CloudWatch metrics appear in Datadog under the aws.ec2.* namespace.
  3. Optional: install the Datadog Agent on the three instances via SSM Run Command, then compare host-level metrics (memory, disk, processes) to what the CloudWatch Agent is publishing.

B.2 — What Datadog adds over CloudWatch

  • Unified host, container, APM view (CloudWatch needs Container Insights + X-Ray to match).
  • Better dashboards out of the box.
  • Cross-cloud / cross-vendor — if you also run GCP or on-prem, Datadog sees both.

B.3 — What Datadog doesn't add

  • It still doesn't know your runbook, your ticketing system, or your deployment pipeline.
  • The detection-to-action gap is the same one CloudWatch has; Datadog just draws nicer graphs of the detection half.

The point of this appendix is not to learn Datadog — it's to confirm that the "platform AI ceiling" isn't a CloudWatch-specific limitation. Every observability platform hits the same ceiling; the gap above it is where custom agents live.