Evaluating Automation Candidates

Not Everything Should Be Automated

The instinct to automate everything is understandable — AI agents are powerful and the potential ROI is high. But automation has a cost: design time, testing, maintenance, governance overhead, and the risk of wrong automation being worse than no automation.

The framework in this module helps you make that trade-off systematically.

The Automation Quadrant

The core framework is a 2×2 matrix with two axes:

Frequency: How often does the task occur?
Complexity: How hard is it? (measured by error risk + tool count)

High Complexity
     |  ASSIST          PRIME
     |  (AI helps)      (Automate first!)
     |
     |---------------------------
     |
     |  SKIP            QUICK WINS
     |  (Not worth it)  (Script first)
     |
     +-------------------------> High Frequency

Quadrant B: PRIME CANDIDATES (High Frequency + High Complexity)

These are your highest-ROI automation targets. They happen constantly, involve multiple tools, and carry significant error risk under pressure.

Examples: Morning alert triage (daily, 4 dashboards, pager fatigue), pre-deployment checklist (multiple times per week, 5-7 tools, high stakes), incident escalation triage (daily, PagerDuty + Jira + Slack + runbook).

Why these are PRIME: Every instance is an opportunity to reduce mean time to detection (MTTD) and mean time to resolution (MTTR). Automation pays back the investment quickly.

Quadrant C: QUICK WINS (High Frequency + Low Complexity)

These happen constantly but aren't complex. A simple script or a narrow single-tool agent works here.

Examples: Certificate expiry checks (daily, one AWS CLI command), disk usage alerts (daily, one command), restart failed services by name.

Recommendation: Write a script first. If the task needs conditional logic or decision-making, then consider an agent.

Quadrant A: ASSIST MODE (Low Frequency + High Complexity)

These are rare but high-stakes. An AI-assisted tool that helps when they occur is valuable — even if it doesn't fully automate them.

Examples: Disaster recovery runbook (quarterly, 20+ steps, high stakes), major version upgrades (annually, cross-team coordination), novel failure modes (unpredictable frequency, complex root cause).

Recommendation: Build an AI assistant or copilot — the agent helps the human, but the human stays in the loop. Full automation is risky here because the task requires judgment that benefits from human oversight.

Quadrant D: SKIP (Low Frequency + Low Complexity)

These don't happen often and are easy when they do. The cost of building, testing, and maintaining automation exceeds the cost of just doing it manually.

Examples: Monthly one-off reports, ad-hoc queries to a single service, one-time configuration tasks.

Recommendation: Do it manually. Document it well. Don't automate it.

When an Agent Is Overkill

An AI agent is overkill when:

The task has no decision-making (single command, always the same output)
The task involves a single tool (one CLI call, one API request)
The task never varies (same inputs → same outputs, no conditional logic)

For these tasks, write a shell script. Agents add value when they need to reason, choose between options, or coordinate across multiple tools. They're not a better cron job — they're a better SRE.

Good agent candidates: Alarm triage (reads alarm, checks deployments, follows runbook, files ticket), deployment validation (runs smoke tests, checks rollback eligibility, compares metrics before/after), cost review (identifies anomalies, correlates with known events, drafts recommendations).

Poor agent candidates: Password reset (one API call, deterministic), DNS record update (one CLI call, no decision needed), SSL cert renewal (single certbot command, no judgment required).

The 5 Selection Criteria (Expanded)

The selection criteria in the lab assess whether a task is ready for agent automation:

1. Decomposable into discrete steps

Agents work through sequential tool use. If you can't write a numbered procedure for the task, the agent can't follow one either. The selection criteria is a proxy for "can I write a runbook for this?" — because if you can write a runbook, you can write a SKILL.md.

2. Tools accessible via CLI/API

The agent needs programmatic access to every system it touches. This is often the practical blocker — proprietary monitoring tools, GUI-only dashboards, or data sources behind SSO that can't be accessed via API. Identify these gaps before your capstone build.

3. Clear success/failure criteria

The agent needs to know when it's done. Tasks with binary outcomes (alarm resolved / alarm still firing) are easier to automate correctly than tasks with fuzzy outcomes ("the infrastructure looks healthy"). Define success before you build.

4. Safe with approval gates

Most operational tasks can be made safe with the right governance structure. The question isn't "is this dangerous?" — it's "can I define the safe boundaries?" A rollback command that requires human approval before execution is safe. A rollback command that runs automatically is a different risk profile.

5. Testable with mock data

This is a practical constraint for the Day 3 capstone specifically. You need to build and test your agent without risking production. CloudWatch JSON fixtures work for alarm triage. Mock kubectl output works for cluster health checks. If your task can only be tested against live production data, you can't safely iterate on the agent design.

Real-World Examples

Task	Score (approx)	Quadrant	Verdict
Morning CloudWatch alert review	17/20	PRIME	Build agent
Terraform drift detection	14/20	PRIME	Build agent
Pre-deploy readiness check	15/20	PRIME	Build agent
Monthly EC2 right-sizing	10/20	ASSIST	Build AI-assisted tool
Certificate expiry check	8/20	QUICK WIN	Write script first
Password reset	5/20	SKIP	Keep manual

Not Everything Should Be Automated​

The Automation Quadrant​

Quadrant B: PRIME CANDIDATES (High Frequency + High Complexity)​

Quadrant C: QUICK WINS (High Frequency + Low Complexity)​

Quadrant A: ASSIST MODE (Low Frequency + High Complexity)​

Quadrant D: SKIP (Low Frequency + Low Complexity)​

When an Agent Is Overkill​

The 5 Selection Criteria (Expanded)​

1. Decomposable into discrete steps​

2. Tools accessible via CLI/API​

3. Clear success/failure criteria​

4. Safe with approval gates​

5. Testable with mock data​

Real-World Examples​