CLOUD ON DEMAND I am not your thought leader™
Module OPR-007
Version 1.0 · May 2026
cloud-ondemand.com
Operator Module · Tier 2

SnowWork Agent Design
Autonomous Workflows with Human-in-the-Loop Governance

Design and deploy autonomous Snowflake workflow agents that handle real production tasks — with human-in-the-loop checkpoints, complete audit trails, and cost containment baked in from the first deployment.

Learning objectives
Level
Operator
Advanced
Duration
90 minutes
plus lab
Prerequisites
OPR-001 complete
RBAC basics

Why this matters

SnowWork agents are powerful — and the most common production deployments fail in the same three ways. Wrong tasks given to agents that shouldn't be autonomous. Missing human checkpoints at high-risk decision points. No cost caps until the credit bill explains the problem. This module is for the deployment that doesn't repeat those failures.

00
Module Contents
What this module covers

This module covers the design and operations of SnowWork agents — autonomous Snowflake workflow components that handle scheduled tasks, anomaly response, and data quality monitoring. By the end you will have a production agent design pattern with audit, governance, and human-in-the-loop checkpoints.

01
Chapter 01
What SnowWork agents are

SnowWork agents are Snowflake's autonomous workflow components — they handle scheduled tasks, anomaly response, data quality monitoring, and operational runbooks without requiring human action for routine work. Unlike Cortex Code (which generates code on demand) or Cortex Analyst (which answers business questions), SnowWork agents operate continuously on long-running responsibilities.

An agent monitoring data quality runs every hour, checks defined rules, flags anomalies, and routes findings. An agent handling warehouse FinOps watches consumption patterns, suspends warehouses idle longer than policy, and reports to a Slack channel. An agent for compliance reviews access logs nightly, identifies anomalous patterns, and produces an evidence package. These are the kinds of work that previously required dedicated platform engineers; SnowWork lets a small team run them automated.

Three workflows where SnowWork earns its place

Workflow 01 — Scheduled operational tasks. Tasks that run on a schedule with deterministic outputs. Nightly data quality checks, daily FinOps summaries, weekly governance reports. SnowWork makes these reliable without constant engineer maintenance.

Workflow 02 — Anomaly response. Workflows that watch for specific patterns and act. Warehouse exceeding budget threshold triggers suspension. Failed pipeline detected triggers escalation. SnowWork agents excel at threshold-based response patterns.

Workflow 03 — Documentation and reporting. Producing weekly executive summaries, monthly compliance reports, quarterly cost attribution analyses. SnowWork agents can compose reports by pulling from ACCOUNT_USAGE, formatting findings, and delivering output to defined channels.

Where simpler scheduling beats SnowWork
Overkill for SnowWork

Single SQL query scheduled to run nightly. Snowflake Tasks handle this natively. Adding SnowWork orchestration is complexity without benefit.

SnowWork-worthy

Workflow requiring multi-step coordination, conditional branching, external integration (Slack/email), or LLM-based output composition. Snowflake Tasks alone can't handle these elegantly; SnowWork can.

The threshold of justification

If a workflow can be expressed as a single SQL statement, use Snowflake Tasks. If it needs three or more steps, conditional logic, or LLM-based output composition, SnowWork is the right tool. Don't reach for SnowWork because it's the newest thing; reach for it because the workflow's complexity actually justifies it.

02
Chapter 02
Task suitability — autonomy vs supervision

Not every task that can be made autonomous should be. The five-question filter below separates good-fit autonomous tasks from dangerous ones. Run any candidate workflow through these five questions before designing the agent.

The five-question filter

Question 01 — Is the task reversible? If the agent makes a wrong decision, can the consequence be undone in minutes rather than days? Suspending a warehouse is reversible (resume it). Dropping a table is not. Reversible tasks are good candidates for full autonomy; irreversible tasks need human checkpoints.

Question 02 — Is the blast radius bounded? If the agent errors completely, what's the worst case? An agent that suspends warehouses might inconvenience users for an hour. An agent that modifies production data might corrupt downstream systems for weeks. Bounded blast radius means autonomy is appropriate.

Question 03 — Does the task have clear success criteria? Can you write a SQL query that confirms the task completed correctly? "Warehouse suspended" is verifiable. "User happy with the report" is not. Verifiable success criteria allow agents to know when they're done.

Question 04 — Is the input deterministic enough? Does the same input always produce the same correct output? FinOps queries against ACCOUNT_USAGE are deterministic. Decisions about ambiguous data ("is this PII?") are not. Deterministic inputs allow consistent agent behavior.

Question 05 — Would a senior engineer trust this agent? The final filter. If a senior engineer on your team wouldn't sign off on the agent operating without supervision, the agent needs supervision. Trust isn't something the agent earns automatically; it's something humans grant after watching the agent perform.

The suitability matrix
Task typeAutonomy levelPattern
Warehouse FinOps monitoringFull autonomyReversible, bounded, deterministic
Auto-suspending idle warehousesFull autonomyReversible, well-understood
Data quality alertingFull autonomyRead-only, bounded
Storage cleanup recommendationsRecommend onlyIrreversible; humans approve
RBAC changesRecommend onlySecurity impact; humans approve
PII detection and maskingFlag, never actCompliance stakes; humans decide
Production schema changesNever autonomousBlast radius too broad
Operator Tip
When in doubt, start with "recommend only" rather than full autonomy. The agent produces output that requires a human approval click before executing. After three months of clean recommendations with no errors, consider promoting the agent to full autonomy. Promotion is one-way easy; demoting an out-of-control autonomous agent is painful.
03
Chapter 03
The five-step design pattern
Step 01 — Scope
Define the agent's task with surgical precision
Write down: what the agent does, what triggers it, what inputs it reads, what outputs it produces, what success looks like, what failure looks like. Scope creep is the #1 cause of agent failures — agents asked to do "anything related to FinOps" produce inconsistent output and unmanageable governance surfaces.
Step 02 — Govern
Design the dedicated role and grants
Each agent gets its own Snowflake role with the narrowest possible privileges. A FinOps monitoring agent needs SELECT on ACCOUNT_USAGE — nothing else. A warehouse-suspension agent needs ALTER WAREHOUSE on specific warehouses — nothing else. Never reuse roles across agents.
CREATE ROLE AGENT_FINOPS_MONITOR; GRANT IMPORTED PRIVILEGES ON DATABASE SNOWFLAKE TO ROLE AGENT_FINOPS_MONITOR; -- That's it. Nothing else.
Step 03 — Instrument
Build in audit and observability from day one
Every agent action gets a QUERY_TAG identifying it. Every decision produces an audit log entry. Cost caps are set at the agent role level. Build observability first; you cannot add it after the fact when something goes wrong.
Step 04 — Deploy
Roll out in shadow mode first
Deploy the agent in observe-only mode for the first week. It runs, produces output, but takes no action. Compare its decisions to what humans would have done. If alignment is high, promote to action mode. If decisions are off, refine before going live.
Step 05 — Validate
Run weekly review of agent decisions
For the first 90 days of any agent's life, schedule a weekly review of its decisions. Did it act when it should have? Did it skip when it should have skipped? Were costs in budget? Use the review to refine the agent or expand its scope.
Operator Tip
Shadow mode is non-negotiable. Every customer who skipped it had an incident in the first month. Every customer who ran shadow mode for a week caught at least one decision pattern they wanted to adjust before production. The week of shadow mode pays for itself before production deployment.
04
Chapter 04
Diagnosing agent failures

SnowWork agents fail in patterns that differ from human failures. Below are the six most common, with the diagnostic and the fix for each.

SymptomRoot cause and fix
Agent takes action that contradicts policy Scope creep — agent was given broader responsibilities than its design supports. Narrow the scope; if multiple responsibilities are needed, split into multiple agents.
Agent consumes 10x expected credits Retry loop or runaway condition. Check QUERY_HISTORY for repeated identical operations. Add explicit retry caps and exponential backoff.
Agent makes inconsistent decisions on same input Non-deterministic LLM behavior on edge cases. For decisions that must be consistent, use explicit SQL rules rather than LLM judgment. Save LLM for output composition, not decision logic.
Agent silently stops running Underlying task suspended (often after multiple failed runs). Check TASK_HISTORY for run status. Snowflake suspends tasks after 1000 consecutive failures by default.
Agent produces output nobody reads Output destination wrong, or the right people aren't notified. Verify the Slack channel, email list, or storage location is monitored. If nobody acts on the output, the agent has no value.
Agent works fine but humans don't trust it Trust deficit. Add explicit confidence scores to agent output, increase audit log visibility, schedule monthly review with stakeholders showing the agent's track record. Trust is earned through transparency.

The trust principle

A technically correct agent that humans don't trust is worthless. A slightly imperfect agent that humans trust because they can see what it's doing is invaluable. Design for transparency, not just correctness. The audit trail isn't just for compliance — it's for the engineer who wonders at 11pm whether to trust the agent's last decision.

05
Chapter 05
Human-in-the-loop checkpoints

Human-in-the-loop is the pattern that makes autonomous agents safe in environments where mistakes are expensive. The agent does most of the work autonomously, but specific high-stakes decisions require explicit human approval before executing.

Three checkpoint design patterns

Pattern 01 — The recommendation queue. Agent produces recommendations and writes them to a table or sends them to a Slack channel. Humans review and approve via a simple UI or by responding to the Slack message. Approved recommendations execute; un-approved ones expire after a defined window. Best for irreversible actions like dropping tables or modifying RBAC.

Pattern 02 — The escalation threshold. Agent acts autonomously below a defined threshold. Above the threshold, it requires approval. A warehouse-suspension agent suspends idle warehouses freely; suspending a warehouse with active sessions requires approval. Threshold-based autonomy preserves speed for routine cases while protecting against exceptional ones.

Pattern 03 — The time-delayed action. Agent announces an intended action and executes after a defined delay if no one objects. "I will suspend warehouse XYZ in 30 minutes unless you reply STOP." Combines autonomy (action happens by default) with safety (humans can intercept). Best for medium-risk reversible actions.

UX principles for checkpoints
Operator Tip
Slack is the right approval surface for most cases. A bot post with action buttons (Approve, Reject, More Info) integrates with where engineers already work. Custom web UIs are over-engineering for the first 6 months; switch only when you've outgrown Slack's threading model.
06
Chapter 06
Audit trails and cost containment

Production agents need two non-negotiable layers: complete audit trails for compliance and incident investigation, and explicit cost caps that prevent runaway consumption. Both are easier to add at deployment than to retrofit later.

The QUERY_TAG convention

Every operation the agent performs should set a QUERY_TAG identifying the agent, the workflow, and the action. The convention below gives auditors a single column to filter on:

# QUERY_TAG convention for agent operations ALTER SESSION SET QUERY_TAG = '{ "agent": "finops_monitor", "version": "1.2", "workflow": "daily_warehouse_check", "action": "evaluate_suspension", "run_id": "2026-05-20-0234" }';
The audit view

Create a dedicated audit view that combines QUERY_HISTORY filtered by the agent's QUERY_TAG. This becomes the single source of truth for what the agent did, when, and with what outcome.

# Agent audit view — single source of truth CREATE OR REPLACE VIEW GOVERNANCE.AGENT_AUDIT AS SELECT TRY_PARSE_JSON(query_tag):agent::VARCHAR AS agent_name, TRY_PARSE_JSON(query_tag):workflow::VARCHAR AS workflow, TRY_PARSE_JSON(query_tag):action::VARCHAR AS action, TRY_PARSE_JSON(query_tag):run_id::VARCHAR AS run_id, start_time, execution_status, credits_used_cloud_services, rows_produced, error_message FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY WHERE TRY_PARSE_JSON(query_tag):agent IS NOT NULL AND start_time >= DATEADD(day, -90, CURRENT_TIMESTAMP());
Cost containment — three layers
07
Chapter 07
Hands-on lab worksheet
Task 01
Pick a workflow and run the five-question filter
15 min
Identify a real workflow in your environment. Run it through the five-question filter from Chapter 02. Determine the appropriate autonomy level — full, recommend-only, or never.
Task 02
Design the agent's role and grants
15 min
Write the SQL to create the dedicated agent role with the narrowest possible privileges. Run the SQL. Verify the role has exactly what it needs and nothing more.
Task 03
Build the QUERY_TAG convention and audit view
15 min
Implement the QUERY_TAG pattern for your agent. Create the audit view that filters QUERY_HISTORY by agent. Confirm the view returns data when the agent runs a test query.
Task 04
Deploy in shadow mode for one week
N/A — async
Deploy the agent with action disabled — it observes and produces output but takes no action. Schedule weekly review. Compare its recommendations to what humans would have done.
Task 05
Build the human-in-the-loop checkpoint
15 min
Design the approval mechanism for high-stakes decisions. Slack message with action buttons, email with approval link, or table-based recommendation queue. Document who approves and the approval timeout.
08
Chapter 08
Knowledge check
Question 01
When does SnowWork earn its complexity over simple Snowflake Tasks?
Question 02
Which task is appropriate for full agent autonomy?
Question 03
What is shadow mode?
Question 04
An agent making inconsistent decisions on the same input most likely needs:
Question 05
The single source of truth for agent activity is:
09
Chapter 09
Glossary and reference
Key terminology
SnowWork agent
An autonomous Snowflake workflow component that handles long-running operational responsibilities. Distinct from Cortex Code (interactive) or Cortex Analyst (Q&A).
Five-question filter
The framework for evaluating task suitability for autonomous execution: reversibility, blast radius, success criteria, deterministic input, senior engineer trust.
Shadow mode
Initial deployment phase where the agent observes and produces output but takes no action. Used to validate behavior before production action mode.
Human-in-the-loop checkpoint
A design pattern requiring explicit human approval at specific high-stakes decision points. Preserves autonomy for routine work while protecting against irreversible mistakes.
QUERY_TAG convention
A structured JSON tag set on every agent operation, identifying the agent, workflow, action, and run. Foundation of agent audit and observability.
Recommendation queue
Pattern where agent produces recommendations to a table or channel; humans approve before execution. Used for irreversible actions.
Per-agent credit cap
A daily credit limit set on the agent's service account user. Prevents runaway consumption from a misbehaving agent.
Production deployment checklist
10
Chapter 10
Module completion
What you learned

The autonomy decision

The five-question filter for task suitability. When SnowWork earns its complexity vs. simple Snowflake Tasks.

The design discipline

Scope, govern, instrument, deploy, validate. Five steps from task definition to production agent.

Human-in-the-loop

Recommendation queues, escalation thresholds, time-delayed actions. Three patterns that preserve safety while maintaining velocity.

Governance

QUERY_TAG conventions, audit views, per-agent credit caps. The infrastructure that makes agents production-grade.

Cloud On Demand · Operator Module Completion
SnowWork
Agent Design
Awarded to the practitioner who completed module OPR-007
Recipient signature
Issued by Cloud On Demand · OPR-007 · v1.0
Next in series
OPR-008 · RBAC for Agentic Workflows