Operator Module · Tier 2

SnowWork Agent Design
Autonomous Workflows with Human-in-the-Loop Governance

Design and deploy autonomous Snowflake workflow agents that handle real production tasks — with human-in-the-loop checkpoints, complete audit trails, and cost containment baked in from the first deployment.

Learning objectives

01
Understand what SnowWork agents are and the workflows where they earn their place.
02
Identify which tasks are appropriate for autonomous execution versus human-supervised execution.
03
Apply the five-step design pattern from task scope through deployment.
04
Diagnose the six common patterns where autonomous agents go off the rails.
05
Design human-in-the-loop checkpoints that preserve autonomy while preventing damage.
06
Configure audit trails and cost caps for production-grade governance.

Level

Operator
Advanced

Duration

90 minutes
plus lab

Prerequisites

OPR-001 complete
RBAC basics

Why this matters

SnowWork agents are powerful — and the most common production deployments fail in the same three ways. Wrong tasks given to agents that shouldn't be autonomous. Missing human checkpoints at high-risk decision points. No cost caps until the credit bill explains the problem. This module is for the deployment that doesn't repeat those failures.

00

Module Contents

What this module covers

This module covers the design and operations of SnowWork agents — autonomous Snowflake workflow components that handle scheduled tasks, anomaly response, and data quality monitoring. By the end you will have a production agent design pattern with audit, governance, and human-in-the-loop checkpoints.

01

What SnowWork agents are
The autonomous-task pattern, where it earns its complexity, and where simpler scheduling beats it.

p.03
02

Task suitability — autonomy vs supervision
The five-question filter that separates good-fit autonomous tasks from dangerous ones.

p.04
03

The five-step design pattern
Scope, govern, instrument, deploy, validate. From task definition to production agent.

p.05
04

Diagnosing agent failures
Six patterns where autonomous agents drift, escalate, or hide their work.

p.06
05

Human-in-the-loop checkpoints
Where to require approval, where to allow autonomy, and how to design checkpoint UX that humans actually use.

p.07
06

Audit trails and cost containment
QUERY_TAG conventions, audit views, per-agent cost caps, and the compliance evidence package.

p.08
07

Hands-on lab worksheet
Design and deploy a SnowWork agent for a real workflow with human checkpoints and audit trail.

p.09
08

Knowledge check
Five questions covering design, governance, diagnostics, and the human-in-the-loop pattern.

p.10
09

Glossary and reference
Terminology, agent design patterns, and a production deployment checklist.

p.11
10

Module completion and next steps
Module summary, certificate, and the next module in the series.

p.12

01

Chapter 01

What SnowWork agents are

SnowWork agents are Snowflake's autonomous workflow components — they handle scheduled tasks, anomaly response, data quality monitoring, and operational runbooks without requiring human action for routine work. Unlike Cortex Code (which generates code on demand) or Cortex Analyst (which answers business questions), SnowWork agents operate continuously on long-running responsibilities.

An agent monitoring data quality runs every hour, checks defined rules, flags anomalies, and routes findings. An agent handling warehouse FinOps watches consumption patterns, suspends warehouses idle longer than policy, and reports to a Slack channel. An agent for compliance reviews access logs nightly, identifies anomalous patterns, and produces an evidence package. These are the kinds of work that previously required dedicated platform engineers; SnowWork lets a small team run them automated.

Three workflows where SnowWork earns its place

Workflow 01 — Scheduled operational tasks. Tasks that run on a schedule with deterministic outputs. Nightly data quality checks, daily FinOps summaries, weekly governance reports. SnowWork makes these reliable without constant engineer maintenance.

Workflow 02 — Anomaly response. Workflows that watch for specific patterns and act. Warehouse exceeding budget threshold triggers suspension. Failed pipeline detected triggers escalation. SnowWork agents excel at threshold-based response patterns.

Workflow 03 — Documentation and reporting. Producing weekly executive summaries, monthly compliance reports, quarterly cost attribution analyses. SnowWork agents can compose reports by pulling from ACCOUNT_USAGE, formatting findings, and delivering output to defined channels.

Where simpler scheduling beats SnowWork

Overkill for SnowWork

Single SQL query scheduled to run nightly. Snowflake Tasks handle this natively. Adding SnowWork orchestration is complexity without benefit.

SnowWork-worthy

Workflow requiring multi-step coordination, conditional branching, external integration (Slack/email), or LLM-based output composition. Snowflake Tasks alone can't handle these elegantly; SnowWork can.

The threshold of justification

If a workflow can be expressed as a single SQL statement, use Snowflake Tasks. If it needs three or more steps, conditional logic, or LLM-based output composition, SnowWork is the right tool. Don't reach for SnowWork because it's the newest thing; reach for it because the workflow's complexity actually justifies it.

02

Chapter 02

Task suitability — autonomy vs supervision

Not every task that can be made autonomous should be. The five-question filter below separates good-fit autonomous tasks from dangerous ones. Run any candidate workflow through these five questions before designing the agent.

The five-question filter

Question 01 — Is the task reversible? If the agent makes a wrong decision, can the consequence be undone in minutes rather than days? Suspending a warehouse is reversible (resume it). Dropping a table is not. Reversible tasks are good candidates for full autonomy; irreversible tasks need human checkpoints.

Question 02 — Is the blast radius bounded? If the agent errors completely, what's the worst case? An agent that suspends warehouses might inconvenience users for an hour. An agent that modifies production data might corrupt downstream systems for weeks. Bounded blast radius means autonomy is appropriate.

Question 03 — Does the task have clear success criteria? Can you write a SQL query that confirms the task completed correctly? "Warehouse suspended" is verifiable. "User happy with the report" is not. Verifiable success criteria allow agents to know when they're done.

Question 04 — Is the input deterministic enough? Does the same input always produce the same correct output? FinOps queries against ACCOUNT_USAGE are deterministic. Decisions about ambiguous data ("is this PII?") are not. Deterministic inputs allow consistent agent behavior.

Question 05 — Would a senior engineer trust this agent? The final filter. If a senior engineer on your team wouldn't sign off on the agent operating without supervision, the agent needs supervision. Trust isn't something the agent earns automatically; it's something humans grant after watching the agent perform.

The suitability matrix

Task type	Autonomy level	Pattern
Warehouse FinOps monitoring	Full autonomy	Reversible, bounded, deterministic
Auto-suspending idle warehouses	Full autonomy	Reversible, well-understood
Data quality alerting	Full autonomy	Read-only, bounded
Storage cleanup recommendations	Recommend only	Irreversible; humans approve
RBAC changes	Recommend only	Security impact; humans approve
PII detection and masking	Flag, never act	Compliance stakes; humans decide
Production schema changes	Never autonomous	Blast radius too broad

Operator Tip

When in doubt, start with "recommend only" rather than full autonomy. The agent produces output that requires a human approval click before executing. After three months of clean recommendations with no errors, consider promoting the agent to full autonomy. Promotion is one-way easy; demoting an out-of-control autonomous agent is painful.

03

Chapter 03

The five-step design pattern

Step 01 — Scope

Define the agent's task with surgical precision

Write down: what the agent does, what triggers it, what inputs it reads, what outputs it produces, what success looks like, what failure looks like. Scope creep is the #1 cause of agent failures — agents asked to do "anything related to FinOps" produce inconsistent output and unmanageable governance surfaces.

Step 02 — Govern

Design the dedicated role and grants

Each agent gets its own Snowflake role with the narrowest possible privileges. A FinOps monitoring agent needs SELECT on ACCOUNT_USAGE — nothing else. A warehouse-suspension agent needs ALTER WAREHOUSE on specific warehouses — nothing else. Never reuse roles across agents.

CREATE ROLE AGENT_FINOPS_MONITOR;
GRANT IMPORTED PRIVILEGES ON DATABASE SNOWFLAKE TO ROLE AGENT_FINOPS_MONITOR;
-- That's it. Nothing else.

Step 03 — Instrument

Build in audit and observability from day one

Every agent action gets a QUERY_TAG identifying it. Every decision produces an audit log entry. Cost caps are set at the agent role level. Build observability first; you cannot add it after the fact when something goes wrong.

Step 04 — Deploy

Roll out in shadow mode first

Deploy the agent in observe-only mode for the first week. It runs, produces output, but takes no action. Compare its decisions to what humans would have done. If alignment is high, promote to action mode. If decisions are off, refine before going live.

Step 05 — Validate

Run weekly review of agent decisions

For the first 90 days of any agent's life, schedule a weekly review of its decisions. Did it act when it should have? Did it skip when it should have skipped? Were costs in budget? Use the review to refine the agent or expand its scope.

Operator Tip

Shadow mode is non-negotiable. Every customer who skipped it had an incident in the first month. Every customer who ran shadow mode for a week caught at least one decision pattern they wanted to adjust before production. The week of shadow mode pays for itself before production deployment.

04

Chapter 04

Diagnosing agent failures

SnowWork agents fail in patterns that differ from human failures. Below are the six most common, with the diagnostic and the fix for each.

Symptom	Root cause and fix
Agent takes action that contradicts policy	Scope creep — agent was given broader responsibilities than its design supports. Narrow the scope; if multiple responsibilities are needed, split into multiple agents.
Agent consumes 10x expected credits	Retry loop or runaway condition. Check QUERY_HISTORY for repeated identical operations. Add explicit retry caps and exponential backoff.
Agent makes inconsistent decisions on same input	Non-deterministic LLM behavior on edge cases. For decisions that must be consistent, use explicit SQL rules rather than LLM judgment. Save LLM for output composition, not decision logic.
Agent silently stops running	Underlying task suspended (often after multiple failed runs). Check TASK_HISTORY for run status. Snowflake suspends tasks after 1000 consecutive failures by default.
Agent produces output nobody reads	Output destination wrong, or the right people aren't notified. Verify the Slack channel, email list, or storage location is monitored. If nobody acts on the output, the agent has no value.
Agent works fine but humans don't trust it	Trust deficit. Add explicit confidence scores to agent output, increase audit log visibility, schedule monthly review with stakeholders showing the agent's track record. Trust is earned through transparency.

The trust principle

A technically correct agent that humans don't trust is worthless. A slightly imperfect agent that humans trust because they can see what it's doing is invaluable. Design for transparency, not just correctness. The audit trail isn't just for compliance — it's for the engineer who wonders at 11pm whether to trust the agent's last decision.

05

Chapter 05

Human-in-the-loop checkpoints

Human-in-the-loop is the pattern that makes autonomous agents safe in environments where mistakes are expensive. The agent does most of the work autonomously, but specific high-stakes decisions require explicit human approval before executing.

Three checkpoint design patterns

Pattern 01 — The recommendation queue. Agent produces recommendations and writes them to a table or sends them to a Slack channel. Humans review and approve via a simple UI or by responding to the Slack message. Approved recommendations execute; un-approved ones expire after a defined window. Best for irreversible actions like dropping tables or modifying RBAC.

Pattern 02 — The escalation threshold. Agent acts autonomously below a defined threshold. Above the threshold, it requires approval. A warehouse-suspension agent suspends idle warehouses freely; suspending a warehouse with active sessions requires approval. Threshold-based autonomy preserves speed for routine cases while protecting against exceptional ones.

Pattern 03 — The time-delayed action. Agent announces an intended action and executes after a defined delay if no one objects. "I will suspend warehouse XYZ in 30 minutes unless you reply STOP." Combines autonomy (action happens by default) with safety (humans can intercept). Best for medium-risk reversible actions.

UX principles for checkpoints

The checkpoint UX must be lower-friction than the alternative. If approving an agent recommendation requires more clicks than doing the task manually, humans bypass the agent.
Each approval request includes context: what the agent will do, why, dollar/risk impact, link to underlying data.
Approvals time-box. Recommendations expire after 48–72 hours; humans who don't respond signal "this isn't important enough to act on."
Approval audit logs are persistent. Who approved, what they approved, when. Compliance evidence built into the workflow.

Operator Tip

Slack is the right approval surface for most cases. A bot post with action buttons (Approve, Reject, More Info) integrates with where engineers already work. Custom web UIs are over-engineering for the first 6 months; switch only when you've outgrown Slack's threading model.

06

Chapter 06

Audit trails and cost containment

Production agents need two non-negotiable layers: complete audit trails for compliance and incident investigation, and explicit cost caps that prevent runaway consumption. Both are easier to add at deployment than to retrofit later.

The QUERY_TAG convention

Every operation the agent performs should set a QUERY_TAG identifying the agent, the workflow, and the action. The convention below gives auditors a single column to filter on:

# QUERY_TAG convention for agent operations ALTER SESSION SET QUERY_TAG = '{ "agent": "finops_monitor", "version": "1.2", "workflow": "daily_warehouse_check", "action": "evaluate_suspension", "run_id": "2026-05-20-0234" }';

The audit view

Create a dedicated audit view that combines QUERY_HISTORY filtered by the agent's QUERY_TAG. This becomes the single source of truth for what the agent did, when, and with what outcome.

# Agent audit view — single source of truth CREATE OR REPLACE VIEW GOVERNANCE.AGENT_AUDIT AS SELECT TRY_PARSE_JSON(query_tag):agent::VARCHAR AS agent_name, TRY_PARSE_JSON(query_tag):workflow::VARCHAR AS workflow, TRY_PARSE_JSON(query_tag):action::VARCHAR AS action, TRY_PARSE_JSON(query_tag):run_id::VARCHAR AS run_id, start_time, execution_status, credits_used_cloud_services, rows_produced, error_message FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY WHERE TRY_PARSE_JSON(query_tag):agent IS NOT NULL AND start_time >= DATEADD(day, -90, CURRENT_TIMESTAMP());

Cost containment — three layers

Per-agent credit cap. Set CORTEX_CODE_SNOWSIGHT_DAILY_EST_CREDIT_LIMIT_PER_USER on the agent's service account.
Per-run timeout. Agent tasks should have explicit STATEMENT_TIMEOUT_IN_SECONDS to prevent runaway operations.
Budget alert. Weekly budget check that flags if the agent exceeds its allocated credits. Alert goes to engineering, not just finance.

07

Chapter 07

Hands-on lab worksheet

Task 01

Pick a workflow and run the five-question filter

15 min

Identify a real workflow in your environment. Run it through the five-question filter from Chapter 02. Determine the appropriate autonomy level — full, recommend-only, or never.

Task 02

Design the agent's role and grants

15 min

Write the SQL to create the dedicated agent role with the narrowest possible privileges. Run the SQL. Verify the role has exactly what it needs and nothing more.

Task 03

Build the QUERY_TAG convention and audit view

15 min

Implement the QUERY_TAG pattern for your agent. Create the audit view that filters QUERY_HISTORY by agent. Confirm the view returns data when the agent runs a test query.

Task 04

Deploy in shadow mode for one week

N/A — async

Deploy the agent with action disabled — it observes and produces output but takes no action. Schedule weekly review. Compare its recommendations to what humans would have done.

Task 05

Build the human-in-the-loop checkpoint

15 min

Design the approval mechanism for high-stakes decisions. Slack message with action buttons, email with approval link, or table-based recommendation queue. Document who approves and the approval timeout.

08

Chapter 08

Knowledge check

Question 01

When does SnowWork earn its complexity over simple Snowflake Tasks?

A
Always — it's newer and better
B
When the workflow has three or more steps, conditional logic, or LLM-based composition
C
For single SQL statement schedules
D
Only for governance work

Question 02

Which task is appropriate for full agent autonomy?

A
Dropping abandoned tables
B
RBAC changes
C
Suspending idle warehouses (reversible, bounded, deterministic)
D
PII detection and remediation

Question 03

What is shadow mode?

A
Running the agent in production without telling anyone
B
A deployment where the agent observes and produces output but takes no action — used to validate behavior before production
C
Disabling all audit logging
D
Running multiple copies of the agent in parallel

Question 04

An agent making inconsistent decisions on the same input most likely needs:

A
A larger warehouse
B
To replace LLM judgment with explicit SQL rules for the decision logic
C
More credits allocated
D
Removing the audit logging

Question 05

The single source of truth for agent activity is:

A
Engineer memory
B
Slack message history
C
The QUERY_TAG-based audit view filtering QUERY_HISTORY
D
The agent's source code

09

Chapter 09

Glossary and reference

Key terminology

SnowWork agent

An autonomous Snowflake workflow component that handles long-running operational responsibilities. Distinct from Cortex Code (interactive) or Cortex Analyst (Q&A).

Five-question filter

The framework for evaluating task suitability for autonomous execution: reversibility, blast radius, success criteria, deterministic input, senior engineer trust.

Shadow mode

Initial deployment phase where the agent observes and produces output but takes no action. Used to validate behavior before production action mode.

Human-in-the-loop checkpoint

A design pattern requiring explicit human approval at specific high-stakes decision points. Preserves autonomy for routine work while protecting against irreversible mistakes.

QUERY_TAG convention

A structured JSON tag set on every agent operation, identifying the agent, workflow, action, and run. Foundation of agent audit and observability.

Recommendation queue

Pattern where agent produces recommendations to a table or channel; humans approve before execution. Used for irreversible actions.

Per-agent credit cap

A daily credit limit set on the agent's service account user. Prevents runaway consumption from a misbehaving agent.

Production deployment checklist

Task passes the five-question filter
Dedicated agent role with narrowest possible grants
QUERY_TAG convention implemented on every operation
Audit view created and tested
Per-agent credit cap set on service account
Statement timeout configured
Human-in-the-loop checkpoints designed for high-stakes decisions
Shadow mode validation completed for at least one week
Weekly review scheduled for first 90 days
Budget alert configured

10

Chapter 10

Module completion

What you learned

The autonomy decision

The five-question filter for task suitability. When SnowWork earns its complexity vs. simple Snowflake Tasks.

The design discipline

Scope, govern, instrument, deploy, validate. Five steps from task definition to production agent.

Human-in-the-loop

Recommendation queues, escalation thresholds, time-delayed actions. Three patterns that preserve safety while maintaining velocity.

Governance

QUERY_TAG conventions, audit views, per-agent credit caps. The infrastructure that makes agents production-grade.

Cloud On Demand · Operator Module Completion

SnowWork
Agent Design

Awarded to the practitioner who completed module OPR-007

Recipient signature

Issued by Cloud On Demand · OPR-007 · v1.0

Next in series

OPR-008 · RBAC for Agentic Workflows

→

SnowWork Agent DesignAutonomous Workflows with Human-in-the-Loop Governance

Why this matters

The threshold of justification

The trust principle

The autonomy decision

The design discipline

Human-in-the-loop

Governance

SnowWork Agent Design
Autonomous Workflows with Human-in-the-Loop Governance