2026-03-26 AI agents trust human-in-the-loop

When to trust your AI agent (and when not to)

8 min read

There's a failure mode that's easy to fall into when you first start thinking about AI agent safety: treating every action as equally dangerous. If your answer to "how much should I trust my agent?" is "not at all, gate everything" — you'll build a system so slow and friction-heavy that nobody will use it. The agent becomes useless.

The opposite failure is worse: "trust everything, it's fine." That's how you end up with a coding agent that decides the fastest path to fixing a bug is deleting the test suite that was catching it.

The right answer is a spectrum. And building that spectrum deliberately — rather than accidentally — is one of the most important architectural decisions you'll make when deploying AI agents.

The four trust levels

Think of agent actions falling into four categories:

FREE RUN

No approval needed

WHITELISTED

Pre-approved patterns

GATE

Human in the loop

BLOCK

Never allowed

Free run: read-only, ephemeral, reversible

Actions that observe but don't change the world. Reading files, querying logs, running tests, listing containers — these are inherently low risk. If the agent reads the wrong file, nothing breaks. Let these run without friction.

ls, cat, grep, find
docker ps, kubectl get pods
git log, git diff, git status
curl to read-only APIs (GET requests)
Running unit tests in a sandboxed environment

Whitelisted: known-safe write operations

Actions that modify state but in ways you've already thought through and approved. A staging deploy from a specific branch. Restarting a known service. Creating files in a temp directory. These don't need real-time review — they need a deliberate decision recorded somewhere.

Whitelists are the productivity lever. Once an operation is reviewed and approved once, it shouldn't require human attention every subsequent time. The whitelist is the record of your trust decisions.

Gate: novel, high-impact, or context-dependent

Actions that modify production state, touch sensitive data, or whose safety depends on context you can't encode in a static pattern. These warrant a real human looking at them in real time.

Database migrations on production
Deploying to production (first time, or after a long gap)
Adding or revoking credentials
Any DELETE or DROP on production data
Anything involving customer PII
Commands the agent has never run before in this context

Block: never, under any circumstances

A short list of things that should simply never happen from an automated process, full stop. Not gated — blocked. No approval flow, no override, no "except in emergencies."

rm -rf / or equivalent
Disabling audit logs
Modifying the approval system itself
Exfiltrating credentials or secrets to external endpoints
Disabling or bypassing security controls

The two questions that determine trust level

For any agent action, ask:

1. Is it reversible?

If something goes wrong, can you undo it? Reading a file: trivially reversible (nothing happened). Appending to a log: reversible with effort. Dropping a database: not reversible. Sending an email to 10,000 customers: not reversible.

Irreversibility is the primary driver of required oversight.

2. Is the blast radius bounded?

If the action goes wrong, how many things are affected? Deleting a temp file: blast radius of one file. Running a migration on a table: blast radius of that table. Running a migration without a transaction on all tables in prod: potentially everything.

Unbounded blast radius requires a human to consciously accept that risk.

Map these two dimensions and you get most of the spectrum:

Action type	Reversible?	Blast radius	Trust level
Read file / query log	✓ trivially	None	Free run
Run tests (sandboxed)	✓ trivially	None	Free run
Deploy to staging	✓ with effort	Staging only	Whitelist
Restart known service	✓ quickly	One service	Whitelist
Deploy to production	⚠ with rollback	All users	Gate
Add DB column (with migration)	⚠ usually	One table	Gate
Revoke API credentials	⚠ if you have them	All dependent services	Gate
DROP table in production	✗	Everything dependent	Block
rm -rf on unknown path	✗	Unknown	Block

Trust erodes over time without maintenance

Here's a subtle failure that teams often miss: a whitelist entry that was safe six months ago may not be safe today. The staging environment that used to be truly isolated now has a VPN tunnel to production. The "safe" temp directory is now a symlink to something important. The service you can safely restart now has a 90-second startup time that cascades into a timeout in a dozen dependent services.

Trust decisions need expiry dates. Not because agents become untrustworthy, but because the world they operate in changes. Whitelist rules should have TTLs. High-risk rules should have shorter ones.

A whitelist entry that never expires is a trust decision made once and never revisited. In a system that changes, that's a liability, not an asset.

The "first time" rule

One of the most reliable heuristics: if the agent has never done this specific thing before in this specific context, gate it — regardless of how it looks statically.

docker pull nginx:latest might be fine. But the first time an agent runs it on a production host, something shifted. Maybe the image introduces a version incompatibility. Maybe this is the first time Docker is being used on this host at all. The novelty itself is a signal worth a human glance.

This is why expacti's anomaly detection tracks first-seen commands per target. It's not that docker pull is dangerous — it's that "I've never seen this agent do this here before" is a meaningful flag.

Multi-party approval: when one human isn't enough

For a small class of actions, one reviewer isn't sufficient. Consider:

Any command that modifies the access control system itself
Database operations affecting billing or financial data
Bulk data deletion above a threshold
Infrastructure changes during a production incident

These warrant requiring N of M approvers — two engineers, or an engineer plus a security lead. Not because any single reviewer is untrustworthy, but because the blast radius exceeds any one person's accountability scope.

Building your trust policy

Here's a practical starting point for a team deploying an AI coding or DevOps agent:

Week 1: Start tight

Gate everything except read-only operations. Let the agent run. Watch what it does. After a week, look at your approval log — you'll see patterns emerge.

Week 2: Build your whitelist from data

Take the most common approved actions from week 1 and whitelist them. These are your "boring and safe" category. Expacti's AI suggestions will propose regex patterns from your approval history automatically.

Week 3+: Tune the gates

Add TTLs to whitelist rules for anything production-touching. Set multi-party approval requirements for your highest-stakes operations. Configure anomaly detection sensitivity for your environment's baseline.

The goal isn't to minimize approvals — it's to make approval decisions deliberate. An approval you click through in two seconds because you've seen it a hundred times is noise. An approval that makes you pause and think is signal. Build a system where you mostly see signal.

The practical answer

Trust your AI agent with observations and reversible, bounded actions. Gate novel, high-impact, or irreversible operations. Block a short list of things that should never happen automatically. And build in mechanisms to revisit your trust decisions as your environment changes.

The question isn't "do I trust this agent?" It's "for this specific action, in this specific context, at this specific time — is the cost of a wrong call acceptable without a human in the loop?" Answer that question explicitly, per action, and you'll have a trust policy worth keeping.

Build your trust policy with expacti

Per-command gating, whitelist with TTLs, anomaly detection, and multi-party approval — all in one place.

See the interactive demo Get started free