← All posts

When to trust your AI agent (and when not to)

8 min read

There's a failure mode that's easy to fall into when you first start thinking about AI agent safety: treating every action as equally dangerous. If your answer to "how much should I trust my agent?" is "not at all, gate everything" — you'll build a system so slow and friction-heavy that nobody will use it. The agent becomes useless.

The opposite failure is worse: "trust everything, it's fine." That's how you end up with a coding agent that decides the fastest path to fixing a bug is deleting the test suite that was catching it.

The right answer is a spectrum. And building that spectrum deliberately — rather than accidentally — is one of the most important architectural decisions you'll make when deploying AI agents.

The four trust levels

Think of agent actions falling into four categories:

FREE RUN
No approval needed
WHITELISTED
Pre-approved patterns
GATE
Human in the loop
BLOCK
Never allowed

Free run: read-only, ephemeral, reversible

Actions that observe but don't change the world. Reading files, querying logs, running tests, listing containers — these are inherently low risk. If the agent reads the wrong file, nothing breaks. Let these run without friction.

Whitelisted: known-safe write operations

Actions that modify state but in ways you've already thought through and approved. A staging deploy from a specific branch. Restarting a known service. Creating files in a temp directory. These don't need real-time review — they need a deliberate decision recorded somewhere.

Whitelists are the productivity lever. Once an operation is reviewed and approved once, it shouldn't require human attention every subsequent time. The whitelist is the record of your trust decisions.

Gate: novel, high-impact, or context-dependent

Actions that modify production state, touch sensitive data, or whose safety depends on context you can't encode in a static pattern. These warrant a real human looking at them in real time.

Block: never, under any circumstances

A short list of things that should simply never happen from an automated process, full stop. Not gated — blocked. No approval flow, no override, no "except in emergencies."

The two questions that determine trust level

For any agent action, ask:

1. Is it reversible?

If something goes wrong, can you undo it? Reading a file: trivially reversible (nothing happened). Appending to a log: reversible with effort. Dropping a database: not reversible. Sending an email to 10,000 customers: not reversible.

Irreversibility is the primary driver of required oversight.

2. Is the blast radius bounded?

If the action goes wrong, how many things are affected? Deleting a temp file: blast radius of one file. Running a migration on a table: blast radius of that table. Running a migration without a transaction on all tables in prod: potentially everything.

Unbounded blast radius requires a human to consciously accept that risk.

Map these two dimensions and you get most of the spectrum:

Action type Reversible? Blast radius Trust level
Read file / query log ✓ trivially None Free run
Run tests (sandboxed) ✓ trivially None Free run
Deploy to staging ✓ with effort Staging only Whitelist
Restart known service ✓ quickly One service Whitelist
Deploy to production ⚠ with rollback All users Gate
Add DB column (with migration) ⚠ usually One table Gate
Revoke API credentials ⚠ if you have them All dependent services Gate
DROP table in production Everything dependent Block
rm -rf on unknown path Unknown Block

Trust erodes over time without maintenance

Here's a subtle failure that teams often miss: a whitelist entry that was safe six months ago may not be safe today. The staging environment that used to be truly isolated now has a VPN tunnel to production. The "safe" temp directory is now a symlink to something important. The service you can safely restart now has a 90-second startup time that cascades into a timeout in a dozen dependent services.

Trust decisions need expiry dates. Not because agents become untrustworthy, but because the world they operate in changes. Whitelist rules should have TTLs. High-risk rules should have shorter ones.

A whitelist entry that never expires is a trust decision made once and never revisited. In a system that changes, that's a liability, not an asset.

The "first time" rule

One of the most reliable heuristics: if the agent has never done this specific thing before in this specific context, gate it — regardless of how it looks statically.

docker pull nginx:latest might be fine. But the first time an agent runs it on a production host, something shifted. Maybe the image introduces a version incompatibility. Maybe this is the first time Docker is being used on this host at all. The novelty itself is a signal worth a human glance.

This is why expacti's anomaly detection tracks first-seen commands per target. It's not that docker pull is dangerous — it's that "I've never seen this agent do this here before" is a meaningful flag.

Multi-party approval: when one human isn't enough

For a small class of actions, one reviewer isn't sufficient. Consider:

These warrant requiring N of M approvers — two engineers, or an engineer plus a security lead. Not because any single reviewer is untrustworthy, but because the blast radius exceeds any one person's accountability scope.

Building your trust policy

Here's a practical starting point for a team deploying an AI coding or DevOps agent:

Week 1: Start tight

Gate everything except read-only operations. Let the agent run. Watch what it does. After a week, look at your approval log — you'll see patterns emerge.

Week 2: Build your whitelist from data

Take the most common approved actions from week 1 and whitelist them. These are your "boring and safe" category. Expacti's AI suggestions will propose regex patterns from your approval history automatically.

Week 3+: Tune the gates

Add TTLs to whitelist rules for anything production-touching. Set multi-party approval requirements for your highest-stakes operations. Configure anomaly detection sensitivity for your environment's baseline.

The goal isn't to minimize approvals — it's to make approval decisions deliberate. An approval you click through in two seconds because you've seen it a hundred times is noise. An approval that makes you pause and think is signal. Build a system where you mostly see signal.

The practical answer

Trust your AI agent with observations and reversible, bounded actions. Gate novel, high-impact, or irreversible operations. Block a short list of things that should never happen automatically. And build in mechanisms to revisit your trust decisions as your environment changes.

The question isn't "do I trust this agent?" It's "for this specific action, in this specific context, at this specific time — is the cost of a wrong call acceptable without a human in the loop?" Answer that question explicitly, per action, and you'll have a trust policy worth keeping.

Build your trust policy with expacti

Per-command gating, whitelist with TTLs, anomaly detection, and multi-party approval — all in one place.

See the interactive demo Get started free