When to trust your AI agent (and when not to)
8 min read
There's a failure mode that's easy to fall into when you first start thinking about AI agent safety: treating every action as equally dangerous. If your answer to "how much should I trust my agent?" is "not at all, gate everything" — you'll build a system so slow and friction-heavy that nobody will use it. The agent becomes useless.
The opposite failure is worse: "trust everything, it's fine." That's how you end up with a coding agent that decides the fastest path to fixing a bug is deleting the test suite that was catching it.
The right answer is a spectrum. And building that spectrum deliberately — rather than accidentally — is one of the most important architectural decisions you'll make when deploying AI agents.
The four trust levels
Think of agent actions falling into four categories:
Free run: read-only, ephemeral, reversible
Actions that observe but don't change the world. Reading files, querying logs, running tests, listing containers — these are inherently low risk. If the agent reads the wrong file, nothing breaks. Let these run without friction.
ls,cat,grep,finddocker ps,kubectl get podsgit log,git diff,git statuscurlto read-only APIs (GET requests)- Running unit tests in a sandboxed environment
Whitelisted: known-safe write operations
Actions that modify state but in ways you've already thought through and approved. A staging deploy from a specific branch. Restarting a known service. Creating files in a temp directory. These don't need real-time review — they need a deliberate decision recorded somewhere.
Whitelists are the productivity lever. Once an operation is reviewed and approved once, it shouldn't require human attention every subsequent time. The whitelist is the record of your trust decisions.
Gate: novel, high-impact, or context-dependent
Actions that modify production state, touch sensitive data, or whose safety depends on context you can't encode in a static pattern. These warrant a real human looking at them in real time.
- Database migrations on production
- Deploying to production (first time, or after a long gap)
- Adding or revoking credentials
- Any
DELETEorDROPon production data - Anything involving customer PII
- Commands the agent has never run before in this context
Block: never, under any circumstances
A short list of things that should simply never happen from an automated process, full stop. Not gated — blocked. No approval flow, no override, no "except in emergencies."
rm -rf /or equivalent- Disabling audit logs
- Modifying the approval system itself
- Exfiltrating credentials or secrets to external endpoints
- Disabling or bypassing security controls
The two questions that determine trust level
For any agent action, ask:
1. Is it reversible?
If something goes wrong, can you undo it? Reading a file: trivially reversible (nothing happened). Appending to a log: reversible with effort. Dropping a database: not reversible. Sending an email to 10,000 customers: not reversible.
Irreversibility is the primary driver of required oversight.
2. Is the blast radius bounded?
If the action goes wrong, how many things are affected? Deleting a temp file: blast radius of one file. Running a migration on a table: blast radius of that table. Running a migration without a transaction on all tables in prod: potentially everything.
Unbounded blast radius requires a human to consciously accept that risk.
Map these two dimensions and you get most of the spectrum:
| Action type | Reversible? | Blast radius | Trust level |
|---|---|---|---|
| Read file / query log | ✓ trivially | None | Free run |
| Run tests (sandboxed) | ✓ trivially | None | Free run |
| Deploy to staging | ✓ with effort | Staging only | Whitelist |
| Restart known service | ✓ quickly | One service | Whitelist |
| Deploy to production | ⚠ with rollback | All users | Gate |
| Add DB column (with migration) | ⚠ usually | One table | Gate |
| Revoke API credentials | ⚠ if you have them | All dependent services | Gate |
| DROP table in production | ✗ | Everything dependent | Block |
| rm -rf on unknown path | ✗ | Unknown | Block |
Trust erodes over time without maintenance
Here's a subtle failure that teams often miss: a whitelist entry that was safe six months ago may not be safe today. The staging environment that used to be truly isolated now has a VPN tunnel to production. The "safe" temp directory is now a symlink to something important. The service you can safely restart now has a 90-second startup time that cascades into a timeout in a dozen dependent services.
Trust decisions need expiry dates. Not because agents become untrustworthy, but because the world they operate in changes. Whitelist rules should have TTLs. High-risk rules should have shorter ones.
A whitelist entry that never expires is a trust decision made once and never revisited. In a system that changes, that's a liability, not an asset.
The "first time" rule
One of the most reliable heuristics: if the agent has never done this specific thing before in this specific context, gate it — regardless of how it looks statically.
docker pull nginx:latest might be fine. But the first time an agent runs it on a production
host, something shifted. Maybe the image introduces a version incompatibility. Maybe this is the first
time Docker is being used on this host at all. The novelty itself is a signal worth a human glance.
This is why expacti's anomaly detection tracks first-seen commands per target. It's not that
docker pull is dangerous — it's that "I've never seen this agent do this here before"
is a meaningful flag.
Multi-party approval: when one human isn't enough
For a small class of actions, one reviewer isn't sufficient. Consider:
- Any command that modifies the access control system itself
- Database operations affecting billing or financial data
- Bulk data deletion above a threshold
- Infrastructure changes during a production incident
These warrant requiring N of M approvers — two engineers, or an engineer plus a security lead. Not because any single reviewer is untrustworthy, but because the blast radius exceeds any one person's accountability scope.
Building your trust policy
Here's a practical starting point for a team deploying an AI coding or DevOps agent:
Week 1: Start tight
Gate everything except read-only operations. Let the agent run. Watch what it does. After a week, look at your approval log — you'll see patterns emerge.
Week 2: Build your whitelist from data
Take the most common approved actions from week 1 and whitelist them. These are your "boring and safe" category. Expacti's AI suggestions will propose regex patterns from your approval history automatically.
Week 3+: Tune the gates
Add TTLs to whitelist rules for anything production-touching. Set multi-party approval requirements for your highest-stakes operations. Configure anomaly detection sensitivity for your environment's baseline.
The goal isn't to minimize approvals — it's to make approval decisions deliberate. An approval you click through in two seconds because you've seen it a hundred times is noise. An approval that makes you pause and think is signal. Build a system where you mostly see signal.
The practical answer
Trust your AI agent with observations and reversible, bounded actions. Gate novel, high-impact, or irreversible operations. Block a short list of things that should never happen automatically. And build in mechanisms to revisit your trust decisions as your environment changes.
The question isn't "do I trust this agent?" It's "for this specific action, in this specific context, at this specific time — is the cost of a wrong call acceptable without a human in the loop?" Answer that question explicitly, per action, and you'll have a trust policy worth keeping.
Build your trust policy with expacti
Per-command gating, whitelist with TTLs, anomaly detection, and multi-party approval — all in one place.
See the interactive demo Get started free