Security AI Safety Incidents
Your AI coding agent just deleted the production database. Not because it was hacked. Not because of a prompt injection attack. It ran DROP TABLE users because it was trying to reset the test environment and couldn't tell the difference between your staging connection string and your production one.
This is the misfire problem. And it's not theoretical. As AI agents gain the ability to execute shell commands, deploy infrastructure, and modify running systems, the gap between "helpful automation" and "career-ending incident" is exactly one bad command.
The anatomy of an AI agent misfire
Let's be clear about what we're not talking about. This isn't about rogue AI or adversarial attacks. Misfires happen for a much more mundane reason: agents optimize for the task they're given, and they don't second-guess whether the task itself is appropriate in context.
An agent told to "clean up the git history" will do exactly that. It doesn't ask whether other developers have pulled the branch. It doesn't check if there's a deployment in progress. It doesn't wonder whether force-pushing to main is against team policy. It sees a task, identifies a command that accomplishes it, and executes.
This is the core tension with autonomous agents: the same literal-mindedness that makes them productive also makes them dangerous. A human engineer might hesitate before running rm -rf / because they've seen what it does. An agent has no such instinct. It evaluates commands by whether they achieve the stated goal, not by whether they're safe.
The misfire formula: Correct intent + wrong context + autonomous execution = incident. Every single time.
The agent wasn't wrong about the command. It was wrong about when, where, and whether to run it. And by the time you notice, the command has already finished executing.
Blast radius by command type
Not all misfires are created equal. The damage an errant command can do depends entirely on what category it falls into. Here's a practical breakdown:
| Command type | Examples | Blast radius | Recovery |
|---|---|---|---|
| Read-only | ls, cat, git log |
Zero | N/A |
| Config changes | chmod, export ENV=, git config |
Contained | Usually reversible |
| Data mutation | rm -rf, DROP TABLE, truncate |
Severe | Backup-dependent |
| Infrastructure | kubectl delete, terraform destroy |
Catastrophic | Hours to days |
| History rewrite | git push --force, git rebase on shared branches |
Severe | Requires coordination |
The pattern is clear: the further right a command reaches — from local state to shared infrastructure — the larger the blast radius and the harder the recovery. Read-only commands are harmless. Anything that writes, deletes, or reconfigures carries escalating risk.
Most teams only think about this taxonomy after an incident. By then, the taxonomy has already been demonstrated on their production systems.
Incident walkthrough: the force-push that wrecked the afternoon
Let's trace a real-world scenario. An AI coding agent is helping a developer clean up a feature branch. The developer says something like: "the commit history on main is messy, can you squash and clean it up?" The agent interprets this as permission to rewrite history on main.
Here's the timeline:
- T+0:00 — The agent runs
git push --force origin main, rewriting the main branch history. The command succeeds in 2 seconds. - T+3:00 — CI/CD picks up the new HEAD on main. A fresh deployment pipeline triggers automatically.
- T+7:00 — Staging tests start failing. Slack alerts fire. The deployment pipeline is now rolling out code based on rewritten history.
- T+12:00 — An engineer notices the alerts and starts investigating. They see unfamiliar commit SHAs on main. Confusion sets in.
- T+25:00 — Root cause identified. Someone checks the terminal session and discovers the agent force-pushed to main. The original commit history is gone from the remote.
- T+45:00 — Recovery begins. A senior engineer finds the original HEAD in the reflog of a teammate's local clone, force-pushes the corrected history back. CI/CD triggers again.
- T+3:00:00 — Full CI/CD re-run completes. All tests pass. The deployment is verified. All-clear is called.
Final cost: 3 hours of developer time across 4 engineers, 1 reverted release, a broken staging environment, and 1 very unhappy CTO asking "why does an AI have push access to main?"
The agent did exactly what it was asked to do. That's the problem. Nobody asked it to check whether force-pushing to a shared branch was safe. Nobody configured a guardrail that would have caught it. The command was perfectly valid. The context made it catastrophic.
The reversibility principle
Here's the mental model that should govern every AI agent's access to your systems: evaluate commands before execution, not after.
Most incident response is reactive. Something breaks, you investigate, you fix it, you write a post-mortem. But with AI agents executing commands autonomously, the window between "command issued" and "damage done" is measured in milliseconds. There is no time to react.
The reversibility principle flips this. Before any command runs, ask:
- Is this command reversible? If yes, the risk is bounded. A bad
chmodcan be undone. A badgit commitcan be reverted. - Is this command irreversible? If yes, it needs human approval. A
DROP TABLEcannot be un-dropped. Aterraform destroycannot be un-destroyed. - Does this command affect shared state? If yes, the blast radius extends beyond the agent's session. Force-pushes, production deployments, and infrastructure changes all fall here.
A pre-execution gate that applies this principle turns catastrophic failures into non-events. The dangerous command never runs. The agent pauses, a human reviews, and either approves or rejects. The 3-hour incident becomes a 30-second conversation.
How expacti reduces blast radius
This is the exact problem expacti was built to solve. Instead of hoping your agent makes the right call, you put a deterministic gate between the agent and your systems.
Block before execution
When an agent running through expacti issues a command like git push --force origin main, the command does not execute. It's intercepted, held, and surfaced for review. The blast radius of a blocked command is zero.
Risk scoring surfaces danger immediately
Every command is scored before a reviewer sees it. A git log gets a LOW score and flows through automatically if whitelisted. A terraform destroy gets a HIGH score and is flagged with a red banner. The reviewer doesn't have to evaluate risk from scratch — the system has already done the triage.
Session recording for forensics
Even commands that are whitelisted and auto-approved are logged. Every keystroke, every output, every command in the session is recorded. When something goes wrong — and eventually, something always does — you have a complete forensic trail. No guessing, no "what did the agent do?" Just play back the session.
Whitelist patterns let safe commands flow
The goal isn't to block everything. That would make agents useless. Whitelist patterns let you define exactly which commands are safe to auto-approve: git status, ls, cat, npm test. Everything else pauses for review. You get the speed of automation for routine commands and human judgment for everything that matters.
Recovery playbook: the first 5 minutes after a misfire
Prevention is the goal, but you need a plan for when prevention fails. If an agent has already executed a destructive command, the first 5 minutes determine whether this is a 30-minute fix or a 3-day outage. Here's the playbook:
Step 1: Stop the agent
Kill the session immediately. Don't let the agent run any more commands. Don't try to get the agent to "fix" what it just broke — it got you here in the first place. Terminate the process, close the connection, cut power if you have to. Every second the agent continues is another command that might make things worse.
Step 2: Assess blast radius
What did it touch? Check the command history. Was it a local-only change or did it affect remote systems? Did it modify data, infrastructure, or configuration? Is the damage still propagating (e.g., a CI/CD pipeline deploying bad code)?
Step 3: Preserve evidence
Before you start fixing, capture everything. Save session logs, terminal output, and agent conversation history. If you're using expacti, the session recording is already there. If not, screenshot and copy everything you can. You'll need this for the post-mortem, and you might need it for compliance.
Step 4: Begin recovery
Roll back, restore from backup, or redeploy the last known good state. The specific recovery depends on what was damaged: git reflog for force-pushes, backup restores for deleted databases, terraform apply from the last good state for infrastructure. Don't improvise — follow your runbook.
Step 5: Post-mortem
Once the fire is out, answer the hard questions. How did the agent get access to run that command? Why wasn't there a guardrail? What whitelist or approval gate would have prevented this? Then implement that gate before you give the agent access again.
Key insight: Every misfire post-mortem arrives at the same conclusion — there should have been an approval gate. The only variable is how expensive the lesson was to learn.
Don't wait for your first misfire
AI agents are getting more capable and more autonomous every month. The commands they can run are getting more powerful. The systems they can access are getting more critical. And the gap between "the agent ran a command" and "someone reviewed that command" is the gap where incidents live.
You can close that gap proactively, or you can close it after your agent force-pushes to main, drops a production table, or tears down your Kubernetes cluster. The outcome is the same — you'll end up with an approval gate either way. The only question is whether you add it before or after the incident that makes it obvious.
Don't wait for your first misfire to add an approval gate
Expacti intercepts dangerous commands before they execute. Block, review, approve — in seconds, not hours.
Get started free →