CTRL: The 83% Bug Detection Framework for AI Agents
Why testing AI agents is fundamentally different from testing software, and how the CTRL framework delivers an 83% bug detection rate across any language.
Testing AI agents is hard because agents are non-deterministic. Traditional software testing relies on ‘Input A’ always resulting in ‘Output B’. With agents, ‘Input A’ might result in ‘Output B’ on Monday, ‘Output C’ on Tuesday, and a hallucinated error message by Wednesday.
To solve this, we built the CTRL framework (Close The Running Loop). In our internal benchmarks, this methodology achieves an 83% bug detection rate by shifting the focus from output matching to behavior verification.
The 4-Layer Testing Pyramid
The core of CTRL is a four-layer pyramid that balances speed, cost, and coverage.
1. Unit Layer (Pure Logic)
Speed: < 1s | Frequency: Every commit This layer tests the deterministic parts of your agent: input validation, prompt template assembly, and tool-call parsing. If your agent can’t correctly parse a JSON response, it doesn’t matter how smart the model is.
2. E2E Layer (System Integration)
Speed: 1-5 minutes | Frequency: Before every push Here, we test the agent’s interaction with internal systems: API routes, database state, and authentication flows. We use mock LLM responses to ensure the system architecture is sound without burning tokens.
3. Live Layer (Real-World Integration)
Speed: 5-30 minutes | Frequency: Pre-release This is where it gets real. We run the agent against live instances of 3rd party APIs like Stripe or OpenAI. We verify that the agent can handle real network latency, rate limits, and evolving API schemas.
4. Docker Layer (Cold-Start Deployment)
Speed: 5-10 minutes | Frequency: CI/CD pipeline The final boss. We spin up the entire application in a fresh Docker container to ensure there are no “it works on my machine” issues. This layer tests environment variables, dependency conflicts, and deployment configuration.
Get Started in 30 Seconds
CTRL works with any language. Pick your install method and you’re up in one command.
Option 1: Universal Installer (Recommended)
Run this in the root of any project — it auto-detects your language (JS/TS, Python, Go, Rust, PHP, Ruby, Java, C#) and sets up everything:
curl -fsSL https://raw.githubusercontent.com/henrino3/ctrl/master/scripts/ctrl-bootstrap.sh | bash
Option 2: npx (Node.js projects)
If you’re in a Node.js project, this is the fastest path:
npx close-the-loop init
Both methods will:
- Detect your project’s language and test framework
- Create an
AGENTS.mdwith the correct test commands for your stack (e.g.,npm run test:unit,pytest,cargo test) - Set up a
.github/workflows/ctrl.ymlfor CI/CD - Add a
.cursorrules/.claudercfile so your AI coding agent knows the rules
Once installed, your agent gets two gates:
# Fast gate — run before every push
npm run ctrl:gate # build + unit + e2e
# Full gate — run before release
npm run ctrl:full # gate + live + docker
For Python projects, the equivalent commands are generated automatically (pytest, tox, etc.).
Multi-Language Support
The bootstrap system supports a massive range of engineering stacks. Whether you are building in Python, JavaScript, Go, Rust, PHP, Ruby, Java, or .NET, the installer configures the appropriate test runners and GitHub Actions templates for your specific language.
The 4 Files That Make It Work
After installation, you’ll have these files powering the loop:
| File | Purpose |
|---|---|
| AGENTS.md | Build, test, and dev commands for your AI agent |
| TESTING.md | What to test, what NOT to test, conventions |
| copilot-instructions.md | Anti-redundancy rules, colocated test pattern |
| .github/workflows/ctrl.yml | CI/CD pipeline that enforces the gates |
The key insight: your AI coding agent reads AGENTS.md and TESTING.md to know exactly how to write tests, run them, and fix failures — without asking you.
Why 83%?
Traditional unit testing often catches less than 30% of agent-specific bugs because it misses the ‘reasoning gaps’ between tool calls. By forcing agents through the CTRL pyramid, we close the execution loop and catch the edge cases that only appear during multi-step autonomous tasks.
You can find the framework, bootstrap scripts, and full documentation on GitHub at henrino3/ctrl.