CTRL: The 83% Bug Detection Framework for AI Agents

Why testing AI agents is fundamentally different from testing software, and how the CTRL framework delivers an 83% bug detection rate across any language.

Listen to this post
00:00

A complex Foundation Vault style cosmic diagram representing the CTRL testing layers.

Testing AI agents is hard because agents are non-deterministic. Traditional software testing relies on ‘Input A’ always resulting in ‘Output B’. With agents, ‘Input A’ might result in ‘Output B’ on Monday, ‘Output C’ on Tuesday, and a hallucinated error message by Wednesday.

To solve this, we built the CTRL framework (Close The Running Loop). In our internal benchmarks, this methodology achieves an 83% bug detection rate by shifting the focus from output matching to behavior verification.

The 4-Layer Testing Pyramid

The core of CTRL is a four-layer pyramid that balances speed, cost, and coverage.

1. Unit Layer (Pure Logic)

Speed: < 1s | Frequency: Every commit This layer tests the deterministic parts of your agent: input validation, prompt template assembly, and tool-call parsing. If your agent can’t correctly parse a JSON response, it doesn’t matter how smart the model is.

2. E2E Layer (System Integration)

Speed: 1-5 minutes | Frequency: Before every push Here, we test the agent’s interaction with internal systems: API routes, database state, and authentication flows. We use mock LLM responses to ensure the system architecture is sound without burning tokens.

3. Live Layer (Real-World Integration)

Speed: 5-30 minutes | Frequency: Pre-release This is where it gets real. We run the agent against live instances of 3rd party APIs like Stripe or OpenAI. We verify that the agent can handle real network latency, rate limits, and evolving API schemas.

4. Docker Layer (Cold-Start Deployment)

Speed: 5-10 minutes | Frequency: CI/CD pipeline The final boss. We spin up the entire application in a fresh Docker container to ensure there are no “it works on my machine” issues. This layer tests environment variables, dependency conflicts, and deployment configuration.

Get Started in 30 Seconds

CTRL works with any language. Pick your install method and you’re up in one command.

Option 1: Universal Installer (Recommended)

Run this in the root of any project — it auto-detects your language (JS/TS, Python, Go, Rust, PHP, Ruby, Java, C#) and sets up everything:

curl -fsSL https://raw.githubusercontent.com/henrino3/ctrl/master/scripts/ctrl-bootstrap.sh | bash

Option 2: npx (Node.js projects)

If you’re in a Node.js project, this is the fastest path:

npx close-the-loop init

Both methods will:

  1. Detect your project’s language and test framework
  2. Create an AGENTS.md with the correct test commands for your stack (e.g., npm run test:unit, pytest, cargo test)
  3. Set up a .github/workflows/ctrl.yml for CI/CD
  4. Add a .cursorrules / .clauderc file so your AI coding agent knows the rules

Once installed, your agent gets two gates:

# Fast gate — run before every push
npm run ctrl:gate    # build + unit + e2e

# Full gate — run before release
npm run ctrl:full    # gate + live + docker

For Python projects, the equivalent commands are generated automatically (pytest, tox, etc.).

Multi-Language Support

The bootstrap system supports a massive range of engineering stacks. Whether you are building in Python, JavaScript, Go, Rust, PHP, Ruby, Java, or .NET, the installer configures the appropriate test runners and GitHub Actions templates for your specific language.

The 4 Files That Make It Work

After installation, you’ll have these files powering the loop:

FilePurpose
AGENTS.mdBuild, test, and dev commands for your AI agent
TESTING.mdWhat to test, what NOT to test, conventions
copilot-instructions.mdAnti-redundancy rules, colocated test pattern
.github/workflows/ctrl.ymlCI/CD pipeline that enforces the gates

The key insight: your AI coding agent reads AGENTS.md and TESTING.md to know exactly how to write tests, run them, and fix failures — without asking you.

Why 83%?

Traditional unit testing often catches less than 30% of agent-specific bugs because it misses the ‘reasoning gaps’ between tool calls. By forcing agents through the CTRL pyramid, we close the execution loop and catch the edge cases that only appear during multi-step autonomous tasks.

You can find the framework, bootstrap scripts, and full documentation on GitHub at henrino3/ctrl.

← Back to Ship Log