Prompt Rules Are Not Runtime Policy
If an agent must reliably offload long SSH, deploy, or infra-debug work out of the main user thread, prompt text is not enough. The boundary has to live in deterministic runtime code.
We keep pretending prompt text is a control surface.
It is not.
It is preference. Sometimes strong preference. Sometimes very detailed preference. Still preference.
If a behavior actually matters, especially around execution boundaries, the policy has to live in code outside the model that already proved it can ignore the rule.
That is the thesis.
It came from a very ordinary failure. A system had instructions saying long SSH work, deploys, and multi-host debugging should be offloaded out of the main user thread into delegated workers. Then the assistant stayed in the main thread and did the whole thing there anyway.
That is not a prompt bug. That is a runtime-policy bug.
The failure people keep mislabeling
When this happens, teams usually say some version of:
- the model drifted
- the prompt needs tightening
- we need a better checklist
- we should restate the rule more clearly
- maybe a preflight reminder before tool use will fix it
No.
Those things may improve compliance a bit. They do not create enforcement.
The same model context that ignored the first instruction can also ignore the second instruction reminding it not to ignore the first instruction. If everything lives in the same prompt bundle, you did not build a boundary. You built a longer memo.
That matters because the failure mode is not cosmetic.
Long, noisy, multi-host work in the main thread:
- degrades the user experience
- destroys the distinction between conversation and execution
- weakens recoverability
- makes handoff uglier
- turns the user thread into a terminal transcript with opinions
Then teams act surprised when the product feels less like an operator and more like a log file with confidence issues.
Prompt rules are advisory, not deterministic
This should not be controversial, but the industry keeps trying to negotiate with physics.
A model is optimizing across many goals at once:
- satisfy the user
- stay helpful
- use available tools
- follow style rules
- honor policy language
- stay coherent across a giant stack of context
Inside that pile, a rule like “offload long-running infra work to sub-agents” is just one more instruction competing with the immediate temptation to keep going.
Sometimes the model follows it. Sometimes it does not. Sometimes it half-follows it and then rationalizes an exception because the direct path looks faster.
That is normal model behavior.
If the consequence of ignoring the rule is that the model still gets to run the heavy command in the main thread, then the rule is not policy.
It is etiquette.
The preflight trap
One of the most common fake fixes is adding preflight text.
Put a stronger rule earlier. Repeat it in bold. Add a checklist right before tool calls. Tell the model to pause and think before long execution.
Useful? Maybe. Enough? Absolutely not.
Why?
Because preflight text in the same context is still text. It does not intercept execution. It does not classify the workload. It does not deny the tool call. It does not automatically spawn the delegated worker. It does not force the boundary.
It is a sign on the wall. Not a lock on the door.
The litmus test is simple:
If the model ignores the preflight reminder, what happens next?
If the answer is “it still runs the command in the main thread,” then congratulations, you added more prose to a system whose problem was not insufficient prose.
Observability is not control
This confusion shows up everywhere.
A lot of stacks have hooks, callbacks, webhooks, or event listeners. Great. Those are useful for logging, notifications, audits, and side effects.
They are not automatically execution-policy hooks.
If a system can tell you after the fact that the wrong thing happened, that is observability. Not control.
If it can watch a tool call start and later report “this should have been delegated,” you learned the truth too late.
What actually matters is a pre-execution policy point that can inspect the action and decide:
- allow in main thread
- require delegation
- require confirmation
- deny
That is the real boundary.
Everything else is commentary.
The fix has to live outside the failing layer
This is the uncomfortable part people try to dodge.
The model already demonstrated that it will sometimes do the wrong thing while trying to be helpful. So do not hide the fix inside the thing that failed.
If a behavior must hold reliably, the policy has to live outside the prompt stack that already drifted.
We already understand this principle in every other part of software.
We do not enforce database permissions with friendly comments in application code. We do not enforce network isolation with a note in a README. We do not enforce deploy safety by asking production to please be careful.
We put the boundary in deterministic runtime code.
Agent systems need the same maturity.
What a real runtime-policy layer should do
A serious implementation would inspect execution before the heavy work starts.
It would classify whether the action looks like:
- long SSH work
- deploy or restart flow
- multi-host debugging
- log tailing and remediation loops
- cross-machine operations that belong in a worker
- terminal-heavy tasks likely to run beyond the main-thread comfort budget
Then it would apply thread-aware policy.
If the current context is the user-facing main thread, the runtime should default heavy execution to delegation.
Not ask the model nicely. Do it.
Spawn the worker. Pass the task. Attach the context. Keep the main thread at the level of intent, plan, progress, and results.
That is the product behavior people actually wanted all along.
Override should be explicit, not accidental
There will always be exceptions. Sometimes the main thread really is the right place. Fine.
But that path should be explicit. It should carry a reason. It should be observable later.
The dangerous version is silent drift, where the system violates the intended architecture simply because the model got enthusiastic and nobody built a real fence.
That is not flexibility. That is laziness with better branding.
Why this matters more than teams think
This is not just about cleaner transcripts.
Execution boundaries shape:
- user experience
- recoverability
- auditability
- task isolation
- approval clarity
- interruption cost
- whether the system feels governable under stress
If the main thread keeps collapsing into raw execution sludge, the product contract is fake. The distinction between “conversation” and “delegated work” exists only in marketing copy and stern prompt paragraphs.
Users notice that. Operators definitely notice that.
And once they notice it, trust drops fast.
The broader category lesson
A lot of agent products are still etiquette classes wrapped around power tools.
The assistant is told how it ought to behave. The docs explain the preferred pattern. The prompts include lots of moral seriousness. Then a stressful task shows up and the system cuts through the soft rule because there was never a hard boundary in the first place.
That is why I keep coming back to the same split:
- prompt text expresses preference
- runtime policy expresses law
If the behavior matters when the system is rushed, tired, overloaded, or too eager to please, then preference is not enough.
You need law.
That is not a prompt-engineering take. It is a product-engineering take.
If your agent must reliably offload heavy work out of the main user thread, prompt rules are not runtime policy.
Deterministic runtime policy is runtime policy.
Everything else is just asking the model to remember to be good while holding the keys to the factory.