The retry problem: what to do when your AI agent fails

3am, five identical failures

I need to tell you about the night that changed how I think about retries.

It was October 2024. I had nightloop.sh (the bash script that eventually became Zowl) running a pipeline: generate code, run tests, fix if broken. Simple, right? I went to sleep. Woke up at 6am and checked the logs.

The agent had failed at the test step. So the script retried. Same failure. Retried again. Same failure. Five times. Five identical attempts. Five identical stack traces. It burned through tokens doing the exact same wrong thing over and over, like a person pushing a pull door harder each time.

That morning I realized: simple retry is broken for AI agents.

Why "just retry" doesn't work

With a network request, retrying makes sense. The server might have been temporarily down. The timeout might have been a fluke. The conditions change between attempts.

With an AI agent, the conditions usually don't change. If the agent wrote broken code and validation caught it, retrying validation won't fix the code. The code is still broken. You're just asking the validator to check the same broken thing again.

Pipeline: implement → validate → test

What happens with simple retry:

  implement ✓ (writes code)
  validate ✗ (code has type error)
  validate ✗ (same type error)
  validate ✗ (same type error)
  ❌ FAILED after 3 retries

What SHOULD happen:

  implement ✓ (writes code)
  validate ✗ (code has type error)
  implement ✓ (rewrites code WITH error context)
  validate ✓
  test ✓
  ✅ DONE

The difference is direction. Simple retry goes forward into the same wall. Smart retry goes backward to the step that can actually fix the problem.

Go-to-step: the fix

The concept is simple. When a step fails, instead of retrying that step, you jump back to an earlier step and pass the error as context.

If validate fails, go back to implement with the validation error attached. Now the agent isn't just retrying blindly. It knows what went wrong and can write a fix. The validation error becomes input for the next implementation attempt.

This is what "go to step" means:

Step: validate
On failure:
  retry: 0 times (don't bother, same input = same output)
  then: go to step "implement" with error context
  max loops: 3
  final fallback: skip task, flag for review

The retry: 0 part is intentional for validation. There's no point retrying a pure check. Either the code passes or it doesn't. But for an implement step, you might want 1-2 retries because the agent's output is non-deterministic. Sometimes the second attempt produces different (better) code even without new context.

Sometimes step 2 isn't far enough back

Here's where it gets interesting.

I had a pipeline with four steps: plan, implement, validate, test. The agent planned a React component, implemented it, validation passed (no type errors), but the test failed because the component didn't match the design spec. Go back to implement, rewrite, validate passes again, test fails again. Three loops.

The problem wasn't implementation. The problem was the plan. The agent's plan missed a requirement, so every implementation based on that plan was doomed.

plan ✓ (misses a requirement)
implement ✓ (based on flawed plan)
validate ✓ (types are fine)
test ✗ (doesn't match spec)
implement ✓ (rewrites, still based on flawed plan)
validate ✓
test ✗ (still doesn't match spec)
implement ✓ (third attempt, same flawed plan)
validate ✓
test ✗
🔄 implement loop exhausted → go back to PLAN
plan ✓ (now catches the requirement)
implement ✓
validate ✓
test ✓
✅ DONE

You need nested fallbacks. Step fails, go back one. That fails N times, go back further. It's not complicated logic, but you have to configure it per step because every pipeline is different.

My actual configuration strategy

After running hundreds of pipelines, here's what I've settled on:

Pre-check / plan steps: Retry 1-2 times (agent might produce a better plan on retry), then fail the whole task. If planning fails, there's no point going forward.

Implementation steps: Retry 2-3 times. The non-determinism of LLMs works in your favor here. Different attempts genuinely produce different code. If all retries fail, go back to plan with accumulated errors.

Validation steps: Retry 0 times. Go directly back to implement with the error. Retrying a deterministic check is wasting tokens.

Test steps: Retry 1 time (flaky tests exist, even for agents), then go back to implement with test output, max 3 loops, then go back to plan.

step: pre_check
  on_fail: retry 2, then FAIL_TASK

step: implement
  on_fail: retry 3, then GOTO pre_check (max 2)

step: validate
  on_fail: retry 0, then GOTO implement

step: test
  on_fail: retry 1, then GOTO implement (max 3), then GOTO pre_check

"Skip and come back later" is underrated

Unpopular opinion: sometimes the best retry strategy is to give up and move on.

Not permanently. Temporarily. If a task has burned through all its retry loops and still can't pass tests, parking it and moving to the next task is often smarter than throwing more tokens at it. Why?

Because other tasks in the pipeline might change the codebase in ways that make the stuck task easier. Maybe Task 12 can't pass because it depends on a utility function that Task 15 creates. The dependency wasn't explicit, but it's real. If you let the pipeline skip Task 12, complete Task 15, then circle back, Task 12 might pass on the first try.

I learned this by accident. A task kept failing because it imported a module that didn't exist yet. Another task further down the pipeline created that module. The dependency wasn't in my task graph (my fault), but the skip-and-retry strategy accidentally solved it.

Now I configure every pipeline with a "skip budget." A task can be skipped up to 2 times before it's flagged as truly stuck. Most skipped tasks resolve themselves on the second pass.

The token cost argument

Someone's going to say "all these retries and go-backs waste tokens." Sure. Retrying an implement step costs tokens. But you know what costs more tokens? A human waking up at 6am, reading the failure log, manually re-running the pipeline, and waiting another hour. Or worse, the agent sitting there stuck at step 3 of 40 while 37 other tasks could have been running.

Bad retry logic wastes tokens on one task. Good retry logic, even with some wasted attempts, unblocks the pipeline. The pipeline throughput matters more than per-task efficiency.

What I'd do differently in nightloop.sh

Looking back at that bash script, the retry logic was literally:

for i in 1 2 3; do
  run_step "$step" && break
done

No error forwarding. No go-to-step. No skip logic. Just "try again and hope." It worked maybe 60% of the time. The other 40% produced those 3am failure spirals.

Zowl's retry system exists because of those failure spirals. Every configurable option (retry count per step, go-to-step targets, skip budgets, error context forwarding) came from a real pipeline failure I hit while running nightloop.sh overnight. I didn't design it from theory. I watched my pipelines fail in specific ways and built the fix for each one.

The retry problem isn't really about retries. It's about giving the agent a second chance that's actually different from the first one. This is why intelligent failure routing and testing as part of your pipeline matter so much. And it's why Zowl bakes these retry patterns into every pipeline by default.