Why Most AI Agents Are Just Scripted Workflows

The word “agent” has been strip-mined of all meaning.

Every startup has agents now. Every framework is an “agentic framework.” Every chatbot that calls a function is an “autonomous agent.”

Most of them are if-else chains with an LLM in the middle. Let’s be honest about that.

What People Mean by “Agent”

When the industry says “AI agent,” they usually mean one of three things:

Level 0: Tool Use
  LLM decides which function to call based on user input.
  "Call the weather API" → calls weather API.
  This is function calling. It's useful. It's not an agent.

Level 1: Orchestrated Workflow
  A framework chains multiple LLM calls together.
  Step 1: Research → Step 2: Draft → Step 3: Review.
  The steps are predefined. The LLM fills in the blanks.
  This is a pipeline with natural language glue.

Level 2: Autonomous Reasoning
  The system decides what to do next based on its own
  observations and reasoning. No predefined workflow.
  It can fail, recover, change strategy, ask for help.
  Almost nobody is building this. Almost nobody can.

99% of what’s shipped today is Level 0 or Level 1. And that’s fine — but let’s stop calling it autonomy.

The Multi-Agent Illusion

Here’s a typical “multi-agent system”:

from crewai import Agent, Task, Crew

researcher = Agent(
    role="Research Analyst",
    goal="Find relevant information",
    backstory="You are an expert researcher...",  # This is a system prompt
)

writer = Agent(
    role="Technical Writer",
    goal="Write clear documentation",
    backstory="You are a skilled writer...",  # This is also a system prompt
)

# This is not two agents collaborating.
# This is two system prompts taking turns.
crew = Crew(agents=[researcher, writer], tasks=[...])

I’ve built systems like this. They work for structured tasks. But let’s be clear about what’s happening:

Agent A runs with System Prompt A
Output gets passed to Agent B with System Prompt B
Maybe there’s a loop

That’s prompt chaining with role-play. The “agents” don’t have memory across conversations (unless you manually build it). They don’t negotiate. They don’t disagree productively. They don’t decide to involve another agent — the framework does.

Where Agents Actually Fail

I’ve seen agents fail in predictable ways:

1. The Infinite Loop

Agent: "I need to search for more information."
Agent: *searches, finds same results*
Agent: "I need to search for more information."
Agent: *searches, finds same results*
Agent: "I need to search for more information."
...
# $47 in API costs later

Real autonomy requires knowing when to stop. LLMs are famously bad at this. They’re trained to always produce output. “I don’t know, let me try a different approach” is not a natural completion.

2. The Confidence Problem

Agent: "Based on my research, the answer is X."
# The answer is not X.
# The agent has no way to know it's wrong.
# There's no feedback signal until a human checks.

Autonomous systems need to calibrate uncertainty. Current LLMs express confidence through linguistic hedging (“I think,” “it’s likely”), not through actual probabilistic reasoning. Saying “I’m fairly confident” doesn’t mean anything if you’re fairly confident about everything.

3. The Tool Misuse

# Agent has access to: [search, calculator, code_executor]

# User asks: "What's the square root of 144?"
# Agent: *calls search API for "square root of 144"*
# Then: *calls calculator with the search result*
# Then: *calls code_executor to verify*

# Three tool calls for something it could answer directly.
# This is not intelligence. This is cargo-culting tool use.

What Real Autonomy Would Require

If we’re serious about building autonomous agents, here’s the minimum:

Persistent Memory

Not “conversation history.” Actual long-term memory that the agent maintains, updates, and queries independently.

# Not this:
messages = [{"role": "user", "content": "..."}, ...]

# This:
class AgentMemory:
    working_memory: dict      # Current task context
    episodic_memory: list     # Past experiences and outcomes
    semantic_memory: dict     # Learned facts and relationships

    def should_i_try_this_approach(self, task) -> bool:
        # Check: have I tried this before? Did it work?
        similar = self.episodic_memory.search(task)
        return not any(e.failed for e in similar)

Self-Evaluation

The agent needs to judge its own outputs before presenting them. Not “let me check my answer” as a prompt hack — actual evaluation with external grounding.

Graceful Failure

When an agent gets stuck, it should:

Recognize it’s stuck (not loop forever)
Identify why (not just retry the same thing)
Either try a different approach or ask for help

That third one is crucial and almost never implemented. The best agents should know when to say “I can’t do this, here’s what I tried, here’s where I’m stuck.”

So What Should We Build?

Stop chasing “full autonomy.” Build reliable semi-autonomous systems:

Clear boundaries — Define exactly what the agent can and cannot do. No open-ended “figure it out.”
Human-in-the-loop — For high-stakes decisions, the agent proposes, the human approves.
Explicit workflows with flexible execution — The steps are defined, but the agent has freedom in how it executes each step.
Observable behavior — Log every decision, every tool call, every reasoning step. If you can’t debug it, you can’t trust it.

# Honest agent architecture
agent:
  capabilities:
    - search_documents
    - summarize_text
    - draft_response
  constraints:
    - max_tool_calls: 10
    - max_cost: $0.50
    - require_human_approval: true
    # when: confidence < 0.8
    # ^ This would be great if we could actually measure it
  fallback: "escalate_to_human"

The Uncomfortable Truth

The gap between “AI agent” marketing and “AI agent” reality is enormous. We’re selling autonomy and shipping automation.

Automation is valuable. Extremely valuable. But it’s not the same thing.