The word “agent” has been strip-mined of all meaning.
Every startup has agents now. Every framework is an “agentic framework.” Every chatbot that calls a function is an “autonomous agent.”
Most of them are if-else chains with an LLM in the middle. Let’s be honest about that.
What People Mean by “Agent”
When the industry says “AI agent,” they usually mean one of three things:
Level 0: Tool Use
LLM decides which function to call based on user input.
"Call the weather API" → calls weather API.
This is function calling. It's useful. It's not an agent.
Level 1: Orchestrated Workflow
A framework chains multiple LLM calls together.
Step 1: Research → Step 2: Draft → Step 3: Review.
The steps are predefined. The LLM fills in the blanks.
This is a pipeline with natural language glue.
Level 2: Autonomous Reasoning
The system decides what to do next based on its own
observations and reasoning. No predefined workflow.
It can fail, recover, change strategy, ask for help.
Almost nobody is building this. Almost nobody can.
99% of what’s shipped today is Level 0 or Level 1. And that’s fine — but let’s stop calling it autonomy.
The Multi-Agent Illusion
Here’s a typical “multi-agent system”:
from crewai import Agent, Task, Crew
researcher = Agent(
role="Research Analyst",
goal="Find relevant information",
backstory="You are an expert researcher...", # This is a system prompt
)
writer = Agent(
role="Technical Writer",
goal="Write clear documentation",
backstory="You are a skilled writer...", # This is also a system prompt
)
# This is not two agents collaborating.
# This is two system prompts taking turns.
crew = Crew(agents=[researcher, writer], tasks=[...])
I’ve built systems like this. They work for structured tasks. But let’s be clear about what’s happening:
- Agent A runs with System Prompt A
- Output gets passed to Agent B with System Prompt B
- Maybe there’s a loop
That’s prompt chaining with role-play. The “agents” don’t have memory across conversations (unless you manually build it). They don’t negotiate. They don’t disagree productively. They don’t decide to involve another agent — the framework does.
Where Agents Actually Fail
I’ve seen agents fail in predictable ways:
1. The Infinite Loop
Agent: "I need to search for more information."
Agent: *searches, finds same results*
Agent: "I need to search for more information."
Agent: *searches, finds same results*
Agent: "I need to search for more information."
...
# $47 in API costs later
Real autonomy requires knowing when to stop. LLMs are famously bad at this. They’re trained to always produce output. “I don’t know, let me try a different approach” is not a natural completion.
2. The Confidence Problem
Agent: "Based on my research, the answer is X."
# The answer is not X.
# The agent has no way to know it's wrong.
# There's no feedback signal until a human checks.
Autonomous systems need to calibrate uncertainty. Current LLMs express confidence through linguistic hedging (“I think,” “it’s likely”), not through actual probabilistic reasoning. Saying “I’m fairly confident” doesn’t mean anything if you’re fairly confident about everything.
3. The Tool Misuse
# Agent has access to: [search, calculator, code_executor]
# User asks: "What's the square root of 144?"
# Agent: *calls search API for "square root of 144"*
# Then: *calls calculator with the search result*
# Then: *calls code_executor to verify*
# Three tool calls for something it could answer directly.
# This is not intelligence. This is cargo-culting tool use.
What Real Autonomy Would Require
If we’re serious about building autonomous agents, here’s the minimum:
Persistent Memory
Not “conversation history.” Actual long-term memory that the agent maintains, updates, and queries independently.
# Not this:
messages = [{"role": "user", "content": "..."}, ...]
# This:
class AgentMemory:
working_memory: dict # Current task context
episodic_memory: list # Past experiences and outcomes
semantic_memory: dict # Learned facts and relationships
def should_i_try_this_approach(self, task) -> bool:
# Check: have I tried this before? Did it work?
similar = self.episodic_memory.search(task)
return not any(e.failed for e in similar)
Self-Evaluation
The agent needs to judge its own outputs before presenting them. Not “let me check my answer” as a prompt hack — actual evaluation with external grounding.
Graceful Failure
When an agent gets stuck, it should:
- Recognize it’s stuck (not loop forever)
- Identify why (not just retry the same thing)
- Either try a different approach or ask for help
That third one is crucial and almost never implemented. The best agents should know when to say “I can’t do this, here’s what I tried, here’s where I’m stuck.”
So What Should We Build?
Stop chasing “full autonomy.” Build reliable semi-autonomous systems:
- Clear boundaries — Define exactly what the agent can and cannot do. No open-ended “figure it out.”
- Human-in-the-loop — For high-stakes decisions, the agent proposes, the human approves.
- Explicit workflows with flexible execution — The steps are defined, but the agent has freedom in how it executes each step.
- Observable behavior — Log every decision, every tool call, every reasoning step. If you can’t debug it, you can’t trust it.
# Honest agent architecture
agent:
capabilities:
- search_documents
- summarize_text
- draft_response
constraints:
- max_tool_calls: 10
- max_cost: $0.50
- require_human_approval: true
# when: confidence < 0.8
# ^ This would be great if we could actually measure it
fallback: "escalate_to_human"
The Uncomfortable Truth
The gap between “AI agent” marketing and “AI agent” reality is enormous. We’re selling autonomy and shipping automation.
Automation is valuable. Extremely valuable. But it’s not the same thing.