Agentic AI

Chapter 12: Large Language Models 14 subtopics

I avoided this topic for months. Not because I didn’t find it interesting — I did, painfully so — but because the space was moving so fast that anything I wrote on Monday felt obsolete by Thursday. Every week brought a new framework, a new benchmark, a new claim that agents had “solved” some task. I kept waiting for things to settle. They haven’t. But I’ve realized something: the patterns have stabilized even as the implementations churn. The way we wire an LLM to tools, the way we structure planning loops, the way we manage memory — those ideas are converging. So that’s what we’re going to build here. Not a tour of this week’s hottest framework, but the architectural bones underneath all of them.

Here is what this post covers: we start with a single LLM call and ask why it’s not enough. We give it tools. We give it a thinking loop. We give it memory. We give it colleagues. By the end, you’ll understand how an autonomous agent is built from the ground up, what goes wrong in practice, and how to measure whether any of it works.

Before we start: this post assumes you’re comfortable with LLM basics — you know what a prompt is, what tokens are, and roughly what happens when you call an API like OpenAI’s or Anthropic’s. If you’ve done any prompt engineering, you’re in good shape. If not, the earlier sections on prompt engineering and LLM families will give you what you need.

We’re going to build something together. A trip planning assistant. It starts tiny — a single tool that can look up the weather for one city — and by the end of this post, it grows into a multi-tool, multi-agent vacation planner that books flights, finds restaurants, and argues with itself about whether you can afford that beachfront hotel. Every new concept we learn gets wired into this running example, so the abstractions stay grounded in something real.

Let’s go.

What We’ll Cover

The Perceive-Reason-Act Loop
Tool Use & Function Calling
The ReAct Pattern
Planning & Task Decomposition
Memory Systems
Rest Stop — Single-Agent Architecture
Multi-Agent Systems
Agent Frameworks in Practice
Code Generation Agents
Self-Reflection & Critique
Safety & Sandboxing
Evaluating Agents
Wrap-Up
Resources

The Perceive-Reason-Act Loop

Imagine you hire a summer intern. Brilliant kid — top of their class, knows a staggering amount of trivia, can write beautifully, can reason about complex problems. You sit them at a desk and say: “Plan me a vacation to Lisbon next month.” The intern thinks hard, then writes you a lovely three-page essay about the best neighborhoods in Lisbon, typical weather in June, estimated costs, and restaurant recommendations. Impressive. But none of those prices are current. The weather is a guess from training data two years old. And they didn’t actually book anything.

That intern is your LLM. Brilliant at reasoning. Terrible at doing things in the real world.

Now imagine you hand that same intern a phone, a browser, and a company credit card. You say: “Plan me a vacation to Lisbon. Actually book the flights and hotel. Check the real weather forecast. Find restaurants with reservations available.” Now the intern can perceive (check the weather API, search for flights), reason (decide which options are best given your budget and dates), and act (make the bookings). Same brain, wildly different capability.

That’s the difference between a chatbot and an agent. The chatbot tells you what to do. The agent does it.

Every agent, regardless of framework, runs the same fundamental loop:

# The universal agent loop (pseudocode)
while not done:
    observation = perceive(environment)   # read new info
    thought     = reason(observation)     # LLM decides what to do
    result      = act(thought)            # execute a tool or respond
    environment.update(result)            # world changes

The perceive step is where the agent takes in new information: a user message, the result of a tool call, an error from a failed API request. The reason step is the LLM doing what it does best — understanding context and deciding on a next action. The act step is where something actually happens in the world: an API is called, code is executed, a file is written, a message is sent.

A plain LLM call is one pass through this loop with no tools available in the “act” step. The model perceives your prompt, reasons about it, and acts by generating text. That’s it. One shot. An agent runs this loop multiple times, accumulating observations and refining its plan. The loop is what makes it agentic.

Let me make this concrete with our trip planner. Right now, it has no tools. Someone asks: “What’s the weather in Lisbon next week?”

# No tools: the LLM guesses
User: What's the weather in Lisbon next week?
Assistant: Lisbon typically has warm, sunny weather in summer,
           with temperatures around 25-30°C. However, I don't
           have access to real-time weather data.

Honest, but useless for trip planning. The model knows it can’t help. It doesn’t have hands. Let’s give it some.

Tool Use & Function Calling

Here is the first idea that confused me when I encountered it, and I want to save you that confusion: the LLM does not actually call functions. It outputs structured text — specifically, JSON — that describes which function it wants to call and with what arguments. Then your code on the server reads that JSON, calls the real function, and feeds the result back to the model. The model never touches your code, your database, or the internet. It writes a wish list; you fulfil it.

Think of it like our intern analogy again. The intern can’t actually make phone calls on your behalf. But they can write you a note that says: “Please call the weather service and ask about Lisbon, June 15-22.” You make the call, get the answer, and hand the note back. The intern reads the answer and continues planning.

Here’s how function calling works in practice. You define your tools as JSON schemas and pass them alongside the user’s message:

# Step 1: Define tools as JSON schemas
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the weather forecast for a city on specific dates",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "City name, e.g. 'Lisbon'"
                    },
                    "start_date": {
                        "type": "string",
                        "description": "Start date in YYYY-MM-DD format"
                    },
                    "end_date": {
                        "type": "string",
                        "description": "End date in YYYY-MM-DD format"
                    }
                },
                "required": ["city", "start_date", "end_date"]
            }
        }
    }
]

# Step 2: Send the user message WITH the tool definitions
response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What's the weather in Lisbon June 15-22?"}],
    tools=tools
)

The model doesn’t respond with a text answer. Instead, it responds with a tool call — a structured JSON object saying “I want to call get_weather with these arguments”:

# Step 3: The model's response contains a tool call, not text
# response.choices[0].message.tool_calls[0] looks like:
{
    "id": "call_abc123",
    "type": "function",
    "function": {
        "name": "get_weather",
        "arguments": "{\"city\": \"Lisbon\", \"start_date\": \"2025-06-15\", \"end_date\": \"2025-06-22\"}"
    }
}

Now here is the crucial part that I want to make very explicit. Your code receives this JSON. Your code parses the function name and arguments. Your code actually calls the weather API. The LLM did nothing but write structured text.

# Step 4: YOUR code dispatches the function call
import json

tool_call = response.choices[0].message.tool_calls[0]
func_name = tool_call.function.name
func_args = json.loads(tool_call.function.arguments)

# You execute the real function
if func_name == "get_weather":
    weather_result = get_weather(**func_args)
    # Returns: {"temp_high": 28, "temp_low": 18, "condition": "sunny",
    #           "precipitation_chance": 5}

# Step 5: Feed the result back to the model
messages = [
    {"role": "user", "content": "What's the weather in Lisbon June 15-22?"},
    response.choices[0].message,  # the assistant's tool call
    {
        "role": "tool",
        "tool_call_id": tool_call.id,
        "content": json.dumps(weather_result)
    }
]

# Step 6: Model now generates a final answer using the real data
final_response = openai.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=tools
)
# "Lisbon will be sunny June 15-22 with highs around 28°C and
#  lows around 18°C. Only a 5% chance of rain. Great beach weather!"

That’s the full round trip. User asks a question. Model decides it needs a tool. Model outputs JSON describing the call. Your code executes it. Result goes back to the model. Model generates a human-readable answer grounded in real data.

I want to be honest: when I first saw this pattern, my reaction was “wait, that’s it?” I expected some deep magic — some special neural pathway for tool use. There isn’t one. The model was fine-tuned to output valid JSON in a specific format when given tool schemas. That’s the entire trick. The engineering is in the loop around the model, not in the model itself.

A kitchen analogy. Think of the LLM as a chef. The chef reads the order (user message), looks at the menu of available ingredients and kitchen tools (tool schemas), and writes a prep ticket: “I need 200g of diced onions” (the JSON tool call). The sous-chef (your code) does the actual chopping, hands the bowl back, and the chef continues cooking. The chef never touches the knife. The chef’s genius is knowing what to ask for and when.

Now let’s give our trip planner a second tool: flight search.

tools = [
    { ... get_weather ... },
    {
        "type": "function",
        "function": {
            "name": "search_flights",
            "description": "Search for flights between two cities on a given date",
            "parameters": {
                "type": "object",
                "properties": {
                    "origin": {"type": "string", "description": "Departure city"},
                    "destination": {"type": "string", "description": "Arrival city"},
                    "date": {"type": "string", "description": "Date in YYYY-MM-DD"},
                    "max_price": {"type": "number", "description": "Max price in USD"}
                },
                "required": ["origin", "destination", "date"]
            }
        }
    }
]

With two tools available, something interesting happens. When the user says “Plan me a trip to Lisbon from New York, June 15-22, budget $2000,” the model has to decide which tool to call first. And once it gets the first result, it needs to decide whether to call another tool or respond. This decision-making — this interleaving of thinking and acting — is what leads us to the next pattern.

But there’s a problem. A single function-call round trip is enough for simple questions. For complex requests, the model needs to call tools multiple times, examining each result before deciding what to do next. It needs to think out loud. It needs a pattern.

The ReAct Pattern

In 2023, Yao et al. published a paper called “ReAct: Synergizing Reasoning and Acting in Language Models.” The core idea is disarmingly simple: let the model think out loud before each action. Instead of going straight from observation to tool call, the model writes an explicit Thought explaining its reasoning, then specifies an Action, then receives an Observation (the result). This Thought-Action-Observation cycle repeats until the model decides it has enough information to answer.

I resisted this idea for a while. “Why does the model need to think out loud?” I thought. “It’s already reasoning internally.” But the internal reasoning of an LLM is opaque — it happens in hidden states we can’t inspect. When you force the model to write down its reasoning as literal text tokens, two things happen. First, the reasoning gets better because each thought token conditions the generation of subsequent tokens. Second, you get a trace you can debug, log, and audit. That second point turns out to matter enormously in production.

Here is what a ReAct trace looks like for our trip planner. The user says: “Plan a 7-day trip to Lisbon from NYC in June, budget $2000.”

Thought: The user wants a trip to Lisbon from NYC in June with a $2000 budget.
         I need to figure out: (1) weather, (2) flight costs, (3) whether the
         budget is realistic. Let me start with flights since that's the
         biggest expense.

Action: search_flights(origin="New York", destination="Lisbon",
                       date="2025-06-15", max_price=800)

Observation: Found 3 flights. Cheapest: TAP Portugal, $487 round trip.
             Mid-range: United, $612. Premium: Delta, $789.

Thought: Great, the cheapest flight is $487 round trip. That leaves about
         $1513 for 7 nights of accommodation, food, and activities. Let me
         check the weather to know what kind of activities to recommend.

Action: get_weather(city="Lisbon", start_date="2025-06-15",
                    end_date="2025-06-22")

Observation: Sunny, highs 27-30°C, lows 17-19°C, precipitation 5%.

Thought: Perfect weather for outdoor activities and beaches. With $1513
         remaining after flights, that's about $216/day for hotel and
         expenses. That's reasonable for Lisbon. I have enough information
         to give a solid recommendation.

Action: respond_to_user(...)

Final Answer: "Here's your Lisbon trip plan! Flights: TAP Portugal at $487
round trip departing June 15. Weather: gorgeous — sunny skies, 27-30°C.
Budget breakdown: $487 flights, ~$700 hotel (7 nights at ~$100/night in
Alfama or Baixa), ~$350 food, ~$463 for activities and transport..."

Notice the rhythm. Thought, Action, Observation. Thought, Action, Observation. Each thought explicitly reasons about what the agent knows so far and what it needs next. Each action is a concrete tool call. Each observation provides new data that feeds the next thought. The model is having a conversation with the world, one careful step at a time.

Let me show you the code that implements this loop. It’s shorter than you might expect:

def react_agent(user_query, tools, max_steps=10):
    messages = [
        {"role": "system", "content": """You are a trip planning assistant.
Think step-by-step. For each step, write a Thought explaining your
reasoning, then call a tool if needed. When you have enough information,
respond to the user directly."""},
        {"role": "user", "content": user_query}
    ]

    for step in range(max_steps):
        response = openai.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools
        )

        msg = response.choices[0].message
        messages.append(msg)

        # If the model wants to call tools
        if msg.tool_calls:
            for tool_call in msg.tool_calls:
                func_name = tool_call.function.name
                func_args = json.loads(tool_call.function.arguments)

                # Execute the real function
                result = execute_tool(func_name, func_args)

                messages.append({
                    "role": "tool",
                    "tool_call_id": tool_call.id,
                    "content": json.dumps(result)
                })
        else:
            # No tool calls means the model is done — return the answer
            return msg.content

    return "Agent exceeded maximum steps without completing the task."

That max_steps=10 parameter is doing more work than it looks. Without it, a confused agent could loop forever, burning tokens and money. I learned this the hard way when an early prototype got stuck in a loop trying to find a “direct flight from Omaha to Reykjavik” that didn’t exist, calling the search API over and over with slightly different date permutations. Seventeen API calls and $0.80 in tokens later, I added the guard.

The ReAct pattern is beautiful for tasks where each step depends on the previous result. Check the weather, then decide on activities. Search for flights, then adjust the budget. But it has a limitation that becomes apparent on complex tasks: it’s myopic. The model only decides one step at a time. It doesn’t lay out the full plan upfront, which means it can wander, double back, or miss things entirely. For our simple trip planner, that’s fine. For a complex multi-day research task, we need something more structured.

Planning & Task Decomposition

The single hardest problem in agentic AI is not tool calling or memory. It’s planning. I’ve spent more time debugging agent plans than any other component. A model that calls the wrong tool once can recover. A model that starts with a bad plan will confidently execute a series of perfectly-formed but fundamentally misguided steps, arriving at a detailed, well-formatted, completely wrong answer.

There are three main approaches to planning, and they sit on a spectrum from reactive to deliberate.

ReAct: One Step at a Time

We already covered this. The model doesn’t plan ahead. It thinks about the immediate next action, takes it, observes the result, and decides the next step. It’s greedy — in the algorithmic sense. Good for short tasks (2-5 steps). Gets lost on longer ones because there’s no global view of what’s been done and what remains.

Plan-and-Execute: Think First, Do Second

The alternative is to make the plan explicit. Before calling any tools, the model writes out a complete plan — a numbered list of steps. Then a separate execution phase works through the plan step by step, potentially revising it as new information comes in.

Back to our kitchen analogy: a ReAct chef starts cooking immediately, deciding each step as they go. A Plan-and-Execute chef reads the full recipe first, gathers all ingredients, then starts cooking. The second chef is slower to start but less likely to realize halfway through that they forgot to buy eggs.

Here is what Plan-and-Execute looks like for our trip planner:

# Phase 1: Planning (LLM generates the plan)
User: "Plan a 7-day trip to Lisbon from NYC, June 15-22, budget $2000.
       I love seafood and history."

Plan:
  1. Search for round-trip flights NYC → Lisbon, June 15-22
  2. Get weather forecast for Lisbon, June 15-22
  3. Search for hotels in Lisbon, 7 nights, budget ~$100/night
  4. Find top-rated seafood restaurants in Lisbon
  5. Find historical sites and museums in Lisbon
  6. Estimate daily budget (transport, food, activities)
  7. Compile everything into a day-by-day itinerary

# Phase 2: Execution (work through each step)
Executing step 1... search_flights("NYC", "Lisbon", "2025-06-15") → $487
Executing step 2... get_weather("Lisbon", "2025-06-15", "2025-06-22") → sunny, 28°C
Executing step 3... search_hotels("Lisbon", "2025-06-15", 7, 100) → Hotel Alfama $89/night
...

# Phase 3: Synthesis (combine results into final answer)
Final itinerary: [detailed day-by-day plan]

The advantage is clarity. You can log the plan, show it to the user for approval, and track progress against it. The disadvantage is rigidity. What if step 1 reveals flights cost $1500 and the entire plan needs restructuring? A rigid planner will keep executing steps 2-7 as if nothing happened. A good Plan-and-Execute system includes a replanning step: after each execution, check whether the plan still makes sense given what you’ve learned.

Reflexion: Learn from Mistakes

The third approach is the most ambitious. Reflexion (Shinn et al., 2023) adds a self-critique loop. The agent attempts a task, evaluates its own output, identifies what went wrong, and tries again with the benefit of that reflection. It’s inspired by how humans actually learn: try, fail, think about why you failed, try again differently.

# Reflexion loop (simplified)
for attempt in range(max_attempts):
    result = agent.execute(task)
    evaluation = agent.evaluate(result, task)

    if evaluation.success:
        return result

    # Self-reflection: what went wrong?
    reflection = agent.reflect(task, result, evaluation)
    # e.g., "I booked a hotel outside the city center because I didn't
    #         check the distance to the historical district. Next time,
    #         I should filter hotels by proximity to Alfama."

    agent.memory.add(reflection)  # remember for next attempt

Reflexion is powerful but expensive. Each retry costs a full execution pass. For our trip planner, it might mean planning the entire trip, realizing the budget doesn’t work, reflecting on why (“I should have checked flight prices before allocating the hotel budget”), and replanning from scratch. Three attempts means three times the API calls and tokens.

Here is a practical way to think about when to use which pattern:

ReAct works when tasks are short (under 5 steps), each step has low cost, and the model needs to adapt quickly to new information. Good for: answering a question that requires 1-3 tool calls, real-time chat with tool access.

Plan-and-Execute works when tasks are complex (5-15 steps), the user benefits from seeing the plan upfront, and steps are somewhat predictable. Good for: trip planning, report generation, multi-step research.

Reflexion works when there’s a clear success criterion the agent can evaluate, retries are acceptable, and the task is hard enough that the first attempt often fails. Good for: code generation, math problems, any task with a verifiable answer.

Memory Systems

Here is an embarrassing confession. The first agent I built worked beautifully for about four exchanges. Then it started repeating itself, forgot the user’s budget, and recommended a hotel in Barcelona when we were planning a trip to Lisbon. The problem was memory — or rather, the lack of it. I had been naively stuffing everything into the conversation history, and after a dozen tool calls the context window was full of JSON noise and the model had lost the plot.

Memory is where the intern analogy gets really useful. Think about what a real intern has access to:

They have a desk with limited space — that’s the context window. Only so many papers fit. This is short-term memory: the current conversation, the last few tool results, the immediate task. For current models, this is typically 128K tokens, which sounds like a lot until you start packing in tool results. A single API response from a flight search might be 2,000 tokens. Ten tool calls and you’ve eaten 20,000 tokens of context with raw data.

They have a notepad where they jot down key findings and running calculations — that’s the scratchpad or working memory. It’s a condensed summary: “Cheapest flight: $487 TAP. Weather: sunny 28°C. Remaining budget: $1513.” Instead of keeping all the raw tool outputs in context, you can periodically summarize them into a scratchpad that takes a fraction of the tokens.

# Working memory: a scratchpad maintained by the agent
scratchpad = """
## Trip Planning Progress
- Destination: Lisbon, June 15-22
- Budget: $2000
- Flights: TAP Portugal $487 RT (BOOKED)
- Weather: Sunny, 27-30°C, low rain chance
- Hotel: Searching... budget ~$100/night
- Remaining budget: $1513
- User preferences: seafood, history
"""

They have a filing cabinet (or access to the company’s shared drive) where past projects live — that’s long-term memory, typically implemented as a vector database. When the agent needs to recall something from a previous conversation or a large knowledge base, it performs a similarity search. “What did this user prefer on their last trip?” triggers a vector lookup that retrieves the relevant past context.

# Long-term memory: vector DB retrieval
def retrieve_user_preferences(user_id, query):
    # Embed the query
    query_embedding = embed(query)

    # Search the vector DB for relevant past interactions
    results = vector_db.search(
        collection="user_memories",
        query_vector=query_embedding,
        filter={"user_id": user_id},
        top_k=5
    )

    return results
    # Returns things like:
    # "User prefers window seats"
    # "User is allergic to shellfish (contradicts seafood preference — clarify!)"
    # "Last trip was Rome, enjoyed walking tours"

And finally, there’s episodic memory: specific memories of past experiences. “Last time I tried to book through this API, it was down on weekends” or “The user got frustrated when I gave too many options — limit to three choices.” Episodic memory is what Reflexion creates: memories born from specific successes and failures that guide future behavior.

The art of agent memory design is knowing what to put where. Raw tool outputs go into short-term memory and get summarized into the scratchpad. Key facts and user preferences go into long-term storage. Lessons learned from failures go into episodic memory. And every few steps, you garbage-collect: remove stale data from the context, keep only what the model needs for the current step.

# Memory management in practice
def manage_memory(messages, scratchpad, step_count):
    if step_count % 5 == 0:
        # Every 5 steps, compress old tool results
        summary = llm("Summarize these tool results into key facts: "
                      + format_recent_tools(messages))
        scratchpad.update(summary)

        # Remove verbose tool outputs from message history
        messages = compress_messages(messages, keep_last=3)

    # Always inject the scratchpad into the system prompt
    messages[0]["content"] += f"\n\nCurrent progress:\n{scratchpad}"

    return messages

I cannot overstate how much difference good memory management makes. The same model, with the same tools, will produce dramatically better results when its context window contains a clean scratchpad of key facts instead of a wall of raw JSON from six different API calls. It’s the difference between our intern working at a clean desk with organized notes versus drowning in a pile of unsorted printouts.

Rest Stop: Single-Agent Architecture

You can stop here. If you’re building a tool-using assistant — a chatbot that can look up information, call APIs, and answer questions grounded in real data — you now have everything you need. The perceive-reason-act loop, function calling, ReAct for multi-step reasoning, a planning strategy, and memory management. That covers the vast majority of agentic applications in production today. The mental model is: LLM + tools + a while-loop with a max-step guard + a scratchpad = a useful agent. What follows is what you need when a single agent isn’t enough.

Take a breath. Let’s recap where we are with a snapshot of our trip planner. It’s a single agent with two tools (weather and flights), a ReAct loop for step-by-step reasoning, and a scratchpad that tracks progress. It can handle straightforward trip queries reasonably well. But when we asked it to plan a complex two-week multi-city tour — Lisbon, then Porto, then a day trip to Sintra, with restaurant reservations and museum tickets — it started to struggle. Too many tools, too many steps, too much context. It’s like asking one intern to do the work of an entire travel agency.

So let’s give it colleagues.

Multi-Agent Systems

I’ll be honest: when I first heard “multi-agent systems,” my reaction was skepticism. It sounded like over-engineering. Why use three LLM calls when one could work? But after watching a single agent collapse under the weight of a complex planning task — juggling flights, hotels, restaurants, weather, budget constraints, and user preferences all at once — I understood. The same reason companies have teams instead of one person doing everything applies to agents.

Back to the kitchen. A single chef (our LLM) handling one dish at a time works fine for a dinner party. But a restaurant kitchen has a brigade system: an executive chef directing the operation, a saucier for sauces, a poissonnier for fish, a pâtissier for desserts. Each specialist is excellent at their narrow domain. The executive chef coordinates. That’s multi-agent architecture.

There are three dominant patterns for connecting multiple agents, and each one maps to a real organizational structure.

The Supervisor Pattern (One Boss Delegates)

One agent acts as the manager. It receives the user’s request, breaks it into subtasks, and delegates each subtask to a specialist agent. The specialist agents each have their own tool set and system prompt. They complete their work and report back to the supervisor, who synthesizes the results.

# Supervisor pattern for our trip planner
class TripPlannerSupervisor:
    def __init__(self):
        self.flight_agent = Agent(
            name="FlightAgent",
            tools=[search_flights, compare_airlines],
            system_prompt="You are a flight search specialist..."
        )
        self.hotel_agent = Agent(
            name="HotelAgent",
            tools=[search_hotels, check_reviews],
            system_prompt="You are an accommodation specialist..."
        )
        self.activity_agent = Agent(
            name="ActivityAgent",
            tools=[search_restaurants, find_attractions, check_tickets],
            system_prompt="You are a local activities specialist..."
        )
        self.supervisor = Agent(
            name="Supervisor",
            tools=[delegate_to_agent],
            system_prompt="""You are a trip planning supervisor.
Break the user's request into subtasks and delegate to:
- FlightAgent for flights
- HotelAgent for accommodation
- ActivityAgent for restaurants and attractions
Synthesize their results into a cohesive plan."""
        )

    def plan_trip(self, user_request):
        return self.supervisor.run(user_request)

The supervisor pattern is the most common in production. It’s predictable, easy to debug (you can inspect each specialist’s output independently), and fails gracefully (if the hotel agent errors, the flight results are still good).

The Debate Pattern (Agents Argue)

Multiple agents look at the same problem from different perspectives and argue until they converge. This is powerful for decisions where there’s no single right answer. For our trip planner, imagine a “Budget Agent” that prioritizes saving money and a “Luxury Agent” that prioritizes experience. They debate each choice:

# Debate pattern (simplified)
budget_agent_says = "The hostel in Alfama is $35/night and has great reviews."
luxury_agent_says = "A boutique hotel in Chiado at $120/night puts you walking
                     distance from everything. The time saved on transit is
                     worth the extra cost."
budget_agent_replies = "At $120/night for 7 nights, that's $840 just for the
                        hotel — 42% of the total budget. The hostel leaves
                        more for experiences."

# A judge agent or the supervisor decides
judge_says = "Given the user's preference for history and walking tours,
              the Chiado location is worth the premium. Recommend Chiado
              but offer the hostel as a budget alternative."

The debate pattern produces more nuanced outputs but costs significantly more in tokens (every exchange is an LLM call). Use it when the quality of the decision matters more than the cost of deliberation.

The Assembly Line Pattern (Sequential Handoff)

Agents are arranged in a pipeline. The output of one becomes the input of the next, like an assembly line in a factory. For our trip planner: a Research Agent gathers raw data, a Planning Agent structures it into an itinerary, a Budget Agent verifies costs and adjusts, and a Writing Agent formats the final recommendation.

# Assembly line pattern
raw_data = research_agent.run("Gather flight, hotel, weather data for Lisbon trip")
structured_plan = planning_agent.run(f"Create day-by-day itinerary from: {raw_data}")
verified_plan = budget_agent.run(f"Verify budget and adjust: {structured_plan}")
final_output = writing_agent.run(f"Format this into a polished plan: {verified_plan}")

The assembly line is conceptually the cleanest but also the most fragile. Each handoff is a potential point of information loss. If the research agent forgets to include the user’s seafood preference, no downstream agent can recover that information. In practice, you need to pass the original user request alongside each handoff, not only the previous agent’s output.

So when is multi-agent worth the added complexity? I use a rough heuristic: if a single agent needs more than 6-8 tools, or if the task naturally decomposes into independent subtasks that can run in parallel, multi-agent starts to pay off. For anything simpler, a single agent with good tools and memory will outperform a poorly-designed multi-agent system every time.

Agent Frameworks in Practice

I have a confession that might be unpopular: I built my first three agents without any framework at all. Raw API calls, a while loop, and a dictionary for state. And it worked. I mention this because the framework landscape is so crowded and so loud that it’s easy to believe you need a framework to build agents. You don’t. But frameworks genuinely help once your agents get complex enough, particularly for managing state, handling errors, and orchestrating multi-agent flows.

Here is an honest comparison of the three frameworks I’ve actually used in production-adjacent work.

LangGraph

LangGraph is from the LangChain team. It models agent workflows as graphs — nodes are processing steps (LLM calls, tool executions, conditional checks) and edges define the flow between them. It excels at complex, stateful workflows where the agent might loop, branch, or need human approval at certain steps.

# LangGraph: agent as a state machine
from langgraph.graph import StateGraph, END

workflow = StateGraph(TripPlannerState)

# Define nodes
workflow.add_node("research", research_node)
workflow.add_node("plan", planning_node)
workflow.add_node("verify_budget", budget_node)
workflow.add_node("human_review", human_review_node)

# Define edges (flow)
workflow.add_edge("research", "plan")
workflow.add_edge("plan", "verify_budget")
workflow.add_conditional_edges(
    "verify_budget",
    budget_check,  # function that returns "approved" or "over_budget"
    {"approved": "human_review", "over_budget": "plan"}  # loop back if over budget
)
workflow.add_edge("human_review", END)

app = workflow.compile()

LangGraph’s strength is explicitness. The graph is the architecture diagram. You can visualize it, reason about loops, and add checkpoints for persistence. Its weakness is verbosity — a simple ReAct agent takes more code in LangGraph than a raw while loop.

CrewAI

CrewAI is designed around the multi-agent metaphor. You define “agents” with roles, “tasks” they perform, and a “crew” that orchestrates them. It’s the most intuitive framework if you think in terms of teams.

# CrewAI: agents as team members
from crewai import Agent, Task, Crew

flight_researcher = Agent(
    role="Flight Researcher",
    goal="Find the best flights within budget",
    tools=[search_flights_tool],
    verbose=True
)

itinerary_planner = Agent(
    role="Itinerary Planner",
    goal="Create a detailed day-by-day vacation plan",
    tools=[search_attractions_tool, get_weather_tool],
    verbose=True
)

# Define tasks
research_task = Task(
    description="Find round-trip flights NYC to Lisbon, June 15-22, under $600",
    agent=flight_researcher
)

planning_task = Task(
    description="Create a 7-day Lisbon itinerary for a seafood and history lover",
    agent=itinerary_planner,
    context=[research_task]  # gets the output of research_task
)

# Run the crew
crew = Crew(agents=[flight_researcher, itinerary_planner],
            tasks=[research_task, planning_task])
result = crew.kickoff()

CrewAI is fast to prototype with. Define roles, give them tools, describe tasks in natural language, run. It’s less flexible than LangGraph for complex conditional workflows, but for straightforward multi-agent pipelines, it’s hard to beat for speed of development.

AutoGen

AutoGen (from Microsoft) is built around conversations between agents. Agents are chat participants that talk to each other. A “GroupChat” manages turn-taking. This makes it the natural choice for the debate pattern, or any workflow where agents need back-and-forth dialogue.

# AutoGen: agents as chat participants
from autogen import AssistantAgent, UserProxyAgent, GroupChat

flight_agent = AssistantAgent(
    name="FlightExpert",
    system_message="You are a flight booking expert...",
    llm_config=llm_config
)

budget_agent = AssistantAgent(
    name="BudgetAdvisor",
    system_message="You scrutinize travel plans for cost efficiency...",
    llm_config=llm_config
)

user_proxy = UserProxyAgent(
    name="TripPlanner",
    human_input_mode="NEVER",
    code_execution_config={"work_dir": "planning"}
)

group_chat = GroupChat(
    agents=[user_proxy, flight_agent, budget_agent],
    messages=[],
    max_round=12
)

The trade-off with all three: LangGraph gives you the most control and is best for production state machines. CrewAI gives you the fastest prototype and is best for multi-agent pipelines. AutoGen gives you the most natural multi-agent communication and is best for debate-style workflows. None of them is the “right” choice — they’re different tools for different shapes of problems.

Code Generation Agents

If tool-using agents are the general case, code generation agents are the most spectacular specific case. They don’t call pre-defined functions — they write the functions. They read codebases, understand issues, write patches, and run tests. The first time I watched an agent open a GitHub issue, navigate a codebase, write a fix, run the test suite, discover its fix broke something else, fix that, and submit a pull request — all without human intervention — I had to sit down for a minute.

SWE-agent (Princeton, 2024) was one of the first systems to demonstrate this systematically. It gives an LLM access to a shell, a code editor, and a test runner, then points it at GitHub issues and says “fix this.” The architecture is a ReAct loop with specialized tools for code navigation (search files, open file, scroll, edit) and code execution (run tests, run linter).

The workflow looks like this:

# Simplified SWE-agent workflow
1. Agent reads the GitHub issue description
2. Thought: "This is a bug in the date parsing module. Let me find the
             relevant file."
3. Action: search_codebase("date parsing")
4. Observation: Found src/utils/date_parser.py (3 matches)
5. Thought: "Let me open that file and look at the parsing logic."
6. Action: open_file("src/utils/date_parser.py")
7. Observation: [file contents shown]
8. Thought: "Line 47 uses strptime with format '%Y-%m-%d' but the input
             dates are in DD/MM/YYYY format. That's the bug."
9. Action: edit_file("src/utils/date_parser.py", line=47,
           old="strptime(date_str, '%Y-%m-%d')",
           new="strptime(date_str, '%d/%m/%Y')")
10. Action: run_tests()
11. Observation: 47 passed, 1 failed — test_american_dates now fails
12. Thought: "My fix broke American date format support. I need to detect
              the format first..."
13. [continues iterating...]

Notice step 11. The agent’s first fix introduced a regression. This happens constantly with code agents, and it’s why the test-running step is so critical. Without automated tests, the agent has no way to know if its fix works. The agent’s ability is bounded by the quality of the test suite it has access to.

Devin (Cognition Labs, 2024) took this further, adding a full development environment: browser, terminal, code editor, and the ability to look up documentation, install packages, and debug interactively. It operates more like a junior developer than a code completion tool — it takes high-level tasks (“add dark mode to the settings page”) and works through them end-to-end.

The key insight behind all code agents is that code is verifiable. Unlike a trip plan or a written report, code can be tested. Does it compile? Do the tests pass? Does it produce the right output? This tight feedback loop — write code, run it, see if it works, fix it if not — maps perfectly onto the ReAct + Reflexion pattern. It’s why code generation is the area where agents have improved fastest.

Self-Reflection & Critique

I want to zoom in on the Reflexion pattern because it represents something genuinely new: an LLM that gets better at a task within a single session without any weight updates. No fine-tuning. No gradient descent. The model improves purely through natural language self-critique stored in its context.

The paper by Shinn et al. (2023) calls this verbal reinforcement learning. Instead of updating neural network weights based on a reward signal (as in traditional RL), the agent writes a natural-language reflection on what went wrong and adds it to its prompt for the next attempt. The “weights” being updated are just words in the context window.

Here is what this looks like concretely. Our trip planner attempts to plan a trip, and the result is evaluated:

# Attempt 1
Plan: Fly to Lisbon June 15, hotel in Belem, seafood at Time Out Market...
Evaluation: Budget exceeded by $340. Hotel in Belem is far from historical
            center. No museum or walking tour recommendations despite user
            requesting history focus.

# Self-reflection (generated by the LLM)
Reflection: "I made three mistakes: (1) I chose the hotel before checking
the total budget — I should calculate flight + hotel first and work backward
to get the activities budget. (2) I picked Belem for the hotel because it
has famous attractions, but it's not the best base for exploring the
historical center — Alfama or Baixa would be better. (3) I completely
forgot the user's interest in history and focused only on food. I need to
re-read the original request more carefully."

# Attempt 2 (reflection added to context)
Plan: [now budget-aware, hotel in Alfama, includes Castelo de São Jorge,
       National Tile Museum, walking tour of Alfama, plus seafood at
       Cervejaria Ramiro]
Evaluation: Within budget. Excellent location. Addresses all user preferences.
Success!

The beautiful thing is that the reflection in attempt 1 genuinely improves attempt 2. The model reads its own critique and adjusts. No magic — it’s in-context learning, which we already know LLMs are good at. Reflexion formalizes it into a loop.

But there is a subtle trap I fell into. The model can write a reflection that sounds insightful but doesn’t actually address the root cause. “I should have been more careful” is a useless reflection. “I should calculate total fixed costs (flights + hotel) before allocating the activities budget” is useful because it specifies a concrete behavioral change. When implementing Reflexion, prompt the model to produce actionable reflections, not vague self-criticism.

Safety & Sandboxing

Let me tell you a story that still makes me nervous. During early testing of a code agent, I gave it access to a shell and asked it to clean up some temporary files. It ran rm -rf on a directory that it hallucinated was temp but was actually a subdirectory of the project. Nothing critical was lost (version control saved us), but the agent had deleted production configuration files with complete confidence. It even wrote in its reasoning trace: “These look like auto-generated files, safe to remove.”

Agents are dangerous in a way that plain LLMs are not. A plain LLM can generate harmful text, but it can’t do anything. An agent with tools can delete files, send emails, make API calls, spend money, and modify databases. The autonomy that makes agents useful is the same autonomy that makes them risky.

Here are the safety layers that I’ve found essential in practice:

Sandboxed Execution

Every tool call should run in a sandbox. Code execution happens in a Docker container, not on the host machine. File access is restricted to a specific directory. Network access is limited to approved domains. The agent cannot reach your production database, your cloud credentials, or your email.

# Sandboxing tool execution
def execute_tool_safely(func_name, func_args):
    # Allowlist of safe functions
    ALLOWED_TOOLS = {"get_weather", "search_flights", "search_hotels"}

    if func_name not in ALLOWED_TOOLS:
        return {"error": f"Tool '{func_name}' is not allowed"}

    # Rate limiting
    if rate_limiter.is_exceeded(func_name):
        return {"error": "Rate limit exceeded for this tool"}

    # Budget guard
    if token_budget.remaining() < 1000:
        return {"error": "Token budget exhausted"}

    # Execute in sandbox
    result = sandbox.run(func_name, func_args, timeout=30)
    return result

Human-in-the-Loop

For high-stakes actions, require human approval before execution. The agent proposes an action, a human reviews it, and only then does it execute. This is the difference between the “tool-using assistant” and “fully autonomous” columns in the autonomy spectrum. Most production agents today sit in the “tool-using assistant” column: they propose, humans approve.

# Human-in-the-loop gate
def act_with_approval(action, requires_approval=False):
    if requires_approval or action.cost > COST_THRESHOLD:
        print(f"Agent wants to: {action.description}")
        print(f"Estimated cost: ${action.cost}")
        approval = input("Approve? (y/n): ")
        if approval != "y":
            return {"status": "rejected", "reason": "Human declined"}

    return execute_action(action)

Token and Cost Budgets

An agent in a loop can consume unlimited tokens if you let it. Set hard budgets. Every tool call costs tokens (the request and the response). Every LLM reasoning step costs tokens. A five-step agent might cost $0.10. A confused agent in a loop can cost $5 before you notice. I set three limits: max steps (how many loop iterations), max tokens (total token consumption), and max cost (dollar ceiling).

Comprehensive Logging

Log everything. Every thought, every tool call, every observation, every token count. When an agent goes wrong — and it will — the trace is your only way to understand what happened. This is where the ReAct pattern pays dividends: those explicit Thought steps aren’t overhead, they’re your debugging lifeline.

The autonomy-reliability tradeoff. The more autonomy you give an agent, the less reliable it becomes. A copilot that suggests code for a human to review is maybe 99% safe — the human is the safety net. A fully autonomous agent operating for hours without oversight might be 50-60% reliable on complex tasks. Most production systems today wisely sit in the middle: autonomous for low-risk actions (searching, summarizing), human-approved for high-risk actions (spending money, modifying data, sending communications).

Evaluating Agents

How do you know if your agent is any good? This question tormented me for weeks. With a classifier, you compute accuracy. With a language model, you compute perplexity. But an agent does multiple things across multiple steps, and “correct” is much harder to define. Did it complete the task? Did it complete it efficiently? Did it avoid errors along the way? Did the final output actually help the user?

Let me start with the mathematical reality that keeps agent builders up at night: compounding errors.

Suppose your agent is 95% reliable at each individual step. That sounds great. But a 5-step task means the overall probability of success is 0.95 × 0.95 × 0.95 × 0.95 × 0.95 = 0.77. A 10-step task drops to 0.60. A 20-step task: 0.36. This is the fundamental challenge of agentic systems. Each step is individually reliable, but errors compound multiplicatively across steps.

# Compounding error rates
import numpy as np

per_step_accuracy = 0.95
for n_steps in [1, 3, 5, 10, 15, 20]:
    overall = per_step_accuracy ** n_steps
    print(f"{n_steps:2d} steps: {overall:.1%} overall accuracy")

#  1 steps: 95.0% overall accuracy
#  3 steps: 85.7% overall accuracy
#  5 steps: 77.4% overall accuracy
# 10 steps: 59.9% overall accuracy
# 15 steps: 46.3% overall accuracy
# 20 steps: 35.8% overall accuracy

This is why Reflexion and self-correction matter so much. If the agent can detect and recover from errors, the effective per-step accuracy goes up. If the agent has good tests (for code) or verification tools (for data), it can catch mistakes before they cascade. The math changes from pⁿ to something more favorable when you add error recovery.

SWE-bench

SWE-bench is the most widely cited benchmark for code agents. It consists of real GitHub issues from popular Python repositories (Django, Flask, scikit-learn, etc.) paired with the human-written pull requests that resolved them. The agent gets the issue description and the codebase, and must produce a patch that makes the failing tests pass.

SWE-bench has been a remarkable forcing function for progress. In early 2024, the best systems resolved about 20% of issues. By mid-2025, top models are resolving roughly 70-80% of the “Verified” subset (a curated, cleaner version of the benchmark). That’s an extraordinary rate of improvement. But the numbers need context: SWE-bench Verified is an easier subset, the issues are well-specified, and the test suites are provided. Real-world code tasks are messier.

GAIA

GAIA (General AI Assistants) is a benchmark for general-purpose agents. Tasks range from “What was the population of the city where Einstein was born, in the year he published special relativity?” (requires multi-step lookup) to complex multi-tool challenges. GAIA measures an agent’s ability to reason, use tools, and synthesize information across multiple sources.

Multi-model combinations and sophisticated agent architectures have pushed GAIA scores above 90% on some subsets, though harder levels remain challenging. GAIA is particularly useful because it tests the full agent stack: reasoning, tool use, multi-step planning, and information synthesis.

Evaluating Your Own Agents

Benchmarks are useful for comparing approaches, but ultimately you need to evaluate on your tasks. For our trip planner, I built a simple evaluation suite:

# Custom evaluation for the trip planner
test_cases = [
    {
        "query": "Plan a weekend trip to Lisbon from NYC, budget $1500",
        "checks": [
            ("mentions_flight", lambda r: "flight" in r.lower()),
            ("within_budget", lambda r: extract_total(r) <= 1500),
            ("includes_hotel", lambda r: "hotel" in r.lower()),
            ("includes_weather", lambda r: any(w in r.lower()
                                   for w in ["sunny", "rain", "temperature", "°"])),
        ]
    },
    # ... more test cases
]

def evaluate_agent(agent, test_cases):
    results = []
    for case in test_cases:
        response = agent.run(case["query"])
        checks_passed = sum(1 for name, check in case["checks"]
                           if check(response))
        results.append({
            "query": case["query"],
            "pass_rate": checks_passed / len(case["checks"]),
            "steps_taken": agent.step_count,
            "tokens_used": agent.token_count
        })
    return results

I track three dimensions: correctness (did it complete the task?), efficiency (how many steps and tokens?), and safety (did it stay within bounds?). A perfect agent that takes 20 steps and costs $2 per query isn’t necessarily better than a good-enough agent that takes 4 steps and costs $0.05.

Wrap-Up

Thank you for sticking with me through this. We’ve covered a lot of ground, and I want to trace the path we took because it mirrors the actual architecture of the systems we built.

We started with a frustrated LLM — brilliant at reasoning, but unable to interact with the world. We gave it tools through function calling: the model outputs JSON, our code executes it, and the result feeds back. That single idea — LLM as decision-maker, code as executor — is the foundation of everything else.

We gave it a thinking loop — ReAct — so it could interleave reasoning and action across multiple steps. We explored how different planning strategies (Plan-and-Execute, Reflexion) trade off between reactivity and deliberation. We gave it memory — a scratchpad for working state, a vector database for long-term recall, and episodic memory for learning from past mistakes.

We paused to note that this single-agent architecture is enough for most production use cases. Then we kept going.

We gave it colleagues — multi-agent systems where specialists handle different parts of a complex task, coordinated by a supervisor, debating in a group chat, or passing work down an assembly line. We looked at the frameworks that make this manageable in code (LangGraph, CrewAI, AutoGen), and at the spectacular special case of code generation agents that write and test their own patches.

We confronted the hard parts: safety (agents can do real damage), the compounding error problem (95% per step still means 77% over five steps), and the difficulty of evaluation (benchmarks help, but you have to evaluate on your own tasks).

Our trip planner grew from a single weather lookup to a multi-agent vacation planner. Like any good intern that’s been on the job for a while, it got more capable as we gave it more tools, more structure, and more guardrails. The intern analogy held throughout: how much autonomy you give depends on how much you trust, and trust is built through observation, logging, and incremental delegation.

I still feel like this space is early. Every month brings meaningful improvements in model reliability, tool-use accuracy, and planning capability. The patterns we covered — perceive-reason-act, ReAct, memory management, multi-agent coordination — will evolve in implementation but I believe the core ideas are durable. Agents are how LLMs become useful in the real world, and understanding their architecture from the ground up is one of the most valuable investments you can make right now.

Resources

The papers that matter most:

ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2023) — The paper that formalized Thought-Action-Observation loops. Short, readable, and foundational. If you read one paper from this list, make it this one.

Reflexion: Language Agents with Verbal Reinforcement Learning (Shinn et al., 2023) — Self-critique for agents. The idea that natural-language reflection can substitute for gradient-based learning is provocative and surprisingly effective.

SWE-agent: Agent-Computer Interfaces for Software Engineering (Yang et al., 2024) — How to build a coding agent from scratch. The details about the agent-computer interface design are more valuable than the benchmark results.

Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al., 2023) — An earlier paper showing that LLMs can learn to insert API calls into their own text generation. Important conceptual precursor to modern function calling.

Practical guides:

Building Effective Agents by Anthropic — A practical blog post from the Anthropic team with concrete patterns and anti-patterns for production agents. Refreshingly honest about failure modes.

LangGraph documentation (langchain-ai.github.io/langgraph) — The best documentation of any agent framework I’ve used. The tutorials build complexity gradually and the conceptual guides explain why, not only how.

OpenAI Function Calling Guide (platform.openai.com/docs) — The definitive reference for the function calling API that most agents build on. Clear examples, good edge case coverage.

Benchmarks to watch:

SWE-bench (swebench.com) — The standard benchmark for code agents. Track the leaderboard to see where the field stands. Pay attention to the “Verified” vs. full benchmark distinction.

GAIA (huggingface.co/gaia-benchmark) — The best benchmark for general-purpose agent capability. Multi-level difficulty lets you see where systems break down.

Frameworks to try (start with whichever matches your mental model):

LangGraph — if you think in graphs and state machines.
CrewAI — if you think in teams and roles.
AutoGen — if you think in conversations.
Or none — a while loop and an API key will take you surprisingly far.

← Previous LLM Production — Serving & Inference Next → LLM Evaluation