By Yusuf Elborey

Budget-Aware AI Agents: Keeping Cost, Tokens, and Latency Under Control

ai-agentsbudgetcost-controltokenslatencypythonproductionmonitoring

Your agent works. It answers questions. It calls tools. But your bills spike. Latency jumps around. Users see timeouts.

The agent feels smart. But it calls the model too many times. It pulls huge contexts. It chains tools without limit.

This article shows you how to add budgets. Not complex systems. Simple limits that actually work.

The Hidden Problem: Silent Overuse

Agents look smart. They reason. They plan. They adapt. But they also:

  • Call the model many times per request
  • Pull large contexts into every call
  • Chain tools without stopping
  • Loop on the same problem

You don’t notice at first. Then bills spike. Latency becomes unstable. Users see intermittent timeouts.

Symptoms

Bills spike with no clear reason:

You check your usage. One agent run used 50,000 tokens. Another used 200,000. You can’t predict which requests will be expensive.

Latency becomes unstable:

Some requests finish in 2 seconds. Others take 30 seconds. Users can’t tell when something is broken or just slow.

Users see intermittent timeouts:

The agent runs too long. Your API times out. The user gets nothing. They retry. It happens again.

Basics of Budgeting for Agents

You need three budgets:

  1. Token budget - Max tokens per run, per user, or per tenant
  2. Time budget - Max wall-clock time per run
  3. Money budget - Approximate cost per run or per day

These budgets relate to each other. More tools and thoughts mean more calls. More calls mean more tokens and time. More tokens mean more cost.

You don’t need perfect precision. Rough estimates work. The goal is to stop runaway costs, not to track every token.

How Budgets Relate

Think about a simple agent loop:

  1. Get user input
  2. Call model
  3. Decide to respond or call a tool
  4. Repeat

Each step costs tokens. Each model call takes time. Each tool call might trigger more model calls.

If you cap steps, you cap tokens. If you cap time, you cap total cost. If you cap tokens, you cap cost directly.

Designing a “Run Contract” for Each Agent Call

Before an agent starts, define a run contract. This is a simple set of limits:

  • Max steps
  • Max tokens (input + output)
  • Max nested tool calls

Different use cases need different contracts.

Quick Q&A (Small Budget)

A simple question-answer agent needs:

RunBudget(
    max_steps=5,
    max_tokens=2000,
    max_seconds=10
)

It should answer fast. It shouldn’t call many tools. It shouldn’t think too long.

Deep Analysis (Bigger Budget)

A research agent that synthesizes documents needs:

RunBudget(
    max_steps=20,
    max_tokens=50000,
    max_seconds=60
)

It can take more time. It can use more tokens. It can call more tools.

Back-Office Batch Work (Tight Cost, More Time)

A batch processing agent needs:

RunBudget(
    max_steps=50,
    max_tokens=100000,
    max_seconds=300
)

It can run longer. But you still cap tokens to control cost. Time limits prevent hanging.

Implementing a Budget Manager Around the Agent Loop

The common agent loop looks like this:

def agent_loop(user_input):
    context = []
    
    while True:
        # Call model
        response = call_model(user_input, context)
        
        # Decide what to do
        if should_respond(response):
            return response
        elif should_call_tool(response):
            tool_result = call_tool(response.tool_name, response.args)
            context.append(tool_result)
        else:
            # Keep thinking
            context.append(response)

Wrap this loop with a BudgetManager that:

  • Tracks steps taken
  • Tracks tokens used (rough estimates work)
  • Tracks elapsed time
  • Checks limits before each new call
  • Decides what to do when limits are near

BudgetManager Implementation

class BudgetManager:
    def __init__(self, budget: RunBudget):
        self.budget = budget
        self.steps_used = 0
        self.tokens_used = 0
        self.start_time = time.time()
    
    def can_make_call(self, estimated_tokens: int) -> bool:
        if self.steps_used >= self.budget.max_steps:
            return False
        
        if self.tokens_used + estimated_tokens > self.budget.max_tokens:
            return False
        
        elapsed = time.time() - self.start_time
        if elapsed >= self.budget.max_seconds:
            return False
        
        return True
    
    def record_call(self, tokens_used: int):
        self.steps_used += 1
        self.tokens_used += tokens_used
    
    def remaining(self) -> dict:
        elapsed = time.time() - self.start_time
        return {
            "steps": self.budget.max_steps - self.steps_used,
            "tokens": self.budget.max_tokens - self.tokens_used,
            "seconds": self.budget.max_seconds - elapsed
        }

Now wrap your agent loop:

def agent_loop_with_budget(user_input, budget: RunBudget):
    budget_manager = BudgetManager(budget)
    context = []
    
    while True:
        # Check budget before each call
        estimated_tokens = estimate_tokens(user_input, context)
        
        if not budget_manager.can_make_call(estimated_tokens):
            # Budget exhausted - graceful exit
            return graceful_exit(budget_manager)
        
        # Call model
        response = call_model(user_input, context)
        tokens_used = count_tokens(response)
        budget_manager.record_call(tokens_used)
        
        # Decide what to do
        if should_respond(response):
            return response
        elif should_call_tool(response):
            tool_result = call_tool(response.tool_name, response.args)
            context.append(tool_result)
        else:
            context.append(response)

Graceful Exit

When the budget is almost exhausted, don’t just fail. Ask the model to summarize and finish:

def graceful_exit(budget_manager: BudgetManager) -> str:
    remaining = budget_manager.remaining()
    
    # Ask model to summarize with remaining budget
    summary_prompt = f"""
    You are running out of budget. You have:
    - {remaining['steps']} steps left
    - {remaining['tokens']} tokens left
    - {remaining['seconds']:.1f} seconds left
    
    Provide a brief summary of what you've found so far.
    """
    
    summary = call_model(summary_prompt, context)
    return f"Summary (budget limit reached): {summary}"

Or ask the user to confirm a larger run:

def graceful_exit(budget_manager: BudgetManager) -> str:
    return """
    This request requires more resources than your current budget allows.
    Would you like to:
    1. Get a summary of what I found so far
    2. Continue with a larger budget (may take longer and cost more)
    """

Strategies for Staying Within Budget

Here are practical tactics that work:

Aggressive Context Trimming

Keep only relevant context. Don’t pass the entire conversation history. Keep only:

  • The last few messages
  • Relevant tool results
  • Key facts extracted from earlier context
def trim_context(context: list, max_items: int = 5) -> list:
    # Keep only the most recent items
    return context[-max_items:]

Progressive Summarization

For long histories, summarize older parts:

def summarize_old_context(old_context: list) -> str:
    # Summarize everything except the last 3 items
    to_summarize = old_context[:-3]
    summary = call_model(f"Summarize this conversation: {to_summarize}")
    return summary

Switching to Cheaper Models

Use expensive models for reasoning. Use cheaper models for simple tasks:

def call_model_with_budget(prompt: str, budget_remaining: int):
    if budget_remaining > 10000:
        # Use expensive model for complex reasoning
        return call_gpt4(prompt)
    else:
        # Use cheaper model for simple tasks
        return call_gpt35(prompt)

Capping Retrieved Context

When retrieving documents, limit the size:

def retrieve_documents(query: str, max_tokens: int = 2000):
    results = vector_search(query)
    
    # Cap total context size
    total_tokens = 0
    selected = []
    
    for doc in results:
        doc_tokens = estimate_tokens(doc)
        if total_tokens + doc_tokens > max_tokens:
            break
        selected.append(doc)
        total_tokens += doc_tokens
    
    return selected

Two Patterns

High precision, low depth:

Use expensive models. But limit steps. Get accurate answers quickly. Stop before going deep.

Deeper reasoning, limited scope:

Use cheaper models. Allow more steps. But limit the scope of what you’re reasoning about. Narrow the context.

Multi-User and Multi-Agent Scenarios

When you have many users or agents, you need shared budgets:

  • Per tenant
  • Per project
  • Per user per day

Simple Approach: Central Quota Service

Each agent checks with a central service before making calls:

class QuotaService:
    def __init__(self):
        self.quotas = {}  # tenant_id -> remaining tokens
    
    def check_quota(self, tenant_id: str, tokens_needed: int) -> bool:
        if tenant_id not in self.quotas:
            self.quotas[tenant_id] = 100000  # Daily quota
        
        return self.quotas[tenant_id] >= tokens_needed
    
    def consume_quota(self, tenant_id: str, tokens_used: int):
        self.quotas[tenant_id] -= tokens_used

Each agent checks before each call:

def agent_loop_with_quota(user_input, tenant_id: str):
    quota_service = QuotaService()
    
    while True:
        estimated_tokens = estimate_tokens(user_input, context)
        
        if not quota_service.check_quota(tenant_id, estimated_tokens):
            return "Daily quota exceeded. Please try again tomorrow."
        
        response = call_model(user_input, context)
        tokens_used = count_tokens(response)
        quota_service.consume_quota(tenant_id, tokens_used)
        
        # ... rest of loop

Priorities

Some workloads are critical. Others can wait:

class QuotaService:
    def check_quota(self, tenant_id: str, tokens_needed: int, priority: str = "normal") -> bool:
        if priority == "critical":
            # Always allow critical workloads
            return True
        
        # Check normal quota
        return self.quotas[tenant_id] >= tokens_needed

Graceful Degradation

When quotas are low:

  1. Queue non-urgent tasks
  2. Return partial results with clear messaging
  3. Suggest when to retry
def handle_quota_exhausted(tenant_id: str):
    # Queue the request
    queue_task(tenant_id, request)
    
    return {
        "status": "queued",
        "message": "Your request has been queued. We'll process it when quota is available.",
        "estimated_wait": "2 hours"
    }

Observability: See Where Tokens and Time Go

You need to track:

  • Tokens per request, per tool, per agent type
  • Latency per step and per run
  • Frequency of budget limit hits

Basic Metrics

Add logging to your budget manager:

class BudgetManager:
    def record_call(self, tokens_used: int, step_type: str):
        self.steps_used += 1
        self.tokens_used += tokens_used
        
        # Log metrics
        logger.info({
            "event": "agent_step",
            "step_type": step_type,
            "tokens_used": tokens_used,
            "total_tokens": self.tokens_used,
            "steps_used": self.steps_used
        })

Simple Spend Table

Build a simple table showing spend by agent:

def generate_spend_report(metrics: list) -> str:
    report = "Agent Spend Report\n"
    report += "=" * 50 + "\n"
    
    by_agent = {}
    for metric in metrics:
        agent = metric["agent_type"]
        if agent not in by_agent:
            by_agent[agent] = {"tokens": 0, "runs": 0}
        
        by_agent[agent]["tokens"] += metric["tokens_used"]
        by_agent[agent]["runs"] += 1
    
    for agent, data in by_agent.items():
        avg_tokens = data["tokens"] / data["runs"]
        report += f"{agent}: {data['tokens']} tokens, {data['runs']} runs, {avg_tokens:.0f} avg\n"
    
    return report

Use this to guide refactors. If one agent type uses too many tokens, optimize it. If budget limits hit too often, adjust budgets.

Example: Putting It All Together in Code

Let’s build a small research assistant agent that searches documents and synthesizes an answer. We’ll add budget management.

Run Budget Data Structure

from dataclasses import dataclass
from typing import Optional

@dataclass
class RunBudget:
    max_steps: int
    max_tokens: int
    max_seconds: float
    max_nested_tool_calls: Optional[int] = None

Budget Manager

import time
from typing import Dict

class BudgetManager:
    def __init__(self, budget: RunBudget):
        self.budget = budget
        self.steps_used = 0
        self.tokens_used = 0
        self.start_time = time.time()
        self.nested_tool_calls = 0
    
    def can_make_call(self, estimated_tokens: int) -> bool:
        """Check if we can make another call within budget."""
        if self.steps_used >= self.budget.max_steps:
            return False
        
        if self.tokens_used + estimated_tokens > self.budget.max_tokens:
            return False
        
        elapsed = time.time() - self.start_time
        if elapsed >= self.budget.max_seconds:
            return False
        
        if self.budget.max_nested_tool_calls:
            if self.nested_tool_calls >= self.budget.max_nested_tool_calls:
                return False
        
        return True
    
    def record_call(self, tokens_used: int, is_tool_call: bool = False):
        """Record a model or tool call."""
        self.steps_used += 1
        self.tokens_used += tokens_used
        if is_tool_call:
            self.nested_tool_calls += 1
    
    def remaining(self) -> Dict[str, float]:
        """Get remaining budget."""
        elapsed = time.time() - self.start_time
        return {
            "steps": self.budget.max_steps - self.steps_used,
            "tokens": self.budget.max_tokens - self.tokens_used,
            "seconds": self.budget.max_seconds - elapsed,
            "nested_tool_calls": (
                (self.budget.max_nested_tool_calls - self.nested_tool_calls)
                if self.budget.max_nested_tool_calls
                else None
            )
        }
    
    def is_exhausted(self) -> bool:
        """Check if budget is exhausted."""
        remaining = self.remaining()
        return (
            remaining["steps"] <= 0 or
            remaining["tokens"] <= 0 or
            remaining["seconds"] <= 0
        )

Token Estimation Helper

def estimate_tokens(text: str) -> int:
    """
    Rough token estimation.
    In production, use tiktoken or similar for accuracy.
    """
    # Rough estimate: 1 token ≈ 4 characters
    return len(text) // 4

def count_tokens(text: str) -> int:
    """Count tokens in text."""
    return estimate_tokens(text)

Research Agent with Budget

class ResearchAgent:
    def __init__(self, budget: RunBudget):
        self.budget = budget
    
    def search_documents(self, query: str, max_results: int = 5) -> list:
        """Simulate document search."""
        # In production, this would call a vector database
        return [
            f"Document {i}: Information about {query}"
            for i in range(max_results)
        ]
    
    def synthesize_answer(self, query: str, documents: list, budget_manager: BudgetManager) -> str:
        """Synthesize answer from documents with budget checks."""
        context = "\n".join(documents)
        prompt = f"Question: {query}\n\nContext:\n{context}\n\nAnswer:"
        
        estimated_tokens = estimate_tokens(prompt) + 500  # Estimate response
        
        if not budget_manager.can_make_call(estimated_tokens):
            # Budget exhausted - return summary
            return self._graceful_exit(query, documents, budget_manager)
        
        # Simulate model call
        response = f"Based on the documents, here's what I found about {query}: [synthesized answer]"
        tokens_used = estimate_tokens(prompt) + estimate_tokens(response)
        budget_manager.record_call(tokens_used)
        
        return response
    
    def _graceful_exit(self, query: str, documents: list, budget_manager: BudgetManager) -> str:
        """Handle budget exhaustion gracefully."""
        remaining = budget_manager.remaining()
        
        # Try to provide a brief summary
        if remaining["tokens"] > 100:
            summary_prompt = f"Briefly summarize findings about: {query}"
            summary = f"Summary: Found {len(documents)} relevant documents about {query}"
            tokens_used = estimate_tokens(summary_prompt) + estimate_tokens(summary)
            budget_manager.record_call(tokens_used)
            return f"Budget limit reached. {summary}"
        
        return f"Budget limit reached. Found {len(documents)} documents about {query}."
    
    def answer(self, query: str) -> Dict:
        """Answer a question with budget management."""
        budget_manager = BudgetManager(self.budget)
        
        # Step 1: Search documents
        estimated_search_tokens = estimate_tokens(query) + 1000
        if not budget_manager.can_make_call(estimated_search_tokens):
            return {
                "answer": "Budget exhausted before search.",
                "budget_remaining": budget_manager.remaining()
            }
        
        documents = self.search_documents(query)
        budget_manager.record_call(estimated_search_tokens, is_tool_call=True)
        
        # Step 2: Synthesize answer
        answer = self.synthesize_answer(query, documents, budget_manager)
        
        return {
            "answer": answer,
            "documents_found": len(documents),
            "budget_used": {
                "steps": budget_manager.steps_used,
                "tokens": budget_manager.tokens_used,
                "seconds": time.time() - budget_manager.start_time
            },
            "budget_remaining": budget_manager.remaining()
        }

Configuration Example

Create a YAML config file:

# budgets.yaml
budgets:
  quick_answer:
    max_steps: 5
    max_tokens: 2000
    max_seconds: 10
    max_nested_tool_calls: 2
  
  deep_research:
    max_steps: 20
    max_tokens: 50000
    max_seconds: 60
    max_nested_tool_calls: 10
  
  batch_processing:
    max_steps: 50
    max_tokens: 100000
    max_seconds: 300
    max_nested_tool_calls: 20

Load it in your code:

import yaml

def load_budget_config(config_path: str) -> Dict[str, RunBudget]:
    with open(config_path) as f:
        config = yaml.safe_load(f)
    
    budgets = {}
    for name, params in config["budgets"].items():
        budgets[name] = RunBudget(**params)
    
    return budgets

Basic Metrics and Logging

Add logging to track usage:

import logging
import json

logger = logging.getLogger(__name__)

class BudgetManager:
    def record_call(self, tokens_used: int, is_tool_call: bool = False, step_type: str = "model"):
        """Record a call with logging."""
        self.steps_used += 1
        self.tokens_used += tokens_used
        if is_tool_call:
            self.nested_tool_calls += 1
        
        # Log metrics
        logger.info(json.dumps({
            "event": "agent_step",
            "step_type": step_type,
            "tokens_used": tokens_used,
            "total_tokens": self.tokens_used,
            "steps_used": self.steps_used,
            "budget_exhausted": self.is_exhausted()
        }))

Usage Example

# Load budget config
budgets = load_budget_config("budgets.yaml")

# Create agent with quick answer budget
agent = ResearchAgent(budgets["quick_answer"])

# Answer a question
result = agent.answer("What is machine learning?")

print(result["answer"])
print(f"Budget used: {result['budget_used']}")
print(f"Budget remaining: {result['budget_remaining']}")

Summary

Budget management for AI agents doesn’t need to be complex. Start with three simple limits:

  1. Max steps per run
  2. Max tokens per run
  3. Max time per run

Wrap your agent loop with a budget manager. Check limits before each call. Handle exhaustion gracefully.

Track where tokens and time go. Use that data to optimize. Adjust budgets based on what you learn.

The code examples above give you a working foundation. Adapt them to your needs. Start simple. Add complexity only when you need it.

Discussion

Join the conversation and share your thoughts

Discussion

0 / 5000