Budget-Aware AI Agents: Keeping Cost, Tokens, and Latency Under Control
Your agent works. It answers questions. It calls tools. But your bills spike. Latency jumps around. Users see timeouts.
The agent feels smart. But it calls the model too many times. It pulls huge contexts. It chains tools without limit.
This article shows you how to add budgets. Not complex systems. Simple limits that actually work.
The Hidden Problem: Silent Overuse
Agents look smart. They reason. They plan. They adapt. But they also:
- Call the model many times per request
- Pull large contexts into every call
- Chain tools without stopping
- Loop on the same problem
You don’t notice at first. Then bills spike. Latency becomes unstable. Users see intermittent timeouts.
Symptoms
Bills spike with no clear reason:
You check your usage. One agent run used 50,000 tokens. Another used 200,000. You can’t predict which requests will be expensive.
Latency becomes unstable:
Some requests finish in 2 seconds. Others take 30 seconds. Users can’t tell when something is broken or just slow.
Users see intermittent timeouts:
The agent runs too long. Your API times out. The user gets nothing. They retry. It happens again.
Basics of Budgeting for Agents
You need three budgets:
- Token budget - Max tokens per run, per user, or per tenant
- Time budget - Max wall-clock time per run
- Money budget - Approximate cost per run or per day
These budgets relate to each other. More tools and thoughts mean more calls. More calls mean more tokens and time. More tokens mean more cost.
You don’t need perfect precision. Rough estimates work. The goal is to stop runaway costs, not to track every token.
How Budgets Relate
Think about a simple agent loop:
- Get user input
- Call model
- Decide to respond or call a tool
- Repeat
Each step costs tokens. Each model call takes time. Each tool call might trigger more model calls.
If you cap steps, you cap tokens. If you cap time, you cap total cost. If you cap tokens, you cap cost directly.
Designing a “Run Contract” for Each Agent Call
Before an agent starts, define a run contract. This is a simple set of limits:
- Max steps
- Max tokens (input + output)
- Max nested tool calls
Different use cases need different contracts.
Quick Q&A (Small Budget)
A simple question-answer agent needs:
RunBudget(
max_steps=5,
max_tokens=2000,
max_seconds=10
)
It should answer fast. It shouldn’t call many tools. It shouldn’t think too long.
Deep Analysis (Bigger Budget)
A research agent that synthesizes documents needs:
RunBudget(
max_steps=20,
max_tokens=50000,
max_seconds=60
)
It can take more time. It can use more tokens. It can call more tools.
Back-Office Batch Work (Tight Cost, More Time)
A batch processing agent needs:
RunBudget(
max_steps=50,
max_tokens=100000,
max_seconds=300
)
It can run longer. But you still cap tokens to control cost. Time limits prevent hanging.
Implementing a Budget Manager Around the Agent Loop
The common agent loop looks like this:
def agent_loop(user_input):
context = []
while True:
# Call model
response = call_model(user_input, context)
# Decide what to do
if should_respond(response):
return response
elif should_call_tool(response):
tool_result = call_tool(response.tool_name, response.args)
context.append(tool_result)
else:
# Keep thinking
context.append(response)
Wrap this loop with a BudgetManager that:
- Tracks steps taken
- Tracks tokens used (rough estimates work)
- Tracks elapsed time
- Checks limits before each new call
- Decides what to do when limits are near
BudgetManager Implementation
class BudgetManager:
def __init__(self, budget: RunBudget):
self.budget = budget
self.steps_used = 0
self.tokens_used = 0
self.start_time = time.time()
def can_make_call(self, estimated_tokens: int) -> bool:
if self.steps_used >= self.budget.max_steps:
return False
if self.tokens_used + estimated_tokens > self.budget.max_tokens:
return False
elapsed = time.time() - self.start_time
if elapsed >= self.budget.max_seconds:
return False
return True
def record_call(self, tokens_used: int):
self.steps_used += 1
self.tokens_used += tokens_used
def remaining(self) -> dict:
elapsed = time.time() - self.start_time
return {
"steps": self.budget.max_steps - self.steps_used,
"tokens": self.budget.max_tokens - self.tokens_used,
"seconds": self.budget.max_seconds - elapsed
}
Now wrap your agent loop:
def agent_loop_with_budget(user_input, budget: RunBudget):
budget_manager = BudgetManager(budget)
context = []
while True:
# Check budget before each call
estimated_tokens = estimate_tokens(user_input, context)
if not budget_manager.can_make_call(estimated_tokens):
# Budget exhausted - graceful exit
return graceful_exit(budget_manager)
# Call model
response = call_model(user_input, context)
tokens_used = count_tokens(response)
budget_manager.record_call(tokens_used)
# Decide what to do
if should_respond(response):
return response
elif should_call_tool(response):
tool_result = call_tool(response.tool_name, response.args)
context.append(tool_result)
else:
context.append(response)
Graceful Exit
When the budget is almost exhausted, don’t just fail. Ask the model to summarize and finish:
def graceful_exit(budget_manager: BudgetManager) -> str:
remaining = budget_manager.remaining()
# Ask model to summarize with remaining budget
summary_prompt = f"""
You are running out of budget. You have:
- {remaining['steps']} steps left
- {remaining['tokens']} tokens left
- {remaining['seconds']:.1f} seconds left
Provide a brief summary of what you've found so far.
"""
summary = call_model(summary_prompt, context)
return f"Summary (budget limit reached): {summary}"
Or ask the user to confirm a larger run:
def graceful_exit(budget_manager: BudgetManager) -> str:
return """
This request requires more resources than your current budget allows.
Would you like to:
1. Get a summary of what I found so far
2. Continue with a larger budget (may take longer and cost more)
"""
Strategies for Staying Within Budget
Here are practical tactics that work:
Aggressive Context Trimming
Keep only relevant context. Don’t pass the entire conversation history. Keep only:
- The last few messages
- Relevant tool results
- Key facts extracted from earlier context
def trim_context(context: list, max_items: int = 5) -> list:
# Keep only the most recent items
return context[-max_items:]
Progressive Summarization
For long histories, summarize older parts:
def summarize_old_context(old_context: list) -> str:
# Summarize everything except the last 3 items
to_summarize = old_context[:-3]
summary = call_model(f"Summarize this conversation: {to_summarize}")
return summary
Switching to Cheaper Models
Use expensive models for reasoning. Use cheaper models for simple tasks:
def call_model_with_budget(prompt: str, budget_remaining: int):
if budget_remaining > 10000:
# Use expensive model for complex reasoning
return call_gpt4(prompt)
else:
# Use cheaper model for simple tasks
return call_gpt35(prompt)
Capping Retrieved Context
When retrieving documents, limit the size:
def retrieve_documents(query: str, max_tokens: int = 2000):
results = vector_search(query)
# Cap total context size
total_tokens = 0
selected = []
for doc in results:
doc_tokens = estimate_tokens(doc)
if total_tokens + doc_tokens > max_tokens:
break
selected.append(doc)
total_tokens += doc_tokens
return selected
Two Patterns
High precision, low depth:
Use expensive models. But limit steps. Get accurate answers quickly. Stop before going deep.
Deeper reasoning, limited scope:
Use cheaper models. Allow more steps. But limit the scope of what you’re reasoning about. Narrow the context.
Multi-User and Multi-Agent Scenarios
When you have many users or agents, you need shared budgets:
- Per tenant
- Per project
- Per user per day
Simple Approach: Central Quota Service
Each agent checks with a central service before making calls:
class QuotaService:
def __init__(self):
self.quotas = {} # tenant_id -> remaining tokens
def check_quota(self, tenant_id: str, tokens_needed: int) -> bool:
if tenant_id not in self.quotas:
self.quotas[tenant_id] = 100000 # Daily quota
return self.quotas[tenant_id] >= tokens_needed
def consume_quota(self, tenant_id: str, tokens_used: int):
self.quotas[tenant_id] -= tokens_used
Each agent checks before each call:
def agent_loop_with_quota(user_input, tenant_id: str):
quota_service = QuotaService()
while True:
estimated_tokens = estimate_tokens(user_input, context)
if not quota_service.check_quota(tenant_id, estimated_tokens):
return "Daily quota exceeded. Please try again tomorrow."
response = call_model(user_input, context)
tokens_used = count_tokens(response)
quota_service.consume_quota(tenant_id, tokens_used)
# ... rest of loop
Priorities
Some workloads are critical. Others can wait:
class QuotaService:
def check_quota(self, tenant_id: str, tokens_needed: int, priority: str = "normal") -> bool:
if priority == "critical":
# Always allow critical workloads
return True
# Check normal quota
return self.quotas[tenant_id] >= tokens_needed
Graceful Degradation
When quotas are low:
- Queue non-urgent tasks
- Return partial results with clear messaging
- Suggest when to retry
def handle_quota_exhausted(tenant_id: str):
# Queue the request
queue_task(tenant_id, request)
return {
"status": "queued",
"message": "Your request has been queued. We'll process it when quota is available.",
"estimated_wait": "2 hours"
}
Observability: See Where Tokens and Time Go
You need to track:
- Tokens per request, per tool, per agent type
- Latency per step and per run
- Frequency of budget limit hits
Basic Metrics
Add logging to your budget manager:
class BudgetManager:
def record_call(self, tokens_used: int, step_type: str):
self.steps_used += 1
self.tokens_used += tokens_used
# Log metrics
logger.info({
"event": "agent_step",
"step_type": step_type,
"tokens_used": tokens_used,
"total_tokens": self.tokens_used,
"steps_used": self.steps_used
})
Simple Spend Table
Build a simple table showing spend by agent:
def generate_spend_report(metrics: list) -> str:
report = "Agent Spend Report\n"
report += "=" * 50 + "\n"
by_agent = {}
for metric in metrics:
agent = metric["agent_type"]
if agent not in by_agent:
by_agent[agent] = {"tokens": 0, "runs": 0}
by_agent[agent]["tokens"] += metric["tokens_used"]
by_agent[agent]["runs"] += 1
for agent, data in by_agent.items():
avg_tokens = data["tokens"] / data["runs"]
report += f"{agent}: {data['tokens']} tokens, {data['runs']} runs, {avg_tokens:.0f} avg\n"
return report
Use this to guide refactors. If one agent type uses too many tokens, optimize it. If budget limits hit too often, adjust budgets.
Example: Putting It All Together in Code
Let’s build a small research assistant agent that searches documents and synthesizes an answer. We’ll add budget management.
Run Budget Data Structure
from dataclasses import dataclass
from typing import Optional
@dataclass
class RunBudget:
max_steps: int
max_tokens: int
max_seconds: float
max_nested_tool_calls: Optional[int] = None
Budget Manager
import time
from typing import Dict
class BudgetManager:
def __init__(self, budget: RunBudget):
self.budget = budget
self.steps_used = 0
self.tokens_used = 0
self.start_time = time.time()
self.nested_tool_calls = 0
def can_make_call(self, estimated_tokens: int) -> bool:
"""Check if we can make another call within budget."""
if self.steps_used >= self.budget.max_steps:
return False
if self.tokens_used + estimated_tokens > self.budget.max_tokens:
return False
elapsed = time.time() - self.start_time
if elapsed >= self.budget.max_seconds:
return False
if self.budget.max_nested_tool_calls:
if self.nested_tool_calls >= self.budget.max_nested_tool_calls:
return False
return True
def record_call(self, tokens_used: int, is_tool_call: bool = False):
"""Record a model or tool call."""
self.steps_used += 1
self.tokens_used += tokens_used
if is_tool_call:
self.nested_tool_calls += 1
def remaining(self) -> Dict[str, float]:
"""Get remaining budget."""
elapsed = time.time() - self.start_time
return {
"steps": self.budget.max_steps - self.steps_used,
"tokens": self.budget.max_tokens - self.tokens_used,
"seconds": self.budget.max_seconds - elapsed,
"nested_tool_calls": (
(self.budget.max_nested_tool_calls - self.nested_tool_calls)
if self.budget.max_nested_tool_calls
else None
)
}
def is_exhausted(self) -> bool:
"""Check if budget is exhausted."""
remaining = self.remaining()
return (
remaining["steps"] <= 0 or
remaining["tokens"] <= 0 or
remaining["seconds"] <= 0
)
Token Estimation Helper
def estimate_tokens(text: str) -> int:
"""
Rough token estimation.
In production, use tiktoken or similar for accuracy.
"""
# Rough estimate: 1 token ≈ 4 characters
return len(text) // 4
def count_tokens(text: str) -> int:
"""Count tokens in text."""
return estimate_tokens(text)
Research Agent with Budget
class ResearchAgent:
def __init__(self, budget: RunBudget):
self.budget = budget
def search_documents(self, query: str, max_results: int = 5) -> list:
"""Simulate document search."""
# In production, this would call a vector database
return [
f"Document {i}: Information about {query}"
for i in range(max_results)
]
def synthesize_answer(self, query: str, documents: list, budget_manager: BudgetManager) -> str:
"""Synthesize answer from documents with budget checks."""
context = "\n".join(documents)
prompt = f"Question: {query}\n\nContext:\n{context}\n\nAnswer:"
estimated_tokens = estimate_tokens(prompt) + 500 # Estimate response
if not budget_manager.can_make_call(estimated_tokens):
# Budget exhausted - return summary
return self._graceful_exit(query, documents, budget_manager)
# Simulate model call
response = f"Based on the documents, here's what I found about {query}: [synthesized answer]"
tokens_used = estimate_tokens(prompt) + estimate_tokens(response)
budget_manager.record_call(tokens_used)
return response
def _graceful_exit(self, query: str, documents: list, budget_manager: BudgetManager) -> str:
"""Handle budget exhaustion gracefully."""
remaining = budget_manager.remaining()
# Try to provide a brief summary
if remaining["tokens"] > 100:
summary_prompt = f"Briefly summarize findings about: {query}"
summary = f"Summary: Found {len(documents)} relevant documents about {query}"
tokens_used = estimate_tokens(summary_prompt) + estimate_tokens(summary)
budget_manager.record_call(tokens_used)
return f"Budget limit reached. {summary}"
return f"Budget limit reached. Found {len(documents)} documents about {query}."
def answer(self, query: str) -> Dict:
"""Answer a question with budget management."""
budget_manager = BudgetManager(self.budget)
# Step 1: Search documents
estimated_search_tokens = estimate_tokens(query) + 1000
if not budget_manager.can_make_call(estimated_search_tokens):
return {
"answer": "Budget exhausted before search.",
"budget_remaining": budget_manager.remaining()
}
documents = self.search_documents(query)
budget_manager.record_call(estimated_search_tokens, is_tool_call=True)
# Step 2: Synthesize answer
answer = self.synthesize_answer(query, documents, budget_manager)
return {
"answer": answer,
"documents_found": len(documents),
"budget_used": {
"steps": budget_manager.steps_used,
"tokens": budget_manager.tokens_used,
"seconds": time.time() - budget_manager.start_time
},
"budget_remaining": budget_manager.remaining()
}
Configuration Example
Create a YAML config file:
# budgets.yaml
budgets:
quick_answer:
max_steps: 5
max_tokens: 2000
max_seconds: 10
max_nested_tool_calls: 2
deep_research:
max_steps: 20
max_tokens: 50000
max_seconds: 60
max_nested_tool_calls: 10
batch_processing:
max_steps: 50
max_tokens: 100000
max_seconds: 300
max_nested_tool_calls: 20
Load it in your code:
import yaml
def load_budget_config(config_path: str) -> Dict[str, RunBudget]:
with open(config_path) as f:
config = yaml.safe_load(f)
budgets = {}
for name, params in config["budgets"].items():
budgets[name] = RunBudget(**params)
return budgets
Basic Metrics and Logging
Add logging to track usage:
import logging
import json
logger = logging.getLogger(__name__)
class BudgetManager:
def record_call(self, tokens_used: int, is_tool_call: bool = False, step_type: str = "model"):
"""Record a call with logging."""
self.steps_used += 1
self.tokens_used += tokens_used
if is_tool_call:
self.nested_tool_calls += 1
# Log metrics
logger.info(json.dumps({
"event": "agent_step",
"step_type": step_type,
"tokens_used": tokens_used,
"total_tokens": self.tokens_used,
"steps_used": self.steps_used,
"budget_exhausted": self.is_exhausted()
}))
Usage Example
# Load budget config
budgets = load_budget_config("budgets.yaml")
# Create agent with quick answer budget
agent = ResearchAgent(budgets["quick_answer"])
# Answer a question
result = agent.answer("What is machine learning?")
print(result["answer"])
print(f"Budget used: {result['budget_used']}")
print(f"Budget remaining: {result['budget_remaining']}")
Summary
Budget management for AI agents doesn’t need to be complex. Start with three simple limits:
- Max steps per run
- Max tokens per run
- Max time per run
Wrap your agent loop with a budget manager. Check limits before each call. Handle exhaustion gracefully.
Track where tokens and time go. Use that data to optimize. Adjust budgets based on what you learn.
The code examples above give you a working foundation. Adapt them to your needs. Start simple. Add complexity only when you need it.
Discussion
Loading comments...