Attention Routing in Multi-Agent Systems: The Next Step in Context-Aware AI Agents
Most multi-agent systems work like a crowded room where everyone’s shouting at once. Agents broadcast messages to everyone, or they follow rigid routing rules that don’t adapt. It’s noisy, inefficient, and agents miss important context.
There’s a better way. Attention routing lets agents focus on the messages that actually matter, similar to how transformer models use attention to prioritize relevant information. Instead of broadcasting everything, agents learn to route messages based on relevance and context.
This article shows you how attention routing works, why it matters, and how to implement it in your own multi-agent systems.
Introduction: From Local Context to Global Awareness
Traditional agent communication has two main problems.
First, there’s broadcasting. Every agent sends every message to every other agent. It’s simple, but it doesn’t scale. With 10 agents, you get 100 potential message paths. With 100 agents, you get 10,000. Most of those messages are irrelevant.
Second, there’s fixed routing. Agents follow predefined paths — maybe agent A always talks to agent B, and agent B always talks to agent C. This works for simple workflows, but it breaks when the system needs to adapt. If agent C becomes irrelevant to the current task, agent B still sends messages there.
Attention routing solves both problems. Agents dynamically decide which peers to communicate with based on the current context. They score each potential connection, and only send messages to agents that are likely to be relevant.
The idea comes from transformer models. In a transformer, attention weights determine how much each token should focus on every other token. Attention routing applies the same concept to agent communication. Instead of tokens, we have agents. Instead of token relationships, we have message relevance.
What Is Attention Routing?
Attention routing is a message-passing mechanism where agents prioritize communication with peers based on relevance scores. These scores change over time as the system’s context evolves.
Here’s how it works conceptually:
- An agent needs to send a message
- It computes relevance scores for all potential recipients
- It routes the message only to agents with scores above a threshold
- The scores update based on feedback and context changes
This is different from static routing, where communication paths are fixed. It’s also different from probabilistic routing, where messages are randomly distributed. Attention routing is deterministic but adaptive — it makes smart choices based on current state.
The relevance scoring can use several factors:
- Semantic similarity between the message and each agent’s current focus
- Historical communication patterns
- Current task context
- Agent capabilities and specializations
Hero Diagram: Attention Routing Architecture
Here’s how attention routing works in a multi-agent system:
Attention Routing System
┌─────────────────────────────────────────────────────────┐
│ │
│ Agent A (Data Analyst) │
│ Context: "user behavior analysis" │
│ ┌─────────────────────────────────────┐ │
│ │ Message: "Need ML help with stats" │ │
│ └──────────────┬──────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────┐ │
│ │ Message Bus │ │
│ │ │ │
│ │ Attention │ │
│ │ Scoring │ │
│ └───────┬───────┘ │
│ │ │
│ ┌────────────┼────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ Score: Score: Score: │
│ 0.85 0.12 0.03 │
│ │ │ │ │
│ │ │ │ │
│ ▼ │ │ │
│ ┌──────────┐ │ │ │
│ │ Agent B │ │ │ │
│ │ (ML Eng) │ │ │ │
│ │ Context: │ │ │ │
│ │ "model │ │ │ │
│ │ training"│ │ │ │
│ └──────────┘ │ │ │
│ │ │ │
│ ┌────┴────┐ ┌────┴────┐ │
│ │ Agent C │ │ Agent D│ │
│ │(Backend)│ │(Frontend)│ │
│ │Context: │ │Context: │ │
│ │"API dev"│ │"UI work"│ │
│ └─────────┘ └─────────┘ │
│ │
│ Only Agent B receives the message (score > threshold) │
│ │
└─────────────────────────────────────────────────────────┘
Key Points:
- Agent A sends message to Message Bus
- Bus computes attention scores for all agents
- Only agents with scores above threshold receive message
- Scores based on context similarity and relevance
This diagram shows how a message from the Data Analyst agent gets routed only to the ML Engineer, because their contexts are similar. The Backend and Frontend agents don’t receive it because their attention scores are below the threshold.
Architectural Patterns
There are a few ways to implement attention routing. Here are the most common patterns.
Attention-Gated Message Bus
The message bus acts as a central router. Agents send messages to the bus, and the bus uses attention weights to decide where to forward them.
Agent A → Message Bus → [Attention Scoring] → Agent B, Agent C
The bus maintains a relevance matrix that maps (sender, receiver, context) tuples to attention scores. When a message arrives, the bus:
- Extracts context from the message
- Looks up or computes attention scores for all potential receivers
- Forwards the message only to agents with scores above a threshold
This pattern is centralized, which makes it easier to manage and debug. But it can become a bottleneck if you have many agents.
Relevance Scoring and Temporal Context Embeddings
Each agent maintains embeddings of its current context. These embeddings capture what the agent is working on, what it knows, and what it needs.
When agent A wants to send a message, it:
- Creates an embedding of the message content
- Compares this embedding to all other agents’ context embeddings
- Computes similarity scores (using cosine similarity or dot product)
- Routes to agents with high similarity
The embeddings update over time. As agents process messages and complete tasks, their context embeddings shift. This means routing decisions adapt automatically.
You can also add temporal context. Recent interactions get higher weights. An agent that just sent you a relevant message is more likely to receive your next message than an agent you haven’t talked to in a while.
Implementation Deep-Dive
Let’s build a simple multi-agent system with attention routing. I’ll show you the core components.
Agent Definition
First, we need agents that can maintain context and compute relevance:
import asyncio
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Tuple
from datetime import datetime
import numpy as np
from collections import deque
@dataclass
class AgentContext:
"""Represents an agent's current context"""
agent_id: str
current_task: str
capabilities: List[str]
recent_messages: deque = field(default_factory=lambda: deque(maxlen=10))
context_embedding: np.ndarray = field(default=None)
def update_context(self, task: str, embedding: np.ndarray):
"""Update the agent's context"""
self.current_task = task
self.context_embedding = embedding
class Agent:
def __init__(self, agent_id: str, capabilities: List[str], embedding_dim: int = 128):
self.agent_id = agent_id
self.capabilities = capabilities
self.context = AgentContext(agent_id=agent_id, current_task="", capabilities=capabilities)
self.embedding_dim = embedding_dim
self.message_queue = asyncio.Queue()
self.received_messages = []
# Initialize context embedding randomly (in practice, you'd use a proper embedding model)
self.context.context_embedding = np.random.normal(0, 0.1, embedding_dim)
def create_message_embedding(self, message: str) -> np.ndarray:
"""Create an embedding for a message"""
# In practice, use a proper embedding model like sentence-transformers
# For this example, we'll use a simple hash-based embedding
import hashlib
hash_obj = hashlib.sha256(message.encode())
hash_bytes = hash_obj.digest()
# Convert to numpy array of desired dimension
embedding = np.frombuffer(hash_bytes[:self.embedding_dim], dtype=np.uint8).astype(np.float32)
embedding = embedding / 255.0 # Normalize to [0, 1]
# Pad or truncate to embedding_dim
if len(embedding) < self.embedding_dim:
padding = np.random.normal(0, 0.1, self.embedding_dim - len(embedding))
embedding = np.concatenate([embedding, padding])
else:
embedding = embedding[:self.embedding_dim]
return embedding
def compute_relevance_scores(
self,
message_embedding: np.ndarray,
other_agents: Dict[str, 'Agent'],
temperature: float = 1.0
) -> Dict[str, float]:
"""Compute attention scores for all other agents"""
scores = {}
for agent_id, agent in other_agents.items():
if agent_id == self.agent_id:
continue
# Get agent's context embedding
agent_embedding = agent.context.context_embedding
# Compute cosine similarity
dot_product = np.dot(message_embedding, agent_embedding)
norm_product = np.linalg.norm(message_embedding) * np.linalg.norm(agent_embedding)
if norm_product > 0:
similarity = dot_product / norm_product
else:
similarity = 0.0
# Add temporal bonus for recent interactions
temporal_bonus = 0.0
if agent_id in [msg['from'] for msg in self.context.recent_messages]:
temporal_bonus = 0.1
scores[agent_id] = similarity + temporal_bonus
# Apply softmax to get attention weights
if scores:
score_values = np.array(list(scores.values()))
# Apply temperature scaling
score_values = score_values / temperature
# Softmax
exp_scores = np.exp(score_values - np.max(score_values))
attention_weights = exp_scores / np.sum(exp_scores)
# Map back to agent IDs
attention_dict = {}
for i, agent_id in enumerate(scores.keys()):
attention_dict[agent_id] = float(attention_weights[i])
return attention_dict
return {}
async def send_message(
self,
message: str,
message_bus: 'MessageBus',
threshold: float = 0.1
):
"""Send a message using attention routing"""
message_embedding = self.create_message_embedding(message)
# Get all other agents from the message bus
other_agents = {
aid: agent for aid, agent in message_bus.agents.items()
if aid != self.agent_id
}
# Compute relevance scores
attention_scores = self.compute_relevance_scores(message_embedding, other_agents)
# Route to agents above threshold
recipients = [
agent_id for agent_id, score in attention_scores.items()
if score >= threshold
]
# Send message through bus
await message_bus.route_message(
sender_id=self.agent_id,
message=message,
recipients=recipients,
attention_scores=attention_scores
)
async def process_messages(self):
"""Process incoming messages"""
while True:
try:
message = await asyncio.wait_for(self.message_queue.get(), timeout=1.0)
self.received_messages.append(message)
# Update context based on received message
self.context.recent_messages.append({
'from': message['sender_id'],
'content': message['message'],
'timestamp': datetime.now()
})
# Update context embedding (simplified - in practice, use proper model)
message_embedding = self.create_message_embedding(message['message'])
# Moving average update
alpha = 0.1
self.context.context_embedding = (
(1 - alpha) * self.context.context_embedding +
alpha * message_embedding
)
print(f"Agent {self.agent_id} received: {message['message']} from {message['sender_id']}")
except asyncio.TimeoutError:
continue
Message Bus with Attention-Weighted Routing
The message bus handles routing and maintains the communication graph:
@dataclass
class Message:
sender_id: str
message: str
recipients: List[str]
attention_scores: Dict[str, float]
timestamp: datetime = field(default_factory=datetime.now)
class MessageBus:
def __init__(self):
self.agents: Dict[str, Agent] = {}
self.message_history: List[Message] = []
self.communication_graph: Dict[str, Dict[str, float]] = {}
def register_agent(self, agent: Agent):
"""Register an agent with the message bus"""
self.agents[agent.agent_id] = agent
self.communication_graph[agent.agent_id] = {}
async def route_message(
self,
sender_id: str,
message: str,
recipients: List[str],
attention_scores: Dict[str, float]
):
"""Route a message to recipients based on attention scores"""
msg = Message(
sender_id=sender_id,
message=message,
recipients=recipients,
attention_scores=attention_scores
)
self.message_history.append(msg)
# Update communication graph
if sender_id not in self.communication_graph:
self.communication_graph[sender_id] = {}
for recipient_id in recipients:
# Update edge weight (cumulative attention)
if recipient_id not in self.communication_graph[sender_id]:
self.communication_graph[sender_id][recipient_id] = 0.0
self.communication_graph[sender_id][recipient_id] += attention_scores.get(recipient_id, 0.0)
# Deliver message to recipient
if recipient_id in self.agents:
await self.agents[recipient_id].message_queue.put({
'sender_id': sender_id,
'message': message,
'attention_score': attention_scores.get(recipient_id, 0.0),
'timestamp': datetime.now()
})
print(f"Message from {sender_id} routed to {len(recipients)} agents: {recipients}")
def get_communication_graph(self) -> Dict[str, Dict[str, float]]:
"""Get the current communication graph"""
return self.communication_graph.copy()
Pseudocode for Attention Scoring
Here’s the core attention scoring algorithm:
function compute_attention_scores(message, agents):
message_embedding = embed(message)
scores = {}
for each agent in agents:
agent_embedding = agent.context_embedding
similarity = cosine_similarity(message_embedding, agent_embedding)
// Add temporal bonus
if agent in recent_interactions:
similarity += temporal_bonus
scores[agent.id] = similarity
// Apply softmax with temperature
attention_weights = softmax(scores / temperature)
return attention_weights
The softmax ensures that attention weights sum to 1, making them interpretable as probabilities. Temperature controls how sharp the distribution is — lower temperature means more focused routing, higher temperature means more uniform distribution.
Practical Example
Let’s build a simple simulation with multiple agents working on different tasks:
async def run_simulation():
"""Run a multi-agent simulation with attention routing"""
# Create message bus
bus = MessageBus()
# Create agents with different specializations
agents = [
Agent("data_analyst", ["data_analysis", "statistics"]),
Agent("ml_engineer", ["machine_learning", "model_training"]),
Agent("backend_dev", ["api", "database"]),
Agent("frontend_dev", ["ui", "react"]),
Agent("devops", ["deployment", "monitoring"]),
]
# Register agents
for agent in agents:
bus.register_agent(agent)
# Start message processing
asyncio.create_task(agent.process_messages())
# Set initial contexts
agents[0].context.update_context(
"Analyzing user behavior data",
agents[0].create_message_embedding("user behavior data analysis statistics")
)
agents[1].context.update_context(
"Training recommendation model",
agents[1].create_message_embedding("machine learning model training recommendation")
)
agents[2].context.update_context(
"Optimizing database queries",
agents[2].create_message_embedding("database query optimization api")
)
agents[3].context.update_context(
"Building dashboard UI",
agents[3].create_message_embedding("react dashboard ui frontend")
)
agents[4].context.update_context(
"Setting up monitoring",
agents[4].create_message_embedding("deployment monitoring infrastructure")
)
# Simulate communication
await asyncio.sleep(0.5)
# Data analyst needs ML help
await agents[0].send_message(
"I need help with statistical modeling for user behavior",
bus,
threshold=0.15
)
await asyncio.sleep(0.5)
# ML engineer responds and asks for data
await agents[1].send_message(
"I can help with that. Can you share the dataset?",
bus,
threshold=0.15
)
await asyncio.sleep(0.5)
# Backend dev asks about API requirements
await agents[2].send_message(
"What API endpoints do we need for the dashboard?",
bus,
threshold=0.15
)
await asyncio.sleep(0.5)
# Frontend dev responds
await agents[3].send_message(
"We need user stats and recommendation endpoints",
bus,
threshold=0.15
)
await asyncio.sleep(1.0)
# Print communication graph
print("\n=== Communication Graph ===")
graph = bus.get_communication_graph()
for sender, recipients in graph.items():
for recipient, weight in recipients.items():
if weight > 0:
print(f"{sender} -> {recipient}: {weight:.3f}")
return bus, agents
# Run the simulation
if __name__ == "__main__":
bus, agents = asyncio.run(run_simulation())
In this simulation, agents only communicate with peers whose context is relevant to the message. The data analyst’s message about statistical modeling gets routed to the ML engineer, not the frontend developer. The attention mechanism learns these relationships automatically.
Visualization of Communication Graph
Here’s a simple visualization function:
import matplotlib.pyplot as plt
import networkx as nx
def visualize_communication_graph(message_bus: MessageBus, threshold: float = 0.1):
"""Visualize the communication graph"""
G = nx.DiGraph()
graph = message_bus.get_communication_graph()
# Add nodes
for agent_id in message_bus.agents.keys():
G.add_node(agent_id)
# Add edges with weights
for sender, recipients in graph.items():
for recipient, weight in recipients.items():
if weight >= threshold:
G.add_edge(sender, recipient, weight=weight)
# Layout
pos = nx.spring_layout(G, k=1, iterations=50)
# Draw nodes
nx.draw_networkx_nodes(G, pos, node_color='lightblue',
node_size=2000, alpha=0.9)
# Draw edges with width proportional to weight
edges = G.edges()
weights = [G[u][v]['weight'] for u, v in edges]
nx.draw_networkx_edges(G, pos, width=[w*5 for w in weights],
alpha=0.6, edge_color='gray', arrows=True)
# Draw labels
nx.draw_networkx_labels(G, pos, font_size=10, font_weight='bold')
# Draw edge labels
edge_labels = {(u, v): f"{G[u][v]['weight']:.2f}"
for u, v in edges}
nx.draw_networkx_edge_labels(G, pos, edge_labels, font_size=8)
plt.title("Agent Communication Graph (Attention-Weighted)")
plt.axis('off')
plt.tight_layout()
plt.show()
# Use it
# visualize_communication_graph(bus, threshold=0.1)
This creates a directed graph where edge thickness represents attention weights. You can see which agents communicate frequently and how strong those connections are.
Performance and Scaling
Attention routing has clear performance benefits, but it also introduces overhead. Let’s look at the tradeoffs.
Reduced Message Volume
In a broadcast system with N agents, each message creates N-1 deliveries. With attention routing, messages only go to relevant agents. If the average relevance threshold filters out 70% of agents, you reduce message volume by 70%.
This matters for network bandwidth, CPU usage, and agent processing time. Agents spend less time filtering irrelevant messages and more time on actual work.
But there’s a catch: computing attention scores takes time. For each message, you need to:
- Create a message embedding
- Compare it to all agent context embeddings
- Compute similarity scores
- Apply softmax
With N agents, that’s O(N) operations per message. If you’re sending many messages, this overhead can add up.
Improved Coordination
The real benefit isn’t just reduced volume — it’s better coordination. Agents focus on relevant peers, which means:
- Faster problem-solving (agents find the right collaborators quickly)
- Better context sharing (agents receive information they can actually use)
- Reduced noise (agents aren’t distracted by irrelevant messages)
You can measure this with context coherence metrics. Track how often agents receive messages that are actually relevant to their current task. With attention routing, this should be higher than with broadcasting.
Scaling Considerations
As the number of agents grows, attention computation becomes expensive. Here are some strategies:
Hierarchical Routing: Group agents into clusters. First route between clusters, then within clusters. This reduces computation from O(N) to O(sqrt(N)) in the best case.
Caching: Cache attention scores for similar messages. If two messages have similar embeddings, reuse the routing decision.
Approximate Similarity: Use approximate nearest neighbor search (like LSH or FAISS) instead of computing exact similarities for all agents.
Sparse Attention: Only compute attention for a subset of agents, then expand if needed. Start with the top-K most relevant agents from a quick approximation.
Best Practices and Pitfalls
Here are some things to watch out for when implementing attention routing.
Overfitting Relevance Models
If your relevance scoring is too specific, agents might miss important connections. For example, if the data analyst’s embedding is too narrow, it might never route to the backend developer, even when they need to coordinate on data access.
Solution: Use broader context embeddings. Include not just the current task, but also the agent’s capabilities, recent work, and general domain knowledge.
Also, add some randomness or exploration. Even if an agent has low relevance, occasionally route to them anyway. This helps discover new connections.
Handling Sparse Communication Graphs
In some systems, agents might not communicate for long periods. Their context embeddings drift, and attention scores become stale.
Solution: Implement decay. Reduce attention scores over time if there’s no communication. This makes the system forget old connections and focus on current ones.
You can also use periodic “heartbeat” messages that update context embeddings even when there’s no active task. This keeps the communication graph fresh.
Threshold Tuning
The attention threshold determines how selective routing is. Too high, and agents become isolated. Too low, and you’re back to broadcasting.
Solution: Make thresholds adaptive. Start with a moderate threshold, then adjust based on system performance. If agents are missing important messages, lower the threshold. If there’s too much noise, raise it.
You can also use different thresholds for different message types. Urgent messages might use a lower threshold to ensure delivery.
Embedding Quality
The quality of your embeddings directly affects routing quality. Bad embeddings mean bad routing decisions.
Solution: Use proper embedding models. For text messages, use sentence transformers or similar. For structured data, use domain-specific encoders. Don’t rely on simple hash-based embeddings in production.
Also, fine-tune embeddings on your specific domain. Pre-trained models are good starting points, but they might not capture your system’s specific context.
Future Trends
Attention routing is still evolving. Here are some directions it’s heading.
Integration with Retrieval-Augmented Agents
Retrieval-augmented generation (RAG) lets agents access external knowledge bases. Attention routing can help agents find the right knowledge sources.
Instead of routing to other agents, route to knowledge bases or document stores. The attention mechanism determines which documents are most relevant to the current task.
This combines the benefits of RAG with the efficiency of attention routing. Agents get relevant information faster, and knowledge bases receive fewer irrelevant queries.
Federated Attention for Distributed Intelligence
In distributed systems, agents might run on different machines or networks. Attention routing can work across these boundaries.
The key is federated attention computation. Each node computes attention scores for its local agents, then shares summaries with other nodes. This lets the system route messages efficiently even when agents are geographically distributed.
This is especially useful for edge computing scenarios, where agents run on different devices but need to coordinate.
Learned Routing Policies
Instead of computing attention from scratch each time, you can train routing policies. Use reinforcement learning to learn which routing decisions lead to better outcomes.
The policy takes the current state (message, agent contexts, history) and outputs routing decisions. Over time, it learns patterns like “data analyst messages usually go to ML engineer” or “urgent messages should use lower thresholds.”
This reduces computation (no need to compute embeddings and similarities) and can improve routing quality as the policy learns.
Multi-Modal Attention
Most current systems focus on text messages. But agents might communicate with images, structured data, or other modalities.
Multi-modal attention routing uses embeddings that work across different data types. A text message about “user dashboard” might route to an agent working on a dashboard image, even though the modalities are different.
This requires multi-modal embedding models, but it opens up new possibilities for agent communication.
Conclusion
Attention routing makes multi-agent systems smarter. Instead of broadcasting everything or following rigid paths, agents focus on relevant communication. This reduces noise, improves coordination, and scales better.
The core idea is simple: use attention mechanisms (like in transformers) to determine which agents should communicate. But the implementation details matter. Good embeddings, appropriate thresholds, and proper scaling strategies are all important.
If you’re building a multi-agent system, consider attention routing. Start simple — compute relevance scores based on context embeddings, route to agents above a threshold. Then iterate based on what you learn.
The field is still evolving. Integration with RAG, federated systems, and learned policies are all active areas of research. But the foundation is solid, and the benefits are real.
Give it a try. You might find that your agents work better when they can focus on what matters.
Discussion
Loading comments...