Dec 8, 2025

By Ali Elborey

Safe Rollouts for AI Agents: Shadow Mode, Progressive Delivery, and Kill Switches

AI AgentsDevOpsFeature FlagsMLOpsProductionRollouts

View sample code on GitHub https://github.com/appropri8/sample-code/tree/main/2025/12/08/safe-rollouts-ai-agents-shadow-mode-progressive-delivery-kill-switches

You deploy a new agent version. It works in testing. But production is different. Real users. Real edge cases. Real consequences.

The agent starts making weird decisions. It calls tools it shouldn’t. It costs more than expected. It violates policies. You need to roll back, but you can’t. You deployed everything at once.

Agents change behavior without code changes. Models update. Prompts drift. Tools evolve. Small changes can break user experience, compliance, or cost controls.

This article shows how to roll out agent versions safely. Shadow mode. Canary traffic. Quality gates. Kill switches. These patterns come from DevOps and feature flags, adapted for agents.

The problem: agents change behavior without code changes

Traditional services are predictable. Same code, same behavior. Deploy new code, get new behavior. It’s straightforward.

Agents are different. They depend on:

Models: The underlying LLM might update. Same prompt, different output.
Prompts: Small wording changes can change decisions.
Tools: New tools or tool changes affect what the agent can do.
Policies: Safety rules and constraints evolve.

A small change can have a big impact.

User experience breaks

You update a prompt to be more helpful. The agent starts being too verbose. Users get frustrated. Support tickets spike.

Or the opposite: you make it more concise. The agent stops including important context. Users get confused.

Compliance violations

The agent starts accessing data it shouldn’t. It calls tools that violate privacy rules. It logs sensitive information.

These aren’t code bugs. They’re behavior changes from prompt or model updates.

Cost blow-ups

A new model version is more thorough. It makes more tool calls. Each call costs money. Your bill doubles overnight.

Or the agent gets stuck in loops. It keeps calling the same expensive API. One request costs hundreds of dollars.

Why “big bang” deployments are unsafe

Deploying everything at once means:

All users see the new version immediately
If it breaks, everyone is affected
Rollback is slow and risky
You can’t compare old vs new behavior

You’re flying blind. You don’t know if the new version is better or worse until it’s too late.

Mindset: treat agents like risky features, not simple services

Agents are more like feature flags than microservices. They have behavior that can change unpredictably. They need gradual rollouts and quick kill switches.

Think of each agent version as a feature. Test it in shadow mode. Roll it out gradually. Monitor metrics. Have a kill switch ready.

Versioning agents as first-class artifacts

Before you can roll out safely, you need to version agents properly. An agent version isn’t just code. It’s a combination of:

Prompt: The system prompt and instructions
Tools: Which tools are available and their configurations
Config: Model settings, temperature, max tokens
Model: Which model version to use

All of these together define an agent version. If any changes, it’s a new version.

What is an “agent version”?

An agent version is a snapshot of:

version: "v1.3.2"
model:
  provider: "openai"
  name: "gpt-4"
  temperature: 0.7
  max_tokens: 2000
prompt:
  system: "You are a helpful customer support agent..."
  instructions: "Always be polite and concise."
tools:
  - name: "search_database"
    enabled: true
  - name: "send_email"
    enabled: true
  - name: "delete_user"
    enabled: false
policies:
  max_steps: 10
  max_cost_per_request: 1.00
  allowed_tools: ["search_database", "send_email"]

This is version v1.3.2. Change the prompt? That’s v1.3.3. Change the model? That’s v1.4.0. Change tools? That’s v2.0.0 (if breaking).

How to store versions

You have a few options:

Option 1: Git repo + config files

Store agent configs in your repo. Each version is a YAML or JSON file. Tag releases in git.

agents/
  support-agent/
    v1.3.2.yaml
    v1.3.3.yaml
    v1.4.0.yaml

Pros: Simple, versioned with code, easy to review in PRs.

Cons: Requires code deployment to change configs.

Option 2: Agent registry / metadata store

Store agent metadata in a database or config service. Version numbers are keys. Configs are values.

agent_registry:
  support-agent:
    v1.3.2: { config... }
    v1.3.3: { config... }

Pros: Can change configs without code deployment. Easy to query and compare versions.

Cons: More infrastructure. Need to keep registry in sync with code.

For most teams, start with git + config files. It’s simple and works well. Move to a registry later if you need dynamic config changes.

Naming and tagging versions

Use semantic versioning:

Major (v2.0.0): Breaking changes. New tools, major prompt rewrites.
Minor (v1.4.0): New features. New tools, prompt improvements.
Patch (v1.3.3): Bug fixes. Small prompt tweaks, config adjustments.

Add suffixes for special versions:

v1.3.2-shadow: Running in shadow mode
v1.3.2-canary: Running in canary mode
v1.3.2-rollback: Previous version, ready for rollback

This makes it clear what each version is doing.

Why this matters for rollbacks and incident analysis

When something breaks, you need to know:

Which version was running?
What changed from the previous version?
Can you roll back quickly?

Proper versioning makes this easy. You can see exactly what changed. You can roll back to a known good version. You can analyze incidents with full context.

Code example: Agent config and version loader

Here’s a simple implementation:

# src/agent_config.py
import yaml
from pathlib import Path
from typing import Dict, Any, Optional

class AgentConfig:
    """Agent configuration with versioning."""
    
    def __init__(self, config: Dict[str, Any]):
        self.version = config["version"]
        self.model = config["model"]
        self.prompt = config["prompt"]
        self.tools = config["tools"]
        self.policies = config.get("policies", {})
    
    @classmethod
    def load(cls, agent_name: str, version: str) -> "AgentConfig":
        """Load agent config from file."""
        config_path = Path(f"agents/{agent_name}/{version}.yaml")
        with open(config_path, "r") as f:
            config = yaml.safe_load(f)
        return cls(config)
    
    def to_dict(self) -> Dict[str, Any]:
        """Convert config to dictionary."""
        return {
            "version": self.version,
            "model": self.model,
            "prompt": self.prompt,
            "tools": self.tools,
            "policies": self.policies
        }

# Example config file: agents/support-agent/v1.3.2.yaml
# version: "v1.3.2"
# model:
#   provider: "openai"
#   name: "gpt-4"
#   temperature: 0.7
# prompt:
#   system: "You are a helpful customer support agent."
# tools:
#   - name: "search_database"
#     enabled: true
# policies:
#   max_steps: 10

# src/agent_factory.py
from .agent_config import AgentConfig
from .agent import Agent

def create_agent(agent_name: str, version: str) -> Agent:
    """Create an agent instance from a versioned config."""
    config = AgentConfig.load(agent_name, version)
    
    # Initialize agent with config
    agent = Agent(
        model_provider=config.model["provider"],
        model_name=config.model["name"],
        system_prompt=config.prompt["system"],
        tools=[t["name"] for t in config.tools if t["enabled"]],
        max_steps=config.policies.get("max_steps", 10)
    )
    
    return agent

# Usage
agent_v1 = create_agent("support-agent", "v1.3.2")
agent_v2 = create_agent("support-agent", "v1.3.3")

This gives you versioned agents you can load and compare.

Shadow mode: learning without risk

Shadow mode runs a new agent version alongside the current one. The new version processes the same requests, but its responses aren’t shown to users. You compare outputs and learn about differences.

It’s like A/B testing, but the B version is invisible to users.

Concept

Here’s how it works:

User sends a request
Current agent processes it and returns response to user
New agent processes the same request in the background
Both responses are logged and compared
User only sees the current agent’s response

The new agent learns from real traffic without affecting users.

Use cases

Compare decisions

See how the new agent makes different decisions. Does it call different tools? Does it handle edge cases better? Does it make mistakes the current agent doesn’t?

Check for policy violations

Does the new agent violate safety rules? Does it call forbidden tools? Does it access data it shouldn’t?

Cost analysis

How much does the new agent cost? More tool calls? Longer responses? Higher token usage?

Quality assessment

Is the new agent’s output better? More accurate? More helpful? You can score both responses and compare.

What to log

Log everything you need to compare:

Inputs: User message, context, available tools
Old agent decision: Tool calls, response, metadata
New agent decision: Tool calls, response, metadata
Scores: Quality scores, cost, latency, policy compliance

Store this in a database or log service. Query it later to analyze differences.

How long to run shadow mode

Run shadow mode for at least a few days. You need enough data to see patterns:

Edge cases that only happen occasionally
Different user types and behaviors
Peak traffic vs normal traffic
Different times of day

A week is usually enough. Longer if you have low traffic or want more confidence.

What to watch

Monitor these metrics during shadow mode:

Decision differences: How often do agents make different decisions?
Policy violations: Does the new agent violate rules?
Cost differences: Is the new agent more expensive?
Quality scores: Is the new agent better or worse?
Error rates: Does the new agent fail more often?

If you see red flags, fix them before rolling out.

Code example: Shadow mode routing

Here’s a simple implementation:

# src/shadow_mode.py
import logging
from typing import Dict, Any, Optional
from .agent_factory import create_agent

logger = logging.getLogger(__name__)

class ShadowModeRouter:
    """Routes requests to current agent and runs candidate in shadow mode."""
    
    def __init__(
        self,
        agent_name: str,
        current_version: str,
        candidate_version: Optional[str] = None
    ):
        self.agent_name = agent_name
        self.current_version = current_version
        self.candidate_version = candidate_version
        
        # Load agents
        self.current_agent = create_agent(agent_name, current_version)
        self.candidate_agent = None
        if candidate_version:
            self.candidate_agent = create_agent(agent_name, candidate_version)
    
    def process(self, user_message: str, context: Dict[str, Any]) -> Dict[str, Any]:
        """
        Process request with current agent, run candidate in shadow mode.
        
        Returns current agent's response.
        """
        # Process with current agent (synchronous, user-facing)
        current_response = self.current_agent.process(user_message, context)
        
        # Process with candidate agent (asynchronous, shadow mode)
        if self.candidate_agent:
            try:
                candidate_response = self.candidate_agent.process(user_message, context)
                
                # Log both responses for comparison
                self._log_comparison(
                    user_message=user_message,
                    context=context,
                    current_response=current_response,
                    candidate_response=candidate_response
                )
            except Exception as e:
                logger.error(f"Shadow mode error: {e}", exc_info=True)
                # Don't fail the request if shadow mode fails
        
        return current_response
    
    def _log_comparison(
        self,
        user_message: str,
        context: Dict[str, Any],
        current_response: Dict[str, Any],
        candidate_response: Dict[str, Any]
    ):
        """Log both responses for later analysis."""
        comparison = {
            "timestamp": datetime.utcnow().isoformat(),
            "user_message": user_message,
            "context": context,
            "current_version": self.current_version,
            "candidate_version": self.candidate_version,
            "current_response": {
                "tool_calls": current_response.get("tool_calls", []),
                "response": current_response.get("response", ""),
                "cost": current_response.get("cost", 0),
                "latency_ms": current_response.get("latency_ms", 0)
            },
            "candidate_response": {
                "tool_calls": candidate_response.get("tool_calls", []),
                "response": candidate_response.get("response", ""),
                "cost": candidate_response.get("cost", 0),
                "latency_ms": candidate_response.get("latency_ms", 0)
            },
            "differences": self._compute_differences(current_response, candidate_response)
        }
        
        # Log to database or logging service
        logger.info("Shadow mode comparison", extra=comparison)
    
    def _compute_differences(
        self,
        current: Dict[str, Any],
        candidate: Dict[str, Any]
    ) -> Dict[str, Any]:
        """Compute differences between responses."""
        return {
            "tool_calls_different": (
                current.get("tool_calls", []) != candidate.get("tool_calls", [])
            ),
            "cost_difference": (
                candidate.get("cost", 0) - current.get("cost", 0)
            ),
            "latency_difference_ms": (
                candidate.get("latency_ms", 0) - current.get("latency_ms", 0)
            )
        }

# Usage in API endpoint
router = ShadowModeRouter(
    agent_name="support-agent",
    current_version="v1.3.2",
    candidate_version="v1.3.3"  # Running in shadow mode
)

@app.post("/chat")
def chat(request: ChatRequest):
    response = router.process(
        user_message=request.message,
        context=request.context
    )
    return response

This runs the candidate agent in the background and logs comparisons without affecting users.

Progressive delivery: from 1% to 100%

Once shadow mode looks good, start sending real traffic to the new version. Start small. Increase gradually. Monitor at each step.

This is progressive delivery. Also called canary rollouts or gradual rollouts.

Canary rollout stages

A typical rollout looks like this:

1%: Test with a small fraction of traffic
5%: If 1% looks good, increase to 5%
25%: If 5% looks good, increase to 25%
50%: If 25% looks good, increase to 50%
100%: If 50% looks good, roll out to everyone

Each stage gives you time to catch issues before they affect too many users.

You can adjust the stages based on your risk tolerance. More cautious? Use 0.1%, 1%, 5%, 25%, 100%. Less cautious? Use 5%, 25%, 100%.

What metrics to monitor

At each stage, watch these metrics:

Error rate

Is the new agent failing more? HTTP errors, exceptions, timeouts. If error rate spikes, roll back.

Tool failures

Are tool calls failing? Wrong parameters? Rate limits? API errors? Tool failures are a red flag.

Latency

Is the new agent slower? Higher latency means worse user experience. If latency increases significantly, investigate.

Safety violations / policy flags

Is the new agent violating policies? Calling forbidden tools? Accessing restricted data? These are critical.

Business KPIs

Is the new agent affecting business metrics? Conversion rates, customer satisfaction, revenue. These matter most.

Monitor all of these. Set thresholds. If any threshold is exceeded, roll back automatically.

Using feature flags or traffic rules

You can control rollouts with:

Feature flags

Use a feature flag service (LaunchDarkly, Split, etc.). Route traffic based on flag values.

if feature_flag.is_enabled("new-agent-v1.3.3", user_id):
    agent = create_agent("support-agent", "v1.3.3")
else:
    agent = create_agent("support-agent", "v1.3.2")

Traffic rules

Route based on user attributes. Percentage of users, user IDs, geographic regions, etc.

if should_use_new_agent(user_id, percentage=5):
    agent = create_agent("support-agent", "v1.3.3")
else:
    agent = create_agent("support-agent", "v1.3.2")

Load balancer rules

Route at the infrastructure level. Use load balancer rules to send X% of traffic to new version.

Choose what works for your infrastructure. Feature flags are flexible. Traffic rules are simple. Load balancer rules are infrastructure-native.

How to design thresholds and automatic rollback rules

Set clear thresholds for each metric:

rollout_thresholds:
  error_rate:
    max_increase: 0.05  # 5% increase is acceptable
    absolute_max: 0.10  # 10% error rate is unacceptable
  latency:
    max_increase_ms: 200  # 200ms increase is acceptable
    absolute_max_ms: 2000  # 2s latency is unacceptable
  cost:
    max_increase_percent: 20  # 20% cost increase is acceptable
  policy_violations:
    max_count: 0  # Zero tolerance for policy violations

If any threshold is exceeded, roll back automatically. Don’t wait for manual intervention.

Code example: Progressive rollout controller

Here’s a simple rollout controller:

# src/rollout_controller.py
import time
from typing import Dict, Any, Optional
from dataclasses import dataclass
from enum import Enum

class RolloutStage(Enum):
    SHADOW = "shadow"
    CANARY_1_PCT = "canary_1pct"
    CANARY_5_PCT = "canary_5pct"
    CANARY_25_PCT = "canary_25pct"
    CANARY_50_PCT = "canary_50pct"
    FULL = "full"
    ROLLED_BACK = "rolled_back"

@dataclass
class RolloutThresholds:
    """Thresholds for automatic rollback."""
    max_error_rate_increase: float = 0.05
    max_latency_increase_ms: int = 200
    max_cost_increase_percent: float = 20
    max_policy_violations: int = 0

class RolloutController:
    """Manages progressive rollout of agent versions."""
    
    def __init__(
        self,
        agent_name: str,
        current_version: str,
        candidate_version: str,
        thresholds: RolloutThresholds
    ):
        self.agent_name = agent_name
        self.current_version = current_version
        self.candidate_version = candidate_version
        self.thresholds = thresholds
        self.stage = RolloutStage.SHADOW
        self.stage_start_time = time.time()
        self.metrics_history = []
    
    def should_use_candidate(self, user_id: str) -> bool:
        """Determine if request should use candidate version."""
        if self.stage == RolloutStage.SHADOW:
            return False  # Shadow mode doesn't serve candidate
        
        if self.stage == RolloutStage.ROLLED_BACK:
            return False  # Rolled back, use current
        
        # Calculate percentage based on user_id hash
        user_hash = hash(user_id) % 100
        percentage = self._get_stage_percentage()
        
        return user_hash < percentage
    
    def _get_stage_percentage(self) -> int:
        """Get traffic percentage for current stage."""
        stage_percentages = {
            RolloutStage.CANARY_1_PCT: 1,
            RolloutStage.CANARY_5_PCT: 5,
            RolloutStage.CANARY_25_PCT: 25,
            RolloutStage.CANARY_50_PCT: 50,
            RolloutStage.FULL: 100
        }
        return stage_percentages.get(self.stage, 0)
    
    def record_metrics(
        self,
        version: str,
        error: bool,
        latency_ms: float,
        cost: float,
        policy_violations: int
    ):
        """Record metrics for a request."""
        self.metrics_history.append({
            "timestamp": time.time(),
            "version": version,
            "error": error,
            "latency_ms": latency_ms,
            "cost": cost,
            "policy_violations": policy_violations
        })
        
        # Keep only recent metrics (last hour)
        cutoff = time.time() - 3600
        self.metrics_history = [
            m for m in self.metrics_history
            if m["timestamp"] > cutoff
        ]
    
    def evaluate_and_advance(self) -> bool:
        """
        Evaluate metrics and advance to next stage if safe.
        
        Returns True if advanced, False if rolled back.
        """
        if self.stage == RolloutStage.SHADOW:
            # Shadow mode: just collect metrics, don't advance automatically
            return False
        
        if self.stage == RolloutStage.FULL:
            # Already at 100%, nothing to do
            return False
        
        # Check if we should roll back
        if self._should_rollback():
            self.stage = RolloutStage.ROLLED_BACK
            return False
        
        # Check if we've been in this stage long enough
        stage_duration = time.time() - self.stage_start_time
        min_stage_duration = 3600  # 1 hour minimum per stage
        
        if stage_duration < min_stage_duration:
            return False  # Not enough time in this stage
        
        # Advance to next stage
        self._advance_stage()
        return True
    
    def _should_rollback(self) -> bool:
        """Check if metrics indicate we should roll back."""
        if not self.metrics_history:
            return False
        
        # Calculate metrics for candidate version
        candidate_metrics = [
            m for m in self.metrics_history
            if m["version"] == self.candidate_version
        ]
        
        if not candidate_metrics:
            return False
        
        current_metrics = [
            m for m in self.metrics_history
            if m["version"] == self.current_version
        ]
        
        if not current_metrics:
            return True  # No baseline, roll back
        
        # Calculate averages
        candidate_error_rate = sum(m["error"] for m in candidate_metrics) / len(candidate_metrics)
        current_error_rate = sum(m["error"] for m in current_metrics) / len(current_metrics)
        
        candidate_avg_latency = sum(m["latency_ms"] for m in candidate_metrics) / len(candidate_metrics)
        current_avg_latency = sum(m["latency_ms"] for m in current_metrics) / len(current_metrics)
        
        candidate_avg_cost = sum(m["cost"] for m in candidate_metrics) / len(candidate_metrics)
        current_avg_cost = sum(m["cost"] for m in current_metrics) / len(current_metrics)
        
        candidate_violations = sum(m["policy_violations"] for m in candidate_metrics)
        
        # Check thresholds
        if candidate_error_rate - current_error_rate > self.thresholds.max_error_rate_increase:
            return True
        
        if candidate_avg_latency - current_avg_latency > self.thresholds.max_latency_increase_ms:
            return True
        
        if (candidate_avg_cost - current_avg_cost) / current_avg_cost > self.thresholds.max_cost_increase_percent / 100:
            return True
        
        if candidate_violations > self.thresholds.max_policy_violations:
            return True
        
        return False
    
    def _advance_stage(self):
        """Advance to next rollout stage."""
        stage_order = [
            RolloutStage.SHADOW,
            RolloutStage.CANARY_1_PCT,
            RolloutStage.CANARY_5_PCT,
            RolloutStage.CANARY_25_PCT,
            RolloutStage.CANARY_50_PCT,
            RolloutStage.FULL
        ]
        
        current_index = stage_order.index(self.stage)
        if current_index < len(stage_order) - 1:
            self.stage = stage_order[current_index + 1]
            self.stage_start_time = time.time()

# Usage
controller = RolloutController(
    agent_name="support-agent",
    current_version="v1.3.2",
    candidate_version="v1.3.3",
    thresholds=RolloutThresholds()
)

@app.post("/chat")
def chat(request: ChatRequest):
    # Determine which version to use
    use_candidate = controller.should_use_candidate(request.user_id)
    version = controller.candidate_version if use_candidate else controller.current_version
    
    agent = create_agent("support-agent", version)
    response = agent.process(request.message, request.context)
    
    # Record metrics
    controller.record_metrics(
        version=version,
        error=response.get("error", False),
        latency_ms=response.get("latency_ms", 0),
        cost=response.get("cost", 0),
        policy_violations=response.get("policy_violations", 0)
    )
    
    return response

# Background task to evaluate and advance rollout
def evaluate_rollout():
    controller.evaluate_and_advance()

This controller manages the rollout stages and automatically rolls back if thresholds are exceeded.

Automated quality gates and checks

Don’t rely only on charts and dashboards. Add automated checks into your pipeline. These checks block bad versions from reaching production.

Quality gates

Quality gates are automated checks that must pass before a version can advance:

Quality score from offline evals

Run the new version against a test suite. Calculate quality scores. If scores drop, block the rollout.

Safety checker score

Run safety checks. Check for policy violations, dangerous tool calls, PII leaks. If safety score is too low, block the rollout.

Regression checks on key scenarios

Run regression tests on critical scenarios. If key scenarios fail, block the rollout.

Performance benchmarks

Check latency, cost, token usage. If performance degrades significantly, block the rollout.

Pipeline view

Your rollout pipeline looks like this:

Build → Test → Shadow → Canary → Full Rollout
         ↓       ↓        ↓          ↓
      Quality  Safety  Metrics   Metrics
      Gates    Checks  Checks    Checks

Each stage has gates. If a gate fails, the rollout stops.

How to represent gates as code

Define gates as config files or code:

# quality_gates.yaml
gates:
  - name: "quality_score"
    type: "threshold"
    metric: "offline_quality_score"
    threshold: 0.85
    operator: ">="
  
  - name: "safety_score"
    type: "threshold"
    metric: "safety_checker_score"
    threshold: 0.95
    operator: ">="
  
  - name: "regression_tests"
    type: "test_suite"
    test_suite: "critical_scenarios"
    min_pass_rate: 1.0  # 100% must pass
  
  - name: "latency"
    type: "threshold"
    metric: "p95_latency_ms"
    threshold: 2000
    operator: "<="
  
  - name: "cost"
    type: "threshold"
    metric: "avg_cost_per_request"
    threshold: 0.50
    operator: "<="

Load these gates and check them at each stage.

Code example: Quality gate checker

Here’s a simple quality gate implementation:

# src/quality_gates.py
import yaml
from typing import Dict, Any, List
from pathlib import Path

class QualityGate:
    """Represents a single quality gate."""
    
    def __init__(self, config: Dict[str, Any]):
        self.name = config["name"]
        self.type = config["type"]
        self.metric = config.get("metric")
        self.threshold = config.get("threshold")
        self.operator = config.get("operator", ">=")
        self.test_suite = config.get("test_suite")
        self.min_pass_rate = config.get("min_pass_rate", 1.0)
    
    def check(self, metrics: Dict[str, Any]) -> tuple[bool, str]:
        """
        Check if gate passes.
        
        Returns (passed, message)
        """
        if self.type == "threshold":
            return self._check_threshold(metrics)
        elif self.type == "test_suite":
            return self._check_test_suite(metrics)
        else:
            return False, f"Unknown gate type: {self.type}"
    
    def _check_threshold(self, metrics: Dict[str, Any]) -> tuple[bool, str]:
        """Check threshold gate."""
        if self.metric not in metrics:
            return False, f"Metric {self.metric} not found"
        
        value = metrics[self.metric]
        threshold = self.threshold
        
        if self.operator == ">=":
            passed = value >= threshold
        elif self.operator == "<=":
            passed = value <= threshold
        elif self.operator == ">":
            passed = value > threshold
        elif self.operator == "<":
            passed = value < threshold
        elif self.operator == "==":
            passed = value == threshold
        else:
            return False, f"Unknown operator: {self.operator}"
        
        if passed:
            return True, f"{self.name}: {value} {self.operator} {threshold}"
        else:
            return False, f"{self.name}: {value} {self.operator} {threshold} (FAILED)"
    
    def _check_test_suite(self, metrics: Dict[str, Any]) -> tuple[bool, str]:
        """Check test suite gate."""
        if self.test_suite not in metrics:
            return False, f"Test suite {self.test_suite} not found"
        
        test_results = metrics[self.test_suite]
        total = len(test_results)
        passed = sum(1 for r in test_results if r["passed"])
        pass_rate = passed / total if total > 0 else 0
        
        if pass_rate >= self.min_pass_rate:
            return True, f"{self.name}: {passed}/{total} tests passed ({pass_rate:.1%})"
        else:
            return False, f"{self.name}: {passed}/{total} tests passed ({pass_rate:.1%}) < {self.min_pass_rate:.1%} (FAILED)"

class QualityGateChecker:
    """Checks quality gates for agent versions."""
    
    def __init__(self, gates_config_path: str):
        with open(gates_config_path, "r") as f:
            config = yaml.safe_load(f)
        
        self.gates = [QualityGate(g) for g in config["gates"]]
    
    def check_all(self, metrics: Dict[str, Any]) -> tuple[bool, List[str]]:
        """
        Check all gates.
        
        Returns (all_passed, messages)
        """
        results = []
        all_passed = True
        
        for gate in self.gates:
            passed, message = gate.check(metrics)
            results.append(message)
            if not passed:
                all_passed = False
        
        return all_passed, results

# Usage
checker = QualityGateChecker("quality_gates.yaml")

# After running evals, check gates
metrics = {
    "offline_quality_score": 0.87,
    "safety_checker_score": 0.96,
    "critical_scenarios": [
        {"name": "test_1", "passed": True},
        {"name": "test_2", "passed": True},
        {"name": "test_3", "passed": True}
    ],
    "p95_latency_ms": 1500,
    "avg_cost_per_request": 0.45
}

all_passed, messages = checker.check_all(metrics)
for message in messages:
    print(message)

if not all_passed:
    print("Quality gates failed. Blocking rollout.")
    exit(1)
else:
    print("All quality gates passed. Proceeding with rollout.")

This checks quality gates and blocks rollouts if they fail.

Kill switches and rollback patterns

Kill switches let you turn off a version immediately. No deployment needed. No waiting. Just flip a switch and traffic goes back to the previous version.

Why kill switches are essential

Agents can break in weird ways. They might:

Start making expensive API calls
Violate policies repeatedly
Generate inappropriate content
Get stuck in loops

You need to stop this immediately. Kill switches give you that control.

Types of kill switches

Global flag to turn off the new version

A single flag that disables the new version for everyone. All traffic goes to the current version.

Per-feature or per-tenant switches

More granular control. Turn off the new version for specific features or tenants. Useful if only some users are affected.

Percentage-based rollback

Gradually reduce traffic to the new version. 100% → 50% → 25% → 0%. Gives you a controlled rollback.

Rollback patterns

When you kill a version, you need to roll back. Options:

Roll back to previous agent version

Switch all traffic back to the previous version. Simple and fast.

Fall back to simple workflow or FAQ bot

If the agent is completely broken, fall back to a simpler system. A rule-based bot or FAQ lookup. Better than nothing.

Circuit breaker pattern

If the agent fails too many times, stop using it automatically. Fall back to a safe alternative.

Where to put controls

Edge router / gateway

Put kill switches in your API gateway or edge router. Fast and centralized.

Feature-flag system

Use a feature flag service. Easy to manage and monitor.

Internal admin panel

Build a simple admin panel. Let on-call engineers flip switches quickly.

Configuration service

Store kill switch state in a config service. Update it via API or UI.

Choose what fits your infrastructure. Feature flags are usually easiest.

Code example: Kill switch implementation

Here’s a simple kill switch:

# src/kill_switch.py
import json
from typing import Dict, Any, Optional
from pathlib import Path
from datetime import datetime

class KillSwitch:
    """Manages kill switches for agent versions."""
    
    def __init__(self, config_path: str = "kill_switches.json"):
        self.config_path = Path(config_path)
        self._load_config()
    
    def _load_config(self):
        """Load kill switch config from file."""
        if self.config_path.exists():
            with open(self.config_path, "r") as f:
                self.config = json.load(f)
        else:
            self.config = {
                "global_switches": {},
                "feature_switches": {},
                "tenant_switches": {}
            }
            self._save_config()
    
    def _save_config(self):
        """Save kill switch config to file."""
        with open(self.config_path, "w") as f:
            json.dump(self.config, f, indent=2)
    
    def is_killed(self, agent_name: str, version: str, feature: Optional[str] = None, tenant: Optional[str] = None) -> bool:
        """Check if a version is killed."""
        # Check global switch
        global_key = f"{agent_name}:{version}"
        if global_key in self.config["global_switches"]:
            if self.config["global_switches"][global_key].get("enabled", False):
                return True
        
        # Check feature switch
        if feature:
            feature_key = f"{agent_name}:{version}:{feature}"
            if feature_key in self.config["feature_switches"]:
                if self.config["feature_switches"][feature_key].get("enabled", False):
                    return True
        
        # Check tenant switch
        if tenant:
            tenant_key = f"{agent_name}:{version}:{tenant}"
            if tenant_key in self.config["tenant_switches"]:
                if self.config["tenant_switches"][tenant_key].get("enabled", False):
                    return True
        
        return False
    
    def kill_version(
        self,
        agent_name: str,
        version: str,
        reason: str,
        feature: Optional[str] = None,
        tenant: Optional[str] = None
    ):
        """Kill a version (enable kill switch)."""
        if feature:
            key = f"{agent_name}:{version}:{feature}"
            self.config["feature_switches"][key] = {
                "enabled": True,
                "reason": reason,
                "killed_at": datetime.utcnow().isoformat()
            }
        elif tenant:
            key = f"{agent_name}:{version}:{tenant}"
            self.config["tenant_switches"][key] = {
                "enabled": True,
                "reason": reason,
                "killed_at": datetime.utcnow().isoformat()
            }
        else:
            key = f"{agent_name}:{version}"
            self.config["global_switches"][key] = {
                "enabled": True,
                "reason": reason,
                "killed_at": datetime.utcnow().isoformat()
            }
        
        self._save_config()
    
    def unkill_version(
        self,
        agent_name: str,
        version: str,
        feature: Optional[str] = None,
        tenant: Optional[str] = None
    ):
        """Unkill a version (disable kill switch)."""
        if feature:
            key = f"{agent_name}:{version}:{feature}"
            if key in self.config["feature_switches"]:
                del self.config["feature_switches"][key]
        elif tenant:
            key = f"{agent_name}:{version}:{tenant}"
            if key in self.config["tenant_switches"]:
                del self.config["tenant_switches"][key]
        else:
            key = f"{agent_name}:{version}"
            if key in self.config["global_switches"]:
                del self.config["global_switches"][key]
        
        self._save_config()

# Usage in request handler
kill_switch = KillSwitch()

@app.post("/chat")
def chat(request: ChatRequest):
    # Check kill switch
    if kill_switch.is_killed(
        agent_name="support-agent",
        version="v1.3.3",
        feature=request.feature,
        tenant=request.tenant_id
    ):
        # Use previous version
        version = "v1.3.2"
    else:
        version = "v1.3.3"
    
    agent = create_agent("support-agent", version)
    return agent.process(request.message, request.context)

# Admin endpoint to kill a version
@app.post("/admin/kill")
def kill_version(request: KillSwitchRequest):
    kill_switch.kill_version(
        agent_name=request.agent_name,
        version=request.version,
        reason=request.reason,
        feature=request.feature,
        tenant=request.tenant
    )
    return {"status": "killed"}

# Admin endpoint to unkill a version
@app.post("/admin/unkill")
def unkill_version(request: KillSwitchRequest):
    kill_switch.unkill_version(
        agent_name=request.agent_name,
        version=request.version,
        feature=request.feature,
        tenant=request.tenant
    )
    return {"status": "unkilled"}

This gives you kill switches you can flip instantly without code changes.

Putting it together: a rollout playbook

Here’s a practical playbook for rolling out agent versions safely.

Checklist before enabling shadow mode

Agent version is tagged and stored in registry
Config file is reviewed and approved
Offline evals pass quality gates
Shadow mode logging is configured
Monitoring dashboards are set up
Team is notified about shadow mode start

Checklist before raising canary traffic

Shadow mode ran for at least 3-7 days
No policy violations detected
Cost is within acceptable range
Quality scores meet thresholds
Key scenarios pass regression tests
Rollout controller is configured
Kill switch is tested and ready
On-call engineer is available

Checklist during incident (who does what)

On-call engineer:

Check monitoring dashboards
Identify which version is affected
Check kill switch status
Kill the version if needed
Notify team

Team lead:

Review incident details
Coordinate investigation
Decide on fix or rollback
Update playbook if needed

Metrics to check:

Error rate dashboard
Cost dashboard
Policy violations log
User complaints / support tickets
Business KPI dashboard

Runbook template

Here’s a simple runbook template teams can adapt:

# Agent Rollout Runbook

## Pre-Rollout

1. **Version Preparation**
   - Tag version in git
   - Store config in registry
   - Run offline evals
   - Review quality gate results

2. **Shadow Mode Setup**
   - Enable shadow mode routing
   - Configure logging
   - Set up monitoring
   - Notify team

3. **Shadow Mode Duration**
   - Run for 3-7 days minimum
   - Monitor metrics daily
   - Review differences weekly

## Rollout

1. **Canary 1%**
   - Enable 1% traffic
   - Monitor for 1 hour minimum
   - Check all metrics
   - Advance if safe

2. **Canary 5%**
   - Enable 5% traffic
   - Monitor for 2 hours minimum
   - Check all metrics
   - Advance if safe

3. **Canary 25%**
   - Enable 25% traffic
   - Monitor for 4 hours minimum
   - Check all metrics
   - Advance if safe

4. **Canary 50%**
   - Enable 50% traffic
   - Monitor for 8 hours minimum
   - Check all metrics
   - Advance if safe

5. **Full Rollout**
   - Enable 100% traffic
   - Monitor for 24 hours
   - Mark rollout complete

## Rollback

1. **Trigger Rollback If:**
   - Error rate increases >5%
   - Latency increases >200ms
   - Cost increases >20%
   - Policy violations detected
   - Business KPIs degrade

2. **Rollback Steps:**
   - Activate kill switch
   - Verify traffic switched
   - Monitor previous version
   - Investigate root cause
   - Document incident

## Post-Rollout

1. **Review**
   - Compare metrics (old vs new)
   - Review user feedback
   - Update playbook if needed

2. **Cleanup**
   - Remove shadow mode config
   - Archive old version
   - Update documentation

Customize this for your team and infrastructure.

Conclusion

Rolling out agent versions safely requires:

Versioning: Treat agents as versioned artifacts
Shadow mode: Learn from real traffic without risk
Progressive delivery: Roll out gradually, monitor at each step
Quality gates: Automated checks that block bad versions
Kill switches: Instant rollback when things go wrong

These patterns come from DevOps and feature flags. They work for agents too.

Start simple. Add shadow mode first. Then add progressive delivery. Then add quality gates and kill switches. Build up your safety net over time.

The goal is to roll out new versions confidently. You should know they’re safe before users see them. And if something goes wrong, you should be able to fix it quickly.

Agents are different from traditional services. But the rollout principles are the same: test thoroughly, roll out gradually, monitor closely, and have a kill switch ready.

Sign In

Safe Rollouts for AI Agents: Shadow Mode, Progressive Delivery, and Kill Switches

Stay Updated

Discussion

Discussion

Sign In