Dec 8, 2025

By Yusuf Elborey

Shift-Left DevOps for AI Agents: Testing, Sandboxing, and Tool Mocks in CI/CD

AI AgentsDevOpsCI/CDTestingMLOps

View sample code on GitHub https://github.com/appropri8/sample-code/tree/main/2025/12/08/shift-left-devops-ai-agents-testing-sandboxing-tool-mocks-cicd

Shift-Left DevOps for AI Agents: Testing, Sandboxing, and Tool Mocks in CI/CD

AI agents break in ways normal microservices don’t. They hallucinate tool calls. They loop forever. They make subtle mistakes that only show up when real users hit them.

The problem is we’re catching these bugs in production. That’s expensive. It’s risky. And it’s unnecessary.

Shift-left DevOps means testing earlier in the pipeline. For agents, that means testing tools, prompts, and workflows before deployment. This article shows you how.

Why AI agents need shift-left DevOps

Agents fail differently than traditional services. Here’s what goes wrong:

Hallucinated tool calls

An agent might call a tool that doesn’t exist. Or pass parameters that don’t match the schema. The LLM generates plausible-looking code, but it’s wrong.

Example: An agent tries to call send_email(to="user@example.com", subject="Hello") but the real API requires recipient, not to. The call fails silently or returns an error the agent doesn’t handle.

Infinite loops and runaway actions

Agents can get stuck. They call a tool, get a result, decide they need more information, call another tool, and repeat. Without limits, they’ll run until they hit a timeout or rate limit.

A customer support agent might keep querying a database, trying to find information that doesn’t exist. Each query costs money and time.

Subtle prompt regressions

You update a prompt to fix one issue. It breaks something else. The agent starts making different decisions, and you don’t notice until users complain.

Maybe you change “be concise” to “be very concise” and suddenly the agent stops including important context. Or you add a safety check that makes the agent too cautious.

Why production testing is risky

Catching these issues in production means:

Real users see failures
Real API calls cost money
Real data might get corrupted
Real services might get overwhelmed

A single agent making bad tool calls can trigger rate limits, create duplicate records, or send wrong notifications.

The goal of shift-left

Fail fast in CI. Keep production simple and boring.

If an agent breaks, the PR should fail. The deployment should be blocked. You should know before it reaches users.

Defining testable contracts for agents

Treat the agent as a black box with a contract. It takes inputs and produces outputs. You can test those inputs and outputs without understanding the internal reasoning.

The contract

Inputs:

User message
Available tools
Context (conversation history, user data, etc.)

Allowed actions:

Tool calls (which tools, with what parameters)
Messages (what the agent says)
State updates (what changes in the system)

Expected constraints:

No PII in logs
Maximum number of steps
Never call dangerous tools in test environments
Tool parameters match schemas

Writing contracts as code

Contracts live in your repo. They’re JSON schemas, YAML files, or Python classes. They define what’s allowed and what’s not.

Here’s a simple example using JSON Schema:

{
  "type": "object",
  "properties": {
    "tool_name": {
      "type": "string",
      "enum": ["search_database", "send_notification", "get_user_info"]
    },
    "parameters": {
      "type": "object",
      "properties": {
        "query": {"type": "string", "maxLength": 500},
        "user_id": {"type": "string", "pattern": "^user_\\d+$"}
      },
      "required": ["query"]
    }
  },
  "required": ["tool_name", "parameters"]
}

This schema says: the agent can only call these three tools, parameters must match these types, and query is required.

Policy rules

Beyond schemas, you need policy rules. These are checks that run after the agent makes a decision.

Example rules:

Never call /delete_user in test environments
Maximum 10 tool calls per conversation
All database queries must include a limit parameter
No tool calls that modify production data during testing

Where contracts live

Contracts can live in:

The same repo as the agent code
A separate config service
A policy repository (like OPA policies)

For most teams, keeping them in the same repo works best. They’re versioned with the code, and changes are visible in PRs.

Code example: Contract validation

Here’s a Python example that validates an agent action:

from jsonschema import validate, ValidationError
from typing import Dict, Any

TOOL_CONTRACT = {
    "type": "object",
    "properties": {
        "tool_name": {
            "type": "string",
            "enum": ["search_database", "send_notification"]
        },
        "parameters": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "maxLength": 500}
            },
            "required": ["query"]
        }
    },
    "required": ["tool_name", "parameters"]
}

def validate_agent_action(action: Dict[str, Any]) -> bool:
    """Validate an agent action against the contract."""
    try:
        validate(instance=action, schema=TOOL_CONTRACT)
        return True
    except ValidationError as e:
        print(f"Contract violation: {e.message}")
        return False

# Test
action = {
    "tool_name": "search_database",
    "parameters": {"query": "find users"}
}

assert validate_agent_action(action), "Action should be valid"

This test fails if the agent tries to call a tool that’s not in the allowed list, or if parameters don’t match the schema.

Unit tests for tools and prompt modules

Tools are just functions. Test them like normal code.

Prompt “modules” are trickier. They’re sub-prompts for planning, tool selection, or response formatting. You can test them with fixed inputs and expected outputs.

Testing tools

A tool is a function that takes parameters and returns a result. Test it with known inputs and check the outputs.

def search_database(query: str, limit: int = 10) -> list:
    """Search the database for records matching the query."""
    # Implementation here
    pass

def test_search_database():
    results = search_database("user@example.com", limit=5)
    assert len(results) <= 5
    assert all("user@example.com" in str(r) for r in results)

This is standard unit testing. Nothing special about agents here.

Testing prompt modules

Prompt modules are harder. The LLM generates text, and text is variable. You can’t test for exact matches.

Instead, test for structure and key fields. Use “good enough” matchers.

Example: You have a prompt that extracts structured data from user messages. Test that the output has the right structure, even if the exact wording varies.

def test_extract_user_intent():
    prompt = "Extract the user's intent from: 'I want to cancel my subscription'"
    response = llm.generate(prompt)
    
    # Check structure, not exact text
    assert "intent" in response
    assert response["intent"] in ["cancel", "unsubscribe", "cancel_subscription"]
    assert "confidence" in response
    assert 0 <= response["confidence"] <= 1

You’re checking that the response has the right shape and reasonable values, not that it matches a specific string.

Snapshot tests

Snapshot tests store a “golden” output and compare new outputs against it. They’re useful for prompt modules that should be relatively stable.

import json

def test_planning_prompt_snapshot():
    prompt = create_planning_prompt(user_message="Book a flight")
    response = llm.generate(prompt)
    
    # Store snapshot on first run
    snapshot_file = "snapshots/planning_prompt.json"
    if not os.path.exists(snapshot_file):
        with open(snapshot_file, "w") as f:
            json.dump(response, f, indent=2)
        return
    
    # Compare with snapshot
    with open(snapshot_file, "r") as f:
        expected = json.load(f)
    
    assert response["steps"] == expected["steps"]
    assert response["tools"] == expected["tools"]

If the prompt changes in an unexpected way, the snapshot test fails. You review the diff and either update the snapshot or fix the prompt.

Avoiding brittle tests

Prompt tests break easily. The LLM might rephrase something, and your test fails even though the behavior is correct.

To avoid brittleness:

Check structure and key fields, not every word
Use matchers that allow variation (e.g., “contains ‘cancel’ or ‘unsubscribe’”)
Test behavior, not implementation (does it work, not how it’s worded)

Code example: Mocking the LLM

Here’s a pytest example that mocks the LLM and asserts the agent sends the right tool call:

import pytest
from unittest.mock import Mock, patch

def test_agent_tool_selection():
    """Test that the agent selects the correct tool for a user request."""
    
    # Mock the LLM response
    mock_llm_response = {
        "tool_calls": [
            {
                "tool_name": "search_database",
                "parameters": {"query": "user@example.com"}
            }
        ]
    }
    
    with patch('agent.llm.generate') as mock_llm:
        mock_llm.return_value = mock_llm_response
        
        agent = Agent()
        response = agent.process("Find the user with email user@example.com")
        
        # Assert the agent called the right tool
        assert len(response.tool_calls) == 1
        assert response.tool_calls[0]["tool_name"] == "search_database"
        assert "user@example.com" in response.tool_calls[0]["parameters"]["query"]

This test doesn’t call a real LLM. It mocks the response and checks that the agent handles it correctly.

Tool mocks and fake environments in CI

Never call real APIs from CI. They cost money, have rate limits, and create side effects.

Instead, use mocks. Create fake implementations of your tools that behave like the real ones but don’t make external calls.

Why mock tools

Real API calls in CI mean:

Tests are slow (network latency)
Tests are flaky (network failures, rate limits)
Tests cost money (API usage)
Tests create side effects (real emails sent, real data created)

A single test run might make hundreds of tool calls. If each call costs $0.001, that’s $0.10 per run. With 100 runs per day, that’s $10/day, $300/month. For one test suite.

Layered mocks

Use different levels of mocking depending on what you’re testing:

Level 1: In-memory stubs

Simple functions that return hardcoded responses. Fast, no dependencies.

class MockEmailService:
    def send_email(self, to: str, subject: str, body: str):
        return {"status": "sent", "message_id": "mock_123"}

Level 2: Local fake services

Docker containers that run simplified versions of real services. More realistic, but still isolated.

# docker-compose.yml
services:
  fake-database:
    image: postgres:15
    environment:
      POSTGRES_DB: test_db

Tool adapter interface

Create an interface for each tool. The real implementation and the mock implementation both implement the same interface.

from abc import ABC, abstractmethod

class DatabaseTool(ABC):
    @abstractmethod
    def search(self, query: str, limit: int = 10) -> list:
        pass

class RealDatabaseTool(DatabaseTool):
    def __init__(self, connection_string: str):
        self.conn = connect(connection_string)
    
    def search(self, query: str, limit: int = 10) -> list:
        # Real database query
        return self.conn.execute(f"SELECT * FROM users WHERE {query} LIMIT {limit}")

class MockDatabaseTool(DatabaseTool):
    def __init__(self):
        self.data = [
            {"id": 1, "email": "user1@example.com"},
            {"id": 2, "email": "user2@example.com"}
        ]
    
    def search(self, query: str, limit: int = 10) -> list:
        # Fake search
        return [r for r in self.data if query in str(r)][:limit]

The agent code uses DatabaseTool, not RealDatabaseTool or MockDatabaseTool. In tests, you inject the mock. In production, you inject the real implementation.

Dependency injection

Wire up mocks using environment variables or dependency injection.

def create_agent():
    if os.getenv("USE_MOCKS") == "true":
        db_tool = MockDatabaseTool()
        email_tool = MockEmailService()
    else:
        db_tool = RealDatabaseTool(os.getenv("DB_CONNECTION_STRING"))
        email_tool = RealEmailService(os.getenv("EMAIL_API_KEY"))
    
    return Agent(tools=[db_tool, email_tool])

In CI, set USE_MOCKS=true. In production, don’t set it (or set it to false).

Code example: Tool interface and mock

Here’s a complete example:

from abc import ABC, abstractmethod
from typing import List, Dict

class SearchTool(ABC):
    @abstractmethod
    def search(self, query: str) -> List[Dict]:
        pass

class RealSearchTool(SearchTool):
    def __init__(self, api_key: str):
        self.api_key = api_key
    
    def search(self, query: str) -> List[Dict]:
        # Real API call
        response = requests.get(
            "https://api.example.com/search",
            params={"q": query},
            headers={"Authorization": f"Bearer {self.api_key}"}
        )
        return response.json()["results"]

class MockSearchTool(SearchTool):
    def search(self, query: str) -> List[Dict]:
        # Return fake data
        return [
            {"id": 1, "title": "Result 1", "url": "https://example.com/1"},
            {"id": 2, "title": "Result 2", "url": "https://example.com/2"}
        ]

# Test
def test_agent_with_mock_tool():
    mock_tool = MockSearchTool()
    agent = Agent(tools=[mock_tool])
    
    response = agent.process("Search for Python tutorials")
    
    # Verify the tool was called
    assert len(response.tool_calls) == 1
    assert response.tool_calls[0]["tool_name"] == "search"

The test uses the mock, so it’s fast and doesn’t make real API calls.

Agent sandbox runs on every pull request

Every PR should spin up a sandbox run of the agent. Run it against fixed scenarios with mocked tools. Fail the PR if the agent misbehaves.

What is a sandbox run?

A sandbox run is a full execution of the agent in a controlled environment:

Scripted scenarios (e.g., “user forgot password”, “failed payment”)
Mocked tools (no real API calls)
Policy checks (max steps, no dangerous calls)
Structured logging

The agent runs through the scenario, and you check:

Did it complete successfully?
Did it call the right tools?
Did it stay within limits?
Did it follow policies?

Fixed scenarios

Create a set of scenarios that cover common cases:

# scenarios/forgot_password.yaml
name: "User forgot password"
steps:
  - user_message: "I forgot my password"
  - expected_tools:
      - name: "get_user_by_email"
      - name: "send_password_reset_email"
  - max_steps: 5
  - policies:
      - no_pii_in_logs: true
      - no_delete_operations: true

Each scenario defines:

The user’s message
Expected tool calls (or tool call patterns)
Maximum number of steps
Policy constraints

Running scenarios

A script runs the agent against each scenario:

def run_scenario(scenario: dict, agent: Agent) -> dict:
    """Run an agent against a scenario and return results."""
    result = {
        "scenario": scenario["name"],
        "passed": False,
        "errors": [],
        "tool_calls": [],
        "steps": 0
    }
    
    try:
        response = agent.process(scenario["user_message"])
        result["tool_calls"] = response.tool_calls
        result["steps"] = response.step_count
        
        # Check policies
        if result["steps"] > scenario["max_steps"]:
            result["errors"].append(f"Exceeded max steps: {result['steps']} > {scenario['max_steps']}")
        
        # Check expected tools
        expected_tools = {t["name"] for t in scenario["expected_tools"]}
        actual_tools = {tc["tool_name"] for tc in response.tool_calls}
        
        if not expected_tools.issubset(actual_tools):
            missing = expected_tools - actual_tools
            result["errors"].append(f"Missing tool calls: {missing}")
        
        result["passed"] = len(result["errors"]) == 0
        
    except Exception as e:
        result["errors"].append(str(e))
    
    return result

The script returns a pass/fail result. If it fails, the CI job fails.

Failing the PR

Fail the PR if:

Too many steps (agent is looping)
Wrong tool sequences (agent is making bad decisions)
Policy violations (agent is doing something dangerous)
Exceptions (agent crashed)

Practical tips

Start small:

3-5 scenarios
2-3 critical tools mocked
Basic policy checks

Expand over time. Add more scenarios as you find edge cases. Add more policy checks as you discover risks.

Keep logs short but structured. You want enough information to debug failures, but not so much that it’s overwhelming.

Code example: Scenario runner

Here’s a complete scenario runner:

import yaml
import json
from pathlib import Path

def load_scenario(path: str) -> dict:
    """Load a scenario from a YAML file."""
    with open(path, "r") as f:
        return yaml.safe_load(f)

def run_scenario(scenario: dict, agent: Agent) -> dict:
    """Run agent against scenario and return results."""
    result = {
        "scenario": scenario["name"],
        "passed": False,
        "errors": [],
        "tool_calls": []
    }
    
    try:
        response = agent.process(scenario["user_message"])
        result["tool_calls"] = response.tool_calls
        
        # Check max steps
        if len(response.tool_calls) > scenario.get("max_steps", 10):
            result["errors"].append(
                f"Too many steps: {len(response.tool_calls)} > {scenario['max_steps']}"
            )
        
        # Check expected tools
        expected = {t["name"] for t in scenario.get("expected_tools", [])}
        actual = {tc["tool_name"] for tc in response.tool_calls}
        
        if expected and not expected.issubset(actual):
            result["errors"].append(f"Missing tools: {expected - actual}")
        
        result["passed"] = len(result["errors"]) == 0
        
    except Exception as e:
        result["errors"].append(str(e))
        result["passed"] = False
    
    return result

def main():
    """Run all scenarios and print results."""
    scenario_dir = Path("scenarios")
    scenarios = list(scenario_dir.glob("*.yaml"))
    
    agent = create_agent(use_mocks=True)
    results = []
    
    for scenario_path in scenarios:
        scenario = load_scenario(str(scenario_path))
        result = run_scenario(scenario, agent)
        results.append(result)
        
        status = "PASS" if result["passed"] else "FAIL"
        print(f"{status}: {result['scenario']}")
        if result["errors"]:
            for error in result["errors"]:
                print(f"  ERROR: {error}")
    
    # Exit with non-zero if any scenario failed
    failed = [r for r in results if not r["passed"]]
    if failed:
        print(f"\n{len(failed)} scenario(s) failed")
        exit(1)
    else:
        print("\nAll scenarios passed")
        exit(0)

if __name__ == "__main__":
    main()

Run this script in CI. If any scenario fails, the job fails and the PR is blocked.

Integrating into CI/CD

Here’s a GitHub Actions pipeline that runs all the checks:

name: Agent CI

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - run: pip install ruff
      - run: ruff check .

  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - run: pip install -r requirements.txt
      - run: pip install pytest pytest-cov
      - run: pytest tests/unit/ --cov=src --cov-report=xml
      - uses: codecov/codecov-action@v3
        with:
          files: ./coverage.xml

  sandbox-scenarios:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - run: pip install -r requirements.txt
      - name: Run sandbox scenarios
        run: python scripts/run_scenarios.py
        env:
          USE_MOCKS: "true"
      - name: Upload scenario logs
        if: failure()
        uses: actions/upload-artifact@v3
        with:
          name: scenario-logs
          path: logs/scenarios/

  policy-checks:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - run: pip install -r requirements.txt
      - name: Run policy checks
        run: python scripts/check_policies.py

This pipeline has four stages:

Lint + static checks: Formatting, type checking, basic validation
Unit tests: Tool tests, prompt module tests, contract validation
Sandbox scenario runs: Full agent runs against fixed scenarios
Policy checks: OPA rules or custom Python checks

If any stage fails, the PR is blocked.

Exposing results

Show results in the PR:

Status checks (green checkmark or red X)
Comments with summaries
Artifacts with detailed logs

GitHub Actions automatically shows status checks. For more detail, add a comment:

def post_pr_comment(results: list):
    """Post scenario results as a PR comment."""
    body = "## Sandbox Scenario Results\n\n"
    
    for result in results:
        status = "✅" if result["passed"] else "❌"
        body += f"{status} {result['scenario']}\n"
        if result["errors"]:
            body += "```\n"
            for error in result["errors"]:
                body += f"{error}\n"
            body += "```\n"
    
    # Post to GitHub PR using GitHub API
    # (implementation omitted)

Failing the job

The scenario runner exits with a non-zero code if any scenario fails. GitHub Actions treats this as a job failure, which blocks the PR from merging.

if __name__ == "__main__":
    results = run_all_scenarios()
    failed = [r for r in results if not r["passed"]]
    
    if failed:
        print(f"{len(failed)} scenario(s) failed")
        sys.exit(1)  # This fails the CI job
    else:
        print("All scenarios passed")
        sys.exit(0)

Practical rollout checklist

Start small. Don’t try to test everything at once.

Phase 1: Basics (Week 1)

3-5 sandbox scenarios
2-3 critical tools mocked
Basic contract validation (tool names, parameter types)
Simple policy: max steps = 10

Phase 2: Expand (Week 2-3)

Add more scenarios (edge cases, error paths)
Mock all external tools
Add policy checks (no PII, no dangerous operations)
Snapshot tests for prompt modules

Phase 3: Mature (Month 2+)

Contract versioning
Performance benchmarks (max latency)
Cost tracking (token usage per scenario)
Automated scenario generation

How to expand without blocking the team

Make tests optional at first (warn, don’t fail)
Gradually make them required
Add tests incrementally (one scenario per PR)
Document failures and fixes

The goal is to catch issues early, not to create friction. If tests are too strict, developers will disable them or work around them.

Conclusion

Shift-left DevOps for AI agents means testing earlier. Test tools, prompts, and workflows in CI. Use mocks, sandboxes, and contracts.

Start with a few scenarios and basic checks. Expand over time. The key is to fail fast and keep production simple.

Agents are different from traditional services, but the testing principles are the same: catch bugs early, use mocks, test contracts, and automate everything.

Sign In

Shift-Left DevOps for AI Agents: Testing, Sandboxing, and Tool Mocks in CI/CD

Stay Updated

Discussion

Discussion

Sign In