By Ali Elborey

Designing Reliable Sagas: A Practical Guide to Distributed Transactions in Microservices

microservicesdistributed-systemssaga-patterntransaction-managementevent-driven-architecture

You’re building an e-commerce system. A customer places an order. Your system needs to:

  1. Create the order
  2. Charge the payment card
  3. Reserve inventory
  4. Schedule shipment

Simple, right? Until the payment service times out after the inventory is already reserved. Now you have stock locked up but no payment. Or the inventory service fails after the payment is charged. Now you have money but no product to ship.

This is the distributed transaction problem. In a monolith, you’d wrap everything in a database transaction. In microservices, that doesn’t work. Each service has its own database. You can’t use two-phase commit across services—it’s too slow and doesn’t scale.

Sagas solve this. They’re sequences of local transactions with compensating actions. If something fails, you run compensations in reverse order to undo what you’ve done.

But designing sagas well is harder than it sounds. This guide shows you how.

The Real Problem: Multi-Service Workflows That Fail in Production

Let’s look at what actually breaks when you try to coordinate multiple services.

The Naive Approach: Synchronous REST Chain

Here’s what many teams do first:

async function placeOrder(orderData: OrderData) {
  // Step 1: Create order
  const order = await orderService.create(orderData);
  
  // Step 2: Reserve inventory
  await inventoryService.reserve(order.productId, order.quantity);
  
  // Step 3: Charge payment
  await paymentService.charge(order.customerId, order.total);
  
  // Step 4: Schedule shipment
  await shipmentService.schedule(order.id);
  
  return order;
}

This breaks in several ways:

One service down = everything breaks. If the payment service is slow or down, the whole flow blocks. The inventory reservation times out. The customer waits. Nothing completes.

Partial failures create “ghost” orders. If the payment fails after inventory is reserved, you have stock locked up. If inventory fails after payment, you’ve charged the customer but can’t fulfill the order.

Retries without idempotency cause duplicates. Network timeouts trigger retries. Without idempotency checks, you might charge the customer twice or reserve inventory multiple times.

No way to rollback. Once a step completes, you can’t undo it. You’re stuck with inconsistent state.

Why ACID and 2PC Don’t Work

Traditional database transactions use ACID properties: Atomicity, Consistency, Isolation, Durability. They work great within a single database.

Two-phase commit (2PC) tries to extend this across databases. It works like this:

  1. Prepare phase: Ask all services if they can commit
  2. Commit phase: If everyone says yes, tell them all to commit

The problem: 2PC is slow. It requires multiple round trips. It blocks resources while waiting for all services to respond. In a distributed system, one slow service blocks everything.

More importantly, 2PC doesn’t handle network partitions well. If services can’t communicate, transactions hang. This is why modern microservices avoid 2PC.

What Is a Saga, Really?

A saga is a sequence of local transactions. Each transaction updates data in one service. If a transaction fails, you run compensating transactions to undo previous steps.

Think of it like this:

  • Local transaction: Update data in one service’s database
  • Compensating transaction: Undo what a local transaction did
  • Retryable step: A step that can safely be retried if it fails

Example: Order Checkout Saga

Here’s how an order checkout saga works:

  1. Reserve inventory (local transaction)

    • Compensation: Release inventory
  2. Charge payment (local transaction)

    • Compensation: Refund payment
  3. Confirm order (local transaction)

    • Compensation: Cancel order

If payment fails, you release the inventory. If inventory fails, there’s nothing to compensate yet. If order confirmation fails, you refund the payment and release inventory.

Pros and Cons vs Traditional Distributed Transactions

Pros:

  • Works across service boundaries
  • No blocking locks
  • Services stay decoupled
  • Can handle long-running workflows

Cons:

  • Eventual consistency (not immediate)
  • Need to design compensations carefully
  • Harder to reason about than ACID transactions
  • Requires idempotency to handle retries

The key insight: You trade immediate consistency for availability and scalability. That’s usually the right trade-off for microservices.

Orchestration vs Choreography

Sagas come in two flavors. Each has trade-offs.

Orchestrated Saga: Central Coordinator

An orchestrated saga has a central “Saga Orchestrator” that coordinates everything. It calls services, tracks state, and triggers compensations.

class OrderSagaOrchestrator {
  async execute(orderData: OrderData): Promise<OrderResult> {
    const sagaId = generateId();
    const sagaState = { id: sagaId, status: 'PENDING', steps: [] };
    
    try {
      // Step 1: Reserve inventory
      const inventory = await this.inventoryService.reserve(
        orderData.productId, 
        orderData.quantity,
        sagaId
      );
      sagaState.steps.push({ name: 'RESERVE_INVENTORY', status: 'COMPLETED' });
      
      // Step 2: Charge payment
      const payment = await this.paymentService.charge(
        orderData.customerId,
        orderData.total,
        sagaId
      );
      sagaState.steps.push({ name: 'CHARGE_PAYMENT', status: 'COMPLETED' });
      
      // Step 3: Confirm order
      const order = await this.orderService.confirm(orderData, sagaId);
      sagaState.steps.push({ name: 'CONFIRM_ORDER', status: 'COMPLETED' });
      
      sagaState.status = 'COMPLETED';
      return { success: true, orderId: order.id };
      
    } catch (error) {
      // Compensate in reverse order
      await this.compensate(sagaState);
      throw error;
    }
  }
  
  private async compensate(sagaState: SagaState) {
    // Run compensations in reverse order
    for (const step of sagaState.steps.reverse()) {
      if (step.status === 'COMPLETED') {
        switch (step.name) {
          case 'CONFIRM_ORDER':
            await this.orderService.cancel(sagaState.id);
            break;
          case 'CHARGE_PAYMENT':
            await this.paymentService.refund(sagaState.id);
            break;
          case 'RESERVE_INVENTORY':
            await this.inventoryService.release(sagaState.id);
            break;
        }
      }
    }
    sagaState.status = 'COMPENSATED';
  }
}

Advantages:

  • Easier to reason about—all logic in one place
  • Central place for workflow logic
  • Easier to test and debug
  • Can handle complex branching logic

Risks:

  • Can become a “God service” if you’re not careful
  • Single point of failure (though you can make it resilient)
  • Tighter coupling to business logic

Choreographed Saga: Event-Driven

A choreographed saga uses events. Services publish domain events and listen to others. There’s no central coordinator.

// Order Service
class OrderService {
  async handleOrderCreated(event: OrderCreatedEvent) {
    const order = await this.createOrder(event.orderData);
    
    // Publish next event
    await this.eventBus.publish({
      type: 'InventoryReservationRequested',
      sagaId: event.sagaId,
      orderId: order.id,
      productId: order.productId,
      quantity: order.quantity
    });
  }
  
  async handlePaymentProcessed(event: PaymentProcessedEvent) {
    await this.confirmOrder(event.orderId);
    
    await this.eventBus.publish({
      type: 'OrderConfirmed',
      sagaId: event.sagaId,
      orderId: event.orderId
    });
  }
}

// Inventory Service
class InventoryService {
  async handleInventoryReservationRequested(event: InventoryReservationRequested) {
    const reservation = await this.reserveInventory(
      event.productId,
      event.quantity,
      event.sagaId
    );
    
    await this.eventBus.publish({
      type: 'InventoryReserved',
      sagaId: event.sagaId,
      reservationId: reservation.id
    });
  }
  
  async handlePaymentFailed(event: PaymentFailed) {
    // Compensation: release inventory
    await this.releaseInventory(event.sagaId);
  }
}

Advantages:

  • More decoupled—services don’t know about each other
  • Scales well—no central bottleneck
  • Natural fit for event-driven architectures
  • Services can evolve independently

Disadvantages:

  • Harder to trace—flow is distributed
  • Harder to debug—no single place to see the workflow
  • Can become hard to understand as it grows
  • Compensations are scattered across services

Decision Guide

Use orchestration when:

  • The workflow is complex with branching logic
  • You need strong guarantees about execution order
  • The workflow is business-critical and needs central oversight
  • You’re starting out and want easier debugging

Use choreography when:

  • The workflow is simple and linear
  • Services are naturally event-driven
  • You want maximum decoupling
  • The domain fits an event-centric model

Hybrid approach: Use orchestrated “high-level” sagas over choreographed sub-flows. The orchestrator handles the main flow, but delegates to event-driven sub-processes.

Designing a Saga Step by Step

Let’s design a complete saga using the Order Checkout example. We’ll use orchestration for clarity, but the principles apply to choreography too.

Step 1: Map the Business Process as a State Machine

Start by modeling your saga as a state machine:

States:
- PENDING → RESERVED → CHARGED → SHIPPED → COMPLETED
- Any state → FAILED → COMPENSATED

Events:
- OrderCreated → triggers ReserveInventory
- InventoryReserved → triggers ChargePayment
- PaymentCaptured → triggers ConfirmOrder
- ShipmentScheduled → triggers CompleteOrder
- Any failure → triggers Compensation

Here’s the TypeScript type:

type SagaState = 
  | 'PENDING'
  | 'RESERVED'
  | 'CHARGED'
  | 'SHIPPED'
  | 'COMPLETED'
  | 'FAILED'
  | 'COMPENSATED';

interface SagaStep {
  name: string;
  status: 'PENDING' | 'COMPLETED' | 'FAILED' | 'COMPENSATED';
  completedAt?: Date;
  failedAt?: Date;
  error?: string;
  result?: any;
}

interface Saga {
  id: string;
  state: SagaState;
  steps: SagaStep[];
  startedAt: Date;
  completedAt?: Date;
  compensatedAt?: Date;
  payload: any;
}

Step 2: Decide Service Boundaries

Which service owns which step?

  • Order Service: Creates and confirms orders
  • Inventory Service: Reserves and releases inventory
  • Payment Service: Charges and refunds payments
  • Shipment Service: Schedules shipments

Each service owns its data. The orchestrator coordinates but doesn’t own the data.

Step 3: Define Compensations

For each step, define what happens if you need to undo it:

  • Reserve inventory → Release inventory (return stock to available)
  • Charge payment → Refund payment (return money to customer)
  • Confirm order → Cancel order (mark as cancelled)
  • Schedule shipment → Cancel shipment (remove from schedule)

Some steps can’t be compensated. For example, if you’ve already shipped the product, you can’t “un-ship” it. In that case, you might need manual intervention or a different compensation strategy.

Step 4: Define Retry Rules and Idempotency

Every step must be idempotent. If you retry a step, it should produce the same result.

Use unique saga IDs and step sequence numbers:

interface Command {
  sagaId: string;
  stepSequence: number;
  idempotencyKey: string;
  payload: any;
}

class InventoryService {
  private processedKeys = new Set<string>();
  
  async reserveInventory(command: Command): Promise<Reservation> {
    // Check idempotency
    if (this.processedKeys.has(command.idempotencyKey)) {
      // Return the existing result
      return await this.getExistingReservation(command.sagaId);
    }
    
    // Process the reservation
    const reservation = await this.db.transaction(async (tx) => {
      // Check stock availability
      const product = await tx.products.findOne({ 
        id: command.payload.productId 
      });
      
      if (product.stock < command.payload.quantity) {
        throw new Error('Insufficient stock');
      }
      
      // Reserve inventory
      await tx.products.update(
        { id: command.payload.productId },
        { stock: product.stock - command.payload.quantity }
      );
      
      // Create reservation record
      const reservation = await tx.reservations.create({
        sagaId: command.sagaId,
        productId: command.payload.productId,
        quantity: command.payload.quantity,
        status: 'RESERVED'
      });
      
      return reservation;
    });
    
    // Mark as processed
    this.processedKeys.add(command.idempotencyKey);
    
    return reservation;
  }
}

Step 5: Handle Timeouts and Dead Letters

What if a service never replies?

class SagaOrchestrator {
  private readonly STEP_TIMEOUT = 30000; // 30 seconds
  
  async executeStep(
    stepName: string,
    action: () => Promise<any>,
    sagaId: string
  ): Promise<any> {
    const timeoutPromise = new Promise((_, reject) => {
      setTimeout(() => reject(new Error('Step timeout')), this.STEP_TIMEOUT);
    });
    
    try {
      const result = await Promise.race([action(), timeoutPromise]);
      return result;
    } catch (error) {
      // Check if this is a retryable error
      if (this.isRetryable(error)) {
        // Retry with exponential backoff
        return await this.retryWithBackoff(stepName, action, sagaId);
      }
      
      // Non-retryable error - trigger compensation
      throw error;
    }
  }
  
  private isRetryable(error: Error): boolean {
    // Network errors, timeouts are retryable
    // Business logic errors (e.g., insufficient funds) are not
    return error.message.includes('timeout') || 
           error.message.includes('network') ||
           error.message.includes('ECONNREFUSED');
  }
}

For dead letters (messages that can’t be processed), use a dead letter queue:

class DeadLetterHandler {
  async handleDeadLetter(message: Command, error: Error) {
    // Log for manual review
    await this.logger.error({
      sagaId: message.sagaId,
      step: message.stepSequence,
      error: error.message,
      payload: message.payload,
      timestamp: new Date()
    });
    
    // Notify operations team
    await this.notifyOps({
      type: 'DEAD_LETTER',
      sagaId: message.sagaId,
      requiresManualIntervention: true
    });
    
    // Mark saga as failed
    await this.sagaStore.updateState(message.sagaId, 'FAILED');
  }
}

Implementation Patterns

Let’s build a complete implementation using Node.js, TypeScript, and PostgreSQL. We’ll use Kafka for messaging, but the patterns work with RabbitMQ or other brokers.

Architecture Overview

┌─────────────────┐
│  API Gateway    │
└────────┬────────┘


┌─────────────────┐     ┌──────────────┐
│ Saga Orchestrator│────▶│   Kafka      │
└────────┬────────┘     └──────┬───────┘
         │                     │
         ▼                     ▼
┌─────────────────┐     ┌──────────────┐
│  PostgreSQL     │     │   Services   │
│  (Saga State)   │     │  (Order,     │
└─────────────────┘     │   Payment,   │
                        │   Inventory) │
                        └──────────────┘

Saga State Model

First, define the database schema:

CREATE TABLE sagas (
  id VARCHAR(255) PRIMARY KEY,
  state VARCHAR(50) NOT NULL,
  type VARCHAR(100) NOT NULL,
  payload JSONB NOT NULL,
  started_at TIMESTAMP NOT NULL,
  completed_at TIMESTAMP,
  compensated_at TIMESTAMP,
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE saga_steps (
  id SERIAL PRIMARY KEY,
  saga_id VARCHAR(255) NOT NULL REFERENCES sagas(id),
  step_name VARCHAR(100) NOT NULL,
  step_sequence INTEGER NOT NULL,
  status VARCHAR(50) NOT NULL,
  result JSONB,
  error TEXT,
  completed_at TIMESTAMP,
  failed_at TIMESTAMP,
  created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_sagas_state ON sagas(state);
CREATE INDEX idx_saga_steps_saga_id ON saga_steps(saga_id);

TypeScript types:

interface SagaRecord {
  id: string;
  state: SagaState;
  type: string;
  payload: any;
  startedAt: Date;
  completedAt?: Date;
  compensatedAt?: Date;
}

interface SagaStepRecord {
  id: number;
  sagaId: string;
  stepName: string;
  stepSequence: number;
  status: 'PENDING' | 'COMPLETED' | 'FAILED' | 'COMPENSATED';
  result?: any;
  error?: string;
  completedAt?: Date;
  failedAt?: Date;
}

Orchestrator Implementation

Here’s the core orchestrator:

import { Pool } from 'pg';
import { Kafka } from 'kafkajs';

class SagaOrchestrator {
  constructor(
    private db: Pool,
    private kafka: Kafka
  ) {}
  
  async startSaga(
    sagaType: string,
    payload: any
  ): Promise<string> {
    const sagaId = generateId();
    const producer = this.kafka.producer();
    
    await producer.connect();
    
    // Create saga record
    await this.db.query(
      `INSERT INTO sagas (id, state, type, payload, started_at)
       VALUES ($1, $2, $3, $4, NOW())`,
      [sagaId, 'PENDING', sagaType, JSON.stringify(payload)]
    );
    
    // Publish initial command
    await producer.send({
      topic: `${sagaType}.commands`,
      messages: [{
        key: sagaId,
        value: JSON.stringify({
          sagaId,
          stepSequence: 1,
          command: 'ReserveInventory',
          payload
        })
      }]
    });
    
    await producer.disconnect();
    
    return sagaId;
  }
  
  async handleEvent(event: SagaEvent) {
    const saga = await this.getSaga(event.sagaId);
    if (!saga) {
      throw new Error(`Saga ${event.sagaId} not found`);
    }
    
    // Update step status
    await this.updateStepStatus(
      event.sagaId,
      event.stepSequence,
      'COMPLETED',
      event.result
    );
    
    // Determine next step
    const nextStep = this.getNextStep(saga.type, event.stepSequence);
    
    if (nextStep) {
      await this.executeNextStep(saga.id, nextStep, saga.payload);
    } else {
      // Saga complete
      await this.completeSaga(saga.id);
    }
  }
  
  async handleFailure(event: SagaFailureEvent) {
    const saga = await this.getSaga(event.sagaId);
    if (!saga) return;
    
    // Mark step as failed
    await this.updateStepStatus(
      event.sagaId,
      event.stepSequence,
      'FAILED',
      null,
      event.error
    );
    
    // Trigger compensation
    await this.compensate(saga.id);
  }
  
  private async compensate(sagaId: string) {
    const steps = await this.getCompletedSteps(sagaId);
    
    // Compensate in reverse order
    for (const step of steps.reverse()) {
      const producer = this.kafka.producer();
      await producer.connect();
      
      await producer.send({
        topic: 'saga.compensations',
        messages: [{
          key: sagaId,
          value: JSON.stringify({
            sagaId,
            stepName: step.stepName,
            compensation: this.getCompensationCommand(step.stepName),
            payload: step.result
          })
        }]
      });
      
      await producer.disconnect();
      
      // Mark step as compensated
      await this.updateStepStatus(
        sagaId,
        step.stepSequence,
        'COMPENSATED'
      );
    }
    
    // Mark saga as compensated
    await this.db.query(
      `UPDATE sagas SET state = 'COMPENSATED', compensated_at = NOW()
       WHERE id = $1`,
      [sagaId]
    );
  }
  
  private async getSaga(sagaId: string): Promise<SagaRecord | null> {
    const result = await this.db.query(
      `SELECT * FROM sagas WHERE id = $1`,
      [sagaId]
    );
    return result.rows[0] || null;
  }
  
  private async updateStepStatus(
    sagaId: string,
    stepSequence: number,
    status: string,
    result?: any,
    error?: string
  ) {
    if (status === 'COMPLETED') {
      await this.db.query(
        `UPDATE saga_steps 
         SET status = $1, result = $2, completed_at = NOW()
         WHERE saga_id = $3 AND step_sequence = $4`,
        [status, JSON.stringify(result), sagaId, stepSequence]
      );
    } else if (status === 'FAILED') {
      await this.db.query(
        `UPDATE saga_steps 
         SET status = $1, error = $2, failed_at = NOW()
         WHERE saga_id = $3 AND step_sequence = $4`,
        [status, error, sagaId, stepSequence]
      );
    }
  }
  
  private getNextStep(
    sagaType: string,
    currentStep: number
  ): string | null {
    const steps: Record<string, string[]> = {
      'OrderCheckout': [
        'ReserveInventory',
        'ChargePayment',
        'ConfirmOrder'
      ]
    };
    
    const sagaSteps = steps[sagaType] || [];
    const nextIndex = currentStep;
    
    if (nextIndex >= sagaSteps.length) {
      return null;
    }
    
    return sagaSteps[nextIndex];
  }
  
  private async executeNextStep(
    sagaId: string,
    stepName: string,
    payload: any
  ) {
    const producer = this.kafka.producer();
    await producer.connect();
    
    // Get current step sequence
    const stepResult = await this.db.query(
      `SELECT MAX(step_sequence) as max_seq FROM saga_steps WHERE saga_id = $1`,
      [sagaId]
    );
    const nextSequence = (stepResult.rows[0].max_seq || 0) + 1;
    
    // Create step record
    await this.db.query(
      `INSERT INTO saga_steps (saga_id, step_name, step_sequence, status)
       VALUES ($1, $2, $3, 'PENDING')`,
      [sagaId, stepName, nextSequence]
    );
    
    // Publish command
    await producer.send({
      topic: `${stepName.toLowerCase()}.commands`,
      messages: [{
        key: sagaId,
        value: JSON.stringify({
          sagaId,
          stepSequence: nextSequence,
          command: stepName,
          payload,
          idempotencyKey: `${sagaId}-${nextSequence}`
        })
      }]
    });
    
    await producer.disconnect();
  }
  
  private getCompensationCommand(stepName: string): string {
    const compensations: Record<string, string> = {
      'ReserveInventory': 'ReleaseInventory',
      'ChargePayment': 'RefundPayment',
      'ConfirmOrder': 'CancelOrder'
    };
    return compensations[stepName] || 'Unknown';
  }
}

Service Event Handler

Here’s how a service (Inventory Service) handles commands:

class InventoryService {
  constructor(
    private db: Pool,
    private kafka: Kafka,
    private processedKeys: Set<string>
  ) {}
  
  async start() {
    const consumer = this.kafka.consumer({ groupId: 'inventory-service' });
    await consumer.connect();
    await consumer.subscribe({ topic: 'reserveinventory.commands' });
    
    await consumer.run({
      eachMessage: async ({ message }) => {
        const command = JSON.parse(message.value!.toString());
        await this.handleReserveInventory(command);
      }
    });
  }
  
  async handleReserveInventory(command: Command) {
    const idempotencyKey = command.idempotencyKey;
    
    // Check idempotency
    if (this.processedKeys.has(idempotencyKey)) {
      console.log(`Skipping duplicate command: ${idempotencyKey}`);
      return;
    }
    
    const producer = this.kafka.producer();
    await producer.connect();
    
    try {
      // Reserve inventory in a transaction
      const result = await this.db.query(`
        BEGIN;
        
        SELECT stock FROM products WHERE id = $1 FOR UPDATE;
        
        UPDATE products 
        SET stock = stock - $2 
        WHERE id = $1 AND stock >= $2;
        
        INSERT INTO reservations (saga_id, product_id, quantity, status)
        VALUES ($3, $1, $2, 'RESERVED');
        
        COMMIT;
      `, [command.payload.productId, command.payload.quantity, command.sagaId]);
      
      // Mark as processed
      this.processedKeys.add(idempotencyKey);
      
      // Publish success event
      await producer.send({
        topic: 'saga.events',
        messages: [{
          key: command.sagaId,
          value: JSON.stringify({
            sagaId: command.sagaId,
            stepSequence: command.stepSequence,
            event: 'InventoryReserved',
            result: { reservationId: result.rows[0].id }
          })
        }]
      });
      
    } catch (error) {
      // Publish failure event
      await producer.send({
        topic: 'saga.failures',
        messages: [{
          key: command.sagaId,
          value: JSON.stringify({
            sagaId: command.sagaId,
            stepSequence: command.stepSequence,
            error: error.message
          })
        }]
      });
    } finally {
      await producer.disconnect();
    }
  }
  
  async handleReleaseInventory(compensation: CompensationCommand) {
    // Release inventory (compensation)
    await this.db.query(`
      UPDATE products p
      SET stock = p.stock + r.quantity
      FROM reservations r
      WHERE r.saga_id = $1 AND r.product_id = p.id;
      
      UPDATE reservations
      SET status = 'RELEASED'
      WHERE saga_id = $1;
    `, [compensation.sagaId]);
  }
}

Compensation Handler

Compensations run in reverse order:

class CompensationHandler {
  constructor(
    private db: Pool,
    private kafka: Kafka,
    private services: Map<string, Service>
  ) {}
  
  async start() {
    const consumer = this.kafka.consumer({ groupId: 'compensation-handler' });
    await consumer.connect();
    await consumer.subscribe({ topic: 'saga.compensations' });
    
    await consumer.run({
      eachMessage: async ({ message }) => {
        const compensation = JSON.parse(message.value!.toString());
        await this.handleCompensation(compensation);
      }
    });
  }
  
  async handleCompensation(compensation: CompensationCommand) {
    const service = this.services.get(compensation.stepName);
    if (!service) {
      console.error(`No service found for ${compensation.stepName}`);
      return;
    }
    
    try {
      await service.compensate(compensation);
      
      // Publish compensation completed event
      const producer = this.kafka.producer();
      await producer.connect();
      
      await producer.send({
        topic: 'saga.compensations.completed',
        messages: [{
          key: compensation.sagaId,
          value: JSON.stringify({
            sagaId: compensation.sagaId,
            stepName: compensation.stepName,
            status: 'COMPENSATED'
          })
        }]
      });
      
      await producer.disconnect();
    } catch (error) {
      console.error(`Compensation failed for ${compensation.stepName}:`, error);
      // Log for manual intervention
      await this.logCompensationFailure(compensation, error);
    }
  }
}

Observability and Debugging Sagas

Sagas are hard to debug. You need to see what’s happening across services.

Correlation IDs and Saga IDs

Use correlation IDs in all logs, traces, and metrics:

import { trace, context } from '@opentelemetry/api';

class TracedSagaOrchestrator extends SagaOrchestrator {
  async startSaga(sagaType: string, payload: any): Promise<string> {
    const tracer = trace.getTracer('saga-orchestrator');
    const span = tracer.startSpan('start_saga');
    
    span.setAttributes({
      'saga.type': sagaType,
      'saga.id': payload.sagaId || generateId()
    });
    
    try {
      const sagaId = await super.startSaga(sagaType, payload);
      span.setAttribute('saga.id', sagaId);
      span.setStatus({ code: SpanStatusCode.OK });
      return sagaId;
    } catch (error) {
      span.setStatus({ 
        code: SpanStatusCode.ERROR, 
        message: error.message 
      });
      throw error;
    } finally {
      span.end();
    }
  }
}

Distributed Tracing

Use OpenTelemetry to trace the full saga:

import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node';
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';

const tracerProvider = new NodeTracerProvider({
  resource: new Resource({ 'service.name': 'saga-orchestrator' })
});

tracerProvider.addSpanProcessor(
  new BatchSpanProcessor(new JaegerExporter())
);

tracerProvider.register();

Metrics to Track

Track these metrics:

  • Saga success/failure rate
  • Average completion time
  • Compensation frequency
  • Step failure rates
  • Timeout rates
import { metrics } from '@opentelemetry/api';

const meter = metrics.getMeter('saga-orchestrator');

const sagaCounter = meter.createCounter('sagas_total', {
  description: 'Total number of sagas'
});

const sagaDuration = meter.createHistogram('saga_duration_seconds', {
  description: 'Saga execution duration'
});

const compensationCounter = meter.createCounter('compensations_total', {
  description: 'Total number of compensations'
});

// In your orchestrator
sagaCounter.add(1, { type: sagaType, status: 'started' });

const startTime = Date.now();
// ... execute saga ...
const duration = (Date.now() - startTime) / 1000;
sagaDuration.record(duration, { type: sagaType });

Saga Inspector View

Build a simple admin UI to inspect sagas:

// API endpoint to get saga state
app.get('/api/sagas/:id', async (req, res) => {
  const saga = await db.query(
    `SELECT s.*, 
            json_agg(
              json_build_object(
                'stepName', st.step_name,
                'sequence', st.step_sequence,
                'status', st.status,
                'completedAt', st.completed_at,
                'error', st.error
              ) ORDER BY st.step_sequence
            ) as steps
     FROM sagas s
     LEFT JOIN saga_steps st ON s.id = st.saga_id
     WHERE s.id = $1
     GROUP BY s.id`,
    [req.params.id]
  );
  
  res.json(saga.rows[0]);
});

Testing Sagas

Testing distributed workflows is hard. You need to test happy paths, failures, and edge cases.

Unit Tests for Orchestrator Logic

describe('SagaOrchestrator', () => {
  it('should execute steps in order', async () => {
    const mockDb = createMockDb();
    const mockKafka = createMockKafka();
    const orchestrator = new SagaOrchestrator(mockDb, mockKafka);
    
    const sagaId = await orchestrator.startSaga('OrderCheckout', {
      productId: 'prod-1',
      quantity: 2,
      customerId: 'cust-1',
      total: 100
    });
    
    // Verify saga was created
    const saga = await mockDb.query('SELECT * FROM sagas WHERE id = $1', [sagaId]);
    expect(saga.rows[0].state).toBe('PENDING');
    
    // Verify first command was published
    expect(mockKafka.producedMessages).toContainEqual(
      expect.objectContaining({
        topic: 'reserveinventory.commands',
        key: sagaId
      })
    );
  });
  
  it('should compensate on failure', async () => {
    // Mock payment service to fail
    const mockPaymentService = {
      charge: jest.fn().mockRejectedValue(new Error('Payment failed'))
    };
    
    // ... test compensation logic
  });
});

Component Tests

Test the orchestrator with one service and a fake broker:

describe('Saga Component Tests', () => {
  it('should complete full order flow', async () => {
    // Use testcontainers for real PostgreSQL and Kafka
    const postgres = await startPostgresContainer();
    const kafka = await startKafkaContainer();
    
    const db = new Pool({ connectionString: postgres.connectionString });
    const kafkaClient = new Kafka({ brokers: [kafka.broker] });
    
    const orchestrator = new SagaOrchestrator(db, kafkaClient);
    const inventoryService = new InventoryService(db, kafkaClient);
    
    // Start services
    await inventoryService.start();
    
    // Start saga
    const sagaId = await orchestrator.startSaga('OrderCheckout', {
      productId: 'prod-1',
      quantity: 1
    });
    
    // Wait for completion
    await waitForSagaCompletion(sagaId, db, 5000);
    
    // Verify final state
    const saga = await db.query('SELECT * FROM sagas WHERE id = $1', [sagaId]);
    expect(saga.rows[0].state).toBe('COMPLETED');
  });
});

Chaos Tests

Test what happens when things go wrong:

describe('Saga Chaos Tests', () => {
  it('should handle service crashes during execution', async () => {
    const sagaId = await orchestrator.startSaga('OrderCheckout', orderData);
    
    // Kill inventory service mid-execution
    await killService('inventory-service');
    
    // Wait and verify compensation
    await wait(5000);
    
    const saga = await db.query('SELECT * FROM sagas WHERE id = $1', [sagaId]);
    expect(saga.rows[0].state).toBe('COMPENSATED');
  });
  
  it('should handle message loss', async () => {
    // Drop messages from Kafka
    await dropKafkaMessages('reserveinventory.commands');
    
    // Verify saga times out and compensates
    // ...
  });
});

Common Anti-Patterns

Here’s what not to do:

Treating Sagas as ACID Transactions

Sagas are not ACID. They provide eventual consistency. Don’t expect immediate consistency across services.

Orchestrator That Knows Everything

Don’t put all business logic in the orchestrator. Services should own their logic. The orchestrator should only coordinate.

Synchronous REST for Every Step

Don’t use synchronous REST calls for saga steps. Use async messaging. It’s more resilient and scalable.

No Compensations, Only Retries

Retries don’t fix everything. Some failures need compensation. Design compensations for every step.

Using Sagas When You Don’t Need Them

If everything fits in one service, use a database transaction. Don’t over-engineer.

Checklist and Summary

Before deploying a saga to production, check:

  • Business process modeled as a state machine
  • Clear orchestration/choreography choice made
  • Compensations defined and tested for all steps
  • Idempotency and retries handled
  • Timeouts and dead-letter handling implemented
  • Tracing, logs, and metrics wired up
  • Correlation IDs in all logs
  • Chaos tests run and passing
  • Saga inspector view built for debugging
  • Documentation for operations team

The Bottom Line

Sagas are the standard way to handle distributed transactions in microservices. They trade immediate consistency for availability and scalability.

Design them carefully. Model your workflow as a state machine. Choose orchestration or choreography based on your needs. Implement idempotency and compensations. Add observability from day one.

Start simple. Pick one workflow. Implement it. Learn from it. Then expand.

Your microservices will be more reliable. Your users will see fewer partial failures. Your operations team will thank you when they can actually debug what’s happening.


Ali Elborey is a software architect who helps teams build reliable distributed systems. He writes about microservices, event-driven architecture, and system design patterns.

Discussion

Join the conversation and share your thoughts

Discussion

0 / 5000