Designing Reliable Sagas: A Practical Guide to Distributed Transactions in Microservices
You’re building an e-commerce system. A customer places an order. Your system needs to:
- Create the order
- Charge the payment card
- Reserve inventory
- Schedule shipment
Simple, right? Until the payment service times out after the inventory is already reserved. Now you have stock locked up but no payment. Or the inventory service fails after the payment is charged. Now you have money but no product to ship.
This is the distributed transaction problem. In a monolith, you’d wrap everything in a database transaction. In microservices, that doesn’t work. Each service has its own database. You can’t use two-phase commit across services—it’s too slow and doesn’t scale.
Sagas solve this. They’re sequences of local transactions with compensating actions. If something fails, you run compensations in reverse order to undo what you’ve done.
But designing sagas well is harder than it sounds. This guide shows you how.
The Real Problem: Multi-Service Workflows That Fail in Production
Let’s look at what actually breaks when you try to coordinate multiple services.
The Naive Approach: Synchronous REST Chain
Here’s what many teams do first:
async function placeOrder(orderData: OrderData) {
// Step 1: Create order
const order = await orderService.create(orderData);
// Step 2: Reserve inventory
await inventoryService.reserve(order.productId, order.quantity);
// Step 3: Charge payment
await paymentService.charge(order.customerId, order.total);
// Step 4: Schedule shipment
await shipmentService.schedule(order.id);
return order;
}
This breaks in several ways:
One service down = everything breaks. If the payment service is slow or down, the whole flow blocks. The inventory reservation times out. The customer waits. Nothing completes.
Partial failures create “ghost” orders. If the payment fails after inventory is reserved, you have stock locked up. If inventory fails after payment, you’ve charged the customer but can’t fulfill the order.
Retries without idempotency cause duplicates. Network timeouts trigger retries. Without idempotency checks, you might charge the customer twice or reserve inventory multiple times.
No way to rollback. Once a step completes, you can’t undo it. You’re stuck with inconsistent state.
Why ACID and 2PC Don’t Work
Traditional database transactions use ACID properties: Atomicity, Consistency, Isolation, Durability. They work great within a single database.
Two-phase commit (2PC) tries to extend this across databases. It works like this:
- Prepare phase: Ask all services if they can commit
- Commit phase: If everyone says yes, tell them all to commit
The problem: 2PC is slow. It requires multiple round trips. It blocks resources while waiting for all services to respond. In a distributed system, one slow service blocks everything.
More importantly, 2PC doesn’t handle network partitions well. If services can’t communicate, transactions hang. This is why modern microservices avoid 2PC.
What Is a Saga, Really?
A saga is a sequence of local transactions. Each transaction updates data in one service. If a transaction fails, you run compensating transactions to undo previous steps.
Think of it like this:
- Local transaction: Update data in one service’s database
- Compensating transaction: Undo what a local transaction did
- Retryable step: A step that can safely be retried if it fails
Example: Order Checkout Saga
Here’s how an order checkout saga works:
-
Reserve inventory (local transaction)
- Compensation: Release inventory
-
Charge payment (local transaction)
- Compensation: Refund payment
-
Confirm order (local transaction)
- Compensation: Cancel order
If payment fails, you release the inventory. If inventory fails, there’s nothing to compensate yet. If order confirmation fails, you refund the payment and release inventory.
Pros and Cons vs Traditional Distributed Transactions
Pros:
- Works across service boundaries
- No blocking locks
- Services stay decoupled
- Can handle long-running workflows
Cons:
- Eventual consistency (not immediate)
- Need to design compensations carefully
- Harder to reason about than ACID transactions
- Requires idempotency to handle retries
The key insight: You trade immediate consistency for availability and scalability. That’s usually the right trade-off for microservices.
Orchestration vs Choreography
Sagas come in two flavors. Each has trade-offs.
Orchestrated Saga: Central Coordinator
An orchestrated saga has a central “Saga Orchestrator” that coordinates everything. It calls services, tracks state, and triggers compensations.
class OrderSagaOrchestrator {
async execute(orderData: OrderData): Promise<OrderResult> {
const sagaId = generateId();
const sagaState = { id: sagaId, status: 'PENDING', steps: [] };
try {
// Step 1: Reserve inventory
const inventory = await this.inventoryService.reserve(
orderData.productId,
orderData.quantity,
sagaId
);
sagaState.steps.push({ name: 'RESERVE_INVENTORY', status: 'COMPLETED' });
// Step 2: Charge payment
const payment = await this.paymentService.charge(
orderData.customerId,
orderData.total,
sagaId
);
sagaState.steps.push({ name: 'CHARGE_PAYMENT', status: 'COMPLETED' });
// Step 3: Confirm order
const order = await this.orderService.confirm(orderData, sagaId);
sagaState.steps.push({ name: 'CONFIRM_ORDER', status: 'COMPLETED' });
sagaState.status = 'COMPLETED';
return { success: true, orderId: order.id };
} catch (error) {
// Compensate in reverse order
await this.compensate(sagaState);
throw error;
}
}
private async compensate(sagaState: SagaState) {
// Run compensations in reverse order
for (const step of sagaState.steps.reverse()) {
if (step.status === 'COMPLETED') {
switch (step.name) {
case 'CONFIRM_ORDER':
await this.orderService.cancel(sagaState.id);
break;
case 'CHARGE_PAYMENT':
await this.paymentService.refund(sagaState.id);
break;
case 'RESERVE_INVENTORY':
await this.inventoryService.release(sagaState.id);
break;
}
}
}
sagaState.status = 'COMPENSATED';
}
}
Advantages:
- Easier to reason about—all logic in one place
- Central place for workflow logic
- Easier to test and debug
- Can handle complex branching logic
Risks:
- Can become a “God service” if you’re not careful
- Single point of failure (though you can make it resilient)
- Tighter coupling to business logic
Choreographed Saga: Event-Driven
A choreographed saga uses events. Services publish domain events and listen to others. There’s no central coordinator.
// Order Service
class OrderService {
async handleOrderCreated(event: OrderCreatedEvent) {
const order = await this.createOrder(event.orderData);
// Publish next event
await this.eventBus.publish({
type: 'InventoryReservationRequested',
sagaId: event.sagaId,
orderId: order.id,
productId: order.productId,
quantity: order.quantity
});
}
async handlePaymentProcessed(event: PaymentProcessedEvent) {
await this.confirmOrder(event.orderId);
await this.eventBus.publish({
type: 'OrderConfirmed',
sagaId: event.sagaId,
orderId: event.orderId
});
}
}
// Inventory Service
class InventoryService {
async handleInventoryReservationRequested(event: InventoryReservationRequested) {
const reservation = await this.reserveInventory(
event.productId,
event.quantity,
event.sagaId
);
await this.eventBus.publish({
type: 'InventoryReserved',
sagaId: event.sagaId,
reservationId: reservation.id
});
}
async handlePaymentFailed(event: PaymentFailed) {
// Compensation: release inventory
await this.releaseInventory(event.sagaId);
}
}
Advantages:
- More decoupled—services don’t know about each other
- Scales well—no central bottleneck
- Natural fit for event-driven architectures
- Services can evolve independently
Disadvantages:
- Harder to trace—flow is distributed
- Harder to debug—no single place to see the workflow
- Can become hard to understand as it grows
- Compensations are scattered across services
Decision Guide
Use orchestration when:
- The workflow is complex with branching logic
- You need strong guarantees about execution order
- The workflow is business-critical and needs central oversight
- You’re starting out and want easier debugging
Use choreography when:
- The workflow is simple and linear
- Services are naturally event-driven
- You want maximum decoupling
- The domain fits an event-centric model
Hybrid approach: Use orchestrated “high-level” sagas over choreographed sub-flows. The orchestrator handles the main flow, but delegates to event-driven sub-processes.
Designing a Saga Step by Step
Let’s design a complete saga using the Order Checkout example. We’ll use orchestration for clarity, but the principles apply to choreography too.
Step 1: Map the Business Process as a State Machine
Start by modeling your saga as a state machine:
States:
- PENDING → RESERVED → CHARGED → SHIPPED → COMPLETED
- Any state → FAILED → COMPENSATED
Events:
- OrderCreated → triggers ReserveInventory
- InventoryReserved → triggers ChargePayment
- PaymentCaptured → triggers ConfirmOrder
- ShipmentScheduled → triggers CompleteOrder
- Any failure → triggers Compensation
Here’s the TypeScript type:
type SagaState =
| 'PENDING'
| 'RESERVED'
| 'CHARGED'
| 'SHIPPED'
| 'COMPLETED'
| 'FAILED'
| 'COMPENSATED';
interface SagaStep {
name: string;
status: 'PENDING' | 'COMPLETED' | 'FAILED' | 'COMPENSATED';
completedAt?: Date;
failedAt?: Date;
error?: string;
result?: any;
}
interface Saga {
id: string;
state: SagaState;
steps: SagaStep[];
startedAt: Date;
completedAt?: Date;
compensatedAt?: Date;
payload: any;
}
Step 2: Decide Service Boundaries
Which service owns which step?
- Order Service: Creates and confirms orders
- Inventory Service: Reserves and releases inventory
- Payment Service: Charges and refunds payments
- Shipment Service: Schedules shipments
Each service owns its data. The orchestrator coordinates but doesn’t own the data.
Step 3: Define Compensations
For each step, define what happens if you need to undo it:
- Reserve inventory → Release inventory (return stock to available)
- Charge payment → Refund payment (return money to customer)
- Confirm order → Cancel order (mark as cancelled)
- Schedule shipment → Cancel shipment (remove from schedule)
Some steps can’t be compensated. For example, if you’ve already shipped the product, you can’t “un-ship” it. In that case, you might need manual intervention or a different compensation strategy.
Step 4: Define Retry Rules and Idempotency
Every step must be idempotent. If you retry a step, it should produce the same result.
Use unique saga IDs and step sequence numbers:
interface Command {
sagaId: string;
stepSequence: number;
idempotencyKey: string;
payload: any;
}
class InventoryService {
private processedKeys = new Set<string>();
async reserveInventory(command: Command): Promise<Reservation> {
// Check idempotency
if (this.processedKeys.has(command.idempotencyKey)) {
// Return the existing result
return await this.getExistingReservation(command.sagaId);
}
// Process the reservation
const reservation = await this.db.transaction(async (tx) => {
// Check stock availability
const product = await tx.products.findOne({
id: command.payload.productId
});
if (product.stock < command.payload.quantity) {
throw new Error('Insufficient stock');
}
// Reserve inventory
await tx.products.update(
{ id: command.payload.productId },
{ stock: product.stock - command.payload.quantity }
);
// Create reservation record
const reservation = await tx.reservations.create({
sagaId: command.sagaId,
productId: command.payload.productId,
quantity: command.payload.quantity,
status: 'RESERVED'
});
return reservation;
});
// Mark as processed
this.processedKeys.add(command.idempotencyKey);
return reservation;
}
}
Step 5: Handle Timeouts and Dead Letters
What if a service never replies?
class SagaOrchestrator {
private readonly STEP_TIMEOUT = 30000; // 30 seconds
async executeStep(
stepName: string,
action: () => Promise<any>,
sagaId: string
): Promise<any> {
const timeoutPromise = new Promise((_, reject) => {
setTimeout(() => reject(new Error('Step timeout')), this.STEP_TIMEOUT);
});
try {
const result = await Promise.race([action(), timeoutPromise]);
return result;
} catch (error) {
// Check if this is a retryable error
if (this.isRetryable(error)) {
// Retry with exponential backoff
return await this.retryWithBackoff(stepName, action, sagaId);
}
// Non-retryable error - trigger compensation
throw error;
}
}
private isRetryable(error: Error): boolean {
// Network errors, timeouts are retryable
// Business logic errors (e.g., insufficient funds) are not
return error.message.includes('timeout') ||
error.message.includes('network') ||
error.message.includes('ECONNREFUSED');
}
}
For dead letters (messages that can’t be processed), use a dead letter queue:
class DeadLetterHandler {
async handleDeadLetter(message: Command, error: Error) {
// Log for manual review
await this.logger.error({
sagaId: message.sagaId,
step: message.stepSequence,
error: error.message,
payload: message.payload,
timestamp: new Date()
});
// Notify operations team
await this.notifyOps({
type: 'DEAD_LETTER',
sagaId: message.sagaId,
requiresManualIntervention: true
});
// Mark saga as failed
await this.sagaStore.updateState(message.sagaId, 'FAILED');
}
}
Implementation Patterns
Let’s build a complete implementation using Node.js, TypeScript, and PostgreSQL. We’ll use Kafka for messaging, but the patterns work with RabbitMQ or other brokers.
Architecture Overview
┌─────────────────┐
│ API Gateway │
└────────┬────────┘
│
▼
┌─────────────────┐ ┌──────────────┐
│ Saga Orchestrator│────▶│ Kafka │
└────────┬────────┘ └──────┬───────┘
│ │
▼ ▼
┌─────────────────┐ ┌──────────────┐
│ PostgreSQL │ │ Services │
│ (Saga State) │ │ (Order, │
└─────────────────┘ │ Payment, │
│ Inventory) │
└──────────────┘
Saga State Model
First, define the database schema:
CREATE TABLE sagas (
id VARCHAR(255) PRIMARY KEY,
state VARCHAR(50) NOT NULL,
type VARCHAR(100) NOT NULL,
payload JSONB NOT NULL,
started_at TIMESTAMP NOT NULL,
completed_at TIMESTAMP,
compensated_at TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE saga_steps (
id SERIAL PRIMARY KEY,
saga_id VARCHAR(255) NOT NULL REFERENCES sagas(id),
step_name VARCHAR(100) NOT NULL,
step_sequence INTEGER NOT NULL,
status VARCHAR(50) NOT NULL,
result JSONB,
error TEXT,
completed_at TIMESTAMP,
failed_at TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_sagas_state ON sagas(state);
CREATE INDEX idx_saga_steps_saga_id ON saga_steps(saga_id);
TypeScript types:
interface SagaRecord {
id: string;
state: SagaState;
type: string;
payload: any;
startedAt: Date;
completedAt?: Date;
compensatedAt?: Date;
}
interface SagaStepRecord {
id: number;
sagaId: string;
stepName: string;
stepSequence: number;
status: 'PENDING' | 'COMPLETED' | 'FAILED' | 'COMPENSATED';
result?: any;
error?: string;
completedAt?: Date;
failedAt?: Date;
}
Orchestrator Implementation
Here’s the core orchestrator:
import { Pool } from 'pg';
import { Kafka } from 'kafkajs';
class SagaOrchestrator {
constructor(
private db: Pool,
private kafka: Kafka
) {}
async startSaga(
sagaType: string,
payload: any
): Promise<string> {
const sagaId = generateId();
const producer = this.kafka.producer();
await producer.connect();
// Create saga record
await this.db.query(
`INSERT INTO sagas (id, state, type, payload, started_at)
VALUES ($1, $2, $3, $4, NOW())`,
[sagaId, 'PENDING', sagaType, JSON.stringify(payload)]
);
// Publish initial command
await producer.send({
topic: `${sagaType}.commands`,
messages: [{
key: sagaId,
value: JSON.stringify({
sagaId,
stepSequence: 1,
command: 'ReserveInventory',
payload
})
}]
});
await producer.disconnect();
return sagaId;
}
async handleEvent(event: SagaEvent) {
const saga = await this.getSaga(event.sagaId);
if (!saga) {
throw new Error(`Saga ${event.sagaId} not found`);
}
// Update step status
await this.updateStepStatus(
event.sagaId,
event.stepSequence,
'COMPLETED',
event.result
);
// Determine next step
const nextStep = this.getNextStep(saga.type, event.stepSequence);
if (nextStep) {
await this.executeNextStep(saga.id, nextStep, saga.payload);
} else {
// Saga complete
await this.completeSaga(saga.id);
}
}
async handleFailure(event: SagaFailureEvent) {
const saga = await this.getSaga(event.sagaId);
if (!saga) return;
// Mark step as failed
await this.updateStepStatus(
event.sagaId,
event.stepSequence,
'FAILED',
null,
event.error
);
// Trigger compensation
await this.compensate(saga.id);
}
private async compensate(sagaId: string) {
const steps = await this.getCompletedSteps(sagaId);
// Compensate in reverse order
for (const step of steps.reverse()) {
const producer = this.kafka.producer();
await producer.connect();
await producer.send({
topic: 'saga.compensations',
messages: [{
key: sagaId,
value: JSON.stringify({
sagaId,
stepName: step.stepName,
compensation: this.getCompensationCommand(step.stepName),
payload: step.result
})
}]
});
await producer.disconnect();
// Mark step as compensated
await this.updateStepStatus(
sagaId,
step.stepSequence,
'COMPENSATED'
);
}
// Mark saga as compensated
await this.db.query(
`UPDATE sagas SET state = 'COMPENSATED', compensated_at = NOW()
WHERE id = $1`,
[sagaId]
);
}
private async getSaga(sagaId: string): Promise<SagaRecord | null> {
const result = await this.db.query(
`SELECT * FROM sagas WHERE id = $1`,
[sagaId]
);
return result.rows[0] || null;
}
private async updateStepStatus(
sagaId: string,
stepSequence: number,
status: string,
result?: any,
error?: string
) {
if (status === 'COMPLETED') {
await this.db.query(
`UPDATE saga_steps
SET status = $1, result = $2, completed_at = NOW()
WHERE saga_id = $3 AND step_sequence = $4`,
[status, JSON.stringify(result), sagaId, stepSequence]
);
} else if (status === 'FAILED') {
await this.db.query(
`UPDATE saga_steps
SET status = $1, error = $2, failed_at = NOW()
WHERE saga_id = $3 AND step_sequence = $4`,
[status, error, sagaId, stepSequence]
);
}
}
private getNextStep(
sagaType: string,
currentStep: number
): string | null {
const steps: Record<string, string[]> = {
'OrderCheckout': [
'ReserveInventory',
'ChargePayment',
'ConfirmOrder'
]
};
const sagaSteps = steps[sagaType] || [];
const nextIndex = currentStep;
if (nextIndex >= sagaSteps.length) {
return null;
}
return sagaSteps[nextIndex];
}
private async executeNextStep(
sagaId: string,
stepName: string,
payload: any
) {
const producer = this.kafka.producer();
await producer.connect();
// Get current step sequence
const stepResult = await this.db.query(
`SELECT MAX(step_sequence) as max_seq FROM saga_steps WHERE saga_id = $1`,
[sagaId]
);
const nextSequence = (stepResult.rows[0].max_seq || 0) + 1;
// Create step record
await this.db.query(
`INSERT INTO saga_steps (saga_id, step_name, step_sequence, status)
VALUES ($1, $2, $3, 'PENDING')`,
[sagaId, stepName, nextSequence]
);
// Publish command
await producer.send({
topic: `${stepName.toLowerCase()}.commands`,
messages: [{
key: sagaId,
value: JSON.stringify({
sagaId,
stepSequence: nextSequence,
command: stepName,
payload,
idempotencyKey: `${sagaId}-${nextSequence}`
})
}]
});
await producer.disconnect();
}
private getCompensationCommand(stepName: string): string {
const compensations: Record<string, string> = {
'ReserveInventory': 'ReleaseInventory',
'ChargePayment': 'RefundPayment',
'ConfirmOrder': 'CancelOrder'
};
return compensations[stepName] || 'Unknown';
}
}
Service Event Handler
Here’s how a service (Inventory Service) handles commands:
class InventoryService {
constructor(
private db: Pool,
private kafka: Kafka,
private processedKeys: Set<string>
) {}
async start() {
const consumer = this.kafka.consumer({ groupId: 'inventory-service' });
await consumer.connect();
await consumer.subscribe({ topic: 'reserveinventory.commands' });
await consumer.run({
eachMessage: async ({ message }) => {
const command = JSON.parse(message.value!.toString());
await this.handleReserveInventory(command);
}
});
}
async handleReserveInventory(command: Command) {
const idempotencyKey = command.idempotencyKey;
// Check idempotency
if (this.processedKeys.has(idempotencyKey)) {
console.log(`Skipping duplicate command: ${idempotencyKey}`);
return;
}
const producer = this.kafka.producer();
await producer.connect();
try {
// Reserve inventory in a transaction
const result = await this.db.query(`
BEGIN;
SELECT stock FROM products WHERE id = $1 FOR UPDATE;
UPDATE products
SET stock = stock - $2
WHERE id = $1 AND stock >= $2;
INSERT INTO reservations (saga_id, product_id, quantity, status)
VALUES ($3, $1, $2, 'RESERVED');
COMMIT;
`, [command.payload.productId, command.payload.quantity, command.sagaId]);
// Mark as processed
this.processedKeys.add(idempotencyKey);
// Publish success event
await producer.send({
topic: 'saga.events',
messages: [{
key: command.sagaId,
value: JSON.stringify({
sagaId: command.sagaId,
stepSequence: command.stepSequence,
event: 'InventoryReserved',
result: { reservationId: result.rows[0].id }
})
}]
});
} catch (error) {
// Publish failure event
await producer.send({
topic: 'saga.failures',
messages: [{
key: command.sagaId,
value: JSON.stringify({
sagaId: command.sagaId,
stepSequence: command.stepSequence,
error: error.message
})
}]
});
} finally {
await producer.disconnect();
}
}
async handleReleaseInventory(compensation: CompensationCommand) {
// Release inventory (compensation)
await this.db.query(`
UPDATE products p
SET stock = p.stock + r.quantity
FROM reservations r
WHERE r.saga_id = $1 AND r.product_id = p.id;
UPDATE reservations
SET status = 'RELEASED'
WHERE saga_id = $1;
`, [compensation.sagaId]);
}
}
Compensation Handler
Compensations run in reverse order:
class CompensationHandler {
constructor(
private db: Pool,
private kafka: Kafka,
private services: Map<string, Service>
) {}
async start() {
const consumer = this.kafka.consumer({ groupId: 'compensation-handler' });
await consumer.connect();
await consumer.subscribe({ topic: 'saga.compensations' });
await consumer.run({
eachMessage: async ({ message }) => {
const compensation = JSON.parse(message.value!.toString());
await this.handleCompensation(compensation);
}
});
}
async handleCompensation(compensation: CompensationCommand) {
const service = this.services.get(compensation.stepName);
if (!service) {
console.error(`No service found for ${compensation.stepName}`);
return;
}
try {
await service.compensate(compensation);
// Publish compensation completed event
const producer = this.kafka.producer();
await producer.connect();
await producer.send({
topic: 'saga.compensations.completed',
messages: [{
key: compensation.sagaId,
value: JSON.stringify({
sagaId: compensation.sagaId,
stepName: compensation.stepName,
status: 'COMPENSATED'
})
}]
});
await producer.disconnect();
} catch (error) {
console.error(`Compensation failed for ${compensation.stepName}:`, error);
// Log for manual intervention
await this.logCompensationFailure(compensation, error);
}
}
}
Observability and Debugging Sagas
Sagas are hard to debug. You need to see what’s happening across services.
Correlation IDs and Saga IDs
Use correlation IDs in all logs, traces, and metrics:
import { trace, context } from '@opentelemetry/api';
class TracedSagaOrchestrator extends SagaOrchestrator {
async startSaga(sagaType: string, payload: any): Promise<string> {
const tracer = trace.getTracer('saga-orchestrator');
const span = tracer.startSpan('start_saga');
span.setAttributes({
'saga.type': sagaType,
'saga.id': payload.sagaId || generateId()
});
try {
const sagaId = await super.startSaga(sagaType, payload);
span.setAttribute('saga.id', sagaId);
span.setStatus({ code: SpanStatusCode.OK });
return sagaId;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message
});
throw error;
} finally {
span.end();
}
}
}
Distributed Tracing
Use OpenTelemetry to trace the full saga:
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node';
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';
const tracerProvider = new NodeTracerProvider({
resource: new Resource({ 'service.name': 'saga-orchestrator' })
});
tracerProvider.addSpanProcessor(
new BatchSpanProcessor(new JaegerExporter())
);
tracerProvider.register();
Metrics to Track
Track these metrics:
- Saga success/failure rate
- Average completion time
- Compensation frequency
- Step failure rates
- Timeout rates
import { metrics } from '@opentelemetry/api';
const meter = metrics.getMeter('saga-orchestrator');
const sagaCounter = meter.createCounter('sagas_total', {
description: 'Total number of sagas'
});
const sagaDuration = meter.createHistogram('saga_duration_seconds', {
description: 'Saga execution duration'
});
const compensationCounter = meter.createCounter('compensations_total', {
description: 'Total number of compensations'
});
// In your orchestrator
sagaCounter.add(1, { type: sagaType, status: 'started' });
const startTime = Date.now();
// ... execute saga ...
const duration = (Date.now() - startTime) / 1000;
sagaDuration.record(duration, { type: sagaType });
Saga Inspector View
Build a simple admin UI to inspect sagas:
// API endpoint to get saga state
app.get('/api/sagas/:id', async (req, res) => {
const saga = await db.query(
`SELECT s.*,
json_agg(
json_build_object(
'stepName', st.step_name,
'sequence', st.step_sequence,
'status', st.status,
'completedAt', st.completed_at,
'error', st.error
) ORDER BY st.step_sequence
) as steps
FROM sagas s
LEFT JOIN saga_steps st ON s.id = st.saga_id
WHERE s.id = $1
GROUP BY s.id`,
[req.params.id]
);
res.json(saga.rows[0]);
});
Testing Sagas
Testing distributed workflows is hard. You need to test happy paths, failures, and edge cases.
Unit Tests for Orchestrator Logic
describe('SagaOrchestrator', () => {
it('should execute steps in order', async () => {
const mockDb = createMockDb();
const mockKafka = createMockKafka();
const orchestrator = new SagaOrchestrator(mockDb, mockKafka);
const sagaId = await orchestrator.startSaga('OrderCheckout', {
productId: 'prod-1',
quantity: 2,
customerId: 'cust-1',
total: 100
});
// Verify saga was created
const saga = await mockDb.query('SELECT * FROM sagas WHERE id = $1', [sagaId]);
expect(saga.rows[0].state).toBe('PENDING');
// Verify first command was published
expect(mockKafka.producedMessages).toContainEqual(
expect.objectContaining({
topic: 'reserveinventory.commands',
key: sagaId
})
);
});
it('should compensate on failure', async () => {
// Mock payment service to fail
const mockPaymentService = {
charge: jest.fn().mockRejectedValue(new Error('Payment failed'))
};
// ... test compensation logic
});
});
Component Tests
Test the orchestrator with one service and a fake broker:
describe('Saga Component Tests', () => {
it('should complete full order flow', async () => {
// Use testcontainers for real PostgreSQL and Kafka
const postgres = await startPostgresContainer();
const kafka = await startKafkaContainer();
const db = new Pool({ connectionString: postgres.connectionString });
const kafkaClient = new Kafka({ brokers: [kafka.broker] });
const orchestrator = new SagaOrchestrator(db, kafkaClient);
const inventoryService = new InventoryService(db, kafkaClient);
// Start services
await inventoryService.start();
// Start saga
const sagaId = await orchestrator.startSaga('OrderCheckout', {
productId: 'prod-1',
quantity: 1
});
// Wait for completion
await waitForSagaCompletion(sagaId, db, 5000);
// Verify final state
const saga = await db.query('SELECT * FROM sagas WHERE id = $1', [sagaId]);
expect(saga.rows[0].state).toBe('COMPLETED');
});
});
Chaos Tests
Test what happens when things go wrong:
describe('Saga Chaos Tests', () => {
it('should handle service crashes during execution', async () => {
const sagaId = await orchestrator.startSaga('OrderCheckout', orderData);
// Kill inventory service mid-execution
await killService('inventory-service');
// Wait and verify compensation
await wait(5000);
const saga = await db.query('SELECT * FROM sagas WHERE id = $1', [sagaId]);
expect(saga.rows[0].state).toBe('COMPENSATED');
});
it('should handle message loss', async () => {
// Drop messages from Kafka
await dropKafkaMessages('reserveinventory.commands');
// Verify saga times out and compensates
// ...
});
});
Common Anti-Patterns
Here’s what not to do:
Treating Sagas as ACID Transactions
Sagas are not ACID. They provide eventual consistency. Don’t expect immediate consistency across services.
Orchestrator That Knows Everything
Don’t put all business logic in the orchestrator. Services should own their logic. The orchestrator should only coordinate.
Synchronous REST for Every Step
Don’t use synchronous REST calls for saga steps. Use async messaging. It’s more resilient and scalable.
No Compensations, Only Retries
Retries don’t fix everything. Some failures need compensation. Design compensations for every step.
Using Sagas When You Don’t Need Them
If everything fits in one service, use a database transaction. Don’t over-engineer.
Checklist and Summary
Before deploying a saga to production, check:
- Business process modeled as a state machine
- Clear orchestration/choreography choice made
- Compensations defined and tested for all steps
- Idempotency and retries handled
- Timeouts and dead-letter handling implemented
- Tracing, logs, and metrics wired up
- Correlation IDs in all logs
- Chaos tests run and passing
- Saga inspector view built for debugging
- Documentation for operations team
The Bottom Line
Sagas are the standard way to handle distributed transactions in microservices. They trade immediate consistency for availability and scalability.
Design them carefully. Model your workflow as a state machine. Choose orchestration or choreography based on your needs. Implement idempotency and compensations. Add observability from day one.
Start simple. Pick one workflow. Implement it. Learn from it. Then expand.
Your microservices will be more reliable. Your users will see fewer partial failures. Your operations team will thank you when they can actually debug what’s happening.
Ali Elborey is a software architect who helps teams build reliable distributed systems. He writes about microservices, event-driven architecture, and system design patterns.
Discussion
Loading comments...