By Ali Elborey

Zero-Downtime Schema Migrations: Expand-Contract with Dual-Write Cutovers

databasemigrationsschemapostgresqlzero-downtimeexpand-contractdual-writebackfillsystem-designdevops

Zero-Downtime Schema Migration Pattern

Most outages in mature systems are caused by migrations. You can avoid them with safe steps and the right guardrails.

This guide covers the expand-contract pattern. It’s a way to change your database schema without downtime. You add new columns, switch reads and writes, then remove old columns. Each step is reversible.

Why This Matters

Schema changes are risky. They can lock tables, break queries, or cause data loss. The expand-contract pattern reduces that risk.

Here’s what can go wrong:

Locks: Adding a column with a default value can lock the table. In PostgreSQL, ALTER TABLE ... ADD COLUMN ... DEFAULT rewrites the table. That locks everything.

Long scans: Backfilling data can scan millions of rows. That slows down queries.

Hot tables: Popular tables get hit hard during migrations. One slow query can cascade.

ORMs issuing dangerous queries: Some ORMs generate queries that don’t work with your new schema. They might reference columns that don’t exist yet.

Rolling deploys vs. schema drift: You deploy code that expects the new schema. But some servers still have the old code. That mismatch causes errors.

The expand-contract pattern fixes these problems. You make changes in safe steps. Each step can be rolled back.

The Failure Modes

Before we get into the solution, let’s look at what breaks.

Table Locks

PostgreSQL locks tables during certain operations. ALTER TABLE ... ADD COLUMN ... DEFAULT rewrites the table. That means:

  • All writes block
  • Reads might block
  • The operation can take hours on large tables

Example of a dangerous migration:

-- DON'T DO THIS
ALTER TABLE orders ADD COLUMN status VARCHAR(50) DEFAULT 'pending';

This locks the table. On a table with 100 million rows, that’s hours of downtime.

Long Scans

Backfilling data often means scanning the entire table:

-- This scans every row
UPDATE orders SET status = 'pending' WHERE status IS NULL;

If you have 100 million rows, that’s slow. It also creates WAL (Write-Ahead Log) entries. That can fill up disk.

Hot Tables

Popular tables get hit hard. During a migration:

  • Reads slow down
  • Writes queue up
  • Replication lag increases

One slow query can cascade. Users see timeouts. The system degrades.

ORM Mismatches

ORMs generate queries based on your models. If you deploy new code before the schema is ready:

# New code expects 'status' column
order = Order.objects.get(id=123)
print(order.status)  # Column doesn't exist yet!

That breaks. You need to coordinate code and schema changes.

Rolling Deploys

You deploy code in stages. Some servers have new code. Some have old code. If the schema is in between states, you get errors.

Expand-Contract in Practice

The expand-contract pattern has three phases:

  1. Expand: Add new columns, tables, indexes. Write adapters. Plan backfill.
  2. Dual-write: Write to old and new shapes.
  3. Contract: Switch reads, freeze old path, drop old columns.

Let’s walk through each phase.

Phase 1: Expand

Add new columns as nullable. Don’t add defaults yet.

-- Safe: nullable column, no default
ALTER TABLE orders ADD COLUMN status VARCHAR(50);

-- Create index concurrently (doesn't lock)
CREATE INDEX CONCURRENTLY idx_orders_status ON orders(status);

Why nullable? Because existing rows don’t have values yet. You’ll backfill later.

Why no default? Because adding a default locks the table. We’ll set values during backfill.

Write adapters: Your code should handle both old and new shapes. Write a function that reads from both:

// Adapter that reads from old or new column
function getOrderStatus(order) {
  // New column takes precedence
  if (order.status !== null && order.status !== undefined) {
    return order.status;
  }
  // Fall back to old column
  return order.old_status || 'pending';
}

Backfill plan: Decide how to populate the new column. We’ll cover this later.

Phase 2: Dual-Write

Write to both old and new shapes. This ensures both are in sync.

async function createOrder(orderData) {
  const client = await pool.connect();
  
  try {
    await client.query('BEGIN');
    
    // Write to new shape
    await client.query(
      `INSERT INTO orders (id, user_id, amount, status, created_at)
       VALUES ($1, $2, $3, $4, $5)`,
      [orderData.id, orderData.userId, orderData.amount, orderData.status, new Date()]
    );
    
    // Also write to old shape (if it exists)
    // This ensures backward compatibility
    await client.query(
      `UPDATE orders SET old_status = $1 WHERE id = $2`,
      [orderData.status, orderData.id]
    );
    
    await client.query('COMMIT');
  } catch (error) {
    await client.query('ROLLBACK');
    throw error;
  } finally {
    client.release();
  }
}

Dual-write means:

  • New code writes to both
  • Old code still works
  • Both shapes stay in sync

Read-path fan-out: Read from both, compare results. This helps catch bugs:

async function getOrder(id) {
  const [newResult, oldResult] = await Promise.all([
    pool.query('SELECT * FROM orders WHERE id = $1', [id]),
    pool.query('SELECT * FROM orders_old WHERE id = $1', [id])
  ]);
  
  // Compare for debugging
  if (JSON.stringify(newResult.rows[0]) !== JSON.stringify(oldResult.rows[0])) {
    logger.warn('Data mismatch', { id, new: newResult.rows[0], old: oldResult.rows[0] });
  }
  
  return newResult.rows[0];
}

Read-preference flag: Use a feature flag to switch reads:

const READ_FROM_NEW = process.env.READ_FROM_NEW === 'true';

async function getOrder(id) {
  if (READ_FROM_NEW) {
    return await pool.query('SELECT * FROM orders WHERE id = $1', [id]);
  } else {
    return await pool.query('SELECT * FROM orders_old WHERE id = $1', [id]);
  }
}

Start with the flag off. Turn it on gradually. Monitor for errors.

Phase 3: Contract

Switch reads to the new path. Freeze the old path. Drop old columns.

Switch reads: Turn on the read-preference flag. All reads go to the new schema.

Freeze old path: Stop writing to old columns. Keep them for a while in case you need to roll back.

// Stop writing to old columns
async function createOrder(orderData) {
  // Only write to new shape now
  await pool.query(
    `INSERT INTO orders (id, user_id, amount, status, created_at)
     VALUES ($1, $2, $3, $4, $5)`,
    [orderData.id, orderData.userId, orderData.amount, orderData.status, new Date()]
  );
  // No more old_status writes
}

Drop old columns: After a few days, if everything looks good, drop old columns:

-- Wait a few days, then:
ALTER TABLE orders DROP COLUMN old_status;

This is the point of no return. Make sure you’re confident.

Backfills Without Pain

Backfilling data is tricky. You need to update millions of rows without locking the table.

Chunking

Process rows in chunks. Use primary key ranges or time windows.

Primary key ranges:

async function backfillOrders() {
  const BATCH_SIZE = 1000;
  let lastId = 0;
  
  while (true) {
    const rows = await pool.query(
      `SELECT id, old_status FROM orders
       WHERE id > $1 AND status IS NULL
       ORDER BY id
       LIMIT $2`,
      [lastId, BATCH_SIZE]
    );
    
    if (rows.rows.length === 0) {
      break; // Done
    }
    
    for (const row of rows.rows) {
      await pool.query(
        'UPDATE orders SET status = $1 WHERE id = $2',
        [row.old_status || 'pending', row.id]
      );
    }
    
    lastId = rows.rows[rows.rows.length - 1].id;
    
    // Throttle to avoid overwhelming the database
    await sleep(100); // 100ms between batches
  }
}

Time windows:

async function backfillOrdersByTime() {
  const BATCH_SIZE = 1000;
  let startTime = new Date('2024-01-01');
  const endTime = new Date();
  
  while (startTime < endTime) {
    const windowEnd = new Date(startTime.getTime() + 24 * 60 * 60 * 1000); // 1 day
    
    const rows = await pool.query(
      `SELECT id, old_status FROM orders
       WHERE created_at >= $1 AND created_at < $2 AND status IS NULL
       LIMIT $3`,
      [startTime, windowEnd, BATCH_SIZE]
    );
    
    for (const row of rows.rows) {
      await pool.query(
        'UPDATE orders SET status = $1 WHERE id = $2',
        [row.old_status || 'pending', row.id]
      );
    }
    
    startTime = windowEnd;
    await sleep(100);
  }
}

Throttling and Vacuums

Throttle updates to avoid overwhelming the database. Also run VACUUM periodically to reclaim space.

// Throttle updates
await sleep(100); // Wait 100ms between batches

// Run VACUUM periodically (in a separate job)
await pool.query('VACUUM ANALYZE orders');

Idempotent Backfill Jobs

Make backfills idempotent. If they fail, you can resume.

async function backfillOrdersWithResume() {
  const RESUME_TOKEN_KEY = 'backfill_orders_resume_token';
  let lastId = await getResumeToken(RESUME_TOKEN_KEY) || 0;
  const BATCH_SIZE = 1000;
  
  try {
    while (true) {
      const rows = await pool.query(
        `SELECT id, old_status FROM orders
         WHERE id > $1 AND status IS NULL
         ORDER BY id
         LIMIT $2`,
        [lastId, BATCH_SIZE]
      );
      
      if (rows.rows.length === 0) {
        break;
      }
      
      for (const row of rows.rows) {
        // Idempotent: only update if status is still NULL
        await pool.query(
          `UPDATE orders SET status = $1
           WHERE id = $2 AND status IS NULL`,
          [row.old_status || 'pending', row.id]
        );
      }
      
      lastId = rows.rows[rows.rows.length - 1].id;
      await saveResumeToken(RESUME_TOKEN_KEY, lastId);
      
      await sleep(100);
    }
    
    // Clear resume token when done
    await clearResumeToken(RESUME_TOKEN_KEY);
  } catch (error) {
    logger.error('Backfill failed', { error, lastId });
    // Resume token is saved, can resume later
    throw error;
  }
}

Resume markers: Store progress in a separate table:

CREATE TABLE backfill_progress (
  job_name VARCHAR(255) PRIMARY KEY,
  last_processed_id BIGINT,
  updated_at TIMESTAMP DEFAULT NOW()
);

Indexing Safely

Creating indexes can lock tables. Use concurrent indexes when possible.

PostgreSQL: Concurrent Indexes

PostgreSQL supports CREATE INDEX CONCURRENTLY. It doesn’t lock the table:

-- Safe: doesn't lock table
CREATE INDEX CONCURRENTLY idx_orders_status ON orders(status);

-- Dangerous: locks table
CREATE INDEX idx_orders_status ON orders(status);

Concurrent indexes take longer, but they don’t block writes.

MySQL: Online DDL

MySQL 5.6+ supports online DDL for some operations:

-- Online DDL (doesn't lock)
ALTER TABLE orders ADD INDEX idx_status (status), ALGORITHM=INPLACE, LOCK=NONE;

Check your MySQL version. Older versions lock the table.

Strong vs. Weak Indexes

Strong indexes: Unique constraints, foreign keys. These validate data. They can slow down writes.

Weak indexes: Regular indexes for queries. These are safer to add.

Add weak indexes first. Add strong indexes later if needed.

Avoiding Full Table Locks

Some operations always lock:

  • Adding a column with a default (in PostgreSQL)
  • Changing column types
  • Adding foreign keys (can lock)

Avoid these during peak hours. Use the expand-contract pattern instead.

Operational Controls

Use feature flags and toggles to control the migration.

Feature Flags

Use feature flags to switch between old and new paths:

const flags = {
  readFromNew: process.env.READ_FROM_NEW === 'true',
  writeToNew: process.env.WRITE_TO_NEW === 'true',
  enableBackfill: process.env.ENABLE_BACKFILL === 'true'
};

async function getOrder(id) {
  if (flags.readFromNew) {
    return await getOrderFromNew(id);
  } else {
    return await getOrderFromOld(id);
  }
}

Turn flags on gradually. Monitor for errors.

Safe Rollback Plan

Have a rollback plan:

  1. Turn off read-from-new flag
  2. Turn off write-to-new flag
  3. Keep old columns for a few days
  4. Monitor for issues

If something breaks, roll back immediately.

Blast Radius Limits

Limit the impact of failures:

  • Use circuit breakers
  • Set timeouts
  • Monitor error rates
  • Have alerts

If error rates spike, roll back automatically.

Testing and Observability

Test migrations before running them in production.

Shadow Reads and Diffing

Read from both old and new paths. Compare results:

async function shadowRead(id) {
  const [oldResult, newResult] = await Promise.all([
    getOrderFromOld(id),
    getOrderFromNew(id)
  ]);
  
  if (JSON.stringify(oldResult) !== JSON.stringify(newResult)) {
    logger.warn('Shadow read mismatch', { id, old: oldResult, new: newResult });
    metrics.increment('shadow_read.mismatch');
  }
  
  return oldResult; // Return old result to user
}

Run shadow reads for a few days. If there are no mismatches, switch to new path.

Query Plans and Regressions

Check query plans before and after:

-- Before migration
EXPLAIN ANALYZE SELECT * FROM orders WHERE status = 'pending';

-- After migration
EXPLAIN ANALYZE SELECT * FROM orders WHERE status = 'pending';

Compare execution times. If queries slow down, investigate.

Error Budgets for Migration Speed

Set error budgets. If errors exceed the budget, slow down or roll back:

const ERROR_BUDGET = 0.01; // 1% error rate
let errorCount = 0;
let totalRequests = 0;

function checkErrorBudget() {
  const errorRate = errorCount / totalRequests;
  if (errorRate > ERROR_BUDGET) {
    logger.error('Error budget exceeded', { errorRate, budget: ERROR_BUDGET });
    // Roll back or slow down
    process.env.READ_FROM_NEW = 'false';
  }
}

Real Example: Splitting a Wide Table

Let’s walk through a real example: splitting orders into orders + order_items.

The Problem

You have a wide orders table:

CREATE TABLE orders (
  id BIGSERIAL PRIMARY KEY,
  user_id BIGINT,
  item_1_name VARCHAR(255),
  item_1_price DECIMAL(10,2),
  item_2_name VARCHAR(255),
  item_2_price DECIMAL(10,2),
  -- ... more items
  created_at TIMESTAMP
);

This is hard to query. You want to normalize it:

CREATE TABLE orders (
  id BIGSERIAL PRIMARY KEY,
  user_id BIGINT,
  created_at TIMESTAMP
);

CREATE TABLE order_items (
  id BIGSERIAL PRIMARY KEY,
  order_id BIGINT REFERENCES orders(id),
  name VARCHAR(255),
  price DECIMAL(10,2)
);

The Solution

Step 1: Expand

Create the new tables. Keep the old table:

-- Create new tables
CREATE TABLE orders_new (
  id BIGSERIAL PRIMARY KEY,
  user_id BIGINT,
  created_at TIMESTAMP
);

CREATE TABLE order_items_new (
  id BIGSERIAL PRIMARY KEY,
  order_id BIGINT REFERENCES orders_new(id),
  name VARCHAR(255),
  price DECIMAL(10,2)
);

-- Add foreign key index
CREATE INDEX CONCURRENTLY idx_order_items_order_id ON order_items_new(order_id);

Step 2: Dual-Write

Write to both old and new:

async function createOrder(orderData) {
  const client = await pool.connect();
  
  try {
    await client.query('BEGIN');
    
    // Write to old table
    await client.query(
      `INSERT INTO orders (user_id, item_1_name, item_1_price, item_2_name, item_2_price, created_at)
       VALUES ($1, $2, $3, $4, $5, $6)`,
      [orderData.userId, orderData.items[0].name, orderData.items[0].price, 
       orderData.items[1].name, orderData.items[1].price, new Date()]
    );
    
    // Write to new tables
    const orderResult = await client.query(
      `INSERT INTO orders_new (user_id, created_at) VALUES ($1, $2) RETURNING id`,
      [orderData.userId, new Date()]
    );
    const orderId = orderResult.rows[0].id;
    
    for (const item of orderData.items) {
      await client.query(
        `INSERT INTO order_items_new (order_id, name, price) VALUES ($1, $2, $3)`,
        [orderId, item.name, item.price]
      );
    }
    
    await client.query('COMMIT');
  } catch (error) {
    await client.query('ROLLBACK');
    throw error;
  } finally {
    client.release();
  }
}

Step 3: Backfill

Backfill existing orders:

async function backfillOrders() {
  const BATCH_SIZE = 100;
  let lastId = 0;
  
  while (true) {
    const orders = await pool.query(
      `SELECT * FROM orders WHERE id > $1 ORDER BY id LIMIT $2`,
      [lastId, BATCH_SIZE]
    );
    
    if (orders.rows.length === 0) {
      break;
    }
    
    for (const order of orders.rows) {
      // Insert into new tables
      const orderResult = await pool.query(
        `INSERT INTO orders_new (id, user_id, created_at) VALUES ($1, $2, $3) RETURNING id`,
        [order.id, order.user_id, order.created_at]
      );
      
      // Insert items
      for (let i = 1; i <= 10; i++) {
        const itemName = order[`item_${i}_name`];
        const itemPrice = order[`item_${i}_price`];
        
        if (itemName) {
          await pool.query(
            `INSERT INTO order_items_new (order_id, name, price) VALUES ($1, $2, $3)`,
            [order.id, itemName, itemPrice]
          );
        }
      }
    }
    
    lastId = orders.rows[orders.rows.length - 1].id;
    await sleep(100);
  }
}

Step 4: Contract

Switch reads to new tables. Drop old table:

-- Rename tables
ALTER TABLE orders RENAME TO orders_old;
ALTER TABLE orders_new RENAME TO orders;
ALTER TABLE order_items_new RENAME TO order_items;

-- After a few days, drop old table
DROP TABLE orders_old;

Contract phase with cascade deletes: If you have foreign keys, handle cascade deletes:

-- Add cascade delete
ALTER TABLE order_items 
  DROP CONSTRAINT order_items_order_id_fkey,
  ADD CONSTRAINT order_items_order_id_fkey 
    FOREIGN KEY (order_id) REFERENCES orders(id) ON DELETE CASCADE;

This ensures referential integrity.

Code Requirements

Here are the concrete code snippets you need.

Safe Migration SQL (PostgreSQL)

-- Step 1: Add nullable column (no default, no lock)
ALTER TABLE orders ADD COLUMN status VARCHAR(50);

-- Step 2: Create index concurrently (no lock)
CREATE INDEX CONCURRENTLY idx_orders_status ON orders(status);

-- Step 3: Backfill data (in application code, not SQL)

-- Step 4: After backfill, add NOT NULL constraint (locks briefly)
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;

-- Step 5: After contract phase, drop old column
ALTER TABLE orders DROP COLUMN old_status;

Dual-Write in Service Code (Node.js)

const { Pool } = require('pg');

const pool = new Pool({
  connectionString: process.env.DATABASE_URL
});

async function createOrder(orderData) {
  const client = await pool.connect();
  
  try {
    await client.query('BEGIN');
    
    // Write to new shape
    const result = await client.query(
      `INSERT INTO orders (user_id, amount, status, created_at)
       VALUES ($1, $2, $3, $4)
       RETURNING id`,
      [orderData.userId, orderData.amount, orderData.status || 'pending', new Date()]
    );
    
    const orderId = result.rows[0].id;
    
    // Also write to old shape for backward compatibility
    if (orderData.oldStatus) {
      await client.query(
        `UPDATE orders SET old_status = $1 WHERE id = $2`,
        [orderData.oldStatus, orderId]
      );
    }
    
    await client.query('COMMIT');
    return { id: orderId };
  } catch (error) {
    await client.query('ROLLBACK');
    throw error;
  } finally {
    client.release();
  }
}

// With retries
async function createOrderWithRetry(orderData, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await createOrder(orderData);
    } catch (error) {
      if (i === maxRetries - 1) {
        throw error;
      }
      await sleep(1000 * (i + 1)); // Exponential backoff
    }
  }
}

Backfill Worker

const { Pool } = require('pg');
const { Pool: MetricsPool } = require('pg');

const pool = new Pool({
  connectionString: process.env.DATABASE_URL
});

class BackfillWorker {
  constructor(config) {
    this.batchSize = config.batchSize || 1000;
    this.throttleMs = config.throttleMs || 100;
    this.resumeTokenKey = config.resumeTokenKey;
    this.metrics = {
      rowsProcessed: 0,
      rowsPerSecond: 0,
      lag: 0,
      startTime: Date.now()
    };
  }
  
  async getResumeToken() {
    const result = await pool.query(
      'SELECT last_processed_id FROM backfill_progress WHERE job_name = $1',
      [this.resumeTokenKey]
    );
    return result.rows[0]?.last_processed_id || 0;
  }
  
  async saveResumeToken(lastId) {
    await pool.query(
      `INSERT INTO backfill_progress (job_name, last_processed_id, updated_at)
       VALUES ($1, $2, NOW())
       ON CONFLICT (job_name) 
       DO UPDATE SET last_processed_id = $2, updated_at = NOW()`,
      [this.resumeTokenKey, lastId]
    );
  }
  
  async processBatch() {
    const lastId = await this.getResumeToken();
    
    const rows = await pool.query(
      `SELECT id, old_status FROM orders
       WHERE id > $1 AND status IS NULL
       ORDER BY id
       LIMIT $2`,
      [lastId, this.batchSize]
    );
    
    if (rows.rows.length === 0) {
      return false; // Done
    }
    
    for (const row of rows.rows) {
      // Idempotent update
      await pool.query(
        `UPDATE orders SET status = $1 WHERE id = $2 AND status IS NULL`,
        [row.old_status || 'pending', row.id]
      );
    }
    
    const newLastId = rows.rows[rows.rows.length - 1].id;
    await this.saveResumeToken(newLastId);
    
    // Update metrics
    this.metrics.rowsProcessed += rows.rows.length;
    const elapsed = (Date.now() - this.metrics.startTime) / 1000;
    this.metrics.rowsPerSecond = this.metrics.rowsProcessed / elapsed;
    this.metrics.lag = newLastId;
    
    // Log progress
    console.log(`Processed ${this.metrics.rowsProcessed} rows, ${this.metrics.rowsPerSecond.toFixed(2)} rows/sec, lag: ${this.metrics.lag}`);
    
    return true; // More to process
  }
  
  async shadowReadCompare(orderId) {
    const [oldResult, newResult] = await Promise.all([
      pool.query('SELECT * FROM orders_old WHERE id = $1', [orderId]),
      pool.query('SELECT * FROM orders WHERE id = $1', [orderId])
    ]);
    
    if (JSON.stringify(oldResult.rows[0]) !== JSON.stringify(newResult.rows[0])) {
      console.warn('Shadow read mismatch', { orderId, old: oldResult.rows[0], new: newResult.rows[0] });
      return false;
    }
    
    return true;
  }
  
  async run() {
    console.log('Starting backfill...');
    
    try {
      while (await this.processBatch()) {
        await sleep(this.throttleMs);
      }
      
      console.log('Backfill complete!', this.metrics);
    } catch (error) {
      console.error('Backfill failed', { error, metrics: this.metrics });
      throw error;
    }
  }
}

function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

// Usage
const worker = new BackfillWorker({
  batchSize: 1000,
  throttleMs: 100,
  resumeTokenKey: 'backfill_orders_status'
});

worker.run().catch(console.error);

Load Test Script (k6)

import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate } from 'k6/metrics';

const errorRate = new Rate('errors');

export const options = {
  stages: [
    { duration: '2m', target: 100 }, // Ramp up
    { duration: '5m', target: 100 }, // Stay at 100 users
    { duration: '2m', target: 0 },   // Ramp down
  ],
  thresholds: {
    'http_req_duration': ['p95<500', 'p99<1000'],
    'errors': ['rate<0.01'],
  },
};

export default function () {
  const idempotencyKey = `test-${__VU}-${__ITER}`;
  const payload = JSON.stringify({
    userId: `user-${Math.floor(Math.random() * 1000)}`,
    amount: Math.floor(Math.random() * 1000),
    status: 'pending'
  });
  
  const params = {
    headers: {
      'Content-Type': 'application/json',
      'Idempotency-Key': idempotencyKey,
    },
  };
  
  const res = http.post('http://localhost:3000/orders', payload, params);
  
  const success = check(res, {
    'status is 200 or 201': (r) => r.status === 200 || r.status === 201,
    'response time < 500ms': (r) => r.timings.duration < 500,
  });
  
  errorRate.add(!success);
  
  // Retry with same idempotency key
  if (Math.random() < 0.1) { // 10% retry rate
    sleep(1);
    const retryRes = http.post('http://localhost:3000/orders', payload, params);
    check(retryRes, {
      'retry returns same result': (r) => {
        const body = JSON.parse(r.body);
        return body.id === JSON.parse(res.body).id;
      },
    });
  }
  
  sleep(1);
}

Metrics to Watch (Grafana/Prometheus)

Monitor these metrics during migrations:

Latency:

  • http_request_duration_seconds{p95, p99}
  • database_query_duration_seconds{p95, p99}

Errors:

  • http_requests_total{status=~"5.."}
  • database_errors_total

Locks:

  • postgres_locks_total
  • postgres_lock_wait_time_seconds

Replication lag:

  • postgres_replication_lag_seconds

Backfill progress:

  • backfill_rows_processed_total
  • backfill_rows_per_second
  • backfill_lag

Deadlocks:

  • postgres_deadlocks_total

Set up alerts for:

  • P95 latency > 500ms
  • Error rate > 1%
  • Lock wait time > 1s
  • Replication lag > 10s

Conclusion

Zero-downtime schema migrations are possible. The expand-contract pattern makes them safe.

Remember:

  1. Expand: Add nullable columns, create indexes concurrently
  2. Dual-write: Write to both old and new shapes
  3. Contract: Switch reads, freeze old path, drop old columns

Use feature flags. Monitor metrics. Have a rollback plan. Test in staging first.

Most outages are caused by migrations. With the right approach, you can avoid them.

Discussion

Join the conversation and share your thoughts

Discussion

0 / 5000