By Yusuf Elborey

Cell-Based Architectures for SaaS: Designing for Blast Radius, Not Just Scale

saasmulti-tenantarchitecturescalabilityblast-radiuscell-architecturesystem-designproduction

Most SaaS systems start simple. One database. Horizontally scaled services. Everything shared. It works fine until it doesn’t.

Then you hit the real problems. One tenant’s bad query slows everyone down. Your database hits hard limits. An incident takes down all customers at once.

This article shows you how to move to a cell-based architecture. Multiple self-contained cells that each serve a subset of tenants. The goal isn’t just scale. It’s limiting blast radius.

The Real Problem: One Big Cluster Hurts

You probably started here:

  • One shared database
  • Horizontally scaled stateless services
  • All tenants in the same pool

This works for a while. Then you notice problems.

Noisy Neighbors

One tenant runs a report that scans millions of rows. Your database CPU spikes. Every other tenant’s queries slow down. Response times jump from 50ms to 2 seconds.

You can’t tell that tenant to stop. They’re paying customers. You can’t predict which tenant will cause problems next.

Hard Limits on Database Scaling

Your database can only get so big. Vertical scaling hits hardware limits. Read replicas help, but writes still bottleneck. Sharding helps, but it’s complex and risky.

At some point, you can’t scale the database anymore. You’re stuck.

Incidents Affect Everyone

A bug in your code affects all tenants. A database connection pool exhaustion takes down the whole system. A deployment issue impacts every customer.

One bad tenant can take down the whole system. I’ve seen it happen. A tenant’s integration started making 10,000 requests per minute. Our API gateway couldn’t handle it. The whole platform went down.

The Shared Everything Trap

Everything is shared. Databases. Caches. Queues. Services. When one thing breaks, everything breaks. When one tenant misbehaves, everyone suffers.

You need isolation. Not just logical isolation. Physical isolation.

What Is a Cell-Based Architecture?

A cell is a full stack slice. Services. Storage. Infrastructure. Everything needed to serve a subset of tenants.

Each cell is self-contained. It has its own database. Its own caches. Its own queues. Its own services.

Tenants are mapped to cells. Tenant A goes to Cell 1. Tenant B goes to Cell 2. Each cell operates independently.

Compare to Other Approaches

Classic sharding:

Sharding splits data across databases. But services are still shared. A service failure still affects everyone. Cell-based architecture isolates everything.

Multi-region architectures:

Multi-region gives you geographic distribution. But regions are still shared within themselves. Cells give you isolation within a region.

The core idea:

Limit blast radius more than you chase infinite scale. When Cell 1 has a problem, only tenants in Cell 1 are affected. The rest keep running.

Key Design Questions

Before you build, answer these questions.

How Do You Map Tenants to Cells?

You have two main options:

Hash-based mapping:

Hash the tenant ID. Use modulo to pick a cell. Simple. Predictable. But hard to move specific tenants.

function getCellForTenant(tenantId: string, totalCells: number): string {
  const hash = hashString(tenantId);
  const cellIndex = hash % totalCells;
  return `cell-${cellIndex}`;
}

Rules-based mapping:

Use rules. VIP tenants get their own cell. Enterprise customers go to dedicated cells. Free tier goes to shared cells.

function getCellForTenant(tenantId: string, tenantTier: string): string {
  if (tenantTier === 'enterprise') {
    return `cell-enterprise-${hashString(tenantId) % 10}`;
  }
  if (tenantTier === 'vip') {
    return `cell-vip-${tenantId}`;
  }
  return `cell-shared-${hashString(tenantId) % 100}`;
}

Handling outliers:

Some tenants are huge. They need their own cell. Some tenants are tiny. They can share cells.

Plan for growth. When a tenant outgrows a shared cell, move them to a dedicated cell.

What Lives Inside a Cell?

Each cell needs:

  • App services: Your business logic. API servers. Workers. Everything that processes requests.
  • Databases: Primary database. Read replicas if needed. Each cell owns its data.
  • Caches: Redis or Memcached. Isolated per cell.
  • Messaging/queues: Message queues. Task queues. Isolated per cell.

Everything is self-contained. A cell can run independently of other cells.

What Is Shared Globally?

Some things must be shared:

  • Authentication: User identity. JWT validation. OAuth flows.
  • Identity and tenant directory: Which tenant exists. Which cell serves them.
  • Observability: Metrics. Logs. Tracing. Aggregated across cells.
  • Control plane: The system that manages cells. Routing. Provisioning.

Keep the shared layer small. Make it highly available. It’s a single point of failure if you’re not careful.

Control Plane vs Data Plane

Split your system into two planes.

Data Plane

The data plane handles actual requests:

  • Request routing: Which cell gets this request?
  • Business services: Your actual application logic.
  • Data storage: Databases. Caches. Queues.

The data plane is where cells live. Each cell is part of the data plane.

Control Plane

The control plane manages the system:

  • Tenant directory: Where to send each tenant’s traffic.
  • Config store: Tenant to cell mappings. Cell health. Capacity.
  • Orchestration: Provisioning new cells. Rebalancing tenants. Health checks.

The control plane is shared. But it’s read-heavy. You can make it highly available with caching and replication.

Keeping Them Loosely Coupled

The control plane tells the data plane where to route. The data plane doesn’t need to know about other cells.

Use events for coordination. When a tenant moves cells, emit an event. Data plane components react to the event.

Keep dependencies minimal. The data plane should work even if the control plane is slow.

Routing Strategies

How do requests find the right cell?

Edge / API Gateway

Your API gateway does the routing:

  1. Tenant identification: Extract tenant ID from JWT, subdomain, or header.
  2. Cell lookup: Query the control plane (with caching).
  3. Route request: Forward to the correct cell.
  4. Handle errors: If cell is down, fall back gracefully.
async function routeRequest(request: Request): Promise<Response> {
  // Extract tenant ID
  const tenantId = extractTenantId(request);
  
  // Look up cell (with caching)
  const cellUrl = await getCellForTenant(tenantId);
  
  if (!cellUrl) {
    return new Response('Tenant not found', { status: 404 });
  }
  
  // Forward request
  try {
    return await forwardToCell(request, cellUrl);
  } catch (error) {
    // Handle cell unavailable
    return handleCellUnavailable(tenantId, error);
  }
}

Cell Lookup with Caching

The cell directory service maps tenants to cells:

class CellDirectory {
  private cache: Map<string, { cell: string; expires: number }>;
  private ttl = 300000; // 5 minutes
  
  async getCellForTenant(tenantId: string): Promise<string> {
    // Check cache
    const cached = this.cache.get(tenantId);
    if (cached && cached.expires > Date.now()) {
      return cached.cell;
    }
    
    // Query control plane
    const cell = await this.controlPlane.getCell(tenantId);
    
    // Cache result
    this.cache.set(tenantId, {
      cell,
      expires: Date.now() + this.ttl
    });
    
    return cell;
  }
}

Fallback Behavior

What if the cell is down?

Fail-closed:

Return an error. Better to fail fast than route to the wrong place.

async function handleCellUnavailable(tenantId: string, error: Error): Promise<Response> {
  // Log the error
  logger.error('Cell unavailable', { tenantId, error });
  
  // Return 503 Service Unavailable
  return new Response('Service temporarily unavailable', {
    status: 503,
    headers: { 'Retry-After': '60' }
  });
}

Read-only fallback:

For read operations, route to a read replica or another cell with cached data.

async function handleCellUnavailable(tenantId: string, error: Error, method: string): Promise<Response> {
  if (method === 'GET') {
    // Try read replica or cached data
    const fallbackData = await getCachedData(tenantId);
    if (fallbackData) {
      return new Response(JSON.stringify(fallbackData), {
        headers: { 'X-Data-Source': 'cache' }
      });
    }
  }
  
  return new Response('Service temporarily unavailable', { status: 503 });
}

Maintenance pages:

If a cell is in maintenance, show a maintenance page instead of an error.

Data and Schema Concerns

Each cell owns its data. This creates new challenges.

Independent Schema Migrations

Each cell can migrate independently. But you need coordination.

Strategy 1: Forward-compatible migrations

Make migrations forward-compatible. Old code works with new schema. New code works with old schema. Migrate cells one at a time.

-- Forward-compatible: add nullable column
ALTER TABLE users ADD COLUMN new_field VARCHAR(255) NULL;

-- Deploy new code that uses new_field

-- Backfill data
UPDATE users SET new_field = compute_value(id) WHERE new_field IS NULL;

-- Make column non-nullable (after all cells migrated)
ALTER TABLE users ALTER COLUMN new_field SET NOT NULL;

Strategy 2: Feature flags

Use feature flags to control which cells use new schema. Gradually roll out.

Strategy 3: Blue-green per cell

Each cell has blue and green databases. Migrate green. Switch traffic. Migrate blue.

Independent Performance Tuning

Each cell can tune independently. Cell 1 might need more read replicas. Cell 2 might need different indexes.

You can optimize per tenant pattern. High-read tenants get more replicas. High-write tenants get optimized write paths.

Global Reporting

But you still need global reports. How do you aggregate across cells?

Event streaming:

Each cell publishes events to a global stream:

async function publishEvent(event: TenantEvent) {
  await globalEventStream.publish({
    ...event,
    cellId: this.cellId,
    tenantId: event.tenantId,
    timestamp: Date.now()
  });
}

// Example: Order created
await publishEvent({
  type: 'OrderCreated',
  tenantId: 'tenant-123',
  orderId: 'order-456',
  amount: 99.99
});

Global analytics store:

Consume events into a global analytics database:

async function consumeEvents() {
  for await (const event of globalEventStream.consume()) {
    await analyticsDb.insert({
      event_type: event.type,
      tenant_id: event.tenantId,
      cell_id: event.cellId,
      data: event.data,
      timestamp: event.timestamp
    });
  }
}

Trade-offs:

  • Real-time: More complex. Higher latency. More infrastructure.
  • Delayed aggregation: Simpler. Batch processing. Slight delay in reports.

Choose based on your needs. Most SaaS can tolerate a few minutes of delay in analytics.

Failure Modes and Blast Radius

The whole point is limiting blast radius. Let’s see how it works.

How Incidents Stay Contained

Single cell offline:

Cell 1 goes down. Tenants in Cell 1 are affected. Tenants in Cell 2, 3, 4 keep running.

You’ve isolated the failure. Only 10% of tenants are down instead of 100%.

Database issues:

Cell 1’s database has problems. Other cells aren’t affected. You can fix Cell 1 without touching other cells.

Deployment issues:

You deploy bad code to Cell 1. Only Cell 1 breaks. Other cells keep running. You can roll back Cell 1 independently.

Common Failure Patterns

Mis-routed traffic:

A bug routes Tenant A to the wrong cell. Tenant A’s data isn’t there. Requests fail.

Fix: Validate routing. Add checksums. Monitor for routing errors.

Control plane outages:

The control plane goes down. New routing lookups fail. But cached routes still work.

Fix: Long TTLs on routing cache. Fallback to last known good routing.

Health checks:

Cells report health to the control plane. Unhealthy cells are marked. Routing avoids them.

class CellHealthChecker {
  async checkHealth(cellId: string): Promise<boolean> {
    try {
      const response = await fetch(`${cellUrl}/health`, {
        timeout: 5000
      });
      return response.ok;
    } catch (error) {
      return false;
    }
  }
  
  async updateRouting() {
    const cells = await this.getAllCells();
    for (const cell of cells) {
      const healthy = await this.checkHealth(cell.id);
      await this.controlPlane.updateCellHealth(cell.id, healthy);
    }
  }
}

Fail-closed vs fail-open:

  • Fail-closed: If routing is uncertain, return an error. Safer. But more false failures.
  • Fail-open: If routing is uncertain, try best guess. Riskier. But fewer false failures.

Most systems use fail-closed for writes, fail-open for reads.

Migration Path from Shared Architecture

You can’t flip a switch. You need a gradual migration.

Phase 1: Introduce Tenant Directory and Routing (Still 1 Cell)

Start with routing infrastructure. But route everything to one cell.

// All tenants go to cell-1 for now
async function getCellForTenant(tenantId: string): Promise<string> {
  return 'cell-1';
}

This gives you:

  • Routing code in place
  • Tenant identification working
  • Control plane ready

No behavior change. But infrastructure is ready.

Phase 2: Carve Out 2-3 Cells

Pick a subset of tenants. Move them to a new cell.

Good candidates:

  • New tenants (no migration needed)
  • Low-traffic tenants (easier to move)
  • Specific region or tier
async function getCellForTenant(tenantId: string): Promise<string> {
  // New tenants go to cell-2
  if (isNewTenant(tenantId)) {
    return 'cell-2';
  }
  
  // Enterprise tenants go to cell-3
  if (getTenantTier(tenantId) === 'enterprise') {
    return 'cell-3';
  }
  
  // Everyone else stays in cell-1
  return 'cell-1';
}

Migrate data. Update routing. Monitor closely.

Phase 3: Automate Cell Creation and Rebalancing

Once you have multiple cells working, automate:

  • Auto-create cells: When a cell gets full, create a new one.
  • Auto-rebalance: Move tenants to balance load.
  • Auto-healing: If a cell has problems, move tenants away.
class CellOrchestrator {
  async rebalanceCells() {
    const cells = await this.getAllCells();
    const load = await this.getCellLoads();
    
    // Find overloaded cells
    const overloaded = cells.filter(cell => load[cell.id] > 0.8);
    
    for (const cell of overloaded) {
      // Move some tenants to underloaded cells
      const tenants = await this.getTenantsInCell(cell.id);
      const toMove = tenants.slice(0, Math.floor(tenants.length * 0.2));
      
      for (const tenant of toMove) {
        await this.moveTenant(tenant, findUnderloadedCell(cells, load));
      }
    }
  }
}

Practical Tips

Shadow-routing and mirroring:

Before moving a tenant, shadow their traffic to the new cell. Compare results. Verify correctness.

async function shadowRoute(tenantId: string, newCell: string) {
  const request = await getRequest(tenantId);
  
  // Send to both cells
  const [oldResult, newResult] = await Promise.all([
    forwardToCell(request, 'cell-1'),
    forwardToCell(request, newCell)
  ]);
  
  // Compare results
  if (resultsMatch(oldResult, newResult)) {
    logger.info('Shadow routing successful', { tenantId });
  } else {
    logger.error('Shadow routing mismatch', { tenantId, oldResult, newResult });
  }
}

Canary cell:

Create a canary cell. Route a small percentage of traffic to it. Test new infrastructure. Gradually increase.

async function routeWithCanary(tenantId: string): Promise<string> {
  // 5% of traffic goes to canary
  if (Math.random() < 0.05) {
    return 'cell-canary';
  }
  
  return await getCellForTenant(tenantId);
}

Cost and Team Structure

Cell-based architecture has costs. But also benefits.

Cost Model

More infrastructure overhead:

Each cell needs its own infrastructure. Databases. Caches. Services. More servers. More cost.

But:

  • You can use smaller, cheaper instances per cell
  • You can scale cells independently
  • You can use different instance types per cell

Fewer large outages:

When something breaks, only one cell breaks. You don’t lose all customers. Revenue protection often outweighs infrastructure cost.

Better capacity planning:

You know exactly how many tenants per cell. You can plan capacity better. Less over-provisioning.

Team Ownership

Teams own specific cells:

Team A owns Cell 1 and Cell 2. Team B owns Cell 3 and Cell 4. Clear ownership. Clear responsibility.

Teams can:

  • Deploy independently
  • Tune independently
  • Debug independently

Platform team owns control plane:

One team owns routing. Orchestration. Cell provisioning. They make sure the system works as a whole.

Clear boundaries:

Data plane teams focus on their cells. Platform team focuses on the system. Clear separation of concerns.

Checklist and Guardrails

Before you build, check these.

When Is a Cell-Based Design Worth It?

You need it if:

  • You have noisy neighbor problems
  • Database scaling is hitting limits
  • Incidents affect too many customers
  • You have tenants with very different needs

You don’t need it if:

  • You’re early stage (under 100 tenants)
  • All tenants have similar patterns
  • You don’t have scaling problems yet
  • Your team is too small to operate it

Start simple. Add cells when you need them.

Minimal Viable Cell Design

Start with the minimum:

  • 2-3 cells
  • Simple hash-based routing
  • Manual cell provisioning
  • Basic health checks

Don’t over-engineer. Get it working. Then add complexity.

Pitfalls to Avoid

Too many cells too fast:

Start with 2-3 cells. Learn. Then add more. Don’t create 50 cells on day one.

Complex routing logic:

Keep routing simple. Hash-based or simple rules. Don’t build a complex routing engine until you need it.

Shared state between cells:

Cells should be independent. Don’t share databases. Don’t share caches. If you need shared state, use events.

Ignoring the control plane:

The control plane is critical. Make it highly available. Monitor it closely. Cache aggressively.

Moving tenants too often:

Moving tenants is expensive. Do it carefully. Batch moves. Verify after each move.

Code Examples

Here are working examples you can use.

Tenant to Cell Lookup

interface TenantInfo {
  id: string;
  tier: string;
  createdAt: Date;
}

class CellDirectory {
  private cache: Map<string, { cell: string; expires: number }>;
  private controlPlane: ControlPlaneClient;
  private ttl = 300000; // 5 minutes
  
  constructor(controlPlane: ControlPlaneClient) {
    this.cache = new Map();
    this.controlPlane = controlPlane;
  }
  
  async getCellForTenant(tenantId: string): Promise<string> {
    // Check cache first
    const cached = this.cache.get(tenantId);
    if (cached && cached.expires > Date.now()) {
      return cached.cell;
    }
    
    // Get tenant info
    const tenant = await this.controlPlane.getTenant(tenantId);
    
    // Determine cell
    const cell = this.determineCell(tenant);
    
    // Cache result
    this.cache.set(tenantId, {
      cell,
      expires: Date.now() + this.ttl
    });
    
    return cell;
  }
  
  private determineCell(tenant: TenantInfo): string {
    // VIP tenants get dedicated cells
    if (tenant.tier === 'vip') {
      return `cell-vip-${hashString(tenant.id) % 10}`;
    }
    
    // Enterprise tenants get shared enterprise cells
    if (tenant.tier === 'enterprise') {
      return `cell-enterprise-${hashString(tenant.id) % 20}`;
    }
    
    // Regular tenants use hash-based routing
    return `cell-shared-${hashString(tenant.id) % 100}`;
  }
  
  private hashString(str: string): number {
    let hash = 0;
    for (let i = 0; i < str.length; i++) {
      const char = str.charCodeAt(i);
      hash = ((hash << 5) - hash) + char;
      hash = hash & hash; // Convert to 32-bit integer
    }
    return Math.abs(hash);
  }
  
  invalidateCache(tenantId: string): void {
    this.cache.delete(tenantId);
  }
}

Cell Routing Middleware

interface Request {
  headers: Record<string, string>;
  method: string;
  url: string;
  body?: any;
}

class CellRouter {
  private directory: CellDirectory;
  private cellBaseUrls: Map<string, string>;
  
  constructor(directory: CellDirectory, cellBaseUrls: Map<string, string>) {
    this.directory = directory;
    this.cellBaseUrls = cellBaseUrls;
  }
  
  async route(request: Request): Promise<Response> {
    // Extract tenant ID
    const tenantId = this.extractTenantId(request);
    if (!tenantId) {
      return new Response('Tenant ID required', { status: 400 });
    }
    
    // Get cell for tenant
    let cellId: string;
    try {
      cellId = await this.directory.getCellForTenant(tenantId);
    } catch (error) {
      logger.error('Failed to get cell for tenant', { tenantId, error });
      return new Response('Service unavailable', { status: 503 });
    }
    
    // Get cell base URL
    const cellBaseUrl = this.cellBaseUrls.get(cellId);
    if (!cellBaseUrl) {
      logger.error('Cell base URL not found', { cellId });
      return new Response('Service unavailable', { status: 503 });
    }
    
    // Forward request to cell
    try {
      return await this.forwardToCell(request, cellBaseUrl);
    } catch (error) {
      return this.handleCellError(tenantId, cellId, error, request.method);
    }
  }
  
  private extractTenantId(request: Request): string | null {
    // Try JWT token
    const authHeader = request.headers['authorization'];
    if (authHeader) {
      const token = authHeader.replace('Bearer ', '');
      const payload = this.decodeJWT(token);
      if (payload?.tenantId) {
        return payload.tenantId;
      }
    }
    
    // Try header
    if (request.headers['x-tenant-id']) {
      return request.headers['x-tenant-id'];
    }
    
    // Try subdomain
    const host = request.headers['host'];
    if (host) {
      const parts = host.split('.');
      if (parts.length > 2) {
        return parts[0]; // tenant.example.com
      }
    }
    
    return null;
  }
  
  private decodeJWT(token: string): any {
    try {
      const parts = token.split('.');
      const payload = JSON.parse(Buffer.from(parts[1], 'base64').toString());
      return payload;
    } catch (error) {
      return null;
    }
  }
  
  private async forwardToCell(request: Request, cellBaseUrl: string): Promise<Response> {
    const url = new URL(request.url);
    const targetUrl = `${cellBaseUrl}${url.pathname}${url.search}`;
    
    const response = await fetch(targetUrl, {
      method: request.method,
      headers: request.headers,
      body: request.body ? JSON.stringify(request.body) : undefined
    });
    
    return response;
  }
  
  private handleCellError(
    tenantId: string,
    cellId: string,
    error: Error,
    method: string
  ): Response {
    logger.error('Cell request failed', { tenantId, cellId, error, method });
    
    // For read operations, try cache
    if (method === 'GET') {
      // Could try cached data here
    }
    
    return new Response('Service temporarily unavailable', {
      status: 503,
      headers: { 'Retry-After': '60' }
    });
  }
}

Control Plane API

interface Cell {
  id: string;
  status: 'active' | 'provisioning' | 'maintenance' | 'degraded';
  capacity: number;
  currentTenants: number;
  createdAt: Date;
}

class ControlPlaneAPI {
  private cells: Map<string, Cell>;
  private eventEmitter: EventEmitter;
  
  constructor() {
    this.cells = new Map();
    this.eventEmitter = new EventEmitter();
  }
  
  async createCell(request: CreateCellRequest): Promise<Cell> {
    // Validate
    if (!request.id || !request.id.match(/^[a-z0-9-]+$/)) {
      throw new Error('Invalid cell ID');
    }
    
    if (this.cells.has(request.id)) {
      throw new Error('Cell already exists');
    }
    
    // Check capacity
    const totalCapacity = Array.from(this.cells.values())
      .reduce((sum, cell) => sum + cell.capacity, 0);
    
    if (totalCapacity + request.capacity > 10000) {
      throw new Error('Total capacity limit exceeded');
    }
    
    // Create cell
    const cell: Cell = {
      id: request.id,
      status: 'provisioning',
      capacity: request.capacity,
      currentTenants: 0,
      createdAt: new Date()
    };
    
    this.cells.set(cell.id, cell);
    
    // Emit event for infrastructure provisioning
    this.eventEmitter.emit('cell:provision', {
      cellId: cell.id,
      capacity: cell.capacity
    });
    
    // In real implementation, this would trigger:
    // - Database provisioning
    // - Cache provisioning
    // - Service deployment
    // - Health check setup
    
    return cell;
  }
  
  async getCell(cellId: string): Promise<Cell | null> {
    return this.cells.get(cellId) || null;
  }
  
  async getAllCells(): Promise<Cell[]> {
    return Array.from(this.cells.values());
  }
  
  async updateCellStatus(cellId: string, status: Cell['status']): Promise<void> {
    const cell = this.cells.get(cellId);
    if (!cell) {
      throw new Error('Cell not found');
    }
    
    cell.status = status;
    this.eventEmitter.emit('cell:status-change', { cellId, status });
  }
  
  async assignTenantToCell(tenantId: string, cellId: string): Promise<void> {
    const cell = this.cells.get(cellId);
    if (!cell) {
      throw new Error('Cell not found');
    }
    
    if (cell.status !== 'active') {
      throw new Error('Cell is not active');
    }
    
    if (cell.currentTenants >= cell.capacity) {
      throw new Error('Cell is at capacity');
    }
    
    // In real implementation, this would:
    // - Update tenant directory
    // - Trigger data migration
    // - Update routing
    
    cell.currentTenants++;
    this.eventEmitter.emit('tenant:assigned', { tenantId, cellId });
  }
}

Event Streaming to Global Analytics

interface TenantEvent {
  type: string;
  tenantId: string;
  cellId: string;
  data: any;
  timestamp: number;
}

class EventPublisher {
  private cellId: string;
  private globalStream: EventStream;
  
  constructor(cellId: string, globalStream: EventStream) {
    this.cellId = cellId;
    this.globalStream = globalStream;
  }
  
  async publishEvent(type: string, tenantId: string, data: any): Promise<void> {
    const event: TenantEvent = {
      type,
      tenantId,
      cellId: this.cellId,
      data,
      timestamp: Date.now()
    };
    
    await this.globalStream.publish('tenant-events', event);
  }
}

// Usage in your application
class OrderService {
  constructor(private eventPublisher: EventPublisher) {}
  
  async createOrder(tenantId: string, orderData: any): Promise<Order> {
    // Create order in local database
    const order = await this.db.orders.create({
      ...orderData,
      tenantId
    });
    
    // Publish event to global stream
    await this.eventPublisher.publishEvent('OrderCreated', tenantId, {
      orderId: order.id,
      amount: order.amount,
      items: order.items
    });
    
    return order;
  }
}

// Consumer for global analytics
class AnalyticsConsumer {
  private analyticsDb: AnalyticsDatabase;
  
  constructor(analyticsDb: AnalyticsDatabase) {
    this.analyticsDb = analyticsDb;
  }
  
  async start(): Promise<void> {
    const stream = new EventStream('tenant-events');
    
    for await (const event of stream.consume()) {
      await this.processEvent(event);
    }
  }
  
  private async processEvent(event: TenantEvent): Promise<void> {
    // Store in analytics database
    await this.analyticsDb.insert({
      event_type: event.type,
      tenant_id: event.tenantId,
      cell_id: event.cellId,
      data: event.data,
      timestamp: new Date(event.timestamp)
    });
    
    // Update aggregates
    if (event.type === 'OrderCreated') {
      await this.analyticsDb.increment('daily_orders', {
        tenant_id: event.tenantId,
        date: new Date(event.timestamp).toISOString().split('T')[0]
      });
    }
  }
}

Simple Rebalancing Flow

class TenantRebalancer {
  private controlPlane: ControlPlaneAPI;
  private directory: CellDirectory;
  private dataMigrator: DataMigrator;
  
  async rebalanceTenant(tenantId: string, targetCellId: string): Promise<void> {
    // Step 1: Mark tenant as moving
    await this.controlPlane.markTenantMoving(tenantId);
    
    // Step 2: Get current cell
    const currentCellId = await this.directory.getCellForTenant(tenantId);
    if (currentCellId === targetCellId) {
      throw new Error('Tenant already in target cell');
    }
    
    // Step 3: Backfill data to target cell
    await this.dataMigrator.migrateTenantData(tenantId, currentCellId, targetCellId);
    
    // Step 4: Verify data integrity
    const verified = await this.verifyDataIntegrity(tenantId, currentCellId, targetCellId);
    if (!verified) {
      throw new Error('Data integrity check failed');
    }
    
    // Step 5: Flip routing
    await this.controlPlane.assignTenantToCell(tenantId, targetCellId);
    this.directory.invalidateCache(tenantId);
    
    // Step 6: Wait for traffic to drain from old cell
    await this.waitForTrafficDrain(tenantId, currentCellId);
    
    // Step 7: Clean up old cell data (optional, can be delayed)
    // await this.dataMigrator.cleanupOldData(tenantId, currentCellId);
  }
  
  private async verifyDataIntegrity(
    tenantId: string,
    sourceCell: string,
    targetCell: string
  ): Promise<boolean> {
    // Compare record counts
    const sourceCount = await this.getRecordCount(tenantId, sourceCell);
    const targetCount = await this.getRecordCount(tenantId, targetCell);
    
    if (sourceCount !== targetCount) {
      return false;
    }
    
    // Sample records and compare
    const sample = await this.getSampleRecords(tenantId, sourceCell);
    for (const record of sample) {
      const targetRecord = await this.getRecord(record.id, targetCell);
      if (!this.recordsMatch(record, targetRecord)) {
        return false;
      }
    }
    
    return true;
  }
  
  private async waitForTrafficDrain(tenantId: string, cellId: string): Promise<void> {
    // Wait until no active requests for this tenant in the old cell
    let attempts = 0;
    while (attempts < 60) {
      const activeRequests = await this.getActiveRequestCount(tenantId, cellId);
      if (activeRequests === 0) {
        return;
      }
      await sleep(1000);
      attempts++;
    }
    
    throw new Error('Traffic drain timeout');
  }
}

Summary

Cell-based architecture isn’t about infinite scale. It’s about limiting blast radius.

When one cell has problems, other cells keep running. When one tenant misbehaves, other tenants aren’t affected. When you need to scale, you add cells instead of scaling one giant system.

Start simple. Two or three cells. Hash-based routing. Manual provisioning. Get it working.

Then add complexity as you need it. Rules-based routing. Automated rebalancing. More cells.

The code examples above give you a foundation. Adapt them to your needs. Keep it simple. Add complexity only when you need it.

Most importantly: cells are about isolation. Keep them independent. Keep the shared layer small. Make the control plane highly available.

When done right, cell-based architecture gives you the isolation you need without the complexity you don’t.

Discussion

Join the conversation and share your thoughts

Discussion

0 / 5000