Multi-Region 'Strong Enough' Consistency: Designing Around Reality, Not Theory
Most teams get stuck between “CP vs AP” and end up with either slow global transactions or fragile eventual consistency.
This article focuses on practical patterns for “strong enough” consistency in multi-region systems. Not perfect consistency. Not eventual consistency. Something in between that works in production.
Why Consistency Feels Hard in Multi-Region Systems
Three things make consistency hard across regions.
Latency of Cross-Region Calls
A round trip between US East and Asia Pacific takes 200-300ms. Sometimes more. If you need to coordinate writes across regions, every operation becomes slow.
Users notice 200ms. They really notice 500ms. At 1 second, they think your app is broken.
You can’t make light travel faster. You can design around it.
Network Partitions Are Normal, Not Rare
The CAP theorem says you can’t have consistency, availability, and partition tolerance all at once. But here’s the thing: partitions happen all the time.
Not just big outages. Small ones. Network hiccups. DNS issues. Load balancer problems. They happen daily.
If you design for “partitions never happen,” you’ll break when they do.
Business Expectations: “Data Should Just Be Correct”
Users don’t care about CAP theorem. They care that their balance is right. That their order went through. That they didn’t get charged twice.
You need to explain why sometimes data is stale. But you also need to make sure critical data is never wrong.
”Strong Enough” Consistency as a Design Target
Instead of one global answer, define what must be strongly consistent and what can be eventually consistent.
Per-Entity and Per-Operation Guarantees
Not all data needs the same guarantees.
Payments:
- Must be strongly consistent
- Read-after-write required
- No double spending
- Can accept higher latency
Analytics:
- Can be eventually consistent
- Delayed is fine
- Must be correct eventually
- Low latency not critical
Notifications:
- Can be eventually consistent
- Delayed is fine
- Duplicates are annoying but not critical
- Low latency helps but not required
User profiles:
- Depends on the field
- Email: strongly consistent
- Display name: eventually consistent is fine
- Preferences: eventually consistent is fine
Define guarantees per entity. Per operation. Not globally.
Consistency Profiles by Use Case
Profile A: Critical Financial State
Needs: read-after-write, no double spending, no lost updates.
Examples: account balances, payment transactions, inventory counts.
How to achieve:
- Regional primary with synchronous replication for writes
- Read-your-own-write via sticky sessions
- Quorum writes for selected entities
- Version numbers to prevent lost updates
Profile B: Collaborative or Social Features
Can accept short-lived conflicts. Users can resolve them.
Examples: document editing, comments, likes, follows.
How to achieve:
- Last-write-wins for simple cases
- Operational transforms for complex cases
- Conflict markers for manual resolution
- Eventual consistency with short delay (seconds)
Profile C: Analytics and Reporting
Can be delayed but must be correct eventually.
Examples: dashboards, reports, metrics, logs.
How to achieve:
- Event streaming to global analytics store
- Batch processing
- Eventual consistency with longer delay (minutes to hours)
- Idempotent aggregation
Multi-Region Deployment Patterns
You have three main options. Each has trade-offs.
Active-Passive
One region is primary. Others are replicas.
Pros:
- Simple to understand
- Strong consistency easy
- No conflict resolution needed
Cons:
- Higher RTO/RPO (recovery time/point objectives)
- Failover takes time
- Passive regions waste resources
- All traffic goes to one region (latency)
When to use:
- Disaster recovery only
- Low write volume
- Can accept failover time
- Budget constraints
Active-Active with Regional Primaries
Each region is primary for its local users.
Pros:
- Low latency for local users
- Better resource utilization
- Natural load distribution
- Can handle region failures
Cons:
- Need conflict resolution
- Cross-region reads might be stale
- More complex to operate
- Need to handle user movement
When to use:
- Users are regionally distributed
- Low cross-region interaction
- Can accept eventual consistency for some data
- Need low latency
How it works:
- User in US East → writes go to US East primary
- User in Asia Pacific → writes go to Asia Pacific primary
- Reads from local region are fresh
- Reads from other regions might be stale
Global Services + Regional Caches
One source of truth with edge acceleration.
Pros:
- Strong consistency
- Simple mental model
- No conflict resolution
- Easy to reason about
Cons:
- Higher latency for remote users
- Single point of failure (mitigated with replication)
- More expensive (cross-region traffic)
When to use:
- Need strong consistency everywhere
- Can accept higher latency
- Budget for cross-region traffic
- Simple operations preferred
How it works:
- All writes go to global database
- Regional caches for reads
- Cache invalidation on writes
- Stale reads possible but bounded
Techniques for “Strong Enough” Behavior
Here are practical techniques you can use.
Idempotent Operations with Request IDs
Every write operation should be idempotent. Same request ID = same result.
// Client sends request with idempotency key
const response = await fetch('/api/payments', {
method: 'POST',
headers: {
'Idempotency-Key': 'payment-123-abc',
'Content-Type': 'application/json'
},
body: JSON.stringify({
amount: 100,
currency: 'USD'
})
});
// Server processes once, returns same result on retries
How it works:
- Client generates unique idempotency key
- Sends with request
- Server checks if key exists
- If exists, return cached result
- If not, process and store result
- Return result
Storage:
- Redis for fast lookups (TTL: 24 hours)
- Database for durability (TTL: 7 days)
- Hybrid: check Redis first, fall back to database
Key format:
- Include operation type:
payment-{userId}-{timestamp} - Or use UUID:
{uuid} - Document your format
Versioning with Optimistic Locking
Use version numbers or ETags to prevent lost updates.
// Entity has version field
interface Account {
id: string;
balance: number;
version: number;
updatedAt: Date;
}
// Read with version
const account = await db.accounts.findOne({ id: 'acc-123' });
// account.version = 5
// Update with version check
const result = await db.accounts.updateOne(
{ id: 'acc-123', version: 5 },
{
$set: { balance: 200, version: 6 },
$currentDate: { updatedAt: true }
}
);
if (result.matchedCount === 0) {
// Version mismatch - someone else updated
throw new VersionConflictError('Account was modified');
}
ETags in HTTP:
// GET returns ETag
const response = await fetch('/api/accounts/123');
const etag = response.headers.get('ETag');
const account = await response.json();
// PUT includes ETag
const updateResponse = await fetch('/api/accounts/123', {
method: 'PUT',
headers: {
'If-Match': etag,
'Content-Type': 'application/json'
},
body: JSON.stringify({ balance: 200 })
});
if (updateResponse.status === 412) {
// Precondition failed - version mismatch
// Read again and retry
}
When to use:
- Entities that change frequently
- Need to prevent lost updates
- Can accept retries on conflict
- Not too high contention
Quorum Writes for Selected Entities
For critical entities, write to a quorum of regions before returning success.
async function quorumWrite(
entityId: string,
data: any,
regions: string[]
): Promise<void> {
const quorumSize = Math.floor(regions.length / 2) + 1;
const writePromises = regions.map(region =>
writeToRegion(region, entityId, data)
);
// Wait for quorum
const results = await Promise.allSettled(writePromises);
const successes = results.filter(r => r.status === 'fulfilled');
if (successes.length < quorumSize) {
throw new QuorumWriteFailedError('Failed to write to quorum');
}
// Background: ensure all regions get the write
Promise.all(writePromises).catch(err => {
logger.error('Background replication failed', err);
});
}
Trade-offs:
- Higher latency (wait for quorum)
- Better durability
- Can handle single region failure
- More complex
When to use:
- Critical financial data
- Can accept higher latency
- Need durability guarantees
- Low write volume
Read-Your-Own-Write via Sticky Sessions or Request Routing
Users should see their own writes immediately, even in multi-region.
Sticky sessions:
// Route user to same region for session
function getRegionForUser(userId: string): string {
// Hash user ID to region
const hash = hashString(userId);
const regions = ['us-east', 'eu-west', 'ap-southeast'];
return regions[hash % regions.length];
}
// All requests from this user go to same region
// Writes are local, reads are local
// Consistent view for user
Request routing:
// Include region hint in request
const response = await fetch('/api/orders', {
method: 'POST',
headers: {
'X-User-Region': 'us-east',
'Content-Type': 'application/json'
},
body: JSON.stringify(orderData)
});
// Server routes to user's primary region
// Ensures read-your-own-write
When to use:
- Users interact with their own data
- Low cross-user interaction
- Can accept stale data for other users
- Need low latency
Conflict Handling Strategies
Conflicts happen. Here’s how to handle them.
Last Write Wins and When It Is Actually Okay
Last write wins is simple. But it’s not always safe.
When it’s okay:
- Non-critical data
- Timestamps or counters
- User preferences
- Display names
- Settings
When it’s not okay:
- Financial transactions
- Inventory counts
- Critical state changes
- Anything that can cause data loss
Implementation:
async function lastWriteWins(
entityId: string,
newData: any,
timestamp: Date
): Promise<void> {
const current = await db.entities.findOne({ id: entityId });
if (!current || timestamp > current.updatedAt) {
await db.entities.updateOne(
{ id: entityId },
{
$set: { ...newData, updatedAt: timestamp }
}
);
}
// Otherwise, ignore (last write already won)
}
Merge Functions
For complex conflicts, use merge functions.
Example: Counters
async function mergeCounters(
entityId: string,
increment: number,
region: string
): Promise<void> {
// Use atomic increment
await db.counters.updateOne(
{ id: entityId },
{
$inc: { value: increment },
$set: { [`regions.${region}`]: Date.now() }
},
{ upsert: true }
);
}
Example: Sets
async function mergeSets(
entityId: string,
newItems: string[],
region: string
): Promise<void> {
// Union operation - add all items
await db.sets.updateOne(
{ id: entityId },
{
$addToSet: { items: { $each: newItems } },
$set: { [`regions.${region}`]: Date.now() }
},
{ upsert: true }
);
}
Example: Preferences
interface Preferences {
theme: string;
language: string;
notifications: boolean;
}
async function mergePreferences(
userId: string,
newPrefs: Partial<Preferences>,
region: string
): Promise<void> {
// Field-level merge
const update: any = {
[`regions.${region}`]: Date.now()
};
// Only update provided fields
if (newPrefs.theme !== undefined) {
update['prefs.theme'] = newPrefs.theme;
}
if (newPrefs.language !== undefined) {
update['prefs.language'] = newPrefs.language;
}
if (newPrefs.notifications !== undefined) {
update['prefs.notifications'] = newPrefs.notifications;
}
await db.users.updateOne(
{ id: userId },
{ $set: update },
{ upsert: true }
);
}
Human Resolution Flows
For conflicts that can’t be automatically resolved, mark them for human review.
Marking conflicts:
interface ConflictRecord {
entityId: string;
entityType: string;
conflictType: 'version' | 'merge' | 'data';
versions: any[];
detectedAt: Date;
resolvedAt?: Date;
resolvedBy?: string;
}
async function markConflict(
entityId: string,
entityType: string,
versions: any[]
): Promise<void> {
await db.conflicts.insertOne({
entityId,
entityType,
conflictType: 'version',
versions,
detectedAt: new Date(),
status: 'pending'
});
// Notify support team
await notifySupport({
type: 'conflict_detected',
entityId,
entityType
});
}
Simple dashboard for support:
// GET /api/admin/conflicts
async function getConflicts(req: Request, res: Response) {
const conflicts = await db.conflicts.find({
status: 'pending'
}).sort({ detectedAt: -1 }).limit(100);
res.json(conflicts);
}
// POST /api/admin/conflicts/:id/resolve
async function resolveConflict(req: Request, res: Response) {
const { id } = req.params;
const { resolution, version } = req.body;
await db.conflicts.updateOne(
{ id },
{
$set: {
status: 'resolved',
resolution,
resolvedAt: new Date(),
resolvedBy: req.user.id
}
}
);
// Apply resolution
await applyResolution(id, version);
res.json({ success: true });
}
Designing APIs with Consistency in Mind
Your API design affects consistency. Here’s what to include.
Version Fields
Always include version fields in responses.
interface APIResponse<T> {
data: T;
version: number;
etag: string;
lastModified: Date;
}
// GET /api/accounts/123
{
"data": {
"id": "123",
"balance": 1000
},
"version": 5,
"etag": "\"abc123\"",
"lastModified": "2025-11-22T10:00:00Z"
}
Timestamps
Include timestamps for all entities.
interface Entity {
id: string;
// ... other fields
createdAt: Date;
updatedAt: Date;
// Optional: region-specific timestamps
regions?: {
[region: string]: Date;
};
}
Consistency Hints
Tell clients about data freshness.
interface APIResponse<T> {
data: T;
stale?: boolean;
staleAfter?: Date;
region?: string;
consistencyLevel?: 'strong' | 'eventual';
}
// Response headers
{
"X-Data-Stale": "false",
"X-Stale-After": "2025-11-22T10:05:00Z",
"X-Data-Region": "us-east",
"X-Consistency-Level": "strong"
}
Return Clear Error Types
Use specific error types for consistency issues.
class VersionConflictError extends Error {
constructor(
public entityId: string,
public currentVersion: number,
public providedVersion: number
) {
super(`Version conflict: entity ${entityId} has version ${currentVersion}, but ${providedVersion} was provided`);
this.name = 'VersionConflictError';
}
}
class RegionUnavailableError extends Error {
constructor(public region: string) {
super(`Region ${region} is currently unavailable`);
this.name = 'RegionUnavailableError';
}
}
// In your API
try {
await updateEntity(id, data, version);
} catch (error) {
if (error instanceof VersionConflictError) {
return res.status(409).json({
error: 'VersionConflict',
message: error.message,
currentVersion: error.currentVersion,
providedVersion: error.providedVersion
});
}
if (error instanceof RegionUnavailableError) {
return res.status(503).json({
error: 'RegionUnavailable',
message: error.message,
region: error.region,
retryAfter: 60
});
}
throw error;
}
Document Behavior
Document what clients can expect.
/**
* GET /api/accounts/:id
*
* Returns account balance.
*
* Consistency:
* - Strong consistency within region
* - You might see stale data for up to 2 seconds when reading from other regions
* - Your own writes are always visible immediately
*
* Headers:
* - X-Data-Stale: true if data might be stale
* - X-Stale-After: timestamp after which data is guaranteed stale
* - X-Data-Region: region where data was read from
*
* Errors:
* - 503 RegionUnavailable: primary region is down, try again later
*/
Observability and SLOs
You can’t manage what you don’t measure.
Metrics
Track these metrics:
Stale-read rate:
// Emit metric when read is stale
if (isStale) {
metrics.increment('reads.stale', {
entity_type: 'account',
region: currentRegion,
source_region: dataRegion
});
}
Cross-region latency:
const startTime = Date.now();
const result = await crossRegionRead(entityId, region);
const latency = Date.now() - startTime;
metrics.histogram('reads.cross_region_latency', latency, {
source_region: currentRegion,
target_region: region
});
Conflict rate:
try {
await updateWithVersion(entityId, data, version);
} catch (error) {
if (error instanceof VersionConflictError) {
metrics.increment('conflicts.version', {
entity_type: getEntityType(entityId),
region: currentRegion
});
}
}
Region availability:
async function checkRegionHealth(region: string): Promise<boolean> {
try {
const response = await fetch(`https://${region}.api.example.com/health`, {
timeout: 5000
});
const healthy = response.ok;
metrics.gauge('regions.health', healthy ? 1 : 0, {
region
});
return healthy;
} catch (error) {
metrics.gauge('regions.health', 0, {
region
});
return false;
}
}
SLO Examples
Define clear SLOs.
“95% of reads are fresh within 2 seconds”
// Track freshness
const freshness = Date.now() - entity.updatedAt.getTime();
if (freshness > 2000) {
// Stale
metrics.increment('slo.reads_fresh.violation');
} else {
metrics.increment('slo.reads_fresh.success');
}
// Alert if violation rate > 5%
if (violationRate > 0.05) {
alert.send('SLO violation: reads freshness', {
violationRate,
threshold: 0.05
});
}
“99.9% of balance reads are strongly consistent”
// Track consistency level
if (consistencyLevel === 'strong') {
metrics.increment('slo.balance_consistency.success');
} else {
metrics.increment('slo.balance_consistency.violation');
}
// Alert if violation rate > 0.1%
if (violationRate > 0.001) {
alert.send('SLO violation: balance consistency', {
violationRate,
threshold: 0.001
});
}
“99.95% region availability”
// Track region uptime
const uptime = await getRegionUptime(region);
const availability = uptime / totalTime;
if (availability < 0.9995) {
alert.send('SLO violation: region availability', {
region,
availability,
threshold: 0.9995
});
}
Case Study: Moving from Single-Region to Multi-Region
Here’s how one team did it.
Start: Single-Region App
They had a single-region app in US East. Everything worked fine. Until it didn’t.
Problems:
- Users in Asia Pacific had 300ms+ latency
- Single point of failure
- Disaster recovery was manual and slow
- Couldn’t scale beyond one region
Need: Lower Latency + Better DR
They needed:
- Lower latency for Asian users
- Better disaster recovery
- Ability to handle region failures
Design: Regional Primaries
They chose active-active with regional primaries.
Architecture:
- US East: primary for US users
- Asia Pacific: primary for Asian users
- Each region has its own database
- Global identity and payments service (strongly consistent)
Data partitioning:
- Users assigned to region based on signup location
- Can move users between regions (with data migration)
- Most data is region-local
- Some data is global (identity, payments)
Consistency model:
- Account balances: strongly consistent within region, eventually consistent across regions (with short delay)
- Payments: strongly consistent globally (via global service)
- User profiles: eventually consistent (short delay acceptable)
- Analytics: eventually consistent (longer delay acceptable)
Result
Latency improvements:
- US users: 50ms → 50ms (no change, already good)
- Asian users: 300ms → 50ms (6x improvement)
New consistency trade-offs:
- Cross-region reads might be stale for 1-2 seconds
- Need conflict resolution for some operations
- More complex operations
- Need to handle user movement between regions
What they learned:
- Start with read replicas first (simpler)
- Move to regional primaries only when needed
- Not all data needs strong consistency
- Clear SLOs help set expectations
- Monitoring is critical
Practical Checklist
Before going multi-region, ask these questions.
Questions to Ask
-
Do you actually need multi-region?
- What’s your current latency?
- How many users are affected?
- Can you solve it with CDN/caching?
-
What data needs strong consistency?
- Financial data? Yes.
- User profiles? Maybe.
- Analytics? No.
-
What’s your RTO/RPO?
- How long can you be down?
- How much data can you lose?
- This affects your architecture choice.
-
What’s your budget?
- Multi-region costs more
- Cross-region traffic costs money
- More infrastructure to manage
-
What’s your team size?
- Multi-region is more complex
- Need people who understand it
- More operational overhead
Simple Decision Table
If data = financial transactions → pattern = global service with strong consistency
Use a global service for payments. Regional caches for reads. Strong consistency everywhere.
If data = user profiles → pattern = regional primaries with eventual consistency
Each region is primary for local users. Cross-region reads might be stale. Short delay acceptable.
If data = analytics → pattern = event streaming with batch processing
Events stream to global analytics store. Batch processing. Longer delay acceptable.
If data = collaborative documents → pattern = operational transforms or conflict resolution
Use operational transforms for real-time. Or conflict markers for manual resolution.
Summary
Multi-region consistency isn’t about choosing CP or AP. It’s about choosing the right consistency for each piece of data.
Start with what must be strongly consistent. Make everything else eventually consistent. Set clear SLOs. Monitor everything.
Use these techniques:
- Idempotency keys for safe retries
- Version numbers for conflict prevention
- Regional primaries for low latency
- Merge functions for conflict resolution
- Clear APIs with consistency hints
The code examples in the repository show working implementations. Use them as a starting point. Adapt them to your needs.
Remember: “Strong enough” is better than perfect. Perfect is the enemy of shipped.
Discussion
Loading comments...