Designing Memory-Centric Distributed Systems: Rethinking Data Locality in Modern Architectures
Introduction
Memory is the new disk. Not because disks disappeared, but because the gap between memory speed and everything else keeps growing. CPUs got faster. Networks got faster. NVMe got faster. But when systems fall over in production, the root cause is often the same: data had to move too far, too often, across layers and across machines.
And data movement is expensive. Copying bytes between NUMA nodes adds latency. Crossing a rack adds jitter. Shuffling over the network burns tail latency and CPU cycles. You pay for it in cache misses, queueing, deserialization, and retries. At scale, that cost becomes your architecture.
Memory-centric design flips the default. Instead of pushing data to compute or compute to data without thinking, we start with locality. We place data where it will be used. We align scheduling with placement. We pick protocols and layouts that keep hot paths in memory, on the right socket, and as close as possible to the caller.
Why now? Three trends:
- CXL (Compute Express Link): shared, coherent memory pools across hosts are moving from slides to practice. It won’t solve locality, but it changes what “remote memory” means.
- In-memory databases and grids: Redis, Memcached, Aerospike, Hazelcast, and new shared memory fabrics push sub-millisecond access when designed right.
- RDMA and kernel bypass: frameworks like DPDK, io_uring, and RDMA libraries let you skip the kernel on hot paths. Less overhead. More predictability.
This article is for senior engineers who design and run low-latency systems. We’ll look at patterns, trade-offs, and working code. The goal isn’t to be clever. It’s to reduce hops and avoid moving data you don’t have to.
The Shift Toward Memory-Centric Design
Traditional systems were storage-centric. You designed tables for disk. You batched writes. You cached reads. You scaled by adding replicas and storage nodes. That model still works, but it leaves performance on the table when the bottleneck is data motion, not raw compute.
Memory-centric systems focus on where bytes live in the fast path:
- Keep hot state in memory, near the CPU that will touch it.
- Treat storage as a persistence tier, not a request-time dependency for hot paths.
- Align scheduling with data placement so you don’t fetch on every call.
Two realities shape this design:
-
CPU–memory distance matters. NUMA means your memory access time is not uniform. A core reaching into remote memory can cost 1.3–1.8× more than local. That variance compounds under load. When you add cross-socket contention, locks stretch, queues grow, and tail latency spikes.
-
Data shuffling dominates cost. Moving 10 KB across the network sounds cheap. Do it a million times per second, with serialization and per-hop queueing, and you’ve built a latency machine. This gets worse when your system shuffles intermediate results (think map-reduce style joins or chatty microservices).
Memory-centric design uses a few principles:
- Partition by access pattern, not only by key range. Co-locate keys that are used together.
- Pin work to the partition owner. Prefer idempotent, single-writer flows.
- Avoid unnecessary copies. Use zero-copy paths, direct byte buffers, and cache-friendly structures.
- Prefer a single source of truth in memory with a durable tail (append to NVMe, replicate in background).
It’s less about one tool and more about orchestration. The exact stack—Redis with a write-ahead log on NVMe, a Rust service with RDMA reads, a Go gateway doing affinity routing—depends on your constraints. The patterns repeat.
Key Architectural Patterns
NUMA-aware partitioning
NUMA changes scaling. If a process hops sockets to fetch state, your latency budget melts. A better plan:
- Split data into partitions that fit L3 or main memory budgets per socket.
- Run one worker group per socket. Pin threads to cores. Bind allocations to the local NUMA node.
- Use a lock-minimized design inside each partition (queues per core, single-writer logs, RCU where it helps).
This setup avoids cross-socket thrash. When you must cross sockets, batch and offload. Consider a message-passing hop to hand off work to the owner socket rather than touching its memory directly.
Practical tips:
- Pre-size memory arenas per NUMA node.
- Keep pointer-chasing low. Flat arrays beat linked structures.
- Measure LLC miss rate and remote memory accesses. If they trend up with QPS, your partitions are too big or too chatty.
Data locality-first scheduling (affinity)
Schedulers often ignore where data sits. A basic hash ring can help (more below), but go further:
- Route reads to the primary owner when possible. Use replicas only for overflow or failover.
- Stickiness beats balance for hot keys. Slightly uneven load is cheaper than extra hops.
- If you cache at the edge, add soft TTL and background refresh so you don’t stampede the origin on expiry.
Affinity is not just for services. Apply it to job scheduling and stream processing. If a stream key maps to a worker, keep it there. Push events through the same path to keep state warm.
Shared memory clusters with persistent tiers
Pure in-memory is fast, but you still need durability. Most real systems mix:
- Hot tier: Redis, in-memory maps, lock-free queues, or custom stores.
- Warm tier: NVMe append logs or LSM stores (RocksDB) on the same host.
- Cold tier: object storage for snapshots and backfills.
Write flow example:
- Apply mutation in memory (partition owner).
- Append to a local NVMe log (fsync in batches).
- Replicate to another node asynchronously.
- A background job compacts or snapshots.
Read flow stays in memory unless there’s a miss, then promotes from warm tier. The key is to avoid synchronous network I/O on the hot path. If you need strict consistency, use a lightweight quorum on commit, but keep it small and localized.
RDMA and kernel bypass
RDMA helps when you must go remote. It removes syscalls and copies from the hot path. You still pay for distance, but the slope is lower and tails are tighter. If RDMA is not an option, io_uring and userland networking (DPDK) reduce overhead on commodity stacks.
CXL and pooled memory
CXL makes “remote memory” feel closer, but it doesn’t remove architecture discipline. Treat pooled memory as a slower class than local DRAM. Use it for larger working sets and cold-but-not-frozen data. Keep hot keys local, and spill the rest.
Implementation Blueprint (Code + Diagram)
Below are focused examples. They are small on purpose. You can graft them into a larger system.
Locality-aware request routing in Go
We’ll implement:
- A consistent hash ring for node selection
- Node affinity routing with simple health and weights
- A per-node cache hint to bias towards local access
package main
import (
"crypto/sha1"
"encoding/binary"
"fmt"
"sort"
"sync"
)
type Node struct {
ID string
Weight int
Zone string // optional: rack/zone for topology-aware fallback
Alive bool
}
type ringPoint struct {
hash uint32
node *Node
}
type HashRing struct {
mu sync.RWMutex
points []ringPoint
}
func NewHashRing(nodes []Node) *HashRing {
hr := &HashRing{}
hr.Rebuild(nodes)
return hr
}
func (hr *HashRing) Rebuild(nodes []Node) {
hr.mu.Lock()
defer hr.mu.Unlock()
var pts []ringPoint
for i := range nodes {
n := nodes[i]
if !n.Alive || n.Weight <= 0 {
continue
}
// virtual nodes for weight
replicas := n.Weight * 100
for r := 0; r < replicas; r++ {
key := fmt.Sprintf("%s#%d", n.ID, r)
h := sha1.Sum([]byte(key))
pts = append(pts, ringPoint{hash: binary.BigEndian.Uint32(h[:4]), node: &n})
}
}
sort.Slice(pts, func(i, j int) bool { return pts[i].hash < pts[j].hash })
hr.points = pts
}
func (hr *HashRing) GetNode(key string) *Node {
hr.mu.RLock()
defer hr.mu.RUnlock()
if len(hr.points) == 0 {
return nil
}
h := sha1.Sum([]byte(key))
kh := binary.BigEndian.Uint32(h[:4])
// binary search first point >= kh
i := sort.Search(len(hr.points), func(i int) bool { return hr.points[i].hash >= kh })
if i == len(hr.points) {
i = 0
}
return hr.points[i].node
}
// Locality-aware route: prefer owner; if caller is on owner, mark local=true
type RouteResult struct {
OwnerID string
Local bool
}
func Route(hr *HashRing, key string, callerNodeID string) RouteResult {
owner := hr.GetNode(key)
if owner == nil {
return RouteResult{}
}
return RouteResult{OwnerID: owner.ID, Local: owner.ID == callerNodeID}
}
func main() {
nodes := []Node{
{ID: "node-a", Weight: 1, Zone: "rack-1", Alive: true},
{ID: "node-b", Weight: 1, Zone: "rack-1", Alive: true},
{ID: "node-c", Weight: 1, Zone: "rack-2", Alive: true},
}
ring := NewHashRing(nodes)
for _, key := range []string{"user:1", "user:2", "user:42"} {
r := Route(ring, key, "node-b")
fmt.Printf("key=%s owner=%s local=%v\n", key, r.OwnerID, r.Local)
}
}
How to use it:
- Run the ring inside your gateway and in each shard. The gateway uses it to route to the owner. The shard uses it to check if a request is local and to read directly from its in-memory map without a network hop.
- For failover, rebuild the ring when a node dies and retry. Prefer rack-aware fallback before cross-zone.
Node-affinity caching with a consistent hash ring in Rust
Below is a minimal Rust cache shim that biases towards local hits and falls back to an owner node. It sketches the shape, not a full client.
use sha1::{Digest, Sha1};
use std::cmp::Ordering;
use std::sync::{Arc, RwLock};
#[derive(Clone)]
struct Node {
id: String,
weight: u32,
alive: bool,
}
#[derive(Clone)]
struct RingPoint {
hash: u32,
node: Node,
}
#[derive(Clone, Default)]
struct HashRing {
points: Arc<RwLock<Vec<RingPoint>>>,
}
impl HashRing {
fn rebuild(&self, nodes: Vec<Node>) {
let mut pts: Vec<RingPoint> = Vec::new();
for n in nodes.into_iter() {
if !n.alive || n.weight == 0 { continue; }
let replicas = n.weight * 100;
for r in 0..replicas {
let key = format!("{}#{}", n.id, r);
let mut hasher = Sha1::new();
hasher.update(key.as_bytes());
let digest = hasher.finalize();
let h = u32::from_be_bytes([digest[0], digest[1], digest[2], digest[3]]);
pts.push(RingPoint { hash: h, node: n.clone() });
}
}
pts.sort_by(|a, b| if a.hash < b.hash { Ordering::Less } else { Ordering::Greater });
*self.points.write().unwrap() = pts;
}
fn get_node(&self, key: &str) -> Option<Node> {
let points = self.points.read().unwrap();
if points.is_empty() { return None; }
let mut hasher = Sha1::new();
hasher.update(key.as_bytes());
let digest = hasher.finalize();
let kh = u32::from_be_bytes([digest[0], digest[1], digest[2], digest[3]]);
let i = points.binary_search_by(|p| {
if p.hash >= kh { Ordering::Greater } else { Ordering::Less }
}).unwrap_or_else(|i| i % points.len());
Some(points[i % points.len()].node.clone())
}
}
struct LocalCache {
data: dashmap::DashMap<String, Vec<u8>>,
}
impl LocalCache {
fn get(&self, k: &str) -> Option<Vec<u8>> { self.data.get(k).map(|v| v.to_vec()) }
fn set(&self, k: &str, v: Vec<u8>) { self.data.insert(k.to_string(), v); }
}
struct Client {
node_id: String,
ring: HashRing,
cache: Arc<LocalCache>,
}
impl Client {
fn get(&self, key: &str) -> Option<Vec<u8>> {
if let Some(owner) = self.ring.get_node(key) {
if owner.id == self.node_id {
// serve local
return self.cache.get(key);
} else {
// fallback to remote owner (sketch: network call)
// fetch_remote(owner.id, key)
return None;
}
}
None
}
}
fn main() {
let ring = HashRing::default();
ring.rebuild(vec![
Node { id: "node-a".into(), weight: 1, alive: true },
Node { id: "node-b".into(), weight: 1, alive: true },
]);
let cache = Arc::new(LocalCache { data: dashmap::DashMap::new() });
cache.set("user:1", b"local_value".to_vec());
let client = Client { node_id: "node-a".into(), ring: ring.clone(), cache };
if let Some(v) = client.get("user:1") { println!("{}", String::from_utf8_lossy(&v)); }
}
You would replace the remote fetch with RDMA, gRPC, or a custom protocol. The point is the decision: try local first if you’re the owner. Don’t do an extra hop by default.
Benchmarking: local vs. remote access
You need proof. Measure on your hardware. Below is a Go microbenchmark that compares local memory access to simulated remote access (network hop). Replace the remote call with your actual path.
package bench
import (
"net"
"testing"
)
var store = make(map[string][]byte)
func localGet(key string) []byte {
return store[key]
}
// Simulate remote by sending a tiny request to a loopback server
func remoteGet(key string) []byte {
conn, _ := net.Dial("tcp", "127.0.0.1:9099")
defer conn.Close()
conn.Write([]byte(key))
buf := make([]byte, 16)
conn.Read(buf)
return buf
}
func BenchmarkLocal(b *testing.B) {
store["k"] = make([]byte, 64)
b.ReportAllocs()
for i := 0; i < b.N; i++ {
_ = localGet("k")
}
}
func BenchmarkRemote(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
_ = remoteGet("k")
}
}
And a tiny server to back the remote benchmark:
package main
import (
"io"
"log"
"net"
)
func main() {
ln, err := net.Listen("tcp", ":9099")
if err != nil { log.Fatal(err) }
for {
c, err := ln.Accept()
if err != nil { continue }
go func(conn net.Conn) {
defer conn.Close()
io.CopyN(conn, conn, 16) // echo back 16 bytes
}(c)
}
}
Expect the local benchmark to be orders of magnitude faster and more predictable. On real stacks, prioritize measuring:
- p50, p95, p99 for local vs. remote
- cache hit rate and promotion time from warm tier
- cross-socket remote memory reads if NUMA is in play
Minimal hero diagram
The diagram maps the flow: client -> gateway (hash ring) -> shard owner (hot in-memory partition) -> async durable log (NVMe) -> replica. You’ll find it at the top of this post.
Observability and Debugging Memory Locality
Locality bugs are sneaky. They hide in averages. You need the right signals.
Metrics to track:
- Cache hit rate by partition and by caller node
- Percentage of requests served locally vs. remotely
- Remote memory access latency (NUMA) and cross-socket ratios
- Queue lengths per partition and per core
- RocksDB or NVMe tail durability lag (if you have a warm tier)
- Hash ring churn rate and rebalance time
Eventing:
- Emit a structured event when a request that should be local goes remote. Include caller node, owner, ring version, and reason.
- Tag timeouts and retries with locality data. You’ll see patterns.
Tracing with OpenTelemetry:
- Add attributes like
locality.owner_id,locality.caller_id,locality.is_local,partition.id, andnuma.nodeon spans. - Create a view that breaks down latency by
locality.is_localto visualize the gap.
Prometheus setup (sketch):
# Counters
requests_total{is_local="true"}
requests_total{is_local="false"}
# Histograms
request_latency_seconds_bucket{is_local="true", partition="p1"}
request_latency_seconds_bucket{is_local="false", partition="p1"}
# Gauges
partition_queue_depth{partition="p1"}
remote_memory_reads_total{numa="1->0"}
Alerting ideas:
- If local ratio drops below a threshold for a hot keyspace, page the on-call. Something drifted—ring rebuild, health check, or cache invalidation loop.
- If cross-socket reads jump after a deploy, roll back or rebalance partitions.
Debug checklist:
- Validate the ring version on client and server match.
- Check pinning: threads on the right cores, memory on the right NUMA node.
- Inspect recent leader changes or health flaps.
- Review snapshot/compaction schedules for interference with hot paths.
Conclusion
Future systems—especially AI inference, graph traversal, and streaming joins—lean on memory locality. Models and embeddings get larger. Graphs don’t fit in cache. The systems that win will move less and reuse more. They’ll align data, threads, and partitions so hot paths stay short.
The playbook isn’t exotic:
- Partition with NUMA in mind
- Route by affinity
- Keep hot data in memory with a durable tail
- Use RDMA or at least low-overhead networking when you must go remote
- Observe locality explicitly and budget for it
Do this, and the “mystery” latency goes away. You’ll still have incidents, but they’ll be about real bottlenecks, not hidden hops. That’s how you design systems that feel fast because they are close.
Discussion
Loading comments...