Clean Code for Distributed Systems: Make Retries, Timeouts, and Idempotency Readable
Modern codebases talk to queues, APIs, databases, and third-party services. Reliability logic is everywhere now: retries, timeouts, circuit breakers, idempotency keys. The trend is good reliability practices, but the code often becomes unreadable.
This article focuses on one thing: keep reliability code clean and obvious.
The problem: reliability code that hides intent
You’ve seen it. A function that calls an API. Somewhere inside, there’s retry logic. Maybe a timeout. Maybe idempotency handling. But you can’t tell from reading the function name. You can’t tell from the function signature. You have to read the implementation to know what happens when the API is slow or fails.
Retries and timeouts added “quickly”
Someone adds retry logic. They wrap the HTTP call. They add exponential backoff. It works. But now the function does two things: business logic and reliability handling. The next person adds a timeout. Now it does three things. Then someone adds idempotency. Now it does four things.
The function still has the same name. It still takes the same parameters. But its behavior changed. You can’t tell from the outside.
Now nobody knows what the real behavior is
Six months later, someone needs to change the business logic. They read the function. They see retry logic mixed with business logic. They see timeout handling mixed with error handling. They can’t tell what’s essential and what’s infrastructure.
They make a change. It breaks the retry logic. Or it breaks the timeout. Or it breaks the idempotency. They didn’t know those things existed.
Rule: reliability belongs at boundaries
Reliability is infrastructure. It’s not business logic. Keep them separate.
Wrap outbound calls, not business logic
Wrap HTTP calls. Wrap database calls. Wrap queue operations. Don’t wrap business functions. Business functions should be pure: given the same input, they produce the same output. Reliability logic is about handling failures and retries. That’s infrastructure.
One place to define policies
Define retry policies in one place. Define timeout policies in one place. Define idempotency policies in one place. Don’t scatter them across the codebase. When you need to change a policy, you change it in one place.
Three clean primitives
You need three things: timeouts, retries, and idempotency. Make them explicit. Make them visible.
CallWithTimeout(…)
Wrap outbound calls with timeouts. Make the timeout explicit in the function name or signature.
const result = await callWithTimeout(
() => paymentService.charge(amount),
{ timeoutMs: 5000 }
);
You can read this and know: there’s a timeout. It’s 5 seconds. If it takes longer, it fails.
Retry(…) with explicit rules
Wrap calls with retry logic. Make the retry rules explicit.
const result = await retry(
() => callWithTimeout(() => paymentService.charge(amount), { timeoutMs: 5000 }),
{
maxAttempts: 3,
shouldRetry: (error) => error.statusCode === 503 || error.statusCode === 429,
backoffMs: 1000
}
);
You can read this and know: it retries up to 3 times. It only retries on 503 or 429 errors. It waits 1 second between retries.
EnsureIdempotent(…) at command boundaries
Wrap commands (operations that change state) with idempotency checks. Make the idempotency key explicit.
const order = await ensureIdempotent(
() => createOrder(userId, items),
{
idempotencyKey: `order-${userId}-${Date.now()}`,
idempotencyStore: redisClient
}
);
You can read this and know: this operation is idempotent. It uses a key. It stores results in Redis. If you call it twice with the same key, you get the same result.
Make behavior visible in names
If a function retries, say it in the name. If it has a timeout, say it in the name. If it’s idempotent, say it in the name.
chargeCustomerWithRetry is better than chargeCustomer
If chargeCustomer retries internally, you can’t tell. If chargeCustomerWithRetry retries, you know from the name.
But don’t go overboard. chargeCustomerWithRetryAndTimeoutAndIdempotency is too long. Use composition instead.
But don’t mix business and infra in one function
Don’t create chargeCustomerWithRetryAndTimeout. That mixes concerns. Instead, compose:
async function chargeCustomer(amount: number, customerId: string): Promise<ChargeResult> {
return await retry(
() => callWithTimeout(
() => paymentService.charge(amount, customerId),
{ timeoutMs: 5000 }
),
{ maxAttempts: 3, shouldRetry: isRetryableError }
);
}
The business logic is in paymentService.charge. The reliability logic is in the wrappers. They’re separate. You can test them separately. You can change them separately.
A clean pattern for outbound HTTP
Here’s how to structure HTTP calls cleanly.
Policy definition (timeouts/retries)
Define policies in one place:
const HTTP_POLICIES = {
payment: {
timeoutMs: 5000,
retry: {
maxAttempts: 3,
shouldRetry: (error: HttpError) =>
error.statusCode === 503 ||
error.statusCode === 429 ||
(error.statusCode >= 500 && error.statusCode < 600),
backoffMs: 1000
}
},
notification: {
timeoutMs: 2000,
retry: {
maxAttempts: 2,
shouldRetry: (error: HttpError) => error.statusCode >= 500,
backoffMs: 500
}
}
};
Policies are data. They’re easy to read. They’re easy to change.
Call wrapper
Create a wrapper that applies policies:
async function callHttpWithPolicy<T>(
service: string,
call: () => Promise<T>
): Promise<T> {
const policy = HTTP_POLICIES[service];
if (!policy) {
throw new Error(`No policy defined for service: ${service}`);
}
return await retry(
() => callWithTimeout(call, { timeoutMs: policy.timeoutMs }),
policy.retry
);
}
The wrapper is generic. It works for any service. It applies the right policy.
Business function stays small
The business function is just the HTTP call:
async function chargeCustomer(amount: number, customerId: string): Promise<ChargeResult> {
return await callHttpWithPolicy('payment', () =>
httpClient.post('/api/charge', { amount, customerId })
);
}
You can read this and know: it calls the payment API. It uses the payment policy (5 second timeout, 3 retries). The business logic is clear. The reliability logic is handled by the wrapper.
A clean pattern for message handlers
Message handlers need idempotency. Here’s how to structure them.
Validate message
First, validate the message:
function validateOrderMessage(message: OrderMessage): void {
if (!message.orderId) {
throw new ValidationError('orderId is required');
}
if (!message.userId) {
throw new ValidationError('userId is required');
}
if (!message.items || message.items.length === 0) {
throw new ValidationError('items are required');
}
}
Validation happens first. If it fails, you don’t process the message. You don’t check idempotency. You just fail fast.
Dedup/idempotency check
Then check if you’ve already processed this message:
async function ensureNotDuplicate(
messageId: string,
store: IdempotencyStore
): Promise<void> {
const existing = await store.get(messageId);
if (existing) {
throw new DuplicateMessageError(`Message ${messageId} already processed`);
}
}
If it’s a duplicate, you return the existing result. You don’t process it again.
Apply business action
Then apply the business logic:
async function processOrder(message: OrderMessage): Promise<Order> {
const order = await orderRepository.create({
id: message.orderId,
userId: message.userId,
items: message.items,
status: 'pending'
});
await paymentService.charge(order.total, message.userId);
await inventoryService.reserveItems(message.items);
return order;
}
The business logic is pure. It doesn’t know about idempotency. It doesn’t know about message queues. It just processes an order.
Store result and ack
Finally, store the result and acknowledge the message:
async function handleOrderMessage(message: OrderMessage): Promise<void> {
validateOrderMessage(message);
const result = await ensureIdempotent(
() => processOrder(message),
{
idempotencyKey: message.messageId,
idempotencyStore: redisClient
}
);
await messageQueue.ack(message.messageId);
}
The handler orchestrates. It validates. It ensures idempotency. It processes. It acks. Each step is clear. Each step is testable.
What to log (and what not to)
Logging is part of reliability. But log the right things.
Log decisions: “retrying because 503”
Log when you retry. Log why you retry. Log when you timeout. Log when you skip idempotency.
async function retry<T>(
fn: () => Promise<T>,
options: RetryOptions
): Promise<T> {
for (let attempt = 1; attempt <= options.maxAttempts; attempt++) {
try {
return await fn();
} catch (error) {
if (attempt === options.maxAttempts) {
logger.error('Max retries reached', { error, attempts: attempt });
throw error;
}
if (!options.shouldRetry(error)) {
logger.debug('Not retrying - error not retryable', { error, attempt });
throw error;
}
logger.info('Retrying after error', {
error: error.message,
statusCode: error.statusCode,
attempt,
nextAttemptIn: options.backoffMs
});
await sleep(options.backoffMs);
}
}
}
You can read the logs and know: it retried. It retried because of a 503. It waited 1 second. It succeeded on attempt 2.
Avoid logging secrets and full payloads
Don’t log API keys. Don’t log passwords. Don’t log full request/response payloads. Log enough to debug, but not so much that you leak secrets.
logger.info('HTTP call', {
method: 'POST',
url: '/api/charge',
statusCode: 200,
durationMs: 150,
// Don't log: request body, response body, headers with secrets
});
Log the shape. Log the timing. Log the status. Don’t log the data.
Close
Reliability is part of correctness. If your code doesn’t handle failures, it’s not correct. If your code doesn’t handle timeouts, it’s not correct. If your code doesn’t handle duplicates, it’s not correct.
Reliability is part of correctness
Reliability isn’t optional. It’s not “nice to have.” It’s required. Your code will fail. Networks will be slow. Services will be down. Messages will be duplicated. Handle it. Make it obvious.
Clean code makes reliability reviewable
If reliability code is mixed with business logic, you can’t review it. You can’t test it. You can’t change it. Keep it separate. Keep it visible. Keep it simple.
When someone reads your code, they should know: this retries. This has a timeout. This is idempotent. They shouldn’t have to read the implementation to find out.
Make reliability code as readable as business code. Make it as testable as business code. Make it as changeable as business code.
Points to discuss
Retries without timeouts can make outages worse
If you retry without a timeout, a slow service can tie up your threads. You retry. It’s still slow. You retry again. It’s still slow. Now you have 10 threads waiting on a slow service. Your service becomes slow too.
Always use timeouts with retries. If a call times out, you can retry. If it’s just slow, you fail fast and move on.
Retries must be selective (don’t retry validation errors)
Don’t retry 400 errors. Don’t retry 401 errors. Don’t retry 403 errors. These are client errors. Retrying won’t help. You’ll just waste time and resources.
Only retry server errors (5xx) and rate limit errors (429). These might be transient. These might succeed on retry.
Idempotency must be explicit (a key, a store, and a rule)
Idempotency needs three things:
- A key (how you identify duplicate requests)
- A store (where you store results)
- A rule (what to do when you see a duplicate)
Make all three explicit. Don’t hide them in the implementation. Don’t make them magic. Make them visible.
The clean code goal: “I can read the method and know the runtime behavior”
When you read a method, you should know:
- Does it retry? How many times?
- Does it timeout? How long?
- Is it idempotent? What’s the key?
You shouldn’t have to read the implementation. You shouldn’t have to trace through helper functions. You should know from the method itself.
That’s the goal. That’s what clean reliability code looks like.
Discussion
Loading comments...