- Published on
Stop Breaking Your APIs - How to Implement Proper Retry and Exponential Backoff in NestJS
- Authors
Introduction
In modern web applications, we often rely on external data sources, most commonly APIs. Sometimes these APIs are called synchronously while a user is waiting for a response. In other cases, we consume them asynchronously through scheduled jobs, for example, every Monday at 03:00 a.m.
In a perfect world, we would send a request and always receive a valid response on time. That assumption might work for tutorials or hypothetical examples, but it rarely holds in real-world systems. Networks are unreliable, external services get overloaded, deployments happen, and rate limits are enforced. Failures are not the exception; they are part of normal operation.
The tricky part is that not all failures are equal. Some problems are permanent, and retrying will never help. Others are transient: a short network glitch, a temporarily unavailable service, or a brief overload. In those cases, simply trying the right way again can make the difference between a fragile system and a resilient one.
This is where retry mechanisms come into play.
In this post, we’ll look at the most common retry strategies, explain when and why they are useful, and implement them step by step in a NestJS application. You’ll learn how different retry approaches behave, why blindly retrying can be dangerous, and how a well-designed retry mechanism can significantly improve the stability of your system with surprisingly little effort.
That was exactly my own experience: once I understood these patterns better, I applied them to an existing production system at work, and the impact was immediate. Fewer errors, more stable jobs, and much calmer nights.
Setup
External API
First of all, we need an external API that we can connect to our Nest application. I've prepared a simple Hono.js API that randomly sends back different retryable HTTP status codes.
Just go to https://github.com/Jean-Marc-Dev-Blog/retry-external-api and clone it. You see the instructions to start it locally in the README.
Nest.js API
Since this tutorial targets Nest.js, we also need to create a new Nest.js application. Let's do it:
nest new retry-internal-api
With that in place, we're ready to start exploring retry mechanisms!
Retry Mechanisms
We’ve already talked about retry mechanisms in general in the introduction of this post. Now let’s dive a bit deeper and see what they actually have to offer.
Imagine we have a service that connects to an external service via a REST API. That API might be maintained by a third party or by us. The important part is that communication happens over the network, and once a network is involved, a whole range of potential failures can occur.
For example, the external service might be temporarily unavailable and not respond at all. It could be overloaded, or there might be networking issues that interfere with our communication.
We can’t reliably predict these problems, but we can design our service in a way that it doesn’t crash and remains healthy. This is where retry mechanisms come into play but only when a failure is actually worth retrying.
Retryable HTTP Codes
- 408 https://developer.mozilla.org/de/docs/Web/HTTP/Reference/Status/408
- 429 https://developer.mozilla.org/de/docs/Web/HTTP/Reference/Status/429
- 502 https://developer.mozilla.org/de/docs/Web/HTTP/Reference/Status/502
- 503 https://developer.mozilla.org/de/docs/Web/HTTP/Reference/Status/503
- 504 https://developer.mozilla.org/de/docs/Web/HTTP/Reference/Status/504
These HTTP status codes usually signal temporary problems, where retrying the request can actually be a valid solution. The request itself was not wrong, unlike cases such as 401 or 403. The request simply arrived at the wrong time.
That’s why you shouldn’t blindly retry every HTTP error. A 401 won’t suddenly be fixed by retrying the same request again.### Exponential Backoff
Exponential Backoff
One of the most common retry mechanisms is exponential backoff. When a retryable HTTP error code is received, the request is retried a limited number of times until a defined threshold is reached.
With exponential backoff, the delay between retries grows exponentially with each failed attempt. For example, every time a request fails again, the delay before the next attempt doubles.
The reason for this exponential growth is to give the external service enough time to recover. This is especially important when the service responds with 429 – Too Many Requests, where firing another request immediately would only make the situation worse.
Here is a simple example to make this clearer:
Request 1 --> 503
... wait 2s
Request 2 --> 503
... wait 4s
Request 3 --> 503
... wait 8s
Request 4 --> 200
This approach works fine as long as only a single service is calling the external API. Things look different when hundreds or even thousands of services are doing the same thing.
Imagine the following scenario:
Service A, Service B, Service C, ... --> External Service --> 429 (Too Many Requests)
... wait 2s
Service A, Service B, Service C, ... --> External Service --> 429 (Too Many Requests)
... wait 4s
Service A, Service B, Service C, ... --> External Service --> 429 (Too Many Requests)
... wait 8s
When the external service is already overwhelmed, it doesn’t really matter how long all clients wait. If they all send their requests again at the same time, the service will simply be overwhelmed again. In that case, waiting alone just postpones the problem.
This is where jitter comes into play. Jitter introduces randomness into the delay, which helps distribute retry attempts over time instead of synchronizing them.
Service A: ...wait 2.3s
Service B: ...wait 3.5s
Service C: ...wait 8.7s
...
Now the requests no longer arrive all at once, but are spread across time, which significantly reduces pressure on the external service.
With that in mind, an exponential backoff retry mechanism typically consists of the following building blocks:
- a maximum delay (e.g., 30s)
- a maximum number of retry attempts (e.g., 5)
- clear rules defining when a retry should happen (e.g., 502) and when it should not (e.g., 401)
- when retrying 429 and 503 errors, take the
Retry-Afterheader into account
Nest.js implementation
For the sake of this tutorial, we’ll keep our NestJS API as simple as possible and focus entirely on the exponential backoff logic. We won’t rely on any external Node.js libraries and will implement everything ourselves. This is the best way to really understand what’s going on under the hood.
Before we start, let’s outline roughly how our retry mechanism will work:
- We send a request to our
retry-external-api(the Hono.js API you cloned earlier) - That API randomly returns different HTTP responses
- When we receive a non-retryable response, such as
200or401, we return the response as-is - If we receive a retryable response, we retry the request using an exponential backoff strategy
- we define a maximum delay, a maximum number of retry attempts, and add jitter to introduce more randomness into our delays
- when we receive a
429or503response, we also take theRetry-Afterheader into account
- The mechanism keeps running until one of the two thresholds is met
- we receive a non-retryable response
- we reach our maximum number of retry attempts
Let’s start scaffolding our ExponentialBackoffService:
import { HttpService } from '@nestjs/axios';
import { HttpStatus, Injectable, Logger } from '@nestjs/common';
import axios, { AxiosError } from 'axios';
export interface MainResponse {
status: HttpStatus;
data?: string;
}
export interface RetryResponse {
response: MainResponse;
retryable: boolean;
retryAfterMs?: number;
}
@Injectable()
export class ExponentialBackoffService {
private readonly MAX_DELAY_MS = 30_000;
private readonly MAX_RETRIES = 5;
private readonly BASE_DELAY_MS = 1_000;
private readonly RETRY_STATUS_CODES = new Set<number>([
HttpStatus.REQUEST_TIMEOUT, // 408
HttpStatus.TOO_MANY_REQUESTS, // 429
HttpStatus.BAD_GATEWAY, // 502
HttpStatus.SERVICE_UNAVAILABLE, // 503
HttpStatus.GATEWAY_TIMEOUT, // 504
]);
private readonly RETRY_AFTER_STATUS_CODES = new Set<number>([
HttpStatus.TOO_MANY_REQUESTS, // 429
HttpStatus.SERVICE_UNAVAILABLE, // 503
]);
private readonly logger = new Logger(ExponentialBackoffService.name);
constructor(private readonly httpService: HttpService) {}
async execute(path: string): Promise<MainResponse> { }
}
Here’s the basic outline of our service. We start by defining a set of private properties such as MAX_DELAY_MS, MAX_RETRIES, and BASE_DELAY_MS. These values define the core thresholds of our retry mechanism. For convenience, we also introduce the RETRY_STATUS_CODES and RETRY_AFTER_STATUS_CODES sets. They allow us to quickly check whether a specific HTTP status code is retryable and make it easy to adjust or extend the list of supported codes later on.
The private logger will be useful once we start executing requests. It allows us to expose meaningful information about retries and failures, which makes debugging and observing the behavior of the service much easier.
We also inject Nest’s HttpService as a dependency. This gives us access to Axios and allows us to perform HTTP requests in a consistent and testable way.
The core logic of the exponential backoff mechanism is orchestrated inside the execute method. This is the only public method of the service and therefore its external API. Any class that needs retry behavior with exponential backoff simply calls this method with the appropriate parameters.
Finally, we introduce a private method called request, which is responsible for performing the actual HTTP request against the external API:
private async request(path: string): Promise<RetryResponse> {
try {
const result = await this.httpService.axiosRef.get(path);
const status = result.status;
const data = result.data as string;
const response: MainResponse = { status, data };
return { response, retryable: false };
} catch (e) {
if (axios.isAxiosError(e)) {
// TODO: Handle Axios Errors (the actual retry will be made here)
}
this.logger.error(e);
// TODO: Handle Default Errors
}
}
This method simply calls our external API using the axiosRef property from the injected HttpService. We do this to avoid dealing with observables in this example. If you prefer working with observables, you can of course do that as well. At the moment, we always perform a GET request by using Axios’ get method. To make the implementation more generic, a nice follow-up exercise would be to support other HTTP methods too.
If the request succeeds, for example, when the response status is in the 2xx range, we simply return the status together with the corresponding data. By setting retryable: false in the returned object, we can later detect this case in our retry logic and exit early. This tells our service that a non-retryable response was received. That’s all for the happy path.
Things get more interesting once we land in the catch block. Here, we first handle all AxiosError instances, which should cover the vast majority of failures when dealing with HTTP requests. For the sake of robustness, we also handle any other unexpected errors that might occur, ensuring that no error slips through unhandled.
To bring some structure into our error-handling logic, let’s add the following three private methods to our class:
private isRetryableStatus(status: number): boolean {
return this.RETRY_STATUS_CODES.has(status);
}
private handleAxiosError(e: AxiosError): RetryResponse {
const status = e.response?.status || HttpStatus.INTERNAL_SERVER_ERROR;
const data = (e.response?.data as string) || 'No Data';
// TODO: Add evaluation if there was a network error
const retryable = this.isRetryableStatus(status)
const response: MainResponse = { status, data };
// TODO: Add evaluation of the retry-after header
return { response, retryable };
}
private handleDefaultError(): RetryResponse {
const response: MainResponse = {
status: HttpStatus.INTERNAL_SERVER_ERROR,
data: 'No data',
};
return { response, retryable: false };
}
Pretty straightforward, right? First, we check whether the AxiosError is retryable at all by calling isRetryableStatus. Inside that method, we simply look up the received HTTP status code in our RETRY_STATUS_CODES set. The result of this evaluation is then included in the returned object. As before, this information will later control the behavior of our execute method and determine whether a retry should happen or not.
The handleDefaultError method is intentionally very simple. In this case, we just return a 500 response when something went wrong outside of Axios. Strictly speaking, this step is optional, because NestJS will eventually catch unhandled errors and convert them into a 500 response anyway. However, handling it explicitly here gives us more control and makes the behavior easier to reason about. Personally, I prefer being explicit in error handling instead of relying too much on the black magic that frameworks abstract away.
With these two methods in place, we can now call them from within the request method:
private async request(path: string): Promise<RetryResponse> {
try {
const result = await this.httpService.axiosRef.get(path);
const status = result.status;
const data = result.data as string;
const response: MainResponse = { status, data };
return { response, retryable: false };
} catch (e) {
if (axios.isAxiosError(e)) {
// 👇 Add this method
return this.handleAxiosError(e);
}
this.logger.error(e);
// 👇 Add this method
return this.handleDefaultError();
}
}
What we’ve left out so far is the detection of proper network errors and the evaluation of the Retry-After header, if it is present. To cover these cases, we need to introduce two additional private methods in our class:
private isNetworkError(e: AxiosError): boolean {
const code = e.code;
return (
code === 'ECONNABORTED' ||
code === 'ECONNRESET' ||
code === 'EAI_AGAIN' ||
code === 'ENOTFOUND' ||
code === 'ETIMEDOUT' ||
code === 'ENETUNREACH' ||
code === 'EHOSTUNREACH'
);
}
private parseRetryAfter(e: AxiosError): number | undefined {
const status = e.response?.status || 0;
if (!this.RETRY_AFTER_STATUS_CODES.has(status)) {
return;
}
const headers = e.response?.headers || {};
const retryAfterHeader = headers['retry-after'];
// Headers in Axios can sometimes be parsed as an Array
const retryAfterValue = Array.isArray(retryAfterHeader) ? retryAfterHeader[0] : retryAfterHeader;
if (retryAfterValue == null) {
return;
}
// When it's a HTTP Date
if (typeof retryAfterValue === 'string') {
const dateMs = Date.parse(retryAfterValue);
if (!Number.isNaN(dateMs)) {
const delayMs = Math.max(0, dateMs - Date.now());
this.logger.warn(`Received "Retry-After" header (date). status=${status}. delay=${delayMs}ms`);
return delayMs;
}
}
// When it's a number in seconds
const seconds = Number(retryAfterValue);
if (!Number.isFinite(seconds) || seconds < 0) {
return;
}
const milliseconds = seconds * 1000;
this.logger.warn(`Received "Retry-After" header (seconds). status=${status}. delay=${milliseconds}ms`);
return milliseconds;
}
Axios conveniently exposes additional information on each AxiosError, including a specific error code. We can take advantage of that to gain more granular control when detecting network-related failures inside the isNetworkError method.
The parseRetryAfter method requires a bit more logic. First, we need to determine whether the received response status can even include a Retry-After header. We do this by checking the status code against our RETRY_AFTER_STATUS_CODES set defined earlier. According to the RFC 9110 specification, the Retry-After header can either contain an HTTP-date or a number of seconds. This means we need to handle both formats correctly.
Our retry-external-api will only return numeric values (seconds) for 429 and 503 responses. However, it’s important to also support the HTTP-date format, as another external API we might integrate with in the future could rely on it.
With that in place, we can now wire these two new methods into our handleAxiosError method:
private handleAxiosError(e: AxiosError): RetryResponse {
const status = e.response?.status || HttpStatus.INTERNAL_SERVER_ERROR;
const data = (e.response?.data as string) || 'No Data';
// 👇 Add "isNetworkError" method
const retryable = this.isRetryableStatus(status) || this.isNetworkError(e);
const response: MainResponse = { status, data };
// 👇 Add "parseRetryAfter" method
const retryAfterMs = this.parseRetryAfter(e);
// 👇 Add "retryAfterMs" to response
return { response, retryable, retryAfterMs };
}
Wow! That was fun so far. We now have the full error-handling logic in place, which allows our mechanism to decide when a request should be retried and when it should not. This forms the critical first half of the retry mechanism.
The second half is responsible for orchestrating the actual retries and ensuring that we always stay within the boundaries of our defined thresholds.
Let’s bring the execute method to life next, because this is where the entire retry orchestration comes together:
async execute(path: string): Promise<MainResponse> {
let result: RetryResponse | null = null;
for (let attempt = 1; attempt <= this.MAX_RETRIES; attempt++) {
result = await this.request(path);
const status = result.response.status;
if (!result.retryable) {
this.logger.warn(`Not retryable. status=${status} attempt=${attempt + 1}/${this.MAX_RETRIES}`);
return result.response;
}
if (attempt === this.MAX_RETRIES) {
this.logger.warn(`Retry attempts exceeded. status=${status} attempt=${attempt}/${this.MAX_RETRIES}`);
return result.response;
}
const retryAfterMs = result.retryAfterMs;
const delay = retryAfterMs ?? this.computeDelay(attempt);
if (delay > this.MAX_DELAY_MS) {
this.logger.warn(`Retry delay exceeded maximum delay. delay=${delay}ms max=${this.MAX_DELAY_MS}ms`);
return result.response;
}
this.logger.warn(`Retryable failure. status=${status} attempt=${attempt + 1}/${this.MAX_RETRIES} waiting=${delay}ms`);
await this.sleep(delay);
}
private computeDelay(attempt: number): number {
// base * 2^attempt
const exponent = this.BASE_DELAY_MS * Math.pow(2, attempt);
// ensure that MAX_DELAY_MS is not exceeded
const capped = Math.min(exponent, this.MAX_DELAY_MS);
// add jitter to produce more randomness (full jitter)
const jittered = Math.floor(Math.random() * capped);
return Math.max(50, jittered);
}
private sleep(ms: number): Promise<void> {
return new Promise((resolve) => setTimeout(resolve, ms));
}
Now you can see how the puzzle comes together step by step. We use a simple loop to make sure we don’t exceed our defined retry threshold, MAX_RETRIES. Inside that loop, we perform the actual HTTP request by calling this.request.
There are three situations in which we break out of the loop. The first one is straightforward: if we receive a non-retryable response (for example, a 200), we exit early and return the response immediately. The second case is when the current iteration is already the last allowed attempt, and we still receive a retryable response. In that situation, we also exit early because we’ve already exhausted all retry attempts we were willing to make.
If neither of these cases applies, we know that another retry should happen. To do that, we first check whether a Retry-After header was provided. If not, we compute the delay using our exponential backoff strategy with jitter applied. We then verify that the calculated delay, whether coming from Retry-After or from our backoff logic, does not exceed our MAX_DELAY_MS threshold. If everything is within bounds, we wait for the calculated delay using the sleep method and then start the next iteration of the loop, triggering the same request again.
That’s it for the ExponentialBackoffService for now! At this point, it should look something like this:
import { HttpService } from '@nestjs/axios';
import { HttpStatus, Injectable, Logger } from '@nestjs/common';
import axios, { AxiosError } from 'axios';
export interface MainResponse {
status: HttpStatus;
data?: string;
}
export interface RetryResponse {
response: MainResponse;
retryable: boolean;
retryAfterMs?: number;
}
@Injectable()
export class ExponentialBackoffService {
private readonly MAX_DELAY_MS = 30_000;
private readonly MAX_RETRIES = 5;
private readonly BASE_DELAY_MS = 1_000;
private readonly RETRY_STATUS_CODES = new Set<number>([
HttpStatus.REQUEST_TIMEOUT, // 408
HttpStatus.TOO_MANY_REQUESTS, // 429
HttpStatus.BAD_GATEWAY, // 502
HttpStatus.SERVICE_UNAVAILABLE, // 503
HttpStatus.GATEWAY_TIMEOUT, // 504
]);
private readonly RETRY_AFTER_STATUS_CODES = new Set<number>([
HttpStatus.TOO_MANY_REQUESTS, // 429
HttpStatus.SERVICE_UNAVAILABLE, // 503
]);
private readonly logger = new Logger(ExponentialBackoffService.name);
constructor(private readonly httpService: HttpService) {}
async execute(path: string): Promise<MainResponse> {
let result: RetryResponse | null = null;
for (let attempt = 1; attempt <= this.MAX_RETRIES; attempt++) {
result = await this.request(path);
const status = result.response.status;
if (!result.retryable) {
this.logger.warn(`Not retryable. status=${status} attempt=${attempt + 1}/${this.MAX_RETRIES}`);
return result.response;
}
if (attempt === this.MAX_RETRIES) {
this.logger.warn(`Retry attempts exceeded. status=${status} attempt=${attempt}/${this.MAX_RETRIES}`);
return result.response;
}
const retryAfterMs = result.retryAfterMs;
const delay = retryAfterMs ?? this.computeDelay(attempt);
if (delay > this.MAX_DELAY_MS) {
this.logger.warn(`Retry delay exceeded maximum delay. delay=${delay}ms max=${this.MAX_DELAY_MS}ms`);
return result.response;
}
this.logger.warn(`Retryable failure. status=${status} attempt=${attempt + 1}/${this.MAX_RETRIES} waiting=${delay}ms`);
await this.sleep(delay);
}
return (
result?.response ?? {
status: HttpStatus.INTERNAL_SERVER_ERROR,
data: 'No data',
}
);
}
private async request(path: string): Promise<RetryResponse> {
try {
const result = await this.httpService.axiosRef.get(path);
const status = result.status;
const data = result.data as string;
const response: MainResponse = { status, data };
return { response, retryable: false };
} catch (e) {
if (axios.isAxiosError(e)) {
return this.handleAxiosError(e);
}
this.logger.error(e);
return this.handleDefaultError();
}
}
private computeDelay(attempt: number): number {
// base * 2^attempt
const exponent = this.BASE_DELAY_MS * Math.pow(2, attempt);
// ensure that MAX_DELAY_MS is not exceeded
const capped = Math.min(exponent, this.MAX_DELAY_MS);
// add jitter to produce more randomness (full jitter)
const jittered = Math.floor(Math.random() * capped);
return Math.max(50, jittered);
}
private sleep(ms: number): Promise<void> {
return new Promise((resolve) => setTimeout(resolve, ms));
}
private isRetryableStatus(status: number): boolean {
return this.RETRY_STATUS_CODES.has(status);
}
private isNetworkError(e: AxiosError): boolean {
const code = e.code;
return (
code === 'ECONNABORTED' ||
code === 'ECONNRESET' ||
code === 'EAI_AGAIN' ||
code === 'ENOTFOUND' ||
code === 'ETIMEDOUT' ||
code === 'ENETUNREACH' ||
code === 'EHOSTUNREACH'
);
}
// See: https://httpwg.org/specs/rfc9110.html#field.retry-after
// See: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Retry-After
private parseRetryAfter(e: AxiosError): number | undefined {
const status = e.response?.status || 0;
if (!this.RETRY_AFTER_STATUS_CODES.has(status)) {
return;
}
const headers = e.response?.headers || {};
const retryAfterHeader = headers['retry-after'];
// Headers in Axios can sometimes be parsed as an Array
const retryAfterValue = Array.isArray(retryAfterHeader) ? retryAfterHeader[0] : retryAfterHeader;
if (retryAfterValue == null) {
return;
}
// When it's a HTTP Date
if (typeof retryAfterValue === 'string') {
const dateMs = Date.parse(retryAfterValue);
if (!Number.isNaN(dateMs)) {
const delayMs = Math.max(0, dateMs - Date.now());
this.logger.warn(`Received "Retry-After" header (date). status=${status}. delay=${delayMs}ms`);
return delayMs;
}
}
// When it's a number in seconds
const seconds = Number(retryAfterValue);
if (!Number.isFinite(seconds) || seconds < 0) {
return;
}
const milliseconds = seconds * 1000;
this.logger.warn(`Received "Retry-After" header (seconds). status=${status}. delay=${milliseconds}ms`);
return milliseconds;
}
private handleAxiosError(e: AxiosError): RetryResponse {
const status = e.response?.status || HttpStatus.INTERNAL_SERVER_ERROR;
const data = (e.response?.data as string) || 'No Data';
const retryable = this.isRetryableStatus(status) || this.isNetworkError(e);
const response: MainResponse = { status, data };
const retryAfterMs = this.parseRetryAfter(e);
return { response, retryable, retryAfterMs };
}
private handleDefaultError(): RetryResponse {
const response: MainResponse = {
status: HttpStatus.INTERNAL_SERVER_ERROR,
data: 'No data',
};
return { response, retryable: false };
}
}
Now, let's use it inside our Nest.js application. We just need a simple controller and a simple service for that:
// app.controller.ts
import { Controller, Get } from '@nestjs/common';
import { AppService } from './app.service';
import { MainResponse } from './exponential-backoff.service';
@Controller()
export class AppController {
constructor(private readonly appService: AppService) {}
@Get()
getResponse(): Promise<MainResponse> {
return this.appService.getResponse();
}
}
// app.service.ts
import { Injectable } from '@nestjs/common';
import { ExponentialBackoffService, MainResponse } from './exponential-backoff.service';
@Injectable()
export class AppService {
constructor(private readonly httpService: ExponentialBackoffService) {}
async getResponse(): Promise<MainResponse> {
return this.httpService.execute('/');
}
}
To make dependency injection work, we also need to add the ExponentialBackoffService to our AppModule. That's also where we hook up the HttpModule:
// app.module.ts
import { Module } from '@nestjs/common';
import { AppController } from './app.controller';
import { AppService } from './app.service';
import { HttpModule } from '@nestjs/axios';
import { ExponentialBackoffService } from './exponential-backoff.service';
@Module({
imports: [
HttpModule.register({
baseURL: 'http://localhost:3000',
}),
],
controllers: [AppController],
providers: [AppService, ExponentialBackoffService],
})
export class AppModule {}
Give it a try! Start your retry-external-api (the cloned Hono.js API) and the retry-internal-api (the NestJS API), then make a GET request against the / route of your NestJS application, either in your browser or using an HTTP client of your choice.
Run this a few times and watch the logs in your terminal to see the retry mechanism in action. It’s important to get a feel for how it behaves in different situations. Feel free to experiment and tweak some of the values to your liking to better understand their impact.
Conclusion
Retry mechanisms are one of those topics that often look simple at first glance, but turn out to be incredibly important once you start building real-world systems. External APIs will fail. Networks will be unreliable. Rate limits will be enforced. The question is not if these things happen, but how well your application handles them.
In this post, we’ve seen that retrying is not just about “trying again”. A good retry mechanism needs clear rules about when a retry makes sense, how often it should happen, and how long the system should wait between attempts. Exponential backoff, combined with jitter and proper error classification, is a proven strategy to make systems more resilient without putting unnecessary pressure on external services.
We also looked at how important it is to respect server-side signals like the Retry-After header, especially when dealing with rate limits. Being rate-limit-aware is not only polite, it often makes the difference between a system that recovers gracefully and one that keeps failing under load.
The NestJS implementation we built is intentionally simple, but it already covers many real-world concerns: retryable vs. non-retryable errors, network failures, exponential backoff with jitter, and hard limits to keep behavior predictable. From here, the mechanism can be extended further, for example by adding circuit breakers, retry budgets, or different retry policies per external API.
If you haven’t implemented a retry mechanism like this in your own projects yet, I highly encourage you to try it. Chances are high that you’ll see immediate benefits in stability and reliability, just like I did when applying these ideas to an existing production system.
Retries won’t fix every problem, but when used thoughtfully, they turn temporary failures into minor hiccups instead of outages.