Request Retrial

In this pattern, we'll explore how API services can define a retry policy for both determining which requests are eligible to be retried as well as the timing algorithm to determine how long to wait before retrying.

Implementation

Retry eligibility

GENERALLY RETRIABLE

Code	Name	Description
408	Request Timeout	The client didn't produce a request fast enough.
421	Misdirected Request	The request was sent to a server that couldn't handle it.
425	Too Early	The server doesn't want to try handling a request that might be replayed.
429	Too Many Requests	The client has sent too many requests in a given period of time.
503	Service Unavailable	The server cannot handle the request because it's overloaded.

DEFINITELY NOT RETRIABLE

Code	Name	Description
403	Forbidden	The request was fine, but the server is refusing to handle it.
405	Method Not Allowed	The method specified is not allowed.
412	Precondition Failed	The server does not meet the conditions of the request.
501	Not Implemented	The server cannot recognize or handle the request.

MAYBE RETRIABLE

Code	Name	Description
500	Internal Server Error	An unexpected failure occurred on the server.
502	Bad Gateway	The request was passed to a downstream server that sent an invalid response.
504	Gateway Timeout	The request was passed to a downstream server that never replied.

API definition

async function getChatRoomWithRetries(
  id: string, maxDelayMs = 32000, maxRetries = 10): Promise<ChatRoom> {
  return new Promise<ChatRoom>(async (resolve, reject) => {
    let retryCount = 0;
    let delayMs = 1000;
    while (true) {
      try {
        return resolve(GetChatRoom({ id }));
      } catch (e) {
        if (retryCount++ > maxRetries) return reject(e);
        await new Promise((resolve) => {
          let actualDelayMs;
          if ('Retry-After' in e.response.headers) {
            actualDelayMs = Number(
              e.response.headers['Retry-After']) * 1000;
          } else {
            actualDelayMs = delayMs + (Math.random() * 1000);
          }
          return setTimeout(resolve, actualDelayMs);
        });
        delayMs *= 2;
        if (delayMs > maxDelayMs) delayMs = maxDelayMs;
      }
    }
  });
}

Exercises

Why isn't there a simple rule for deciding which failed requests can safely be retried?

Some are due to client-side errors, whereas some are server-side errors .

What is the underlying reason for relying on exponential back-off? What is the purpose for the random jitter between retries?

When we know nothing else about the system.

To prevent from a stampeding herd.

When does it make sense to use the Retry-After header?

When the service is in control of when the next request is allowed.

Summary

Errors that are in some way transient or time related (e.g., HTTP 429 Too Many Requests) are likely to be retriable, whereas those that are related to some permanent state (e.g., HTTP 403 Forbidden) are mostly unsafe to retry.
Whenever code automatically retries requests, it should rely on some form of exponential back-off with limits on the number of retries and the delay between requests. Ideally it should also introduce some jitter to avoid the stampeding herd problem where all requests are retried according to the same rules and therefore always arrive at the same time.
If the API service knows something about when a request is likely to be successful if retried, it should indicate this using a Retry-After HTTP header.