Gunicorn Timeout Is Not What You Think It Is

How you can get there

So, you’re running your python web application using gunicorn in production, and you’re a good folk, you know that having requests without a timeout can be a bad idea. Imagine a long query loading your DB with the client already leaving your beautiful SAAS waiting for the results, whole UI becoming sluggish, yada-yada-yada.

You Google stuff, likely entirely miss the documentation, and go on with some settings like --timeout 30 and call it a day. Or maybe you’ve been running a production Gunicorn set-up with a different type of worker in place, you’ve changed it, and were assuming that it’ll just keep working the way you assume it to work.

How it really works

What the timeout settings do is exactly what’s told in the documentation and does do what’s not told there, but may be implied.

Workers who are silent for more than this many seconds are killed and restarted.

Value is a positive number or 0. Setting it to 0 has the effect of infinite timeouts by disabling timeouts for all workers entirely.

Generally, the default of thirty seconds should suffice. Only set this noticeably higher if you’re sure of the repercussions for sync workers. For the non-sync workers, it just means that the worker process is still communicating and is not tied to the length of time required to handle a single request.

It only means that if the worker, for some reason, did not acknowledge its alive for the time period specified in the settings - the master process will terminate it and spawn a new one.

Gunicorn architecture is surprisingly easy to explore, it has simple abstractions and a straightforward codebase. There’s the arbiter managing the worker processes, there’s a temporary file, accessible to both the worker and the arbiter that is used to communicate last alive time. The worker (gthread one in our case) manages internal thread pool to serve requests.

Temporary file wrapper is dead simple, it has accessor method last_update reading the file’s last modification time and the notify method to set it.

class WorkerTmp:

    # Stripped nasty file management details

    def notify(self):
 new_time = time.monotonic()
 os.utime(self._tmp.fileno(), (new_time, new_time))

    def last_update(self):
        return os.fstat(self._tmp.fileno()).st_mtime

The worker code is also quite simple, we’re interested in the run method:

    def run(self):
        # init listeners, add them to the event loop
        for sock in self.sockets:
 sock.setblocking(False)
 server = sock.getsockname()
            # Here's where the connections are registered and processed
 acceptor = partial(self.accept, server)
            self.poller.register(sock, selectors.EVENT_READ, acceptor)

        while self.alive:  # <-- Worker loop
            # notify the arbiter we are alive
            self.notify()  # <-- temp file updates

            # can we accept more connections?
            if self.nr_conns < self.worker_connections:
                # wait for an event
 events = self.poller.select(1.0)
                for key, _ in events:
 callback = key.data
 callback(key.fileobj)

                # check (but do not wait) for finished requests
 result = futures.wait(self.futures, timeout=0,
                                      return_when=futures.FIRST_COMPLETED)
            else:
                # wait for a request to finish
 result = futures.wait(self.futures, timeout=1.0,
                                      return_when=futures.FIRST_COMPLETED)
            # Stripped
        
        # Actual clean-up
        self.tpool.shutdown(False)
        self.poller.close()

Acceptor code just polls sockets for events we’re interested in, and submits them to the processing in the threadpool.

    def accept(self, server, listener):
        try:
 sock, client = listener.accept()
 conn = TConn(self.cfg, sock, client, server)

            self.nr_conns += 1
            with self._lock:
                self.poller.register(conn.sock, selectors.EVENT_READ,
 partial(self.on_client_socket_readable, conn))
        except OSError as e:
            if e.errno not in (errno.EAGAIN, errno.ECONNABORTED,
 errno.EWOULDBLOCK):
                raise

    def on_client_socket_readable(self, conn, client):
 ...
        self.enqueue_req(conn)

    def enqueue_req(self, conn):
 conn.init()
        # submit the connection to a worker
 fs = self.tpool.submit(self.handle, conn)
        self._wrap_future(fs, conn)

The arbiter loop is looking for updates of the worker tmp files:

  def murder_workers(self):
 ...
 workers = list(self.WORKERS.items())
      for (pid, worker) in workers:
          try:
              # Check if worker hasn't updated its timestamp within timeout period
              if time.monotonic() - worker.tmp.last_update() <= self.timeout:
                  continue  # Worker is still alive
          except (OSError, ValueError):
              continue

          if not worker.aborted:
              self.log.critical("WORKER TIMEOUT (pid:%s)", pid)
 worker.aborted = True
              self.kill_worker(pid, signal.SIGABRT)  # First attempt: graceful
          else:
              self.kill_worker(pid, signal.SIGKILL)  # Force kill if still alive

Abort handler in the worker is also simple:

    def handle_abort(self, sig, frame):
        self.alive = False  # <-- ends the endless loop of the worker
        self.cfg.worker_abort(self)
 sys.exit(1)

All of the process can be summarized in the following sequence diagram:

sequenceDiagram participant Config as Configuration participant Arbiter as Arbiter Process participant Worker as gthread Worker participant TmpFile as WorkerTmp File participant ThreadPool as Thread Pool participant Request as HTTP Request Config->>Arbiter: --timeout=30 (default) Arbiter->>Worker: spawn_worker(timeout=30) Worker->>TmpFile: create temp file for heartbeat loop Worker Main Loop Worker->>TmpFile: notify() - update timestamp Worker->>Worker: select() for events (1.0s timeout) alt New Connection Worker->>ThreadPool: submit(handle_request) ThreadPool->>Request: process in background thread Request-->>ThreadPool: response (may take time) ThreadPool-->>Worker: future.done() end Worker->>TmpFile: notify() - heartbeat end loop Arbiter Monitoring Loop Arbiter->>Arbiter: sleep() - wait for signals/events Arbiter->>Worker: murder_workers() check Arbiter->>TmpFile: check last_update() timestamp alt Worker Responsive (< timeout) TmpFile-->>Arbiter: timestamp within 30s Arbiter->>Arbiter: continue monitoring else Worker Silent (>= timeout) TmpFile-->>Arbiter: timestamp > 30s old Arbiter->>Worker: SIGABRT (graceful kill) alt Worker Still Alive Arbiter->>Worker: SIGKILL (force kill) end Arbiter->>Arbiter: spawn new worker end end

What’s the point of the timeout?

Here’s the bonus section: what’s the case when the --timeout may work and kill the worker? My initial feeling was that having some thread block can kill the worker (time.sleep(1000), welcome), but it was not the case. As the worker has the threadpool internally, it’s not that easy to block the worker Python process internally, so that we’re in a state where we can not update the temporary file. The only way to get the case was to utilize the kill -STOP <worker PID> command to force the process stop.

To my understanding there are not so many ways you can actually accidentally block the worker thread by your own:

Get a segfault exception in the C-library code (well, likely you’re already dead as the worker)
Face an extensive call in the C extension, which does not release GIL effectively, blocking the whole interpreter
Garbage collector long pause exceeding the --timeout value (or using some expensive callbacks)
Doing something very, very blocking in the signal handler, but this is quite difficult to achieve, as you need to set signal handlers in the main thread (gunicorn worker in this case).
Kernel being so busy that you don’t get your precious CPU time (renice 19 -p <worker pid>, IDK).

If you want to have a request level timeout you’d either need to have the request setting on your load balancer, patch the gunicorn worker to manually write the timer code, properly set-up the keep-alive connection infrastructure and make sure your app is aware of cancellation policies, I don’t have a one-size-fits-them-all solution for such a set-up, reffer to other blog posts :) .

How you can get there#

How it really works#

What’s the point of the timeout?#

How you can get there

How it really works

What’s the point of the timeout?