How you can get there
So, you’re running your python web application using gunicorn in production, and you’re a good folk, you know that having requests without a timeout can be a bad idea. Imagine a long query loading your DB with the client already leaving your beautiful SAAS waiting for the results, whole UI becoming sluggish, yada-yada-yada.
You Google stuff, likely entirely miss the documentation, and go on with some settings like --timeout 30
and call it a day. Or maybe you’ve been running a production Gunicorn set-up with a different type of worker in place, you’ve changed it, and were assuming that it’ll just keep working the way you assume it to work.
How it really works
What the timeout settings do is exactly what’s told in the documentation and does do what’s not told there, but may be implied.
Workers who are silent for more than this many seconds are killed and restarted.
Value is a positive number or 0. Setting it to 0 has the effect of infinite timeouts by disabling timeouts for all workers entirely.
Generally, the default of thirty seconds should suffice. Only set this noticeably higher if you’re sure of the repercussions for sync workers. For the non-sync workers, it just means that the worker process is still communicating and is not tied to the length of time required to handle a single request.
It only means that if the worker, for some reason, did not acknowledge its alive for the time period specified in the settings - the master process will terminate it and spawn a new one.
Gunicorn architecture is surprisingly easy to explore, it has simple abstractions and a straightforward codebase. There’s the arbiter managing the worker processes, there’s a temporary file, accessible to both the worker and the arbiter that is used to communicate last alive time. The worker (gthread one in our case) manages internal thread pool to serve requests.
Temporary file wrapper is dead simple, it has accessor method last_update
reading the file’s last modification time and the notify
method to set it.
class WorkerTmp:
# Stripped nasty file management details
def notify(self):
new_time = time.monotonic()
os.utime(self._tmp.fileno(), (new_time, new_time))
def last_update(self):
return os.fstat(self._tmp.fileno()).st_mtime
The worker code is also quite simple, we’re interested in the run
method:
def run(self):
# init listeners, add them to the event loop
for sock in self.sockets:
sock.setblocking(False)
server = sock.getsockname()
# Here's where the connections are registered and processed
acceptor = partial(self.accept, server)
self.poller.register(sock, selectors.EVENT_READ, acceptor)
while self.alive: # <-- Worker loop
# notify the arbiter we are alive
self.notify() # <-- temp file updates
# can we accept more connections?
if self.nr_conns < self.worker_connections:
# wait for an event
events = self.poller.select(1.0)
for key, _ in events:
callback = key.data
callback(key.fileobj)
# check (but do not wait) for finished requests
result = futures.wait(self.futures, timeout=0,
return_when=futures.FIRST_COMPLETED)
else:
# wait for a request to finish
result = futures.wait(self.futures, timeout=1.0,
return_when=futures.FIRST_COMPLETED)
# Stripped
# Actual clean-up
self.tpool.shutdown(False)
self.poller.close()
Acceptor code just polls sockets for events we’re interested in, and submits them to the processing in the threadpool.
def accept(self, server, listener):
try:
sock, client = listener.accept()
conn = TConn(self.cfg, sock, client, server)
self.nr_conns += 1
with self._lock:
self.poller.register(conn.sock, selectors.EVENT_READ,
partial(self.on_client_socket_readable, conn))
except OSError as e:
if e.errno not in (errno.EAGAIN, errno.ECONNABORTED,
errno.EWOULDBLOCK):
raise
def on_client_socket_readable(self, conn, client):
...
self.enqueue_req(conn)
def enqueue_req(self, conn):
conn.init()
# submit the connection to a worker
fs = self.tpool.submit(self.handle, conn)
self._wrap_future(fs, conn)
The arbiter loop is looking for updates of the worker tmp files:
def murder_workers(self):
...
workers = list(self.WORKERS.items())
for (pid, worker) in workers:
try:
# Check if worker hasn't updated its timestamp within timeout period
if time.monotonic() - worker.tmp.last_update() <= self.timeout:
continue # Worker is still alive
except (OSError, ValueError):
continue
if not worker.aborted:
self.log.critical("WORKER TIMEOUT (pid:%s)", pid)
worker.aborted = True
self.kill_worker(pid, signal.SIGABRT) # First attempt: graceful
else:
self.kill_worker(pid, signal.SIGKILL) # Force kill if still alive
Abort handler in the worker is also simple:
def handle_abort(self, sig, frame):
self.alive = False # <-- ends the endless loop of the worker
self.cfg.worker_abort(self)
sys.exit(1)
All of the process can be summarized in the following sequence diagram:
What’s the point of the timeout?
Here’s the bonus section: what’s the case when the --timeout
may work and kill the worker? My initial feeling was that having some thread block can kill the worker (time.sleep(1000)
, welcome), but it was not the case. As the worker has the threadpool internally, it’s not that easy to block the worker Python process internally, so that we’re in a state where we can not update the temporary file. The only way to get the case was to utilize the kill -STOP <worker PID>
command to force the process stop.
To my understanding there are not so many ways you can actually accidentally block the worker thread by your own:
- Get a segfault exception in the C-library code (well, likely you’re already dead as the worker)
- Face an extensive call in the C extension, which does not release GIL effectively, blocking the whole interpreter
- Garbage collector long pause exceeding the
--timeout
value (or using some expensive callbacks) - Doing something very, very blocking in the signal handler, but this is quite difficult to achieve, as you need to set signal handlers in the main thread (gunicorn worker in this case).
- Kernel being so busy that you don’t get your precious CPU time (
renice 19 -p <worker pid>
, IDK).
If you want to have a request level timeout you’d either need to have the request setting on your load balancer, patch the gunicorn worker to manually write the timer code, properly set-up the keep-alive connection infrastructure and make sure your app is aware of cancellation policies, I don’t have a one-size-fits-them-all solution for such a set-up, reffer to other blog posts :) .