[Disclaimer]
I struggled a lot with writing this article, as K8S itself is way too big to grasp quickly and I had to make a lot of compromises on the structure and details for it all to make some sense. The article may feel dragged in some places, jumping from topic to topic yet I had to complete it, so sorry not sorry.
GenAI has been used to generate the diagrams from the Kubernetes, Container-D and RunC codebase.
Specific commit hashes used for analysis:
Kubernetes - 19aa01e61cf ContainerD - cd0f2cc23 RunC - b04031d7
Introduction
So, we’ve through quite a journey. We’ve be able to track down how signals help us to clean-up resources before the imminent death. We’ve also peeked through the keyhole at how layered the containerization software is right now and were able to find how signals passing works as well!
Now let’s approach the arch-nemesis, the kubernetes, with the same question: how does the graceful shutdown works within it?
Kubernetes itself is quite a difficult beast to overview, with an abundance of abstractions, interfaces, configurations.
At its core, Kubernetes provides a way to:
- Deploy containerized applications across a cluster of machines (nodes)
- Scale applications up and down horizontally and vertically
- Ensure high availability through automatic failover
- Handle networking between applications
- Manage storage and configuration
- Provide abstraction layer on top of cloud providers / raw hardware infrastructure
When you deploy an application to Kubernetes, you’re not directly interacting with containers or individual machines. Instead, you declare the desired state of your application (number of replicas, resource requirements, etc.) and Kubernetes handles the nitty-gritty details of scheduling and maintaining that state.
Let’s dive into the key abstractions of Kubernetes responsible for the graceful shutdown.
Understanding Pods: What We’re Actually Shutting Down
Before diving into the shutdown mechanics, let’s clarify what exactly we’re terminating. A pod in Kubernetes isn’t just a single container - it’s actually a collection of containers working together as a single unit.
Pod Components:
- Pause container: The infrastructure container that creates shared namespaces
- Application containers: Your actual workload (web server, API, etc.)
- Sidecar containers: Supporting services (logging, monitoring, service mesh)
Shared Resources: All containers in a pod share the same network IP, storage volumes, and inter-process communication channels. This means when we “shut down a pod,” we’re coordinating the termination of multiple processes that are dependent on one another.
Kubernetes Architecture for Graceful Shutdown
Kubernetes has lots of components provides comprehensive coverage, but let’s focus on what matters for the pod termination:
Control Plane Components:
- API Server: The central communication hub that coordinates all shutdown activities
- Controller Manager: Manages higher-level constructs like Deployments and Services
- Endpoint Controller: Updates network routing when pods terminate
Worker Node Components:
- kubelet: The node agent that manages pod lifecycles and executes termination
- kube-proxy: Updates network rules to redirect traffic away from terminating pods
- Container Runtime: The actual executor of container start/stop operations
The graceful shutdown process orchestrates across all these components to ensure both proper process termination and network traffic management.
Central management hub"] SCHED["Scheduler
Pod placement decisions"] CM["Controller Manager
Maintains desired state"] ETCD["etcd
Cluster data store"] end API --- SCHED API --- CM API --- ETCD subgraph "Worker Node" KUBELET1["kubelet
Node agent"] PROXY1["kube-proxy
Network proxy"] RUNTIME1["Container Runtime
(containerd/runc)"] subgraph "Pod 1" subgraph "Shared Namespaces" PAUSE1["Pause Container
Infrastructure/Sandbox"] NS1["Network Namespace
IPC Namespace
Storage Volumes"] end PAUSE1 --- NS1 NS1 -.->|"shares namespaces"| APP1["Application
Container"] NS1 -.->|"shares namespaces"| SIDECAR1["Sidecar
Container"] end subgraph "Pod 2" subgraph "Shared Namespaces " PAUSE2["Pause Container
Infrastructure/Sandbox"] NS2["Network Namespace
IPC Namespace
Storage Volumes"] end PAUSE2 --- NS2 NS2 -.->|"shares namespaces"| APP2["Application
Container"] end KUBELET1 --- RUNTIME1 RUNTIME1 ---|"manages"| PAUSE1 RUNTIME1 ---|"manages"| PAUSE2 RUNTIME1 ---|"manages"| APP1 RUNTIME1 ---|"manages"| SIDECAR1 RUNTIME1 ---|"manages"| APP2 end API ---|"Pod lifecycle
commands"| KUBELET1 SCHED ---|"Scheduling
decisions"| API CM ---|"Desired state
monitoring"| API KUBELET1 ---|"Node status
Pod status"| API PROXY1 ---|"Service discovery
Load balancing"| APP1 PROXY1 --- APP2 style API fill:#e1f5fe style SCHED fill:#e1f5fe style CM fill:#e1f5fe style ETCD fill:#e1f5fe style APP1 fill:#f3e5f5 style APP2 fill:#f3e5f5 style SIDECAR1 fill:#f3e5f5 style PAUSE1 fill:#e8f5e8 style PAUSE2 fill:#e8f5e8 style NS1 fill:#fff3cd style NS2 fill:#fff3cd
Network Traffic Management During Shutdown
I’m working mostly on backend systems and web applications are of my uttermost interest. For this case graceful shutdown involves two parallel concerns: actually terminating the application process cleanly AND makig sure network traffic is properly managed.
When a pod shuts down, we need to:
- Stop new traffic from reaching the shutting down pod
- Allow existing connections to complete gracefully (at least as much as we can)
- Coordinate timing between network updates and process termination
How Kubernetes Solves This:
EndpointSlices and Traffic Redirection: Kubernetes uses EndpointSlice objects to track which pods can receive traffic. When a pod begins terminating, the Endpoint Controller immediately updates the EndpointSlice with:
endpoints:
- addresses: ["10.244.1.5"]
conditions:
ready: false # Stop new traffic
serving: true # Allow existing connections
terminating: true # Pod is shutting down
Load Balancer Coordination: Different components handle the transition differently:
- kube-proxy: Updates iptables rules to redirect new connections
- Ingress Controllers: May implement custom connection draining logic
- External Load Balancers: Need time to detect changes (hence PreStop hooks)
- Service Mesh: Advanced traffic management during termination
This networking orchestration happens in parallel with process termination, ensuring zero-downtime deployments when configured properly.
Now that we understand the components and networking coordination involved, let’s examine how to configure graceful shutdown properly.
Configuring Graceful Shutdown in Kubernetes
The magic happens through specific configuration settings in your Kubernetes manifests. Here’s how to configure the key graceful shutdown aspects:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # CRITICAL: Prevents traffic disruption during deployments
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
# CORE GRACEFUL SHUTDOWN SETTING: Time allowed for complete shutdown
terminationGracePeriodSeconds: 60
containers:
- name: app
image: my-web-app:latest
ports:
- containerPort: 8080
# PRESTOP HOOK: Coordinates with load balancer updates
lifecycle:
preStop:
exec:
# We can trigger some app before the SIGTERM comes
command: ["/bin/sh", "-c", "sleep 15"]
# READINESS PROBE: Controls when pod receives traffic
readinessProbe:
httpGet:
path: /health/ready
port: 8080
periodSeconds: 5
failureThreshold: 1 # Fast removal from endpoints when unhealthy
Graceful Shutdown Configuration Breakdown:
terminationGracePeriodSeconds: 60
: Total time Kubernetes waits before SIGKILL. This is your shutdown budget - choose based on your longest-running requests plus buffer time.preStop
hook with 15s delay: Critical for web apps. This delay ensures load balancers detect the pod is terminating and stop sending new traffic BEFORE your application receives SIGTERM. This may be the place where we can define custom shutdown logic in case we don’t have a SIGTERM processing logic. The example may include waiting some time to finish processing already running processes, signal the cluster regarding the possible rebalance, persist some in-memory state, etc. Important: If the preStop hook runs longer than the grace period, kubelet requests a small 2-second extension, but if it still doesn’t complete, emergency termination occurs.maxUnavailable: 0
: Forces rolling updates to start new pods before terminating old ones. Combined with readiness probes, this ensures zero-downtime deployments.readinessProbe
withfailureThreshold: 1
: Fast endpoint removal when app becomes unhealthy. During shutdown, your app should fail readiness checks immediately after receiving SIGTERM to stop new traffic.
Timing Relationship:
t=0s: preStop hook starts (sleep 15)
t=15s: SIGTERM sent to application
t=60s: SIGKILL sent if still running
Your app has 45 seconds (t=15s to t=60s) to drain connections and shut down gracefully.
With the configuration basics covered, let’s dive into the technical implementation details. We’ll trace the complete journey from command execution to signal delivery.
The Graceful Shutdown Flow: From kubectl to kill()
The following breakdown shows the complete pod termination flow from kubectl delete
to process termination:
Phase 1 - Initial Request: User executes kubectl delete pod my-app
. API Server marks pod for deletion in etcd and triggers parallel workflows: Controller Manager removes pod from ReplicaSet while Endpoint Controller updates networking.
Phase 2 - Networking Updates: Endpoint Controller marks pod as ready: false, serving: true, terminating: true
. Load balancers immediately stop routing new traffic while existing connections continue.
Phase 3 - kubelet Coordination: kubelet receives deletion event, calculates grace period (30s default), and executes PreStop hooks if defined. This provides critical time for load balancer updates.
Phase 4 - Signal Delivery: kubelet initiates the CRI chain: kubelet → containerd → containerd-shim → runc → SIGTERM to process 1 inside each container. For pods with sidecar containers, main containers receive SIGTERM first, then sidecars are terminated in reverse order.
Phase 5 - Connection Draining: Application stops accepting new requests but completes existing ones. If graceful exit succeeds, cleanup begins. If grace period expires, SIGKILL forces termination.
Phase 6 - Cleanup: Process termination status propagates back to kubelet and API Server. kubelet transitions the pod to a terminal phase (Failed or Succeeded), then forcibly removes the pod object from the API server by setting grace period to 0. Endpoint is removed from EndpointSlice and load balancers complete updates.
This orchestrated shutdown maintains web application availability during deployments and scaling operations.
Pod Removal Timeline: Events and Coordination
The following timeline shows the chronological flow of events during pod termination, organized by different system components (time slices are exemplary, useful to understand the magnitude):
Pod Removal Sequence: The Complete Flow
The following sequence diagram shows how all the components interact during pod removal, from the initial kubectl delete
command to the final signal delivery:
serving: true
terminating: true EP->>LB: Update endpoint status LB->>LB: Stop routing new connections end API->>KL: Pod deletion event (watch) KL->>KL: Calculate grace period (30s default) Note over KL,APP: PreStop Hook Phase KL->>APP: Execute PreStop hook (if defined) Note over LB,CLIENT: Load balancer stops new traffic
Existing connections continue APP->>KL: PreStop hook completed KL->>KL: Subtract hook time from grace period Note over KL,APP: Sidecar Coordination Phase KL->>KL: Determine termination order KL->>KL: Wait for main containers first Note over KL,APP: Signal Delivery Phase KL->>CRI: StopContainer(containerID, gracePeriod) CRI->>CRI: Parse stop signal (SIGTERM) CRI->>SHIM: Kill(SIGTERM) SHIM->>RUNC: runtime.Kill(SIGTERM) RUNC->>APP: kill(pid, SIGTERM) Note over APP,CLIENT: Application handles SIGTERM
Stop accepting new requests
Complete existing requests alt Application exits gracefully APP->>CLIENT: Complete active requests CLIENT->>APP: Close connections APP->>RUNC: exit(0) RUNC->>SHIM: Process exited SHIM->>CRI: Container stopped CRI->>KL: Container terminated KL->>API: Pod terminated successfully API->>EP: Remove endpoint EP->>LB: Endpoint removed else Grace period timeout Note over CRI,CLIENT: Force kill after timeout CRI->>SHIM: Kill(SIGKILL) SHIM->>RUNC: runtime.Kill(SIGKILL) RUNC->>APP: kill(pid, SIGKILL) APP->>CLIENT: Connections forcibly closed APP->>RUNC: Process killed RUNC->>SHIM: Process terminated SHIM->>CRI: Container stopped CRI->>KL: Container terminated KL->>API: Pod terminated (forced) API->>EP: Remove endpoint EP->>LB: Endpoint removed end API->>K: Pod deletion confirmed K->>U: pod "my-app" deleted
Key Coordination Points:
- Parallel Updates: When the API server marks a pod for deletion, both the Controller Manager and Endpoint Controller are notified simultaneously
- Endpoint State Transition: The endpoint is immediately marked as
terminating: true, ready: false, serving: true
- Load Balancer Coordination: New traffic is redirected while existing connections continue
- Connection Draining: Active clients can complete their requests during the grace period
The sequence diagram above illustrates how networking events happen in parallel with process termination to achieve zero-downtime deployments.