Part 5: Prometheus Graceful Shutdown
- Part 1: Signals and Linux
- Part 2: Containers and signals
- Part 3: Graceful shutdown of K8S pods
- Part 4: Celery Graceful Shutdown
- Part 5: Prometheus Graceful Shutdown [you’re here]
- Part 6: Other frameworks and libraries [WIP]
AI usage disclaimer
Disclaimer: this article is very experimental in the way that it relies heavily on the AI agent to do the heavy-lifting of data extraction, following the article structure and proof-reading as the articles are getting pretty procedural. Here’s the prompt. It would require tinkering around and verification, as it tends to provide wrong code links sometimes and building the narrative structure that does not make a lot of sense for the human reader.
Prometheus graceful shutdown
Databases are complex beasts, as they have multiple components and usually have elevated requirements to manage data consistency even when something goes very wrong.
Let’s take a look at the Prometheus moninitoring system and time series database implementation of the graceful shutdown. As a quick reminder Prometheus is the monitoring system that continuously collects metrics from thousands of services, stores them in an optimized time-series database, and provides a powerful query language for analysis. It scrapes HTTP endpoints every N seconds across your infrastructure, ingests maltitude of metrics per second while maintaining strict-ish data consistency guarantees.
Prometheus operates through several components:
- scrape manager maintains concurrent HTTP connections to collect metrics from targets
- time-series database (TSDB) stores and indexes metric data with specialized on-disk formats
- service discovery mechanisms automatically detect new targets to monitor
- rule engine evaluates alerting conditions in real-time
- web interface serves the UI and API endpoints for queries and configuration
- PromQL query engine processes and executes time-series queries.
Each component runs independently but should be coordinated with each other to ensure no data is lost.
Prometheus implements this coordination using the oklog/run.Group
pattern, managing everything from database head block persistence to scrape manager connection draining. This analysis examines how Prometheus coordinates this shutdown sequence while maintaining its zero-data-loss guarantee.
Prometheus Architecture Overview
Prometheus operates as an orchestrated collection of specialized components, each managing a critical aspect of the metrics pipeline:
Each component runs independently but coordinates through the central run.Group
pattern, ensuring that when shutdown begins, all components can be stopped in the correct order without data loss or resource leaks.
The run.Group Coordination Pattern
Prometheus relies on the oklog/run.Group
library to implement its components’ graceful shutdown. This library solves a common problem in Go applications: how to run multiple long-lived services together and shut them all down gracefully when any one fails or receives a termination signal.
Instead of manually managing goroutines, channels, and shutdown logic, run.Group lets you register pairs of functions—one to execute the service, and one to interrupt it. When any service exits (either successfully or with an error), all services are automatically interrupted in a coordinated fashion.
Here’s a simple example to illustrate the pattern:
package main
import (
"context"
"fmt"
"net/http"
"os"
"os/signal"
"syscall"
"time"
"github.com/oklog/run"
)
func main() {
var g run.Group
// HTTP server component
{
server := &http.Server{Addr: ":8080"}
g.Add(func() error {
fmt.Println("Starting HTTP server on :8080")
return server.ListenAndServe()
}, func(error) {
fmt.Println("Shutting down HTTP server...")
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
server.Shutdown(ctx)
})
}
// Background worker component
{
ctx, cancel := context.WithCancel(context.Background())
g.Add(func() error {
fmt.Println("Starting background worker")
ticker := time.NewTicker(2 * time.Second)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return ctx.Err()
case <-ticker.C:
fmt.Println("Worker processing...")
}
}
}, func(error) {
fmt.Println("Stopping background worker...")
cancel()
})
}
// Signal handling component
{
term := make(chan os.Signal, 1)
signal.Notify(term, os.Interrupt, syscall.SIGTERM)
cancel := make(chan struct{})
g.Add(func() error {
select {
case sig := <-term:
fmt.Printf("Received signal: %v\n", sig)
return nil
case <-cancel:
return nil
}
}, func(error) {
close(cancel)
})
}
// Run all components - blocks until any component exits
if err := g.Run(); err != nil {
fmt.Printf("Application stopped with error: %v\n", err)
}
fmt.Println("Application shutdown complete")
}
In this example, pressing Ctrl+C causes the signal handler to exit, which triggers coordinated shutdown of both the HTTP server and the background worker.
Now let’s see how Prometheus applies this pattern:
Termination handler (cmd/prometheus/main.go:1065-1090):
var g run.Group
{
// Termination handler.
term := make(chan os.Signal, 1)
signal.Notify(term, os.Interrupt, syscall.SIGTERM)
cancel := make(chan struct{})
g.Add(
func() error {
// Don't forget to release the reloadReady channel so that waiting blocks can exit normally.
select {
case sig := <-term:
logger.Warn("Received an OS signal, exiting gracefully...", "signal", sig.String())
reloadReady.Close()
case <-webHandler.Quit():
logger.Warn("Received termination request via web service, exiting gracefully...")
case <-cancel:
reloadReady.Close()
}
return nil
},
func(_ error) {
close(cancel)
webHandler.SetReady(web.Stopping)
notifs.AddNotification(notifications.ShuttingDown)
},
)
}
The termination handler demonstrates several key patterns:
- Multiple shutdown triggers: OS signals, web service requests, or internal cancellation
- State coordination: Closing
reloadReady
ensures waiting components can exit cleanly - Observer pattern: Web handler state changes uses the notification mechanism to notify interested consumers of the message
- Clean resource management: The interrupt function handles cleanup consistently
Component Shutdown Orchestration
Scrape Manager: Connection Draining and In-Flight Handling
The scrape manager faces a particularly complex shutdown challenge: it must coordinate the termination of potentially thousands of active HTTP scrapes while ensuring no data loss.
Scrape manager registration (cmd/prometheus/main.go:1132-1155):
{
// Scrape manager.
g.Add(
func() error {
<-reloadReady.C
err := scrapeManager.Run(discoveryManagerScrape.SyncCh())
logger.Info("Scrape manager stopped")
return err
},
func(_ error) {
// Scrape manager needs to be stopped before closing the local TSDB
// so that it doesn't try to write samples to a closed storage.
logger.Info("Stopping scrape manager...")
scrapeManager.Stop()
},
)
}
Manager.Stop() coordination (scrape/manager.go:229-237):
// Stop cancels all running scrape pools and blocks until all have exited.
func (m *Manager) Stop() {
m.mtxScrape.Lock()
defer m.mtxScrape.Unlock()
for _, sp := range m.scrapePools {
sp.stop()
}
close(m.graceShut)
}
The manager coordinates shutdown across all scrape pools, ensuring thread-safe access and blocking until complete termination.
Scrape pool shutdown (scrape/scrape.go:450-470):
// stop terminates all scrape loops and returns after they all terminated.
func (sp *scrapePool) stop() {
sp.mtx.Lock()
defer sp.mtx.Unlock()
sp.cancel()
var wg sync.WaitGroup
sp.targetMtx.Lock()
for fp, l := range sp.loops {
wg.Add(1)
go func(l loop) {
l.stop()
wg.Done()
}(l)
delete(sp.loops, fp)
delete(sp.activeTargets, fp)
}
sp.targetMtx.Unlock()
wg.Wait()
sp.client.CloseIdleConnections()
}
The scrape pool shutdown reveals the following coordination:
- Context cancellation stops all loops via shared context
- Synchronization with WaitGroup ensures complete termination of the pool
- Connection cleanup closes idle HTTP connections
- Resource cleanup removes internal tracking data
TSDB Storage: Data Persistence and Resource Coordination
The TSDB component faces the most complex shutdown requirements, needing to persist in-memory data, coordinate background compactions, and ensure clean WAL closure.
TSDB registration (cmd/prometheus/main.go:1295-1320):
if !agentMode {
// TSDB.
opts := cfg.tsdb.ToTSDBOptions()
cancel := make(chan struct{})
g.Add(
func() error {
logger.Info("Starting TSDB ...")
// ... TSDB initialization code ...
close(dbOpen)
<-cancel
return nil
},
func(_ error) {
if err := fanoutStorage.Close(); err != nil {
logger.Error("Error stopping storage", "err", err)
}
close(cancel)
},
)
}
The TSDB shutdown triggers through fanoutStorage.Close()
, which coordinates multiple storage backends and ensures proper resource cleanup.
DB.Close() coordination (tsdb/db.go:1023-1043):
func (db *DB) Close() error {
close(db.stopc) // Signal shutdown to background processes
if db.compactCancel != nil {
db.compactCancel() // Cancel ongoing compactions
}
<-db.donec // Wait for background processes to finish
db.mtx.Lock()
defer db.mtx.Unlock()
var g errgroup.Group
// blocks also contains all head blocks.
for _, pb := range db.blocks {
g.Go(pb.Close) // Close all block readers in parallel
}
errs := tsdb_errors.NewMulti(g.Wait(), db.locker.Release())
if db.head != nil {
errs.Add(db.head.Close()) // Close the head block last
}
return errs.Err()
}
The TSDB shutdown sequence demonstrates careful resource coordination:
- Background process signaling via channel closure
- Compaction cancellation prevents data corruption from interrupted operations
- Synchronous waiting ensures background goroutines complete
- Parallel block closure uses errgroup for concurrent operations
- Head block priority ensures memory data is persisted last
- Error aggregation collects all errors without stopping cleanup
Head block closure (tsdb/head.go:1712-1730):
func (h *Head) Close() error {
h.closedMtx.Lock()
defer h.closedMtx.Unlock()
h.closed = true
h.mmapHeadChunks()
errs := tsdb_errors.NewMulti(h.chunkDiskMapper.Close())
if h.wal != nil {
errs.Add(h.wal.Close())
}
if h.wbl != nil {
errs.Add(h.wbl.Close()) // Close out-of-order WAL
}
if errs.Err() == nil && h.opts.EnableMemorySnapshotOnShutdown {
errs.Add(h.performChunkSnapshot()) // Optional memory snapshot
}
return errs.Err()
}
The head block closure is critical for data integrity:
- Memory mapping ensures all in-memory chunks are persisted to disk
- WAL closure guarantees write-ahead log consistency
- Out-of-order handling manages the separate WBL for late-arriving samples
- Optional snapshots provide additional recovery mechanisms
- State management prevents further operations on closed head
Advanced Coordination Mechanisms
Ready State Management
Prometheus uses a channel-based state system to coordinate component startup and shutdown dependencies.
Ready state coordination (cmd/prometheus/main.go:1034-1050):
// Wait until the server is ready to handle reloading.
reloadReady := &closeOnce{
C: make(chan struct{}),
}
reloadReady.Close = func() {
reloadReady.once.Do(func() {
close(reloadReady.C)
})
}
Configuration-dependent components wait on <-reloadReady.C
before starting their main loops, ensuring they don’t start until the initial configuration is successfully loaded.
Rule Manager startup coordination (cmd/prometheus/main.go:1123-1131):
// Rule manager.
g.Add(
func() error {
<-reloadReady.C
ruleManager.Run()
return nil
},
func(_ error) {
ruleManager.Stop()
},
)
Notifier Manager startup coordination (cmd/prometheus/main.go:1413-1427):
// Notifier.
g.Add(
func() error {
// When the notifier manager receives a new targets list
// it needs to read a valid config for each job.
// It depends on the config being in sync with the discovery manager
// so we wait until the config is fully loaded.
<-reloadReady.C
notifierManager.Run(discoveryManagerNotify.SyncCh())
return nil
},
func(_ error) {
notifierManager.Stop()
},
)
Infrastructure components (discovery managers, storage, and web handler) start immediately to provide the foundation services needed for configuration loading and component coordination. The reloadReady.Close()
is called only after successful initial configuration loading, ensuring all waiting components start with valid configuration.
WAL Persistence Guarantees
The Write-Ahead Log implementation ensures data durability through careful shutdown sequencing.
WAL persistence (tsdb/wlog/wlog.go:837-870):
func (w *WL) Close() (err error) {
w.mtx.Lock()
defer w.mtx.Unlock()
if w.closed {
return errors.New("wlog already closed")
}
// Flush the last page and zero out all its remaining size.
if w.page.alloc > 0 {
if err := w.flushPage(true); err != nil {
return err
}
}
donec := make(chan struct{})
w.stopc <- donec // Signal write goroutine to stop
<-donec // Wait for write goroutine to finish
if err = w.fsync(w.segment); err != nil {
w.logger.Error("sync previous segment", "err", err)
}
return nil
}
The WAL shutdown ensures data durability through:
- Final page flushing - guarantees all buffered writes reach disk
- Goroutine coordination - ensures the background writer completes
- Filesystem sync - forces kernel buffers to persistent storage
- Error handling - logs but continues with cleanup on sync failures
Key Design Insights
The run.Group Pattern
Prometheus demonstrates the power of the oklog/run.Group
library usage for complex system coordination. Each component provides:
- Execution function: Runs the component until termination
- Interrupt function: Handles cleanup and resource release
- Automatic ordering: Components shut down in reverse registration order
- Error propagation: Any component failure triggers system-wide shutdown
This pattern eliminates the complexity of manual goroutine management and ensures consistent cleanup behavior across all components.
Data Integrity Guarantees
Prometheus maintains strict data integrity during shutdown through:
- WAL flushing - ensures all writes are persistent
- Memory mapping - persists in-memory chunks to disk
- Compaction cancellation - prevents corruption from interrupted operations
- Error aggregation - continues cleanup even if individual components fail
- Stale marker generation - properly handles metrics that stop being scraped
Resource Management
The implementation demonstrates excellent resource management:
- HTTP connection draining - closes idle connections properly
- Goroutine coordination - ensures clean termination of all background processes
- Memory cleanup - releases series data and internal caches
- File handle management - closes all disk resources consistently
- Lock release - frees directory locks for restart capability
References
Code Sources
- Prometheus Main Entry Point - Primary coordination logic and run.Group setup
- oklog/run Library - Goroutine coordination library used by Prometheus
- Scrape Manager Implementation - HTTP connection draining and pool coordination
- Scrape Pool Shutdown - Individual scrape loop termination logic
- TSDB Database Shutdown - Database closure and resource cleanup
- TSDB Head Block Management - In-memory data persistence during shutdown
- Write-Ahead Log Closure - WAL shutdown and data durability guarantees
- Ready State Coordination - Component startup synchronization mechanism
Documentation
- Prometheus Architecture Documentation - Official architecture overview
- Go sync.WaitGroup Documentation - Goroutine synchronization primitives
- Go errgroup Documentation - Error handling in concurrent operations