Part 5: Prometheus Graceful Shutdown

Part 1: Signals and Linux
Part 2: Containers and signals
Part 3: Graceful shutdown of K8S pods
Part 4: Celery Graceful Shutdown
Part 5: Prometheus Graceful Shutdown [you’re here]
Part 6: Other frameworks and libraries [WIP]

AI usage disclaimer

Disclaimer: this article is very experimental in the way that it relies heavily on the AI agent to do the heavy-lifting of data extraction, following the article structure and proof-reading as the articles are getting pretty procedural. Here’s the prompt. It would require tinkering around and verification, as it tends to provide wrong code links sometimes and building the narrative structure that does not make a lot of sense for the human reader.

Prometheus graceful shutdown

Databases are complex beasts, as they have multiple components and usually have elevated requirements to manage data consistency even when something goes very wrong.

Let’s take a look at the Prometheus moninitoring system and time series database implementation of the graceful shutdown. As a quick reminder Prometheus is the monitoring system that continuously collects metrics from thousands of services, stores them in an optimized time-series database, and provides a powerful query language for analysis. It scrapes HTTP endpoints every N seconds across your infrastructure, ingests maltitude of metrics per second while maintaining strict-ish data consistency guarantees.

Prometheus operates through several components:

scrape manager maintains concurrent HTTP connections to collect metrics from targets
time-series database (TSDB) stores and indexes metric data with specialized on-disk formats
service discovery mechanisms automatically detect new targets to monitor
rule engine evaluates alerting conditions in real-time
web interface serves the UI and API endpoints for queries and configuration
PromQL query engine processes and executes time-series queries.

Each component runs independently but should be coordinated with each other to ensure no data is lost.

Prometheus implements this coordination using the oklog/run.Group pattern, managing everything from database head block persistence to scrape manager connection draining. This analysis examines how Prometheus coordinates this shutdown sequence while maintaining its zero-data-loss guarantee.

Prometheus Architecture Overview

Prometheus operates as an orchestrated collection of specialized components, each managing a critical aspect of the metrics pipeline:

graph TB Main[Main Process] --> RG[run.Group Coordinator] RG --> SM[Scrape Manager] RG --> DM[Discovery Manager] RG --> RM[Rule Manager] RG --> TSDB[TSDB Storage] RG --> WEB[Web Handler] RG --> QE[Query Engine] RG --> NOTIF[Notification Manager] SM --> SP1[Scrape Pool 1] SM --> SP2[Scrape Pool 2] SM --> SPN[Scrape Pool N] SP1 --> SL1[Scrape Loop 1] SP1 --> SL2[Scrape Loop 2] WEB --> API[HTTP API] WEB --> UI[Web UI] QE --> PROMQL[PromQL Parser] QE --> EVAL[Query Evaluator] TSDB --> HEAD[Head Block] TSDB --> WAL[Write-Ahead Log] TSDB --> COMP[Compactor] TSDB --> BLOCKS[Disk Blocks] HEAD --> CHUNK[Chunk Mapper] HEAD --> MEM[Memory Series]

Each component runs independently but coordinates through the central run.Group pattern, ensuring that when shutdown begins, all components can be stopped in the correct order without data loss or resource leaks.

The run.Group Coordination Pattern

Prometheus relies on the oklog/run.Group library to implement its components’ graceful shutdown. This library solves a common problem in Go applications: how to run multiple long-lived services together and shut them all down gracefully when any one fails or receives a termination signal.

Instead of manually managing goroutines, channels, and shutdown logic, run.Group lets you register pairs of functions—one to execute the service, and one to interrupt it. When any service exits (either successfully or with an error), all services are automatically interrupted in a coordinated fashion.

Here’s a simple example to illustrate the pattern:

package main

import (
    "context"
    "fmt"
    "net/http"
    "os"
    "os/signal"
    "syscall"
    "time"

    "github.com/oklog/run"
)

func main() {
    var g run.Group

    // HTTP server component
    {
        server := &http.Server{Addr: ":8080"}
        g.Add(func() error {
            fmt.Println("Starting HTTP server on :8080")
            return server.ListenAndServe()
        }, func(error) {
            fmt.Println("Shutting down HTTP server...")
            ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
            defer cancel()
            server.Shutdown(ctx)
        })
    }

    // Background worker component
    {
        ctx, cancel := context.WithCancel(context.Background())
        g.Add(func() error {
            fmt.Println("Starting background worker")
            ticker := time.NewTicker(2 * time.Second)
            defer ticker.Stop()
            
            for {
                select {
                case <-ctx.Done():
                    return ctx.Err()
                case <-ticker.C:
                    fmt.Println("Worker processing...")
                }
            }
        }, func(error) {
            fmt.Println("Stopping background worker...")
            cancel()
        })
    }

    // Signal handling component
    {
        term := make(chan os.Signal, 1)
        signal.Notify(term, os.Interrupt, syscall.SIGTERM)
        cancel := make(chan struct{})
        g.Add(func() error {
            select {
            case sig := <-term:
                fmt.Printf("Received signal: %v\n", sig)
                return nil
            case <-cancel:
                return nil
            }
        }, func(error) {
            close(cancel)
        })
    }

    // Run all components - blocks until any component exits
    if err := g.Run(); err != nil {
        fmt.Printf("Application stopped with error: %v\n", err)
    }
    fmt.Println("Application shutdown complete")
}

In this example, pressing Ctrl+C causes the signal handler to exit, which triggers coordinated shutdown of both the HTTP server and the background worker.

Now let’s see how Prometheus applies this pattern:

Termination handler (cmd/prometheus/main.go:1065-1090):

var g run.Group
{
    // Termination handler.
    term := make(chan os.Signal, 1)
    signal.Notify(term, os.Interrupt, syscall.SIGTERM)
    cancel := make(chan struct{})
    g.Add(
        func() error {
            // Don't forget to release the reloadReady channel so that waiting blocks can exit normally.
            select {
            case sig := <-term:
                logger.Warn("Received an OS signal, exiting gracefully...", "signal", sig.String())
                reloadReady.Close()
            case <-webHandler.Quit():
                logger.Warn("Received termination request via web service, exiting gracefully...")
            case <-cancel:
                reloadReady.Close()
            }
            return nil
        },
        func(_ error) {
            close(cancel)
            webHandler.SetReady(web.Stopping)
            notifs.AddNotification(notifications.ShuttingDown)
        },
    )
}

The termination handler demonstrates several key patterns:

Multiple shutdown triggers: OS signals, web service requests, or internal cancellation
State coordination: Closing reloadReady ensures waiting components can exit cleanly
Observer pattern: Web handler state changes uses the notification mechanism to notify interested consumers of the message
Clean resource management: The interrupt function handles cleanup consistently

Component Shutdown Orchestration

Scrape Manager: Connection Draining and In-Flight Handling

The scrape manager faces a particularly complex shutdown challenge: it must coordinate the termination of potentially thousands of active HTTP scrapes while ensuring no data loss.

Scrape manager registration (cmd/prometheus/main.go:1132-1155):

{
    // Scrape manager.
    g.Add(
        func() error {
            <-reloadReady.C
            err := scrapeManager.Run(discoveryManagerScrape.SyncCh())
            logger.Info("Scrape manager stopped")
            return err
        },
        func(_ error) {
            // Scrape manager needs to be stopped before closing the local TSDB
            // so that it doesn't try to write samples to a closed storage.
            logger.Info("Stopping scrape manager...")
            scrapeManager.Stop()
        },
    )
}

Manager.Stop() coordination (scrape/manager.go:229-237):

// Stop cancels all running scrape pools and blocks until all have exited.
func (m *Manager) Stop() {
	m.mtxScrape.Lock()
	defer m.mtxScrape.Unlock()

	for _, sp := range m.scrapePools {
		sp.stop()
	}
	close(m.graceShut)
}

The manager coordinates shutdown across all scrape pools, ensuring thread-safe access and blocking until complete termination.

Scrape pool shutdown (scrape/scrape.go:450-470):

// stop terminates all scrape loops and returns after they all terminated.
func (sp *scrapePool) stop() {
	sp.mtx.Lock()
	defer sp.mtx.Unlock()
	sp.cancel()
	var wg sync.WaitGroup

	sp.targetMtx.Lock()
	for fp, l := range sp.loops {
		wg.Add(1)
		go func(l loop) {
			l.stop()
			wg.Done()
		}(l)
		delete(sp.loops, fp)
		delete(sp.activeTargets, fp)
	}
	sp.targetMtx.Unlock()
	
	wg.Wait()
	sp.client.CloseIdleConnections()
}

The scrape pool shutdown reveals the following coordination:

Context cancellation stops all loops via shared context
Synchronization with WaitGroup ensures complete termination of the pool
Connection cleanup closes idle HTTP connections
Resource cleanup removes internal tracking data

TSDB Storage: Data Persistence and Resource Coordination

The TSDB component faces the most complex shutdown requirements, needing to persist in-memory data, coordinate background compactions, and ensure clean WAL closure.

TSDB registration (cmd/prometheus/main.go:1295-1320):

if !agentMode {
    // TSDB.
    opts := cfg.tsdb.ToTSDBOptions()
    cancel := make(chan struct{})
    g.Add(
        func() error {
            logger.Info("Starting TSDB ...")
            // ... TSDB initialization code ...
            close(dbOpen)
            <-cancel
            return nil
        },
        func(_ error) {
            if err := fanoutStorage.Close(); err != nil {
                logger.Error("Error stopping storage", "err", err)
            }
            close(cancel)
        },
    )
}

The TSDB shutdown triggers through fanoutStorage.Close(), which coordinates multiple storage backends and ensures proper resource cleanup.

DB.Close() coordination (tsdb/db.go:1023-1043):

func (db *DB) Close() error {
	close(db.stopc)                    // Signal shutdown to background processes
	if db.compactCancel != nil {
		db.compactCancel()              // Cancel ongoing compactions
	}
	<-db.donec                         // Wait for background processes to finish

	db.mtx.Lock()
	defer db.mtx.Unlock()

	var g errgroup.Group
	// blocks also contains all head blocks.
	for _, pb := range db.blocks {
		g.Go(pb.Close)                  // Close all block readers in parallel
	}

	errs := tsdb_errors.NewMulti(g.Wait(), db.locker.Release())
	if db.head != nil {
		errs.Add(db.head.Close())       // Close the head block last
	}
	return errs.Err()
}

The TSDB shutdown sequence demonstrates careful resource coordination:

Background process signaling via channel closure
Compaction cancellation prevents data corruption from interrupted operations
Synchronous waiting ensures background goroutines complete
Parallel block closure uses errgroup for concurrent operations
Head block priority ensures memory data is persisted last
Error aggregation collects all errors without stopping cleanup

Head block closure (tsdb/head.go:1712-1730):

func (h *Head) Close() error {
	h.closedMtx.Lock()
	defer h.closedMtx.Unlock()
	h.closed = true

	h.mmapHeadChunks()

	errs := tsdb_errors.NewMulti(h.chunkDiskMapper.Close())
	if h.wal != nil {
		errs.Add(h.wal.Close())
	}
	if h.wbl != nil {
		errs.Add(h.wbl.Close())        // Close out-of-order WAL
	}
	if errs.Err() == nil && h.opts.EnableMemorySnapshotOnShutdown {
		errs.Add(h.performChunkSnapshot())  // Optional memory snapshot
	}
	return errs.Err()
}

The head block closure is critical for data integrity:

Memory mapping ensures all in-memory chunks are persisted to disk
WAL closure guarantees write-ahead log consistency
Out-of-order handling manages the separate WBL for late-arriving samples
Optional snapshots provide additional recovery mechanisms
State management prevents further operations on closed head

Advanced Coordination Mechanisms

Ready State Management

Prometheus uses a channel-based state system to coordinate component startup and shutdown dependencies.

Ready state coordination (cmd/prometheus/main.go:1034-1050):

// Wait until the server is ready to handle reloading.
reloadReady := &closeOnce{
    C: make(chan struct{}),
}
reloadReady.Close = func() {
    reloadReady.once.Do(func() {
        close(reloadReady.C)
    })
}

Configuration-dependent components wait on <-reloadReady.C before starting their main loops, ensuring they don’t start until the initial configuration is successfully loaded.

Rule Manager startup coordination (cmd/prometheus/main.go:1123-1131):

// Rule manager.
g.Add(
    func() error {
        <-reloadReady.C
        ruleManager.Run()
        return nil
    },
    func(_ error) {
        ruleManager.Stop()
    },
)

Notifier Manager startup coordination (cmd/prometheus/main.go:1413-1427):

// Notifier.
g.Add(
    func() error {
        // When the notifier manager receives a new targets list
        // it needs to read a valid config for each job.
        // It depends on the config being in sync with the discovery manager
        // so we wait until the config is fully loaded.
        <-reloadReady.C
        
        notifierManager.Run(discoveryManagerNotify.SyncCh())
        return nil
    },
    func(_ error) {
        notifierManager.Stop()
    },
)

Infrastructure components (discovery managers, storage, and web handler) start immediately to provide the foundation services needed for configuration loading and component coordination. The reloadReady.Close() is called only after successful initial configuration loading, ensuring all waiting components start with valid configuration.

WAL Persistence Guarantees

The Write-Ahead Log implementation ensures data durability through careful shutdown sequencing.

WAL persistence (tsdb/wlog/wlog.go:837-870):

func (w *WL) Close() (err error) {
	w.mtx.Lock()
	defer w.mtx.Unlock()

	if w.closed {
		return errors.New("wlog already closed")
	}

	// Flush the last page and zero out all its remaining size.
	if w.page.alloc > 0 {
		if err := w.flushPage(true); err != nil {
			return err
		}
	}

	donec := make(chan struct{})
	w.stopc <- donec           // Signal write goroutine to stop
	<-donec                    // Wait for write goroutine to finish

	if err = w.fsync(w.segment); err != nil {
		w.logger.Error("sync previous segment", "err", err)
	}
	return nil
}

The WAL shutdown ensures data durability through:

Final page flushing - guarantees all buffered writes reach disk
Goroutine coordination - ensures the background writer completes
Filesystem sync - forces kernel buffers to persistent storage
Error handling - logs but continues with cleanup on sync failures

Key Design Insights

The run.Group Pattern

Prometheus demonstrates the power of the oklog/run.Group library usage for complex system coordination. Each component provides:

Execution function: Runs the component until termination
Interrupt function: Handles cleanup and resource release
Automatic ordering: Components shut down in reverse registration order
Error propagation: Any component failure triggers system-wide shutdown

This pattern eliminates the complexity of manual goroutine management and ensures consistent cleanup behavior across all components.

Data Integrity Guarantees

Prometheus maintains strict data integrity during shutdown through:

WAL flushing - ensures all writes are persistent
Memory mapping - persists in-memory chunks to disk
Compaction cancellation - prevents corruption from interrupted operations
Error aggregation - continues cleanup even if individual components fail
Stale marker generation - properly handles metrics that stop being scraped

Resource Management

The implementation demonstrates excellent resource management:

HTTP connection draining - closes idle connections properly
Goroutine coordination - ensures clean termination of all background processes
Memory cleanup - releases series data and internal caches
File handle management - closes all disk resources consistently
Lock release - frees directory locks for restart capability

References

Code Sources

Prometheus Main Entry Point - Primary coordination logic and run.Group setup
oklog/run Library - Goroutine coordination library used by Prometheus
Scrape Manager Implementation - HTTP connection draining and pool coordination
Scrape Pool Shutdown - Individual scrape loop termination logic
TSDB Database Shutdown - Database closure and resource cleanup
TSDB Head Block Management - In-memory data persistence during shutdown
Write-Ahead Log Closure - WAL shutdown and data durability guarantees
Ready State Coordination - Component startup synchronization mechanism

Documentation

Prometheus Architecture Documentation - Official architecture overview
Go sync.WaitGroup Documentation - Goroutine synchronization primitives
Go errgroup Documentation - Error handling in concurrent operations

Part 5: Prometheus Graceful Shutdown#

AI usage disclaimer#

Prometheus graceful shutdown#

Prometheus Architecture Overview#

The run.Group Coordination Pattern#

Component Shutdown Orchestration#

Scrape Manager: Connection Draining and In-Flight Handling#

TSDB Storage: Data Persistence and Resource Coordination#

Advanced Coordination Mechanisms#

Ready State Management#

WAL Persistence Guarantees#

Key Design Insights#

The run.Group Pattern#

Data Integrity Guarantees#

Resource Management#

References#

Code Sources#

Documentation#