How does a Prometheus Histogram work?

How does a Prometheus Histogram work? We looked previously at thecounter, gauge, and summary, how does the Prometheus histogram work? The histogram has several similarities to the summary. A histogram is a combination of various counters. Like summary metrics, histogram metrics are used to track the size of events, usually how long they take, via their observe method. There’s usually also the exact utilities to make it easy to time things as there are for summarys.
Read more

Switching between Prometheus servers in Grafana using data source variables

Variables in Grafana (previously known as templates) allow parameterisation of a dashboard via a drop-down menu. Often this is used to switch between machines or services, so that you can have per-machine dashboards without needing to create a dashboard every time a new machine appears. They’re also stored in URL parameters, so could be linked from alert notifications or wiki pages. Source:

How much disk space do Prometheus blocks use?

Memory for ingestion is just one part of the resources Prometheus uses, let’s look at disk blocks. Every 2 hours Prometheus compacts the data that has been buffered up in memory onto blocks on disk. This will include the chunks, indexes, tombstones, and various metadata. The main part of this should usually be the chunks themselves, and you can see how much space each sample takes on average by graphing: Combined with rate(prometheus_tsdb_head_samples_appended_total[1h]) for the samples ingested per second, you should have a good idea of how much disk space you need given your retention window.
Read more

How much RAM does Prometheus 2.x need for cardinality and ingestion?

Prometheus 2.x has a very different ingestion system to 1.x, with many performance improvements. This time I’m also going to take into account the cost of cardinality in the head block. To start with I took a profile of a Prometheus 2.9.2 ingesting from a single target with 100k unique time series: This gives a good starting point to find the relevant bits of code, but as my Prometheus has just started doesn’t have quite everything.
Read more

How Uber Monitors 4,000 Microservices

With 4,000 proprietary microservices and a growing number of open source systems that needed to be monitored, by late 2014 Uber was outgrowing its usage of Graphite and Nagios for metrics. They evaluated several technologies, including Atlas and OpenTSDB, but the fact that a growing number of open source systems were adding native support for the Prometheus Metrics Exporter format tipped the scales in that direction. Uber found with its use of Prometheus and M3, Uber’s storage costs for ingesting metrics became 8.
Read more

Optimising Prometheus 2.6.0 Memory Usage with pprof

There have been some reportsthat compaction was causing larger memory spikes than was desirable. I dug into this and improved it for Prometheus 2.6.0, so let’s see how. Firstly I wrote a test setup that created some samples for 100k time series, in a way that would require compaction. It would be nice to get the active heap usage at its peak, however we don’t know where in the code this peak is to take a profile then.
Read more

Monitoring Kubernetes, part 1: the challenges + data sources

Our industry has long been relying on microservice-based architecture to deliver software faster and safer. The advent and ubiquity of microservices naturally paved the way for container technology, empowering us to rethink how we build and deploy our applications. Docker exploded onto the scene in 2013, and, for companies focusing on modernizing their infrastructure and cloud migration, a tool like Docker is critical to shipping applications quickly, at scale. But, with that speed comes challenges — containers introduce a non-trivial level of complexity when it comes to orchestration.
Read more

Thanos: long-term storage for your Prometheus Metrics

Thanos is a project that turns your Prometheus installation into a highly available metric system with unlimited storage capacity. From a very high-level view, it does this by deploying a sidecar to Prometheus, which uploads the data blocks to any object storage. A store component downloads the blocks again and makes them accessible to a query component, which has the same API as Prometheus itself. This works nicely with Grafana because its the same API.
Read more

The Billion Data Point Challenge: Building a Query Engine for High Cardinality Time Series Data

Uber, like most large technology companies, relies extensively on metrics to effectively monitor its entire stack. From low-level system metrics, such as memory utilization of a host, to high-level business metrics, including the number of Uber Eats orders in a particular city, they allow our engineers to gain insight into how our services are operating on a daily basis. As our dimensionality and usage of metrics increases, common solutions like Prometheus and Graphite become difficult to manage and sometimes cease to work.
Read more