Prometheus

How does a Prometheus Histogram work?

Posted on Oct 5, 2019 | Updated on Oct 5, 2019 | 86 words | ~1mins

How does a Prometheus Histogram work? We looked previously at thecounter, gauge, and summary, how does the Prometheus histogram work? The histogram has several similarities to the summary. A histogram is a combination of various counters. Like summary metrics, histogram metrics are used to track the size of events, usually how long they take, via their observe method. There’s usually also the exact utilities to make it easy to time things as there are for summarys.

Switching between Prometheus servers in Grafana using data source variables

Posted on Aug 3, 2019 | Updated on Aug 3, 2019 | 63 words | ~1mins

Variables in Grafana (previously known as templates) allow parameterisation of a dashboard via a drop-down menu. Often this is used to switch between machines or services, so that you can have per-machine dashboards without needing to create a dashboard every time a new machine appears. They’re also stored in URL parameters, so could be linked from alert notifications or wiki pages.

How much disk space do Prometheus blocks use?

Posted on Jun 28, 2019 | Updated on Jun 28, 2019 | 160 words | ~1mins

Memory for ingestion is just one part of the resources Prometheus uses, let’s look at disk blocks. Every 2 hours Prometheus compacts the data that has been buffered up in memory onto blocks on disk. This will include the chunks, indexes, tombstones, and various metadata. The main part of this should usually be the chunks themselves, and you can see how much space each sample takes on average by graphing: Combined with rate(prometheus_tsdb_head_samples_appended_total[1h]) for the samples ingested per second, you should have a good idea of how much disk space you need given your retention window. It’s a bit more complicated though, as there’s also indexes to consider. If you have lots of churn in your metrics these can end up taking a non-trivial amount of space.

How much RAM does Prometheus 2.x need for cardinality and ingestion?

Posted on May 11, 2019 | Updated on May 11, 2019 | 273 words | ~2mins

Prometheus 2.x has a very different ingestion system to 1.x, with many performance improvements. This time I’m also going to take into account the cost of cardinality in the head block. To start with I took a profile of a Prometheus 2.9.2 ingesting from a single target with 100k unique time series: This gives a good starting point to find the relevant bits of code, but as my Prometheus has just started doesn’t have quite everything.

How Uber Monitors 4,000 Microservices

Posted on Feb 8, 2019 | Updated on Feb 8, 2019 | 129 words | ~1mins

With 4,000 proprietary microservices and a growing number of open source systems that needed to be monitored, by late 2014 Uber was outgrowing its usage of Graphite and Nagios for metrics. They evaluated several technologies, including Atlas and OpenTSDB, but the fact that a growing number of open source systems were adding native support for the Prometheus Metrics Exporter format tipped the scales in that direction. Uber found with its use of Prometheus and M3, Uber’s storage costs for ingesting metrics became 8.53x more cost effective per metric per replica.

Optimising Prometheus 2.6.0 Memory Usage with pprof

Posted on Jan 23, 2019 | Updated on Jan 23, 2019 | 170 words | ~1mins

There have been some reportsthat compaction was causing larger memory spikes than was desirable. I dug into this and improved it for Prometheus 2.6.0, so let’s see how. Firstly I wrote a test setup that created some samples for 100k time series, in a way that would require compaction. It would be nice to get the active heap usage at its peak, however we don’t know where in the code this peak is to take a profile then. Instead we’ll use the total allocations, which should at least point us in the right direction and may also spot places where we can reduce the amount of garbage generated. As before, we can take a profile with: There’s no external indication that compactions have completed, so we’ll cheat by sleeping.

Monitoring Kubernetes, part 1: the challenges + data sources

Posted on Jan 21, 2019 | Updated on Jan 21, 2019 | 192 words | ~1mins

Our industry has long been relying on microservice-based architecture to deliver software faster and safer. The advent and ubiquity of microservices naturally paved the way for container technology, empowering us to rethink how we build and deploy our applications. Docker exploded onto the scene in 2013, and, for companies focusing on modernizing their infrastructure and cloud migration, a tool like Docker is critical to shipping applications quickly, at scale.

Thanos: long-term storage for your Prometheus Metrics

Posted on Dec 22, 2018 | Updated on Dec 22, 2018 | 111 words | ~1mins

Thanos is a project that turns your Prometheus installation into a highly available metric system with unlimited storage capacity. From a very high-level view, it does this by deploying a sidecar to Prometheus, which uploads the data blocks to any object storage. A store component downloads the blocks again and makes them accessible to a query component, which has the same API as Prometheus itself.

The Billion Data Point Challenge: Building a Query Engine for High Cardinality Time Series Data

Posted on Dec 10, 2018 | Updated on Dec 10, 2018 | 197 words | ~1mins

Uber, like most large technology companies, relies extensively on metrics to effectively monitor its entire stack. From low-level system metrics, such as memory utilization of a host, to high-level business metrics, including the number of Uber Eats orders in a particular city, they allow our engineers to gain insight into how our services are operating on a daily basis. As our dimensionality and usage of metrics increases, common solutions like Prometheus and Graphite become difficult to manage and sometimes cease to work.