monitoring

Telltale: Netflix Application Monitoring Simplified

Posted on Oct 15, 2020

Our Netflix teams need to quickly detect, diagnose, and remediate problems. Telltale is intelligent monitoring and intelligent alerting. The Telltale application health model yields intelligent monitoring and intelligent alerting. Netflix service owners get alerts they can trust with little configuration and no need for constant tuning. When health problems strike, Telltale presents only the most relevant context and suggests possible causes. An alert fires and you get paged in the middle of the night.

How should pipelines be monitored?

Posted on Aug 4, 2019

For online serving systems it’s fairly well known that you should look for request rate, errors and duration. What about offline processing pipelines though? For a typical web application, high latency or error rates are the sort of thing you want to wake someone up about as they usually negatively affect the end-user’s experience. Request rate isn’t something to alert on in and of itself, however it’s important to know as it’s often related to errors/latency plus you’ll want it for capacity planning.

MTTR is dead, long live CIRT

Posted on Aug 3, 2019

The game is changing for the IT ops community, which means the rules of the past make less and less sense. Organizations need accurate, understandable, and actionable metrics in the right context to measure operations performance and drive critical business transformation. The more customers use modern tools and the more variation in the types of incidents they manage, the less sense it makes to smash all those different incidents into one bucket to compute an average resolution time that will represent ops performance, which is what IT has been doing for a long time.

How to monitor Golden signals in Kubernetes

Posted on Jun 28, 2019

What are Golden signals metrics? How do you monitor golden signals in Kubernetes applications? Golden signals can help to detect issues of a microservices application. These signals are a reduced set of metrics that offer a wide view of a service from a user or consumer perspective, so you can detect potential problems that might be directly affecting the behaviour of the application. Golden signals can help to detect issues of a microservices application.

Infrastructure monitoring: Defense against surprise downtime

Posted on Mar 2, 2019

Infrastructure monitoring is an integral part of infrastructure management. It is an IT manager’s first line of defense against surprise downtime. Severe issues can inject considerable downtime to live infrastructure, sometimes causing heavy loss of money and material. Source: opensource.com

Kubernetes Metrics and Monitoring

Posted on Feb 20, 2019

This post explores the current state of metrics and monitoring in Kubernetes by walking through the gradual thought process that I experienced when learning this topic. Kubernetes needs some metrics for it’s basic out-of-the-box functionality, like autoscaling and scheduling. This is regardless of any monitoring solution you may want for the purpose of troubleshooting and alerting. The case for Kubernetes is often being referred to as the ‘core metrics pipeline’ in contrast to a general monitoring solution.

How Uber Monitors 4,000 Microservices

Posted on Feb 8, 2019

With 4,000 proprietary microservices and a growing number of open source systems that needed to be monitored, by late 2014 Uber was outgrowing its usage of Graphite and Nagios for metrics. They evaluated several technologies, including Atlas and OpenTSDB, but the fact that a growing number of open source systems were adding native support for the Prometheus Metrics Exporter format tipped the scales in that direction. Uber found with its use of Prometheus and M3, Uber’s storage costs for ingesting metrics became 8.

Monitoring Kubernetes, part 1: the challenges + data sources

Posted on Jan 21, 2019

Our industry has long been relying on microservice-based architecture to deliver software faster and safer. The advent and ubiquity of microservices naturally paved the way for container technology, empowering us to rethink how we build and deploy our applications. Docker exploded onto the scene in 2013, and, for companies focusing on modernizing their infrastructure and cloud migration, a tool like Docker is critical to shipping applications quickly, at scale.

Observability at Scale: Building Uber’s Alerting Ecosystem

Posted on Dec 22, 2018

Uber’s software architectures consists of thousands of microservices that empower teams to iterate quickly and support our company’s global growth. These microservices support a variety of solutions, such as mobile applications, internal and infrastructure services, and products along with complex configurations that affect these products at city and sub-city levels. To maintain our growth and architecture, Uber’s Observability team built a robust, scalable metrics and alerting pipeline responsible for detecting, mitigating, and notifying engineers of issues with their services as soon as they occur.

Stack Overflow: How We Do Monitoring

Posted on Dec 22, 2018

What is monitoring? As far as I can tell, it means different things to different people. But we more or less agree on the concept. I think. Maybe. Let’s find out! Source: nickcraver.com