Automating Datacenter Operations at Dropbox

Switch provisioning at Dropbox is handled by a Pirlo component called the TOR Starter. The TOR Starter is responsible for validating and configuring switches in our datacenter server racks, PoP server racks, and at the different layers of our datacenter fabric that connect racks in the same facility together. Writing the TOR Starter on top […]

Kubernetes Failure Stories

I started to compile a list of public failure/horror stories related to Kubernetes. It should make it easier for people tasked with operations to find outage reports to learn from. Since we started with Kubernetes at Zalando in 2016, we collected many internal postmortems. Docker bugs (daemon unresponsive, process stuck in pipe wait, ..) were […]

The Many Faces of Envoy Proxy: Edge Gateway, Service Mesh, and Hybrid Networking Bridge

At the inaugural EnvoyCon in Seattle, USA, engineers from Pinterest, Yelp and Groupon presented their current use cases for the Envoy Proxy. The overarching message was that the Envoy Proxy appears to be moving closer to fulfilling its vision of providing the “universal [proxy] data plane API” for modern networking, including edge gateways, service meshes […]

The Biggest IT Failures of 2018

This year provedonce againthat IT-related failures “are universally unprejudiced: they happen in every country; to large companies and small; in commercial, nonprofit, and governmental organizations; and without regard to status or reputation.” Below is a review that just scratches the surface of the sundry failures, glitches, and other IT hiccups that made the news in […]

Observability at Scale: Building Uber’s Alerting Ecosystem

Uber’s software architectures consists of thousands of microservices that empower teams to iterate quickly and support our company’s global growth. These microservices support a variety of solutions, such as mobile applications, internal and infrastructure services, and products along with complex configurations that affect these products at city and sub-city levels. To maintain our growth and […]

Stack Overflow: How We Do Monitoring

What is monitoring? As far as I can tell, it means different things to different people. But we more or less agree on the concept. I think. Maybe. Let’s find out! Source: nickcraver

How Uber Beacon Helps Improve Safety for Riders and Drivers

Globally, there are approximately 1.3 million collision-related fatalities on the road every year. Crash fatalities are still the leading cause of death for people between 15-29 years old, impacting families, communities, and cities. Governments around the world are working to reduce the risks, committing more resources towards improving road safety. At Uber, we want to […]

Cape Technical Deep Dive

In this post, we’ll take a deep dive into the design of the Cape framework. First, we’ll discuss Cape’s architecture. Then we’ll look at the core scheduling component of the system. Throughout, we’ll focus the discussion on a few key design decisions. Before we begin, let’s touch on a few of our principles for developing […]

Kubernetes in production

I’ve provisioned Kubernetes clusters on bare metal before and have some examples here on how it can be done with CoreOS ( Warning the content is rather old now and not maintained ) In the beginning a bunch of tools & methods was considered: For network CNI kube-router was used as I became one of […]