Casestudy

Taming ElastiCache with Auto-discovery at Scale

Our backend infrastructure at Tinder relies on Redis-based caching to fulfill the requests generated by more than 2 billion uses of the Swipe® feature per day and hosts more than 30 billion matches to 190 countries globally. Most of our data operations are reads, which motivates the general data flow architecture of our backend microservices. Source: medium.com

Library of Congress Storage Architecture

In 2026 is there demand for 7X more manufactured storage annually and is there sufficient value for this storage to spend $122B more annually (2.4X) for this storage? Unlike HDD, tape magnetic physics is not the limiting issues since tape bit cells are 60X larger than HDD bit cells … The projected tape areal density in 2025 (90 Gbit/in2) is 13x smaller than today’s HDD areal density and has already been demonstrated in laboratory environments.
Read more

How we 30x’d our Node parallelism

What’s the best way to safely increase parallelism in a production Node service? That’s a question my team needed to answer a couple of months ago. We were running 4,000 Node containers (or ‘workers’) for our bank integration service. The service was originally designed such that each worker would process only a single request at a time. This design lessened the impact of integrations that accidentally blocked the event loop, and allowed us to ignore the variability in resource usage across different integrations. But since our total capacity was capped at 4,000 concurrent requests, the system did not gracefully scale.
Read more

From 15,000 database connections to under 100: DigitalOcean’s tale of tech debt

A new hire recently asked me over lunch, “What does DigitalOcean’s tech debt look like?” I could not help but smile when I heard the question. Software engineers asking about a company’s tech debt is the equivalent of asking about a credit score. It’s their way of As a cloud provider that manages our own servers and hardware, we have faced complications that many other startups have not encountered in this new era of cloud computing. These tough situations ultimately led to tradeoffs we had to make early in our existence. And as any quickly growing company knows, the technical decisions you make early on tend to catch up with you later.
Read more

Making the LinkedIn experimentation engine 20x faster

At LinkedIn, we like to say that experimentation is in our blood because no production release at the company happens without experimentation; by “experimentation,” we typically mean “A/B testing.” The company relies on employees to make decisions by analyzing data. Experimentation is a data-driven foundation of the decision-making process, which helps with measuring the precise impact of every change and release, and evaluating whether expectations meet reality.
Read more

Scaling Beyond a Billion Transactions Per Day with Sub-second Responses

Andrey Zolotov, Gideon Low present their journey of transition to distributed data processing using GemFire and the challenges faced along the way. Source: infoq.com

Lyft’s Journey through Mobile Networking

In 5 years, the number of endpoints consumed by Lyft’s mobile apps grew to over 500, and the size of our mobile engineering team increased by more than 15x. To scale with this growth, our infrastructure had to evolve dramatically to utilize new advances in modern networking in order to continue to provide benefits for our users. This post describes the journey through the evolution of Lyft’s mobile networking: how it’s changed, what we’ve learned, and why it’s important for us as a growing business.
Read more

Database Migration To Amazon Aurora

In this blog post we’ll show you how we migrated a critical Postgres database with 18Tb of data from Amazon RDS (Relational Database Service) to Amazon Aurora, with minimal downtime. To do so, we’ll discuss our experience at Codacy. We chose Amazon’sAuroradatabase as a solution for a few key reasons including: 1) automatic storage growth (up to 64Tb); 2) ease of migration from RDS and 3) performance benefits. Although, Aurora’s official docs only claimed up to a3x increase in throughput performanceover stock PostgreSQL 9.6, testimonials claimed thatperformance increased 12x, just by doing the migration to Aurora.
Read more

Automating Datacenter Operations at Dropbox

Switch provisioning at Dropbox is handled by a Pirlo component called the TOR Starter. The TOR Starter is responsible for validating and configuring switches in our datacenter server racks, PoP server racks, and at the different layers of our datacenter fabric that connect racks in the same facility together. Writing the TOR Starter on top of the ClusterOps queue provides us with a basic manager-worker queuing service.
Read more

Kubernetes Failure Stories

I started to compile a list of public failure/horror stories related to Kubernetes. It should make it easier for people tasked with operations to find outage reports to learn from. Since we started with Kubernetes at Zalando in 2016, we collected many internal postmortems. Docker bugs (daemon unresponsive, process stuck in pipe wait, ..) were a major pain point in the beginning, but Docker itself has become more mature and did not bite us recently. The biggest chunk of problems can be attributed to the nature of distributed systems and ‘cascading failures’, e.g. a Kubernetes API server outage should not affect running workloads, but it did, or see our recent CoreDNS incident.
Read more