casestudy

Taming ElastiCache with Auto-discovery at Scale

Our backend infrastructure at Tinder relies on Redis-based caching to fulfill the requests generated by more than 2 billion uses of the Swipe® feature per day and hosts more than 30 billion matches to 190 countries globally. Most of our data operations are reads, which motivates the general data flow architecture of our backend microservices. Source: medium.com

Library of Congress Storage Architecture

In 2026 is there demand for 7X more manufactured storage annually and is there sufficient value for this storage to spend $122B more annually (2.4X) for this storage? Unlike HDD, tape magnetic physics is not the limiting issues since tape bit cells are 60X larger than HDD bit cells … The projected tape areal density in 2025 (90 Gbit/in2) is 13x smaller than today’s HDD areal density and has already been demonstrated in laboratory environments.
Read more

How we 30x’d our Node parallelism

What’s the best way to safely increase parallelism in a production Node service? That’s a question my team needed to answer a couple of months ago. We were running 4,000 Node containers (or ‘workers’) for our bank integration service. The service was originally designed such that each worker would process only a single request at a time. This design lessened the impact of integrations that accidentally blocked the event loop, and allowed us to ignore the variability in resource usage across different integrations.
Read more

From 15,000 database connections to under 100: DigitalOcean’s tale of tech debt

A new hire recently asked me over lunch, “What does DigitalOcean’s tech debt look like?” I could not help but smile when I heard the question. Software engineers asking about a company’s tech debt is the equivalent of asking about a credit score. It’s their way of As a cloud provider that manages our own servers and hardware, we have faced complications that many other startups have not encountered in this new era of cloud computing.
Read more

Making the LinkedIn experimentation engine 20x faster

At LinkedIn, we like to say that experimentation is in our blood because no production release at the company happens without experimentation; by “experimentation,” we typically mean “A/B testing.” The company relies on employees to make decisions by analyzing data. Experimentation is a data-driven foundation of the decision-making process, which helps with measuring the precise impact of every change and release, and evaluating whether expectations meet reality. LinkedIn’s experimentation platform operates at an extremely large scale: It serves up to 800,000 QPS of network calls, It serves about 35,000 concurrently running A/B experiments, It handles up to 23 trillion experiment evaluations per day, Average latency of experiment evaluation is 700 ns and the 99th percentile is 3 μs, It is used in about 500 production services.
Read more

Scaling Beyond a Billion Transactions Per Day with Sub-second Responses

Andrey Zolotov, Gideon Low present their journey of transition to distributed data processing using GemFire and the challenges faced along the way. Source: infoq.com

Lyft’s Journey through Mobile Networking

In 5 years, the number of endpoints consumed by Lyft’s mobile apps grew to over 500, and the size of our mobile engineering team increased by more than 15x. To scale with this growth, our infrastructure had to evolve dramatically to utilize new advances in modern networking in order to continue to provide benefits for our users. This post describes the journey through the evolution of Lyft’s mobile networking: how it’s changed, what we’ve learned, and why it’s important for us as a growing business.
Read more

Database Migration To Amazon Aurora

In this blog post we’ll show you how we migrated a critical Postgres database with 18Tb of data from Amazon RDS (Relational Database Service) to Amazon Aurora, with minimal downtime. To do so, we’ll discuss our experience at Codacy. We chose Amazon’sAuroradatabase as a solution for a few key reasons including: 1) automatic storage growth (up to 64Tb); 2) ease of migration from RDS and 3) performance benefits. Although, Aurora’s official docs only claimed up to a3x increase in throughput performanceover stock PostgreSQL 9.
Read more

Automating Datacenter Operations at Dropbox

Switch provisioning at Dropbox is handled by a Pirlo component called the TOR Starter. The TOR Starter is responsible for validating and configuring switches in our datacenter server racks, PoP server racks, and at the different layers of our datacenter fabric that connect racks in the same facility together. Writing the TOR Starter on top of the ClusterOps queue provides us with a basic manager-worker queuing service. We also have the ability to customize the queue to fit our needs in switch provisioning.
Read more

Kubernetes Failure Stories

I started to compile a list of public failure/horror stories related to Kubernetes. It should make it easier for people tasked with operations to find outage reports to learn from. Since we started with Kubernetes at Zalando in 2016, we collected many internal postmortems. Docker bugs (daemon unresponsive, process stuck in pipe wait, ..) were a major pain point in the beginning, but Docker itself has become more mature and did not bite us recently.
Read more