Earlier this year, the Streaming PubSub team at Lyft got multiple Apache Kafka clusters ready to take on load that required 24/7 support. The team’s operational burden for Kafka quickly started heading towards burn-out territory. On-call rotations started getting miserable because we’d get woken up at night due to failing hosts. Business requirements kept coming and requiring us to scale the clusters further. The more we scaled, the more we’d get woken up.