Python at Netflix

As many of us prepare to go to PyCon, we wanted to share a sampling of how Python is used at Netflix. We use Python through the full content lifecycle, from deciding which content to fund all the way to operating the CDN that serves the final video to 148 million members. We use and contribute to many open-source Python packages, some of which are mentioned below. If any of this interests you, check out the jobs site or find us at PyCon.
Read more

Cilium 1.5: Scaling to 5k nodes and 100k pods, BPF-based SNAT, and Rolling Key Updates for Transparent Encryption

We are excited to announce the Cilium 1.5 release. Cilium 1.5 is the first release where we primarily focused on scalability with respect to number of nodes, pods and services. Our goal was to scale to 5k nodes, 20k pods and 10k services. We went well past that goal with the 1.5 release and are now officially supporting 5k nodes, 100k pods and 20k services. Along the way, we learned a lot, some expected, some unexpected, this blog post will dive into what we learned and how we improved.
Read more

Linkerd or Istio?

This week I set out to write a post comparing Istio and Linkerd, and I told myself: I’m going to create tables comparing features, and it’s going to be great and people will love and the world will be happier for a few seconds. I promised myself It was going to be a fair comparison without bias from any end. While the ‘comparison table’ is still here, I shifted the focus of the article: the goal is not on which is better, but which is better for you, for your applications, for your organization.
Read more

Guidance for Building a Control Plane for Envoy Part 5: Deployment Tradeoffs

Once you’ve built and designed your control plane and its various supporting components, you’ll want to decide exactly how its components get deployed. You’ll want to weight various security, scalability, and usability concerns when settling into what’s best for your implementation. The options vary from co-deploying control plane components with the data plane to completely separating the control plane from the data plane. There is also a middle ground here as well: deploy some components co-located with the control plane and keep some centralized.
Read more

Setting up Kubernetes Network Policies

The container orchestrator war is over, and Kubernetes has won. With companies large and small rapidly adopting the platform, security has emerged as an important concern — partly because of the learning curve inherent in understanding any new infrastructure, and partly because of recently announced vulnerabilities. Kubernetes brings another security dynamic to the table — its defaults are geared towards making it easy for users to get up and running quickly, as well as being backward compatible with earlier releases of Kubernetes that lacked important security features.
Read more

Debugging Istio control plane with Squash

Solo.io Squash is a distributed debugger that supports multiple languages. When running in a container environment like Kubernetes, debugging applications can be difficult especially when distributed into multiple containers with implementations in potentially different languages. Istio is a service-mesh implementation that exemplifies this “microservice” architecture by implementing its control plane as a set of services. We can use Solo.io Squash to debug the Istio control plane without any modification to the images (ie, adding debuggers, scripts, etc).
Read more

Introducing kube-iptables-tailer: Better Networking Visibility in Kubernetes Clusters

At Box, we use Kubernetes to empower our engineers to own the whole lifecycle of their microservices. When it comes to networking, our engineers use Tigera’s Project Calico to declaratively manage network policies for their apps running in our Kubernetes clusters. App owners define a Calico policy in order to enable their Pods to send/receive network traffic, which is instantiated as iptables rules. There may be times, however, when such network policy is missing or declared incorrectly by app owners.
Read more

Optimizing M3: How Uber Halved Our Metrics Ingestion Latency by Forking the Go Compiler

In Uber’s New York engineering office, our Observability team maintains a robust, scalable metrics and alerting pipeline responsible for detecting, mitigating, and notifying engineers of issues with their services as soon as they occur. Monitoring the health of our thousands of microservices helps us ensure that our platform runs smoothly and efficiently for our millions of users across the globe, from riders and driver-partners to eaters and restaurant-partners. A few months ago, a routine deployment in a core service of M3, our open source metrics and monitoring platform, caused a doubling in overall latency for collecting and persisting metrics to storage, elevating the metrics’ P99 from approximately 10 seconds to over 20 seconds.
Read more

Surface Kubernetes Errors with Sentry

Kubernetes, like a lot of other tools, can be noisy. Errors and warnings often go completely unnoticed in the event stream. Or sometimes they are noticed, but are hard to understand in the context of what else is happening in the cluster. Sentry, unlike a lot of other tools, works to eliminate that noise as much as possible, including Kubernetes-related noise. sentry-kubernetesÂis a small container that can be launched inside your Kubernetes cluster that sends errors and warnings to Sentry, where they will be cleanly presented and intelligently grouped.
Read more

The Future of Cloud Providers in Kubernetes

Approximately 9 months ago, the Kubernetes community agreed to form the Cloud Provider Special Interest Group (SIG). The justification was to have a single governing SIG to own and shape the integration points between Kubernetes and the many cloud providers it supported. A lot has been in motion since then and we’re here to share with you what has been accomplished so far and what we hope to see in the future.
Read more