news

Running Kubernetes in the Federal Government

Tackling security compliance is a long and challenging process for agencies, systems integrators, and vendors trying to launch new information systems in the federal government. Each new information system must go through the Risk Management Framework (RMF) created by the National Institute of Standards and Technology (NIST) in order to obtain authority to operate (ATO). This process is often long and tedious and can last for over a year.
Read more

Monitoring Kubernetes, part 1: the challenges + data sources

Our industry has long been relying on microservice-based architecture to deliver software faster and safer. The advent and ubiquity of microservices naturally paved the way for container technology, empowering us to rethink how we build and deploy our applications. Docker exploded onto the scene in 2013, and, for companies focusing on modernizing their infrastructure and cloud migration, a tool like Docker is critical to shipping applications quickly, at scale.
Read more

Creating a Zoo of Atari-Playing Agents to Catalyze the Understanding of Deep Reinforcement Learning

Some of the most exciting advances in AI recently have come from the field of deep reinforcement learning (deep RL), where deep neural networks learn to perform complicated tasks from reward signals. RL operates similarly to how you might teach a dog to perform a new trick: treats are offered to reinforce improved behavior. Recently, deep RL agents have exceeded human performance in benchmarks like classic video games (such as Atari 2600 games), the board game Go, and modern computer games like DOTA 2.
Read more

MySQL High Availability Framework Explained – Part II

InPart I, we introduced a High Availability (HA) framework forMySQL hostingand discussed various components and their functionality. Now in Part II, we will discuss the details of MySQL semisynchronous replication and the related configuration settings that help us ensure redundancy and consistency of the data in our HA setup. Make sure to check back in for Part III where we will review various failure scenarios that could arise and the way the framework responds and recovers from these conditions.
Read more

Enterprise grade CI/CD with GitOps

Implementing Continuous Delivery[1] at enterprise scale is a major challenge. As every company has to innovate their software delivery methods, we need to allow individual teams to learn and improve their own delivery pipeline. This is especially true in the Cloud Native world, where many best practices are still emerging. However, giving teams flexibility to experiment needs to be balanced with security and compliance requirements. In this post, I will explore how we successfully employed the GitOps architecture pattern to find a good balance between flexibility and security at a large enterprise customer of Container Solutions.
Read more

Courier: Dropbox migration to gRPC

Dropbox runs hundreds of services, written in different languages, which exchange millions of requests per second. At the core of our Service Oriented Architecture is Courier, our gRPC-based Remote Procedure Call (RPC) framework. While developing Courier, we learned a lot about extending gRPC, optimizing performance for scale, and providing a bridge from our legacy RPC system. Courier is not Dropbox’s first RPC framework. Even before we started to break our Python monolith into services in earnest, we needed a solid foundation for inter-service communication.
Read more

New whitepaper: Achieving Operational Resilience in the Financial Sector and Beyond

AWS has released a new whitepaper, Amazon Web Services’ Approach to Operational Resilience in the Financial Sector and Beyond, in which we discuss how AWS and customers build for resiliency on the AWS cloud. We’re constantly amazed at the applications our customers build using AWS services — including what our financial services customers have built, from credit risk simulations to mobile banking applications. Depending on their internal and regulatory requirements, financial services companies may need to meet specific resiliency objectives and withstand low-probability events that could otherwise disrupt their businesses.
Read more

Western Digital HDD Simulation at Cloud Scale – 2.5 Million HPC Tasks, 40K EC2 Spot Instances

Earlier this month my colleague Bala Thekkedath published a story about Extreme Scale HPC and talked about how AWS customer Western Digital built a cloud-scale HPC cluster on AWS and used it to simulate crucial elements of upcoming head designs for their next-generation hard disk drives (HDD). The simulation described in the story encompassed a little over 2.5 million tasks, and ran to completion in just 8 hours on a million-vCPU Amazon EC2 cluster.
Read more

AI year in review

At Facebook, we think that artificial intelligence that learns in new, more efficient ways – much like humans do – can play an important role in bringing people together. That core belief helps drive our AI strategy, focusing our investments in long-term research related to systems that learn using real-world data, inspiring our engineers to share cutting-edge tools and platforms with the wider AI community, and ultimately demonstrating new ways to use the technology to benefit the world.
Read more

Rate Limiting at the Edge

I’m sure many of you have heard of the “Death Star Security” model—the hardening of the perimeter, without much attention paid to the inner core—and while this is generally considered bad form in the current cloud native landscape, there is still many things that do need to be implemented at edge in order to provide both operational and business logic support. One of these things is rate limiting. Modern applications and APIs can experience a burst of traffic over a short time period, for both good and bad reasons, but this needs to be managed well if your business model relies upon the successful completion of requests by paying customers.
Read more