Spark

Making Apache Spark Effortless for All of Uber

Apache Spark is a foundational piece of Uber’s Big Data infrastructure that powers many critical aspects of our business. We currently run more than one hundred thousand Spark applications per day, across multiple different compute environments. Spark’s versatility, which allows us to build applications and run them everywhere that we need, makes this scale possible. However, our ever-growing infrastructure means that these environments are constantly changing, making it increasingly difficult for both new and existing users to give their applications reliable access to data sources, compute resources, and supporting tools. Also, as the number of users grow, it becomes more challenging for the data team to communicate these environmental changes to users, and for us to understand exactly how Spark is being used. We built the Uber Spark Compute Service (uSCS) to help manage the complexities of running Spark at this scale.
Read more

Scaling Spark Streaming for Logging Event Ingestion

Walking over a stream during an Airbnb Experience in Kuala Lumpur. Searching, viewing, and booking such Experiences will all produce logging events that will be processed by our stream processing framework. Logging events are emitted from clients (such as mobile apps and web browser) and online services with key information and context about the actions or operations. Each event carries a specific piece of information. For example, when a guest searches for a beach house in Malibu on Airbnb.com, a search event containing the location, checkin and checkout dates, etc. would be generated (and anonymized for privacy protection). At Airbnb, event logging is crucial for us to understand guests and hosts and then provide them with a better experience.
Read more