data

Scaling a Mature Data Pipeline—Managing Overhead

Before delving into our specifics, I want to take a moment to discuss the technical stack backing our pipeline. Our platform uses a mixture of Spark and Hive jobs. Our core pipeline is primarily implemented in Scala. However, we leverage Spark SQL in certain contexts. We leverage YARN for job scheduling and resource management, and execute our jobs on Amazon EMR. We use Airflow as our task orchestration system that takes care of the orchestration logic.
Read more

Accelerating Uber’s Self-Driving Vehicle Development with Data

A key challenge faced by self-driving vehicles comes during interactions with pedestrians. In our development of self-driving vehicles, the Data Engineering and Data Science teams at Uber ATG (Advanced Technologies Group) contribute to the data processing and analysis that help make these interactions safe. Through data, we can learn the movement of cars and pedestrians in a city, and train our self-driving vehicles how to drive. We map pedestrian movement in cities with LiDAR-equipped cars, search video collected from the roads for interesting, real-life situations that can be used in model training, build and report on simulations, and test on both a closed track and real roads to reinforce our training.
Read more

Scio 0.7: a deep dive

Large-scale data processing is a critical component of Spotify’s business model. It drives music recommendations, artist payouts based on stream counts, and insights about how users interact with Spotify. Every day we capture hundreds of terabytes of event data, in addition to database snapshots and derived datasets. It’s imperative that engineers who want to work with this data can quickly write and execute application-level code without worrying about the low-level semantics of Map/Reduce frameworks, provisioning the right amount of compute power, or writing extensive boilerplate code for every job.
Read more

Matplotlib—Making data visualization interesting

Data visualization is a key step to understand the dataset and draw inferences from it. While one can always closely inspect the data row by row, cell by cell, it’s often a tedious task and does not highlight the big picture. Visuals on the other hand, define data in a form that is easy to understand with just a glance and keeps the audience engaged. Matplotlib is a basic library that provides options for various plots along with extensive customizations in the form of labels, title, font size etc.
Read more

Druid @ Airbnb Data Platform

Airbnb serves millions of guests and hosts in our community. Every second, their activities on Airbnb.com, such as searching, booking, and messaging, generate a huge amount of data we anonymize and use to improve the community’s experience on our platform. The Data Platform Team at Airbnb strives to leverage this data to improve our customers’ experiences and optimize Airbnb’s business. Our mission is to provide infrastructure to collect, organize, and process this deluge of data (all in privacy-safe ways), and empower various organizations across Airbnb to derive necessary analytics and make data-informed decisions from it.
Read more

Scaling Time Series Data Storage—Part II

In January 2016 Netflix expanded worldwide, opening service to 130 additional countries and supporting 20 total languages. Later in 2016 the TV experience evolved to include video previews during the browsing experience. More members, more languages, and more video playbacks stretched the times series data storage architecture from part 1 close to its breaking point. In part 2 here, we will explore the limitations of that architecture and describe how we’re re-architecting for this next phase in our evolution.
Read more