Airbnb serves millions of guests and hosts in our community. Every second, their activities on Airbnb.com, such as searching, booking, and messaging, generate a huge amount of data we anonymize and use to improve the community’s experience on our platform. The Data Platform Team at Airbnb strives to leverage this data to improve our customers’ experiences and optimize Airbnb’s business.
Our mission is to provide infrastructure to collect, organize, and process this deluge of data (all in privacy-safe ways), and empower various organizations across Airbnb to derive necessary analytics and make data-informed decisions from it. The primary way high-level analytics is exposed and shared within the company is through various dashboards. A lot of people use these dashboards every day to make various decisions.
Dashboards also allow real-time tracking and monitoring of various aspects of our business and systems. As a result, the timeliness of these dashboards is critical to the daily operation of Airbnb. However, we are faced with three challenges: First, it would take a long time to aggregate data in the warehouse and generate the necessary data for these dashboards using systems like Hive and Presto at query time.
Hive/Presto has to read all the data and aggregate them on demand, resulting in all necessary computation getting invoked at query time. Even if those engines are used to pre-compute the aggregation and store them, the storage format is not optimized for repeated slicing and dicing of data that analytics queries demand.