Early in 2017 we started exploring Presto for OLAP use cases and we realized the potential of this amazing query engine. It started as an adhoc querying tool for data engineers and analysts to run SQL in a faster way to prototype their queries, when compared to Apache Hive. A lot of internal dashboards were powered by AWS-Redshift back then and it had data storage and compute coupled together.
The Aria is a set of initiatives to dramatically increase PrestoDB efficiency. Our goal is to achieve a 2-3x decrease in CPU time for Hive queries against tables stored in ORC format. For Aria, We are pursuing improvements in three areas: table scan, repartitioning (exchange, shuffle), and hash join.
Nearly 60 percent of our global Presto CPU time is attributed to table scan, making scan improvements high leverage and thus the area we chose to focus on first.