At Uber, we use robust data processing systems such as Apache Flink and Apache Spark to power the streaming applications that helps us calculate up-to-date pricing, enhance driver dispatching, and fight fraud on our platform. Such solutions can process data at a massive scale in real time with exactly-once semantics, and the emergence of these systems over the past several years has unlocked an industry-wide ability to write streaming data processing applications at low latencies, a functionality previously impossible to achieve at scale. However, since streaming systems are inherently unable to guarantee event order, they must make trade-offs in how they handle late data.
Typically, streaming systems mitigate this out-of-order problem by using event-time windows and watermarking. While efficient, this strategy can cause inaccuracies by dropping any events that arrive after watermarking. To support systems that require both the low latency of a streaming pipeline and the correctness of a batch pipeline, many organizations utilize Lambda architectures, a concept first proposed by Nathan Marz.