Samza - Hedwig - Converting Hadoop M/R ETL systems to Stream Processing

tripadvisor.com

Hedwig - Converting Hadoop M/R ETL systems to Stream Processing

TripAdvisor is one of the world’s largest travel websites that provides hotel and restaurant reviews, accommodation bookings and other travel-related content. It produces and processes billions events everyday including billing records, reports, monitoring events and application notifications.

Prior to migrating to Samza, TripAdvisor used Hadoop to ETL its data. In this model, raw data was rolled up to hourly and daily snapshots in a number of stages with joins and sliding windows applied. Session data was then extracted from the daily snapshots. About 300 million sessions were produced daily. With this solution, the engineering team were faced with a few challenges

Long lag time to produce business-critical metrics
Difficult to debug and troubleshoot due to scripts, environments, etc.

The engineering team at TripAdvisor decided to replace the Hadoop solution with a multi-stage Samza pipeline.

Samza pipeline at TripAdvisor

In the new solution, after raw data is first collected by Flume and ingested through a Kafka cluster, it is parsed, cleansed and re-partitioned by the Lookback Router; then processing logic such as windowing, grouping, joining, fraud detection are applied by the Session Collector and the Fraud Collector, The pipeline uses Samza’s RocksDB store to perform stateful aggregations; finally the Uploader writes results to ElasticSearch, RedShift and Hive.

The new solution achieved significant improvements:

Processing time is reduced from 3 hours to 1 hour
Individual stages in the pipeline are scaled independently
Overall hardware requirement is reduced to ⅓ thanks to optimized usage
Much simpler to debug and test the solution

Key Samza features: Stateful processing, Windowing, Kafka-integration

More information

Converting Hadoop M/R ETL to use Stream Processing at TripAdvisor

tripadvisor.com

Hedwig - Converting Hadoop M/R ETL systems to Stream Processing

More Case Studies...