Beam on Samza Quick Start
Apache Beam is an open-source SDK which provides state-of-the-art data processing API and model for both batch and streaming processing pipelines across multiple languages, i.e. Java, Python and Go. By collaborating with Beam, Samza offers the capability of executing Beam API on Samza’s large-scale and stateful streaming engine. Current Samza supports the full Beam Java API, and the support of Python and Go is work-in-progress.
Setting up the Word-Count Project
To get started, you need to install Java 8 SDK as well as Apache Maven. After that, the easiest way to get a copy of the WordCount examples in Beam API is to use the following command to generate a simple Maven project:
This command creates a maven project word-count-beam
which contains a series of example pipelines that count words in text files:
Let’s use the MinimalWordCount example to demonstrate how to create a simple Beam pipeline:
In this example, we first create a Beam Pipeline
object to build the graph of transformations to be executed. Then we first use the Read transform to consume a public data set, and split into words. Then we use Beam build-in Count
transform and returns the key/value pairs where each key represents a unique element from the input collection, and each value represents the number of times that key appeared in the input collection. Finally we format the results and write them to a file. A detailed walkthrough of the example code can be found here.
Let’s run the WordCount example with Samza using the following command:
After the pipeline finishes, you can check out the output counts files in /tmp folder. Note Beam generates multiple output files for parallel processing. If you prefer a single output, please update the code to use TextIO.write().withoutSharding().
For more examples and how to deploy your job in local, standalone and Yarn cluster, you can look at the code examples. Please don’t hesitate to reach out if you encounter any issues.