Samza SQL Quick Start
Overview
Samza SQL allows you to define your stream processing logic declaratively as a a SQL query. This allows you to create streaming pipelines without Java code or configuration unless you require user-defined functions (UDF).
You can run Samza SQL locally on your machine or on a YARN cluster.
Running Samza SQL on your local machine
The Samza SQL console allows you to experiment with Samza SQL locally on your machine.
Setup Kafka
Follow the instructions from the Kafka quickstart to start the zookeeper and Kafka server.
Let us create a Kafka topic named “ProfileChangeStream” for this demo.
./deploy/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic ProfileChangeStream
Download the Samza tools package from here and use the generate-kafka-events
script populate the stream with sample data.
cd samza-tools-<version>
./scripts/generate-kafka-events.sh -t ProfileChangeStream -e ProfileChange
Using the Samza SQL Console
The simplest SQL query is to read all events from a Kafka topic ProfileChangeStream
and print them to the console.
./scripts/samza-sql-console.sh --sql "insert into log.consoleoutput select * from kafka.ProfileChangeStream"
Next, let us project a few fields from the input stream.
./scripts/samza-sql-console.sh --sql "insert into log.consoleoutput select Name, OldCompany, NewCompany from kafka.ProfileChangeStream"
You can also filter messages in the input stream based on some predicate. In this example, we filter profiles currently working at LinkedIn, whose previous employer matches the regex .*soft
. The function RegexMatch(regex, company)
is an example of
a UDF that defines a predicate.
./scripts/samza-sql-console.sh --sql "insert into log.consoleoutput select Name as __key__, Name, NewCompany, RegexMatch('.*soft', OldCompany) from kafka.ProfileChangeStream where NewCompany = 'LinkedIn'"
Running Samza SQL on YARN
The hello-samza project has examples to get started with Samza on YARN. You can define your SQL query in a configuration file and submit it to a YARN cluster.
./deploy/samza/bin/run-app.sh --config-path=$PWD/deploy/samza/config/page-view-filter-sql.properties
How to write a UDF
Right now Samza SQL support Scalar UDFs which means that each UDF should act on each record at a time and return the result corresponding to the record. In essence it exhibits the behavior of 1 output to an input. Users need to implement the following interface to create a UDF.