Go through the Spark Streaming Programming Guide before beginning this section. In particular, it covers the concept of DStreams.
This section requires another dependency on the Spark Streaming library:
<dependency> <!-- Spark Streaming -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.1.0</version>
</dependency>
The earlier examples demonstrates how to compute statistics on an existing log file - but not how to do realtime monitoring of logs. Spark Streaming enables that functionality.
To run the streaming examples, you will tail
a log file into netcat
to send to Spark.
This is not the ideal way to get data into Spark in a production system,
but is an easy workaround for a first Spark Streaming example. We will cover best practices for how to import data for Spark Streaming in Chapter 2.
In a terminal window, just run this command on a logfile which you will append to:
% tail -f [[YOUR_LOG_FILE]] | nc -lk 9999
If you don't have a live log file that is being updated on the fly, you can add lines manually with the included data file or another your own log file:
% cat ../../data/apache.accesslog >> [[YOUR_LOG_FILE]]
When data is streamed into Spark, there are two common use cases covered:
window
function of the streaming library.forEachRDD
function allows you to access the RDD's created each time interval.updateStateByKey
.transform
functions which allow you to apply arbitrary RDD-to-RDD functions, and thus to reuse code from the batch mode of Spark.