Databricks Spark Reference Applications
Introduction
1. Log Analysis with Spark
2. Twitter Streaming Language Classifier
3. Weather TimeSeries Data Application with Cassandra
- 3.1. Overview
- 3.2. Running the Example

Databricks Spark Reference Applications

Kafka

While the previous example picks up new log files right away - the log files aren't copied over until a long time after the HTTP requests in the logs actually occurred. While that enables auto-refresh of log data, that's still not realtime. To get realtime logs processing, we need a way to send over log lines immediately. Kafka is a high-throughput distributed message system that is perfect for that use case. Spark contains an external module importing data from Kafka.

Here is some useful documentation to set up Kafka for Spark Streaming:

Kafka Documentation
KafkaUtils class in the external module of the Spark project - This is the external module that has been written that imports data from Kafka into Spark Streaming.
Spark Streaming Example of using Kafka - This is an example that demonstrates how to call KafkaUtils.