Databricks Spark Reference Applications

Kafka

While the previous example picks up new log files right away - the log files aren't copied over until a long time after the HTTP requests in the logs actually occurred. While that enables auto-refresh of log data, that's still not realtime. To get realtime logs processing, we need a way to send over log lines immediately. Kafka is a high-throughput distributed message system that is perfect for that use case. Spark contains an external module importing data from Kafka.

Here is some useful documentation to set up Kafka for Spark Streaming: