Databricks Spark Reference Applications

Section 1: Introduction to Apache Spark

In this section, we demonstrate how simple it is to analyze web logs using Apache Spark. We'll show how to load a Resilient Distributed Dataset (RDD) of access log lines and use Spark tranformations and actions to compute some statistics for web server monitoring. In the process, we'll introduce the Spark SQL and the Spark Streaming libraries.

In this explanation, the code snippets are in Java 8. However, there is also sample code in Java 6, Scala, and Python included in this directory. In those folders are README's for instructions on how to build and run those examples, and the necessary build files with all the required dependencies.

This chapter covers the following topics:

  1. First Log Analyzer in Spark - This is a first Spark standalone logs analysis application.
  2. Spark SQL - This example does the same thing as the above example, but uses SQL syntax instead of Spark transformations and actions.
  3. Spark Streaming - This example covers how to calculate log statistics using the streaming library.