Databricks Spark Reference Applications


S3 is Amazon Web Services's solution for storing large files in the cloud. On a production system, you want your Amazon EC2 compute nodes on the same zone as your S3 files for speed as well as cost reasons. While S3 files can be read from other machines, it would take a long time and be expensive (Amazon S3 data transfer prices differ if you read data within AWS vs. to somewhere else on the internet).

See running Spark on EC2 if you want to launch a Spark cluster on AWS - charges apply.

If you choose to run this example with a local Spark cluster on your machine rather than EC2 compute nodes to read the files in S3, use a small data input source!

  1. Sign up for an Amazon Web Services Account.
  2. Load example log files to s3.
    • Log into the AWS console for S3
    • Create an S3 bucket.
    • Upload a couple of example log files to that bucket.
    • Your files will be at the path: s3n://YOUR_BUCKET_NAME/YOUR_LOGFILE.log
  3. Configure your security credentials for AWS:
    • Create and download your security credentials
    • Set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to the correct values on all machines on your cluster. These can also be set in your SparkContext object programmatically like this:
jssc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", YOUR_ACCESS_KEY)
jssc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", YOUR_SECRET_KEY)

Now, run passing in the s3n path to your files.