Databricks Spark Reference Applications

Exporting Small Datasets

If the data you are exporting out of Spark is small, you can just use an action to convert the RDD into objects in memory on the driver program, and then write that output directly to any data storage solution of your choosing. You may remember that we called the take(N) action where N is some finite number instead of the collect() action to ensure the output fits in memory - no matter how big the input data set may be - this is good practice. This section walks through example code where you'll write the log statistics to a file.

It may not be that useful to have these stats output to a file - in practice, you might write these statistics to a database for your presentation layer to access.

LogStatistics logStatistics = logAnalyzerRDD.processRdd(accessLogs);

String outputFile = args[1];
Writer out = new BufferedWriter(
    new OutputStreamWriter(new FileOutputStream(outputFile)));

Tuple4<Long, Long, Long, Long> contentSizeStats =
    logStatistics.getContentSizeStats();
out.write(String.format("Content Size Avg: %s, Min: %s, Max: %s\n",
    contentSizeStats._1() / contentSizeStats._2(),
    contentSizeStats._3(),
    contentSizeStats._4()));

List<Tuple2<Integer, Long>> responseCodeToCount =
    logStatistics.getResponseCodeToCount();
out.write(String.format("Response code counts: %s\n", responseCodeToCount));

List<String> ipAddresses = logStatistics.getIpAddresses();
out.write(String.format("IPAddresses > 10 times: %s\n", ipAddresses));

List<Tuple2<String, Long>> topEndpoints = logStatistics.getTopEndpoints();
out.write(String.format("Top Endpoints: %s\n", topEndpoints));

out.close();

Now, run LogAnalyzerExportSmallData.java. Try modifying it to write to a database of your own choosing.