Tutorial: Modern Spark DataFrame and Dataset



Apache Spark has changed dramatically in the past year — from new APIs in Spark 1.4 to dramatic execution improvements and even better APIs in 2.0. In this intermediate-level tutorial, I'll address the question of which Spark APIs to use with a series of brief technical explanations and demos that highlight best practices, latest APIs, and new features.

We'll look at how Dataset and DataFrame behave in Spark 2.0, look at Whole-Stage Code Generation, and go through a simple example of Spark 2.0 Structured Streaming (Streaming with DataFrames) that you can run in your own free instance of Databricks.

Follow Along: You can run all the examples in this tutorial yourself. Just register for a free instance of Databricks Community Edition, and import this notebook.

Spark Training from ProTech

If you're just getting started with Spark development, check out our 3 day Spark Programming course page to see upcoming public classes or request an onsite training for your team.

Published June 13, 2016