Querying
Table of contents
Introduction
Within Apache Spark engineers & scientists may query data programmatically via SQL or via Dataset objects.
Why Datasets?
A valid question is, why Dataset objects? Amongst other characteristics, Dataset objects are strongly typed and
“... are "lazy", i.e. computations are only triggered when an action is invoked. Internally, a Dataset represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner.” [Dataset]
In addition to this ScalaDoc definition, the Databricks articles What are Datasets? & Introduction to Datasets are also helpful references. The Parallel Examples section has links to parallel Dataset & SQL queries.