Querying


Table of contents

  1. Introduction
  2. Why Datasets?

Introduction

Within Apache Spark engineers & scientists may query data programmatically via SQL or via Dataset objects.


Why Datasets?

A valid question is, why Dataset objects? Amongst other characteristics, Dataset objects are strongly typed and

“... are "lazy", i.e. computations are only triggered when an action is invoked. Internally, a Dataset represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner.” [Dataset]

In addition to this ScalaDoc definition, the Databricks articles What are Datasets? & Introduction to Datasets are also helpful references. The Parallel Examples section has links to parallel Dataset & SQL queries.