Parallel Examples
Table of contents
Examples
The sample programs of each table link to examples within Apache Spark programs. The programs explore
- Apple stock prices
- A subset of CrunchBase companies data
- Washington D.C. Capital BikeShare data
- United States Census Bureau buildings data
Filtering Operators
The examples do not yet include (a) the logical operators not, or, and, and (b) the filtering operators distinct, fetch.
| sample programs | comment |
|---|---|
| where, filter, limit | Explicit filtering operators. |
| like, in, between, is null | Logical operators. |
| $=$, $\neq$, $\gt$, $\lt$, $\ge$, $\le$ | Relational operators. |
Aggregating, Ordering
| sample programs | comment |
|---|---|
| count(), sum(), avg(), min(), max() | For aggregating |
| order by | |
Conditionals
| sample programs | comment |
|---|---|
| case statement | Read more about the case statement |
Grouping
| sample programs | comment |
|---|---|
| group by, having | Dataset[Row] does not have a having function, instead there are effective proxy functions. Beware of the SQL query structure w.r.t. using Spark SQL having |
| roll up | For hierarchical arithmetic. |
Window Operations
| sample programs | comment |
|---|---|
| sum().over() | |
| rank(), dense_rank() | Read more about rank() and dense_rank() |
| row_number() | Read more about row_number() |
Upcoming
- Combinatorial queries. Via $cube()$
- Joins. More join examples, e.g., right outer join, full outer join, left semi join filters the left table w.r.t. keys present in the right table, left anti join filters the left table for records that are NOT present in the right table
- Pivoting. Pivoting via Dataset objects is explicit and elegant. The DataReconfiguration class, which reconfigures the data used by the buildings project/module, uses the Dataset pivot function for a reconfiguration step.