Category : Microsoft Azure Data Engineering | Sub Category : Practice Assessment for Exam DP-203 - Data Engineering on Microsoft Azure | By Prasad Bonam Last updated: 2023-09-11 12:04:37 Viewed : 639
Window functions in Spark SQL (Databricks):
Window functions in Spark SQL (Databricks) allow you to perform calculations across a set of table rows that are related to the current row. Here are some examples of window functions:
Assuming you have a DataFrame or table named mytable
, you can use the following examples to demonstrate various window functions:
Basic Window Function:
sqlSELECT
column1,
column2,
ROW_NUMBER() OVER (ORDER BY column1) AS row_num
FROM mytable;
PARTITION BY Clause:
sqlSELECT
column1,
column2,
SUM(column3) OVER (PARTITION BY column1) AS partition_sum
FROM mytable;
ORDER BY Clause:
sqlSELECT
column1,
column2,
SUM(column3) OVER (PARTITION BY column1 ORDER BY column2) AS running_total
FROM mytable;
RANGE and ROWS Clauses:
RANGE BETWEEN
or ROWS BETWEEN
to calculate aggregations within a specific range.sqlSELECT
column1,
column2,
SUM(column3) OVER (PARTITION BY column1 ORDER BY column2
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS range_sum
FROM mytable;
DENSE_RANK and RANK Functions:
sqlSELECT
column1,
column2,
DENSE_RANK() OVER (PARTITION BY column1 ORDER BY column2) AS dense_rank,
RANK() OVER (PARTITION BY column1 ORDER BY column2) AS rank
FROM mytable;
LAG and LEAD Functions:
sqlSELECT
column1,
column2,
LAG(column2) OVER (PARTITION BY column1 ORDER BY column2) AS prev_value,
LEAD(column2) OVER (PARTITION BY column1 ORDER BY column2) AS next_value
FROM mytable;
These examples demonstrate various use cases for window functions in Spark SQL (Databricks). Window functions are powerful tools for performing complex calculations and aggregations within partitions of your data, which can be very useful for advanced analytics and reporting tasks.
Here are examples of Spark SQL queries in Databricks that use window functions along with sample outputs. These examples assume you have a table named mytable
with the appropriate columns.
Basic Window Function - ROW_NUMBER:
sqlSELECT
column1,
column2,
ROW_NUMBER() OVER (ORDER BY column1) AS row_num
FROM mytable;
Sample Output:
css+-------+-------+-------+
|column1|column2|row_num|
+-------+-------+-------+
|A |X |1 |
|A |Y |2 |
|B |Z |3 |
+-------+-------+-------+
PARTITION BY Clause - SUM:
sqlSELECT
column1,
column2,
SUM(column3) OVER (PARTITION BY column1) AS partition_sum
FROM mytable;
Sample Output:
css+-------+-------+--------------+
|column1|column2|partition_sum |
+-------+-------+--------------+
|A |X |8 |
|A |Y |8 |
|B |Z |7 |
+-------+-------+--------------+
ORDER BY Clause - Running Total:
sqlSELECT
column1,
column2,
SUM(column3) OVER (PARTITION BY column1 ORDER BY column2) AS running_total
FROM mytable;
Sample Output:
css+-------+-------+--------------+
|column1|column2|running_total |
+-------+-------+--------------+
|A |X |3 |
|A |Y |8 |
|B |Z |7 |
+-------+-------+--------------+
DENSE_RANK and RANK Functions:
sqlSELECT
column1,
column2,
DENSE_RANK() OVER (PARTITION BY column1 ORDER BY column2) AS dense_rank,
RANK() OVER (PARTITION BY column1 ORDER BY column2) AS rank
FROM mytable;
Sample Output:
css+-------+-------+-----------+-----+
|column1|column2|dense_rank |rank |
+-------+-------+-----------+-----+
|A |X |1 |1 |
|A |Y |2 |2 |
|B |Z |1 |1 |
+-------+-------+-----------+-----+
LAG and LEAD Functions:
sqlSELECT
column1,
column2,
LAG(column2) OVER (PARTITION BY column1 ORDER BY column2) AS prev_value,
LEAD(column2) OVER (PARTITION BY column1 ORDER BY column2) AS next_value
FROM mytable;
Sample Output:
sql+-------+-------+-----------+-----------+
|column1|column2|prev_value |next_value |
+-------+-------+-----------+-----------+
|A |X |null |Y |
|A |Y |X |null |
|B |Z |null |null |
+-------+-------+-----------+-----------+
These examples demonstrate the use of window functions in Spark SQL (Databricks) with sample outputs to help you understand the results of each query. Window functions are valuable for performing complex calculations and analytics on your data within specific partitions or ordered sets of rows.