RunnerDev | Home Page

Window functions in Spark SQL (Databricks)

Category : Microsoft Azure Data Engineering | Sub Category : Practice Assessment for Exam DP-203 - Data Engineering on Microsoft Azure | By Prasad Bonam Last updated: 2023-09-11 12:04:37 Viewed : 292

Window functions in Spark SQL (Databricks):

Window functions in Spark SQL (Databricks) allow you to perform calculations across a set of table rows that are related to the current row. Here are some examples of window functions:

Assuming you have a DataFrame or table named mytable, you can use the following examples to demonstrate various window functions:

Basic Window Function:

Calculate the row number for each row within a partition.

sql
SELECT
    column1,
    column2,
    ROW_NUMBER() OVER (ORDER BY column1) AS row_num
FROM mytable;

PARTITION BY Clause:

Calculate the sum of a column within different partitions.

sql
SELECT
    column1,
    column2,
    SUM(column3) OVER (PARTITION BY column1) AS partition_sum
FROM mytable;

ORDER BY Clause:

Calculate a running total within a partition based on a specific order.

sql
SELECT
    column1,
    column2,
    SUM(column3) OVER (PARTITION BY column1 ORDER BY column2) AS running_total
FROM mytable;

RANGE and ROWS Clauses:

Define the frame specification using RANGE BETWEEN or ROWS BETWEEN to calculate aggregations within a specific range.

sql
SELECT
    column1,
    column2,
    SUM(column3) OVER (PARTITION BY column1 ORDER BY column2
                      RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS range_sum
FROM mytable;

DENSE_RANK and RANK Functions:

Calculate the dense rank and rank of rows within a partition.

sql
SELECT
    column1,
    column2,
    DENSE_RANK() OVER (PARTITION BY column1 ORDER BY column2) AS dense_rank,
    RANK() OVER (PARTITION BY column1 ORDER BY column2) AS rank
FROM mytable;

LAG and LEAD Functions:

Retrieve values from previous and subsequent rows within a partition.

sql
SELECT
    column1,
    column2,
    LAG(column2) OVER (PARTITION BY column1 ORDER BY column2) AS prev_value,
    LEAD(column2) OVER (PARTITION BY column1 ORDER BY column2) AS next_value
FROM mytable;

These examples demonstrate various use cases for window functions in Spark SQL (Databricks). Window functions are powerful tools for performing complex calculations and aggregations within partitions of your data, which can be very useful for advanced analytics and reporting tasks.

Here are examples of Spark SQL queries in Databricks that use window functions along with sample outputs. These examples assume you have a table named mytable with the appropriate columns.

Basic Window Function - ROW_NUMBER:

sql
SELECT
    column1,
    column2,
    ROW_NUMBER() OVER (ORDER BY column1) AS row_num
FROM mytable;

Sample Output:

css
+-------+-------+-------+
|column1|column2|row_num|
+-------+-------+-------+
|A      |X      |1      |
|A      |Y      |2      |
|B      |Z      |3      |
+-------+-------+-------+

PARTITION BY Clause - SUM:

sql
SELECT
    column1,
    column2,
    SUM(column3) OVER (PARTITION BY column1) AS partition_sum
FROM mytable;

Sample Output:

css
+-------+-------+--------------+
|column1|column2|partition_sum |
+-------+-------+--------------+
|A      |X      |8             |
|A      |Y      |8             |
|B      |Z      |7             |
+-------+-------+--------------+

ORDER BY Clause - Running Total:

sql
SELECT
    column1,
    column2,
    SUM(column3) OVER (PARTITION BY column1 ORDER BY column2) AS running_total
FROM mytable;

Sample Output:

css
+-------+-------+--------------+
|column1|column2|running_total |
+-------+-------+--------------+
|A      |X      |3             |
|A      |Y      |8             |
|B      |Z      |7             |
+-------+-------+--------------+

DENSE_RANK and RANK Functions:

sql
SELECT
    column1,
    column2,
    DENSE_RANK() OVER (PARTITION BY column1 ORDER BY column2) AS dense_rank,
    RANK() OVER (PARTITION BY column1 ORDER BY column2) AS rank
FROM mytable;

Sample Output:

css
+-------+-------+-----------+-----+
|column1|column2|dense_rank |rank |
+-------+-------+-----------+-----+
|A      |X      |1          |1    |
|A      |Y      |2          |2    |
|B      |Z      |1          |1    |
+-------+-------+-----------+-----+

LAG and LEAD Functions:

sql
SELECT
    column1,
    column2,
    LAG(column2) OVER (PARTITION BY column1 ORDER BY column2) AS prev_value,
    LEAD(column2) OVER (PARTITION BY column1 ORDER BY column2) AS next_value
FROM mytable;

Sample Output:

sql
+-------+-------+-----------+-----------+
|column1|column2|prev_value |next_value |
+-------+-------+-----------+-----------+
|A      |X      |null       |Y          |
|A      |Y      |X          |null       |
|B      |Z      |null       |null       |
+-------+-------+-----------+-----------+

These examples demonstrate the use of window functions in Spark SQL (Databricks) with sample outputs to help you understand the results of each query. Window functions are valuable for performing complex calculations and analytics on your data within specific partitions or ordered sets of rows.

Window functions in Spark SQL (Databricks)

Search

Categories

Sub-Categories

Related Articles

Leave a Comment: