Window functions in Spark SQL (Databricks)

Category : Microsoft Azure Data Engineering | Sub Category : Practice Assessment for Exam DP-203 - Data Engineering on Microsoft Azure | By Prasad Bonam Last updated: 2023-09-11 12:04:37 Viewed : 639


Window functions in Spark SQL (Databricks):

Window functions in Spark SQL (Databricks) allow you to perform calculations across a set of table rows that are related to the current row. Here are some examples of window functions:

Assuming you have a DataFrame or table named mytable, you can use the following examples to demonstrate various window functions:

  1. Basic Window Function:

    • Calculate the row number for each row within a partition.
    sql
    SELECT column1, column2, ROW_NUMBER() OVER (ORDER BY column1) AS row_num FROM mytable;
  2. PARTITION BY Clause:

    • Calculate the sum of a column within different partitions.
    sql
    SELECT column1, column2, SUM(column3) OVER (PARTITION BY column1) AS partition_sum FROM mytable;
  3. ORDER BY Clause:

    • Calculate a running total within a partition based on a specific order.
    sql
    SELECT column1, column2, SUM(column3) OVER (PARTITION BY column1 ORDER BY column2) AS running_total FROM mytable;
  4. RANGE and ROWS Clauses:

    • Define the frame specification using RANGE BETWEEN or ROWS BETWEEN to calculate aggregations within a specific range.
    sql
    SELECT column1, column2, SUM(column3) OVER (PARTITION BY column1 ORDER BY column2 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS range_sum FROM mytable;
  5. DENSE_RANK and RANK Functions:

    • Calculate the dense rank and rank of rows within a partition.
    sql
    SELECT column1, column2, DENSE_RANK() OVER (PARTITION BY column1 ORDER BY column2) AS dense_rank, RANK() OVER (PARTITION BY column1 ORDER BY column2) AS rank FROM mytable;
  6. LAG and LEAD Functions:

    • Retrieve values from previous and subsequent rows within a partition.
    sql
    SELECT column1, column2, LAG(column2) OVER (PARTITION BY column1 ORDER BY column2) AS prev_value, LEAD(column2) OVER (PARTITION BY column1 ORDER BY column2) AS next_value FROM mytable;

These examples demonstrate various use cases for window functions in Spark SQL (Databricks). Window functions are powerful tools for performing complex calculations and aggregations within partitions of your data, which can be very useful for advanced analytics and reporting tasks.

Here are examples of Spark SQL queries in Databricks that use window functions along with sample outputs. These examples assume you have a table named mytable with the appropriate columns.

  1. Basic Window Function - ROW_NUMBER:

    sql
    SELECT column1, column2, ROW_NUMBER() OVER (ORDER BY column1) AS row_num FROM mytable;

    Sample Output:

    css
    +-------+-------+-------+ |column1|column2|row_num| +-------+-------+-------+ |A |X |1 | |A |Y |2 | |B |Z |3 | +-------+-------+-------+
  2. PARTITION BY Clause - SUM:

    sql
    SELECT column1, column2, SUM(column3) OVER (PARTITION BY column1) AS partition_sum FROM mytable;

    Sample Output:

    css
    +-------+-------+--------------+ |column1|column2|partition_sum | +-------+-------+--------------+ |A |X |8 | |A |Y |8 | |B |Z |7 | +-------+-------+--------------+
  3. ORDER BY Clause - Running Total:

    sql
    SELECT column1, column2, SUM(column3) OVER (PARTITION BY column1 ORDER BY column2) AS running_total FROM mytable;

    Sample Output:

    css
    +-------+-------+--------------+ |column1|column2|running_total | +-------+-------+--------------+ |A |X |3 | |A |Y |8 | |B |Z |7 | +-------+-------+--------------+
  4. DENSE_RANK and RANK Functions:

    sql
    SELECT column1, column2, DENSE_RANK() OVER (PARTITION BY column1 ORDER BY column2) AS dense_rank, RANK() OVER (PARTITION BY column1 ORDER BY column2) AS rank FROM mytable;

    Sample Output:

    css
    +-------+-------+-----------+-----+ |column1|column2|dense_rank |rank | +-------+-------+-----------+-----+ |A |X |1 |1 | |A |Y |2 |2 | |B |Z |1 |1 | +-------+-------+-----------+-----+
  5. LAG and LEAD Functions:

    sql
    SELECT column1, column2, LAG(column2) OVER (PARTITION BY column1 ORDER BY column2) AS prev_value, LEAD(column2) OVER (PARTITION BY column1 ORDER BY column2) AS next_value FROM mytable;

    Sample Output:

    sql
    +-------+-------+-----------+-----------+ |column1|column2|prev_value |next_value | +-------+-------+-----------+-----------+ |A |X |null |Y | |A |Y |X |null | |B |Z |null |null | +-------+-------+-----------+-----------+

These examples demonstrate the use of window functions in Spark SQL (Databricks) with sample outputs to help you understand the results of each query. Window functions are valuable for performing complex calculations and analytics on your data within specific partitions or ordered sets of rows.

Search
Related Articles

Leave a Comment: