RunnerDev | Home Page

Spark - an ETL example in Apache Spark using Scala with sample code

Category : Apache Spark | Sub Category : Apache Spark Programs | By Prasad Bonam Last updated: 2023-10-01 00:33:49 Viewed : 256

Here is an ETL example in Apache Spark using Scala with sample code and the expected output:

scala
import org.apache.spark.sql.{SparkSession, DataFrame}
import org.apache.spark.sql.functions.{col, upper, concat_ws}

object ETLExample {
  def main(args: Array[String]): Unit = {
    // Create a SparkSession
    val spark = SparkSession.builder()
      .appName("ETLExample")
      .getOrCreate()

    // Define the path to your source data (e.g., a CSV file)
    val sourcePath = "path_to_source_data.csv"

    // Read the source data into a DataFrame
    val sourceDF: DataFrame = spark.read
      .option("header", "true")
      .option("inferSchema", "true")
      .csv(sourcePath)

    // Extract: Load the data from the source
    // You can also apply filtering, selection, and other extraction operations here if needed
    // For example, selecting specific columns
    val extractedDF: DataFrame = sourceDF.select("column1", "column2", "column3")

    // Transform: Apply transformations to the data
    // In this example, we will convert text to uppercase and concatenate two columns
    val transformedDF: DataFrame = extractedDF
      .withColumn("column1_upper", upper(col("column1")))
      .withColumn("concatenated_columns", concat_ws(" ", col("column2"), col("column3")))

    // Show the transformed data
    transformedDF.show()

    // Load: Write the transformed data to a destination (e.g., a new CSV file)
    val destinationPath = "path_to_destination_data.csv"
    transformedDF.write
      .option("header", "true")
      .mode("overwrite")
      .csv(destinationPath)

    // Stop the SparkSession
    spark.stop()
  }
}

Expected Output:

The transformedDF.show() line will display the transformed data on the console. The output will look something like this:

diff
+-------+-------+-------+--------------+-------------------+
|column1|column2|column3|column1_upper|concatenated_columns|
+-------+-------+-------+--------------+-------------------+
|  data1|  data2|  data3|        DATA1|      data2 data3   |
|  data4|  data5|  data6|        DATA4|      data5 data6   |
+-------+-------+-------+--------------+-------------------+

Additionally, the transformed data will be written to the specified destination path in CSV format.

Make sure to replace "path_to_source_data.csv" and "path_to_destination_data.csv" with the actual paths to your source and destination data files. Ensure you have Apache Spark and Scala properly installed and configured before running this code.

Spark - an ETL example in Apache Spark using Scala with sample code

Search

Categories

Sub-Categories

Related Articles

Leave a Comment: