Spark - an ETL example in Apache Spark using Scala with sample code

Category : Apache Spark | Sub Category : Apache Spark Programs | By Prasad Bonam Last updated: 2023-10-01 00:33:49 Viewed : 256


Here is an ETL example in Apache Spark using Scala with sample code and the expected output:

scala
import org.apache.spark.sql.{SparkSession, DataFrame} import org.apache.spark.sql.functions.{col, upper, concat_ws} object ETLExample { def main(args: Array[String]): Unit = { // Create a SparkSession val spark = SparkSession.builder() .appName("ETLExample") .getOrCreate() // Define the path to your source data (e.g., a CSV file) val sourcePath = "path_to_source_data.csv" // Read the source data into a DataFrame val sourceDF: DataFrame = spark.read .option("header", "true") .option("inferSchema", "true") .csv(sourcePath) // Extract: Load the data from the source // You can also apply filtering, selection, and other extraction operations here if needed // For example, selecting specific columns val extractedDF: DataFrame = sourceDF.select("column1", "column2", "column3") // Transform: Apply transformations to the data // In this example, we will convert text to uppercase and concatenate two columns val transformedDF: DataFrame = extractedDF .withColumn("column1_upper", upper(col("column1"))) .withColumn("concatenated_columns", concat_ws(" ", col("column2"), col("column3"))) // Show the transformed data transformedDF.show() // Load: Write the transformed data to a destination (e.g., a new CSV file) val destinationPath = "path_to_destination_data.csv" transformedDF.write .option("header", "true") .mode("overwrite") .csv(destinationPath) // Stop the SparkSession spark.stop() } }

Expected Output:

The transformedDF.show() line will display the transformed data on the console. The output will look something like this:

diff
+-------+-------+-------+--------------+-------------------+ |column1|column2|column3|column1_upper|concatenated_columns| +-------+-------+-------+--------------+-------------------+ | data1| data2| data3| DATA1| data2 data3 | | data4| data5| data6| DATA4| data5 data6 | +-------+-------+-------+--------------+-------------------+

Additionally, the transformed data will be written to the specified destination path in CSV format.

Make sure to replace "path_to_source_data.csv" and "path_to_destination_data.csv" with the actual paths to your source and destination data files. Ensure you have Apache Spark and Scala properly installed and configured before running this code.



Search
Related Articles

Leave a Comment: