Category : Apache Spark | Sub Category : Apache Spark Programs | By Prasad Bonam Last updated: 2023-10-01 00:33:49 Viewed : 256
Here is an ETL example in Apache Spark using Scala with sample code and the expected output:
scalaimport org.apache.spark.sql.{SparkSession, DataFrame} import org.apache.spark.sql.functions.{col, upper, concat_ws} object ETLExample { def main(args: Array[String]): Unit = { // Create a SparkSession val spark = SparkSession.builder() .appName("ETLExample") .getOrCreate() // Define the path to your source data (e.g., a CSV file) val sourcePath = "path_to_source_data.csv" // Read the source data into a DataFrame val sourceDF: DataFrame = spark.read .option("header", "true") .option("inferSchema", "true") .csv(sourcePath) // Extract: Load the data from the source // You can also apply filtering, selection, and other extraction operations here if needed // For example, selecting specific columns val extractedDF: DataFrame = sourceDF.select("column1", "column2", "column3") // Transform: Apply transformations to the data // In this example, we will convert text to uppercase and concatenate two columns val transformedDF: DataFrame = extractedDF .withColumn("column1_upper", upper(col("column1"))) .withColumn("concatenated_columns", concat_ws(" ", col("column2"), col("column3"))) // Show the transformed data transformedDF.show() // Load: Write the transformed data to a destination (e.g., a new CSV file) val destinationPath = "path_to_destination_data.csv" transformedDF.write .option("header", "true") .mode("overwrite") .csv(destinationPath) // Stop the SparkSession spark.stop() } }
Expected Output:
The transformedDF.show()
line will display the transformed data on the console. The output will look something like this:
diff+-------+-------+-------+--------------+-------------------+
|column1|column2|column3|column1_upper|concatenated_columns|
+-------+-------+-------+--------------+-------------------+
| data1| data2| data3| DATA1| data2 data3 |
| data4| data5| data6| DATA4| data5 data6 |
+-------+-------+-------+--------------+-------------------+
Additionally, the transformed data will be written to the specified destination path in CSV format.
Make sure to replace "path_to_source_data.csv"
and "path_to_destination_data.csv"
with the actual paths to your source and destination data files. Ensure you have Apache Spark and Scala properly installed and configured before running this code.