Spark -Some DataFrame examples using Scala in Apache Spark

Category : Apache Spark | Sub Category : Apache Spark Programs | By Prasad Bonam Last updated: 2023-10-01 14:22:01 Viewed : 273


Here are some DataFrame examples using Scala in Apache Spark:

scala
// Import the SparkSession library import org.apache.spark.sql.SparkSession // Create a SparkSession val spark = SparkSession.builder() .appName("DataFrameExamples") .getOrCreate() // Example 1: Creating a DataFrame from a sequence of case class objects // Define a case class case class Person(name: String, age: Int) // Create a sequence of case class objects val peopleSeq = Seq(Person("Alice", 25), Person("Bob", 30), Person("Charlie", 35)) // Create a DataFrame from the sequence val peopleDF = spark.createDataFrame(peopleSeq) // Show the DataFrame peopleDF.show() // Example 2: Loading data from a CSV file into a DataFrame // Load a CSV file into a DataFrame val csvDF = spark.read .option("header", "true") // Treat the first row as the header .option("inferSchema", "true") // Infer data types .csv("/path/to/your/file.csv") // Show the DataFrame csvDF.show() // Example 3: Performing operations on DataFrames // Select specific columns val selectedDF = csvDF.select("name", "age") selectedDF.show() // Filtering rows val filteredDF = csvDF.filter(csvDF("age") > 30) filteredDF.show() // Grouping and aggregation import org.apache.spark.sql.functions._ val groupAggDF = csvDF.groupBy("gender").agg(avg("age"), max("salary")) groupAggDF.show() // Example 4: Joining DataFrames // Create another DataFrame val departmentDF = Seq((1, "HR"), (2, "Finance"), (3, "Engineering")).toDF("dept_id", "dept_name") // Join the two DataFrames val joinedDF = peopleDF.join(departmentDF, peopleDF("dept_id") === departmentDF("dept_id"), "inner") // Show the joined DataFrame joinedDF.show() // Example 5: Writing DataFrames to various formats // Write the DataFrame to Parquet format csvDF.write.parquet("/path/to/output/parquet") // Write the DataFrame to JSON format csvDF.write.json("/path/to/output/json") // Stop the SparkSession spark.stop()

In these examples:

  • Example 1 shows how to create a DataFrame from a sequence of case class objects.
  • Example 2 demonstrates loading data from a CSV file into a DataFrame, where we specify options like header and schema inference.
  • Example 3 illustrates common DataFrame operations, such as selecting columns, filtering rows, and performing group aggregations.
  • Example 4 showcases joining two DataFrames based on a common key.
  • Example 5 demonstrates how to write DataFrames to different file formats like Parquet and JSON.

You can run these Scala examples in a Spark shell or as part of a Scala Spark application. Make sure to replace the file paths and column names with your specific data and column names as needed.

Search
Related Articles

Leave a Comment: