RunnerDev | Home Page

Convert an RDD (Resilient Distributed Dataset) to a DataFrame in Apache Spark

Category : Apache Spark | Sub Category : Apache Spark Programs | By Prasad Bonam Last updated: 2023-10-01 00:39:26 Viewed : 284

To convert an RDD (Resilient Distributed Dataset) to a DataFrame in Apache Spark using Scala, you can use the createDataFrame method. I will provide you with an example and show the output.

First, make sure you have a SparkSession created:

scala
import org.apache.spark.sql.{SparkSession, Row}
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}

val spark = SparkSession.builder()
  .appName("RDDToDataFrameExample")
  .getOrCreate()

Next, lets create an RDD and convert it into a DataFrame:

scala
// Create an RDD
val rdd = spark.sparkContext.parallelize(Seq(
  (1, "Alice"),
  (2, "Bob"),
  (3, "Charlie")
))

// Define the schema for the DataFrame
val schema = StructType(Seq(
  StructField("id", IntegerType, false),
  StructField("name", StringType, false)
))

// Convert RDD to DataFrame using createDataFrame
val df = spark.createDataFrame(rdd.map { case (id, name) => Row(id, name) }, schema)

Now, you have converted the RDD rdd into a DataFrame df. Here is how you can display its contents:

scala
df.show()

The output will look like this:

diff
+---+-------+
| id|   name|
+---+-------+
|  1|  Alice|
|  2|    Bob|
|  3|Charlie|
+---+-------+

In this example, we created an RDD of tuples, defined a schema for the DataFrame, and then used the createDataFrame method to convert the RDD into a DataFrame. Finally, we displayed the DataFrame using the show method, which printed the contents as a table.

Convert an RDD (Resilient Distributed Dataset) to a DataFrame in Apache Spark

Search

Categories

Sub-Categories

Related Articles

Leave a Comment: