Convert an RDD (Resilient Distributed Dataset) to a DataFrame in Apache Spark

Category : Apache Spark | Sub Category : Apache Spark Programs | By Prasad Bonam Last updated: 2023-10-01 00:39:26 Viewed : 284


To convert an RDD (Resilient Distributed Dataset) to a DataFrame in Apache Spark using Scala, you can use the createDataFrame method. I will provide you with an example and show the output.

First, make sure you have a SparkSession created:

scala
import org.apache.spark.sql.{SparkSession, Row} import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType} val spark = SparkSession.builder() .appName("RDDToDataFrameExample") .getOrCreate()

Next, lets create an RDD and convert it into a DataFrame:

scala
// Create an RDD val rdd = spark.sparkContext.parallelize(Seq( (1, "Alice"), (2, "Bob"), (3, "Charlie") )) // Define the schema for the DataFrame val schema = StructType(Seq( StructField("id", IntegerType, false), StructField("name", StringType, false) )) // Convert RDD to DataFrame using createDataFrame val df = spark.createDataFrame(rdd.map { case (id, name) => Row(id, name) }, schema)

Now, you have converted the RDD rdd into a DataFrame df. Here is how you can display its contents:

scala
df.show()

The output will look like this:

diff
+---+-------+ | id| name| +---+-------+ | 1| Alice| | 2| Bob| | 3|Charlie| +---+-------+

In this example, we created an RDD of tuples, defined a schema for the DataFrame, and then used the createDataFrame method to convert the RDD into a DataFrame. Finally, we displayed the DataFrame using the show method, which printed the contents as a table.

Search
Related Articles

Leave a Comment: