Category : Apache Spark | Sub Category : Apache Spark Programs | By Prasad Bonam Last updated: 2023-10-01 00:39:26 Viewed : 284
To convert an RDD (Resilient Distributed Dataset) to a DataFrame in Apache Spark using Scala, you can use the createDataFrame
method. I will provide you with an example and show the output.
First, make sure you have a SparkSession created:
scalaimport org.apache.spark.sql.{SparkSession, Row} import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType} val spark = SparkSession.builder() .appName("RDDToDataFrameExample") .getOrCreate()
Next, lets create an RDD and convert it into a DataFrame:
scala// Create an RDD val rdd = spark.sparkContext.parallelize(Seq( (1, "Alice"), (2, "Bob"), (3, "Charlie") )) // Define the schema for the DataFrame val schema = StructType(Seq( StructField("id", IntegerType, false), StructField("name", StringType, false) )) // Convert RDD to DataFrame using createDataFrame val df = spark.createDataFrame(rdd.map { case (id, name) => Row(id, name) }, schema)
Now, you have converted the RDD rdd
into a DataFrame df
. Here is how you can display its contents:
scaladf.show()
The output will look like this:
diff+---+-------+
| id| name|
+---+-------+
| 1| Alice|
| 2| Bob|
| 3|Charlie|
+---+-------+
In this example, we created an RDD of tuples, defined a schema for the DataFrame, and then used the createDataFrame
method to convert the RDD into a DataFrame. Finally, we displayed the DataFrame using the show
method, which printed the contents as a table.