Category : Apache Spark | Sub Category : Apache Spark Programs | By Prasad Bonam Last updated: 2023-09-30 09:40:30 Viewed : 732
Here are some examples of RDD (Resilient Distributed Dataset) actions in Apache Spark along with their expected outputs:
1. collect()
: Collects all elements from the RDD and returns them as an array.
scalaval data = sc.parallelize(Seq(1, 2, 3, 4, 5)) val collectedData = data.collect() // Output // collectedData: Array[Int] = Array(1, 2, 3, 4, 5)
2. count()
: Returns the number of elements in the RDD.
scalaval data = sc.parallelize(Seq(1, 2, 3, 4, 5)) val count = data.count() // Output // count: Long = 5
3. first()
: Returns the first element of the RDD.
scalaval data = sc.parallelize(Seq(1, 2, 3, 4, 5)) val firstElement = data.first() // Output // firstElement: Int = 1
4. take(n)
: Returns the first n
elements of the RDD as an array.
scalaval data = sc.parallelize(Seq(1, 2, 3, 4, 5)) val firstThreeElements = data.take(3) // Output // firstThreeElements: Array[Int] = Array(1, 2, 3)
5. reduce(func)
: Applies a binary operator func
to the elements of the RDD and returns the result.
scalaval data = sc.parallelize(Seq(1, 2, 3, 4, 5)) val sum = data.reduce((x, y) => x + y) // Output // sum: Int = 15
6. foreach(func)
: Applies a function func
to each element of the RDD. This action is typically used for side-effects.
scalaval data = sc.parallelize(Seq(1, 2, 3, 4, 5)) data.foreach(x => println(x)) // Output (printed to the console): // 1 // 2 // 3 // 4 // 5
7. countByKey()
: Only available for RDDs of key-value pairs. Returns a map of each unique key to its count in the RDD.
scalaval data = sc.parallelize(Seq(("A", 1), ("B", 2), ("A", 3), ("C", 1))) val counts = data.countByKey() // Output // counts: scala.collection.Map[String,Long] = Map(B -> 1, C -> 1, A -> 2)
8. saveAsTextFile(path)
: Writes the RDD data to a text file at the specified path.
scalaval data = sc.parallelize(Seq(1, 2, 3, 4, 5)) data.saveAsTextFile("output.txt") // Output: The RDD data is saved as text files in the "output.txt" directory.
These examples demonstrate various RDD actions in Apache Spark and the expected outputs for each action. Keep in mind that the actual outputs may vary depending on your Spark cluster and the order of execution.