Spark - Examples of RDD Actions in Apache Spark

Category : Apache Spark | Sub Category : Apache Spark Programs | By Prasad Bonam Last updated: 2023-09-30 10:03:38 Viewed : 253


Here are some examples of RDD (Resilient Distributed Dataset) actions in Apache Spark along with their expected outputs:

1. collect(): Collects all elements from the RDD and returns them as an array.

scala
val data = sc.parallelize(Seq(1, 2, 3, 4, 5)) val collectedData = data.collect() // Output // collectedData: Array[Int] = Array(1, 2, 3, 4, 5)

2. count(): Returns the number of elements in the RDD.

scala
val data = sc.parallelize(Seq(1, 2, 3, 4, 5)) val count = data.count() // Output // count: Long = 5

3. first(): Returns the first element of the RDD.

scala
val data = sc.parallelize(Seq(1, 2, 3, 4, 5)) val firstElement = data.first() // Output // firstElement: Int = 1

4. take(n): Returns the first n elements of the RDD as an array.

scala
val data = sc.parallelize(Seq(1, 2, 3, 4, 5)) val firstThreeElements = data.take(3) // Output // firstThreeElements: Array[Int] = Array(1, 2, 3)

5. reduce(func): Applies a binary operator func to the elements of the RDD and returns the result.

scala
val data = sc.parallelize(Seq(1, 2, 3, 4, 5)) val sum = data.reduce((x, y) => x + y) // Output // sum: Int = 15

6. foreach(func): Applies a function func to each element of the RDD. This action is typically used for side-effects.

scala
val data = sc.parallelize(Seq(1, 2, 3, 4, 5)) data.foreach(x => println(x)) // Output (printed to the console): // 1 // 2 // 3 // 4 // 5

7. countByKey(): Only available for RDDs of key-value pairs. Returns a map of each unique key to its count in the RDD.

scala
val data = sc.parallelize(Seq(("A", 1), ("B", 2), ("A", 3), ("C", 1))) val counts = data.countByKey() // Output // counts: scala.collection.Map[String,Long] = Map(B -> 1, C -> 1, A -> 2)

8. saveAsTextFile(path): Writes the RDD data to a text file at the specified path.

scala
val data = sc.parallelize(Seq(1, 2, 3, 4, 5)) data.saveAsTextFile("output.txt") // Output: The RDD data is saved as text files in the "output.txt" directory.

These examples demonstrate various RDD actions in Apache Spark and the expected outputs for each action. Keep in mind that the actual outputs may vary depending on your Spark cluster and the order of execution.

Search
Related Articles

Leave a Comment: