Category : Apache Spark | Sub Category : Apache Spark Programs | By Prasad Bonam Last updated: 2020-10-25 06:47:31 Viewed : 898
Return
the key-value pairs in this RDD to the master as a Map.
Warning:
this doesn`t return a multimap (so if you have multiple values to the same key,
only one value per key is preserved in the map returned)
Return
an RDD with the keys of each tuple.
Sort
the RDD by key, so that each partition contains a sorted range of the elements.
Calling collect
or save
on the resulting RDD will return or output
an ordered list of records (in the save
case, they will be
written to multiple part-X
files in the
filesystem, in order of the keys).
Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.
·
JDK 1.7 or higher
·
Scala 2.10.3
·
spark-core_2.10
· scala-library
SparkPairFunctions .scala
package com.runnerdev.rdd
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object SparkPairFunctions extends App{
val conf = new SparkConf().setAppName("SparkPairFunctions").setMaster("local")
// Create a Scala
Spark Context.
val sc = new SparkContext(conf)
val rdd = sc.parallelize(
List("Asia Egypt Turkey","India USA Singapore
Canada,Turkey","USA
India Germany China", "USA
India Russia"
))
val wordsRdd = rdd.flatMap(_.split(" "))
val pairRDD = wordsRdd.map(m => (m, 1))
println("pairRDD: ")
pairRDD.foreach(println)
println("Distinct ==>")
pairRDD.distinct().foreach(println)
//SortByKey
println("Sort by Key ==>")
val sortRDD = pairRDD.sortByKey()
sortRDD.foreach(println)
//reduceByKey
println("Reduce by Key ==>")
val wordCount = pairRDD.reduceByKey((x, y) => x + y)
wordCount.foreach(println)
def param1 = (accu: Int, v: Int) => accu + v
def param2 = (accu1: Int, accu2: Int) => accu1 + accu2
println("Aggregate by Key ==>
wordcount ")
val wordCount2 = pairRDD.aggregateByKey(0)(param1, param2)
wordCount2.foreach(println)
//keys
println("Keys ==>")
wordCount2.keys.foreach(println)
//values
println("values ==>")
wordCount2.values.foreach(println)
println("Count :" + wordCount2.count())
println("collectAsMap ==>")
pairRDD.collectAsMap().foreach(println)
Compile and run the above example
as follows
mvn clean install
run as a scala application-
Output:
pairRDD:
(Asia,1)
(Egypt,1)
(Turkey,1)
(India,1)
(USA,1)
(Singapore,1)
(Canada,Turkey,1)
(USA,1)
(India,1)
(Germany,1)
(China,1)
(USA,1)
(India,1)
Distinct ==>
(Germany,1)
(Singapore,1)
(India,1)
(Egypt,1)
(Turkey,1)
(Canada,Turkey,1)
(China,1)
(USA,1)
(Asia,1)
Sort by Key ==>
(Asia,1)
(Canada,Turkey,1)
(China,1)
(Egypt,1)
(Germany,1)
(India,1)
(India,1)
(India,1)
(Russia,1)
(Singapore,1)
(Turkey,1)
(USA,1)
(USA,1)
Reduce by Key ==>
(Asia,1)
(China,1)
(USA,3)
(Singapore,1)
(Germany,1)
(Canada,Turkey,1)
(Egypt,1)
(Russia,1)
(Turkey,1)
Aggregate by Key ==> wordcount
Asia,1)
(China,1)
(USA,3)
(Singapore,1)
(Germany,1)
(Canada,Turkey,1)
(Egypt,1)
(Russia,1)
(Turkey,1)
Keys ==>
Asia
China
USA
Singapore
Germany
Canada,Turkey
Egypt
Russia
Turkey
India
values ==>
1
1
3
1
1
1
1
1
1
3
Count :10
collectAsMap ==>
(Egypt,1)
(Asia,1)
(Canada,Turkey,1)
(Germany,1)
(China,1)
(Russia,1)
(India,1)
(Turkey,1)
(Singapore,1)
(USA,1)