A Resilient Distributed Dataset (RDD), the
basic abstraction in Spark. Represents an immutable, partitioned collection of
elements that can be operated on in parallel. This class contains the basic
operations available on all RDDs, such as
persist. In addition, org.apache.spark.rdd.PairRDDFunctions contains
operations available only on RDDs of key-value pairs, such as
join; org.apache.spark.rdd.DoubleRDDFunctions contains operations
available only on RDDs of Doubles; and
org.apache.spark.rdd.SequenceFileRDDFunctions contains operations available on
RDDs that can be saved as SequenceFiles. All operations are automatically
available on any RDD of the right type (e.g. RDD[(Int, Int)] through implicit.
Internally, each RDD is characterized by five main properties:
§ A list of partitions
§ A function for computing each split
§ A list of dependencies on other RDDs
§ Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
§ Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
Creates a collection with the specified elements.
Distribute a local Scala collection to form an RDD.
parallelize(Seq()) to create an empty
emptyRDD for an RDD with no
parallelize(Seq[T]()) for an RDD of
T with empty
acts lazily. If
seq is a mutable collection and is altered after the call to
parallelize and before the first action on the RDD, the resultant RDD will
reflect the modified collection. Pass a copy of the argument to avoid this.
Return the count of each unique value in this RDD as a local map of (value, count) pairs.
Note that this method should only be used if the resulting map is expected to be small, as the whole thing is loaded into the driver`s memory. To handle very large results, consider using rdd.map(x => (x, 1L)).reduceByKey(_ + _), which returns an RDD[T, Long] instead of a map.
· JDK 1.7 or higher
· Scala 2.10.3
Compile and run the above example
mvn clean install
run as a scala application-
CountByValue:Map(Scala -> 2, cassandra -> 1, hadoop -> 2, spark -> 2, Apache pig -> 1, hive -> 1, Java -> 1)
Scala : 2
cassandra : 1
hadoop : 2
spark : 2
Apache pig : 1
hive : 1
Java : 1