Spark - Different ways to create RDD in Scala

Category : Apache Spark | Sub Category : Apache Spark Programs | By Runner Dev Last updated: 2020-10-25 11:22:11 Viewed : 155

Spark - Different ways to create RDD in Scala


·        org.apache.spark.rdd.RDD

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist. In addition, org.apache.spark.rdd.PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as groupByKey and join; org.apache.spark.rdd.DoubleRDDFunctions contains operations available only on RDDs of Doubles; and org.apache.spark.rdd.SequenceFileRDDFunctions contains operations available on RDDs that can be saved as SequenceFiles. All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)] through implicit.

Internally, each RDD is characterized by five main properties:

§  A list of partitions

§  A function for computing each split

§  A list of dependencies on other RDDs

§  Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)

§  Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)


·        override def apply[A](xs: A*): List[A]

Creates a collection with the specified elements.

·        def parallelize[T](seq: Seq[T], numSlices: Int)(implicit evidence$1: ClassTag[T]): RDD[T]

Distribute a local Scala collection to form an RDD.


  avoid using parallelize(Seq()) to create an empty RDD. Consider emptyRDD     for an RDD with no partitions, or parallelize(Seq[T]()) for an RDD of T with empty partitions.

  Parallelize acts lazily. If seq is a mutable collection and is altered after the call to parallelize and before the first action on the RDD, the resultant RDD will reflect the modified collection. Pass a copy of the argument to avoid this.

·        def countByValue()(implicit ord: Ordering[String]): Map[String, Long]:

Return the count of each unique value in this RDD as a local map of (value, count) pairs.

Note that this method should only be used if the resulting map is expected to be small, as the whole thing is loaded into the driver`s memory. To handle very large results, consider using => (x, 1L)).reduceByKey(_ + _), which returns an RDD[T, Long] instead of a map.


1.  Program Setup instructions:

Create a maven project and make sure present below jars in pom.xml

·        JDK 1.7 or higher

·        Scala 2.10.3

·        spark-core_2.10

·        scala-library

2.  Example:

Following example illustrates about Spark - Different ways to create RDD in Scala

Save the file as −  CreateSparkRDD.scala


 package com.runnerdev.rdd

import org.apache.spark.SparkConf

import org.apache.spark.SparkContext


object CreateSparkRDD extends App{

  /* Create SparkConf */

  val conf = new SparkConf().setAppName("sparkRDD").setMaster("local[*]")

  val sc = new SparkContext(conf)


  /*create a RDD using List

   * val inputRDD: List[String]


  val inputRDD = List("Java", "Scala", "spark", "Scala", "hadoop", "spark", "hive", "Apache pig", "cassandra", "hadoop")


  /* create a RDD using parallelize

   * val wordRdd: RDD[String]


  val wordRdd = sc.parallelize(inputRDD)

  println("Count: " + wordRdd.count())


  /* val wordCountByValue: Map[String, Long] */

  val wordCountByValue = wordRdd.countByValue()

   //Map(Scala -> 2, cassandra -> 1,....

  println("CountByValue:" +wordCountByValue)


  for ((word, count) <- wordCountByValue) {

    println(word + " : " + count)



 Compile and run the above example as follows

mvn clean install

run as a scala application-




Count: 10

CountByValue:Map(Scala -> 2, cassandra -> 1, hadoop -> 2, spark -> 2, Apache pig -> 1, hive -> 1, Java -> 1)

Scala : 2

cassandra : 1

hadoop : 2

spark : 2

Apache pig : 1

hive : 1

Java : 1

Related Articles

Leave a Comment: