Read data from Hbase using Spark RDD and Scala

Category : Scala | Sub Category : Scala Programs | By Prasad Bonam Last updated: 2020-10-18 05:18:09 Viewed : 746


Read data from Hbase using Spark and Scala


org.apache.spark.SparkContext

    Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs,     accumulators and broadcast variables on that cluster.

    Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one. This limitation may     eventually be removed; see SPARK-2243 for more details. 

 1. Create a Maven project


  Add the below jars in pom.xml file  


 <properties>

                    <hbase.version>1.2.0-cdh5.8.5</hbase.version>

                    <spark.version>1.6.0-cdh5.8.0</spark.version>

          </properties>

 

<dependencies>

                     

                    <dependency>

                              <groupId>org.scala-lang</groupId>

                              <artifactId>scala-library</artifactId>

                              <version>2.10.6</version>

 

                    </dependency>

 

                    <dependency>

                              <groupId>org.apache.spark</groupId>

                              <artifactId>spark-core_2.10</artifactId>

                              <version>${spark.version}</version>

                              <exclusions>

 

                                        <exclusion>

                                                  <groupId>org.scala-lang</groupId>

                                                  <artifactId>scala-library</artifactId>

                                        </exclusion>

                              </exclusions>

                    </dependency>                                    

 

                    <dependency>

                              <groupId>org.apache.spark</groupId>

                              <artifactId>spark-sql_2.10</artifactId>

                              <version>${spark.version}</version>

                              <exclusions>

 

                                        <exclusion>

                                                  <groupId>org.scala-lang</groupId>

                                                  <artifactId>scala-library</artifactId>

                                        </exclusion>

                              </exclusions>

                    </dependency>

                    <dependency>

                              <groupId>it.nerdammer.bigdata</groupId>

                              <artifactId>spark-hbase-connector_2.10</artifactId>

                              <version>1.0.3</version>

                              <exclusions>

 

                                        <exclusion>

                                                  <groupId>org.scala-lang</groupId>

                                                  <artifactId>scala-library</artifactId>

                                        </exclusion>

                              </exclusions>

                    </dependency>

          </dependencies>


Example :

Following example illustrates about Read the data from Hbase using spark and scala

Save the file as −  Runner.scala

 

Hbase Table Name : employee            

     <row>,<colfamily:colname>,<value> ,<value>

     cf:,cf:empId,cf:empName,cf:location,cf:salary

     001:Ram,001,Ram,India,50000

     002:Rajesh,002,Rajesh,USA,70000

     003:xingpang,003,xingpang,China,50000

     004:Yishun,004,Yishun,Singapore,60000

     005:smita,005,smita,India,70000

     006:swetha,006,swetha,India,90000

     006:Archana,006,Archana,India,90000

     007:Mukhtar,07,Mukhtar,India,70000 


    

File Name: Runner.scala


package com.runnerdev 

import org.apache.spark.SparkContext

import org.apache.spark.SparkConf

import it.nerdammer.spark.hbase._

object Runner {

   def main(args: Array[String]): Unit = {

    try {

      var sc:SparkContext = null

     

      var master = "local[*]"

      var ZOOKEEPER_Host = "localhost"

      var ZOOKEEPER_Port = "2189"

      var appName = "myAppName"

 

      val sparkConf = new SparkConf().setAppName(appName).setMaster(master)

      sparkConf.set("spark.hbase.host", ZOOKEEPER_Host)

      sparkConf.set("spark.hbase.clientport", ZOOKEEPER_Port)

      sc = new SparkContext(sparkConf)

      val empList: Array[Employee] = readEmployeeList(sc)

      println("empList::" + empList)

      empList.foreach(println(_))

    } catch {

      case e: Exception => { println("Exception", e.printStackTrace()) }

    }

  } 


def readEmployeeList(sc:SparkContext): Array[Employee] = {

   val employeeList = sc.hbaseTable[(String, Option[String], Option[String], Option[String])]("Employee")

      .select("empId", "empName","location")

      .inColumnFamily("cf")

      .map(a => Employee(a._1, a._2.getOrElse(""), a._3.getOrElse(""), a._4.getOrElse("")))

      .collect()

    return employeeList

  }

  case class Employee(rowKey: String, empId: String, empName: String, location: String)

}

Run above program as follows

  mvn clean install

Run as a Java application

Out put:

empList::[Lcom.runnerdev.Runner$Employee;@750ff7d3

Employee(001:Manoj,001,Manoj,Mumbai)

Employee(001:Ram,001,Ram,India)

Employee(002:Rajesh,002,Rajesh,USA)

Employee(003:xingpang,003,xingpang,China)

Employee(004:Yishun,004,Yishun,Singapore)

Employee(005:smita,005,smita,India)

Employee(006:Archana,006,Archana,India)

Employee(006:swetha,006,swetha,India)

Employee(007:Mukhtar,07,Mukhtar,India) 


Project Structure -Read data Spark RDD :



Get the source code from GitHub:

Search
Related Articles

Leave a Comment: