Spark –read ,write and delete files in scala

Category : Apache Spark | Sub Category : Apache Spark Programs | By Prasad Bonam Last updated: 2020-10-25 05:59:23 Viewed : 512


Spark –read ,write and delete   files in scala
·        File[] java.io.File.listFiles()

Returns an array of abstract pathnames denoting the files in the directory denoted by this abstract pathname.

·        boolean java.io.File.delete()

Deletes the file or directory denoted by this abstract pathname. If this pathname denotes a directory, then the directory must be empty in order to be deleted.

·        boolean java.io.File.exists()

Tests whether the file or directory denoted by this abstract pathname exists.

·        def textFile(path: String, minPartitions: Int): RDD[String]

Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.

·        def reduceByKey(func: (V, V) => V): RDD[(K, V)]

Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.

·        def saveAsTextFile(path: String): Unit

Save this RDD as a text file, using string representations of elements.

·        def foreach(f: String => Unit): Unit

Applies a function f to all elements of this RDD.

·        def map[U](f: String => U)(implicit evidence$4483: ClassTag[U]): RDD[U]

Return a new RDD by applying a function to all elements of this RDD.

·        def flatMap[U](f: String => TraversableOnce[U])(implicit evidence$4484: ClassTag[U]): RDD[U]

Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.

1.  Program Setup instructions:

Create a maven project and make sure present below jars in pom.xml

·        JDK 1.7 or higher

·        Scala 2.10.3

·        spark-core_2.10

·        scala-library

   CSV file name : employeeData.csv

name;age;job

Jorge;30;Developer

Bob;32;Developer

Bob;32;Developer

swetha;39;Tester

Ram;34;DataScience

Ram;34;DataScience

Prasad,34,DataScience

Bonam,34,Developer 

2.  Example:

Following example illustrates about Spark - read ,write and delete   files in scala

Save the file as −  ReadFiles.scala  

ReadFiles.scala 

 package com.runnerdev.rdd

import org.apache.spark.SparkConf

import org.apache.spark.SparkContext

import java.io.File

 

object ReadFiles extends App {

 

  //Create a SparkContext to initialize Spark

  val conf = new SparkConf().setAppName("Read File").setMaster("local")

  val sc = new SparkContext(conf)

 

  val counts = countWordsMethod(sc, "/temp/WordCount/employeeData.csv")

 

  /** below logic is used to write a file */

  counts.saveAsTextFile(output.getAbsolutePath)

 

  val output = new File("/temp/WordCount/")

 

  deleteRecursively(output)

 

  /** below method is used to delete a file */

  def deleteRecursively(file: File): Unit = {

    if (file.isDirectory)

      file.listFiles.foreach(deleteRecursively)

    if (file.exists && !file.delete)

      throw new Exception(s"Unable to delete ${file.getAbsolutePath}")

  }

 

  /** below method is used to read a file */

  def countWordsMethod(sc: SparkContext, inputFileName: String) = {

 

    /*val textFile: RDD[String]*/

    val textFile = sc.textFile(inputFileName)

    println("File Data :")

    textFile.foreach(f => {

      println(f)

    })

 

    /*word count

    val counts: RDD[(String, Int)] */

    var empCounts = textFile.flatMap(line => line.split(";"))

      .map(word => (word, 1))

      .reduceByKey(_ + _)

    println("empCounts Data :")

    empCounts.collect().foreach(println(_))

 

    System.out.println("Total words: " + empCounts.count())

    empCounts

  }

 

}

 

 

Compile and run the above example as follows

mvn clean install

run as a scala application-

  

 

Output:

  File Data :

name;age;job

Jorge;30;Developer

Bob;32;Developer

Bob;32;Developer

swetha;39;Tester

Ram;34;DataScience

Ram;34;DataScience

Prasad,34,DataScience

Bonam,34,Developer

 

  empCounts Data :

(Tester,1)

(Ram,2)

(job,1)

(Jorge,1)

(name,1)

(32,2)

(34,2)

(age,1)

(30,1)

(Prasad,34,DataScience,1)

(DataScience,2)

(Bonam,34,Developer,1)

(swetha,1)

(39,1)

(Developer,3)

(Bob,2) 

Total words: 16

Search
Related Articles

Leave a Comment: