Spark –read ,write and delete files in scala

Category : Apache Spark | Sub Category : Apache Spark Programs | By Prasad Bonam Last updated: 2020-10-25 05:59:23 Viewed : 512

Spark –read ,write and delete files in scala

· File[] java.io.File.listFiles()

Returns an array of abstract pathnames denoting the files in the directory denoted by this abstract pathname.

· boolean java.io.File.delete()

Deletes the file or directory denoted by this abstract pathname. If this pathname denotes a directory, then the directory must be empty in order to be deleted.

· boolean java.io.File.exists()

Tests whether the file or directory denoted by this abstract pathname exists.

· def textFile(path: String, minPartitions: Int): RDD[String]

Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.

· def reduceByKey(func: (V, V) => V): RDD[(K, V)]

Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.

· def saveAsTextFile(path: String): Unit

Save this RDD as a text file, using string representations of elements.

· def foreach(f: String => Unit): Unit

Applies a function f to all elements of this RDD.

· def map[U](f: String => U)(implicit evidence$4483: ClassTag[U]): RDD[U]

Return a new RDD by applying a function to all elements of this RDD.

· def flatMap[U](f: String => TraversableOnce[U])(implicit evidence$4484: ClassTag[U]): RDD[U]

Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.

1. Program Setup instructions:

Create a maven project and make sure present below jars in pom.xml

·        JDK 1.7 or higher

·        Scala 2.10.3

·        spark-core_2.10

·        scala-library

   CSV file name : employeeData.csv

name;age;job

Jorge;30;Developer

Bob;32;Developer

Bob;32;Developer

swetha;39;Tester

Ram;34;DataScience

Ram;34;DataScience

Prasad,34,DataScience

Bonam,34,Developer

2. Example:

Following example illustrates about Spark - read ,write and delete files in scala

Save the file as − ReadFiles.scala

ReadFiles.scala

 package com.runnerdev.rdd

import org.apache.spark.SparkConf

import org.apache.spark.SparkContext

import java.io.File

object ReadFiles extends App {

//Create a SparkContext to initialize Spark

val conf = new SparkConf().setAppName("Read File").setMaster("local")

val sc = new SparkContext(conf)

val counts = countWordsMethod(sc, "/temp/WordCount/employeeData.csv")

/** below logic is used to write a file */

counts.saveAsTextFile(output.getAbsolutePath)

val output = new File("/temp/WordCount/")

deleteRecursively(output)

/** below method is used to delete a file */

def deleteRecursively(file: File): Unit = {

if (file.isDirectory)

file.listFiles.foreach(deleteRecursively)

if (file.exists && !file.delete)

throw new Exception(s"Unable to delete ${file.getAbsolutePath}")

}

/** below method is used to read a file */

def countWordsMethod(sc: SparkContext, inputFileName: String) = {

/*val textFile: RDD[String]*/

val textFile = sc.textFile(inputFileName)

println("File Data :")

textFile.foreach(f => {

println(f)

})

/*word count

val counts: RDD[(String, Int)] */

var empCounts = textFile.flatMap(line => line.split(";"))

.map(word => (word, 1))

.reduceByKey(_ + _)

println("empCounts Data :")

empCounts.collect().foreach(println(_))

System.out.println("Total words: " + empCounts.count())

empCounts

}

Compile and run the above example as follows

mvn clean install

run as a scala application-

Output:

File Data :

name;age;job

Jorge;30;Developer

Bob;32;Developer

swetha;39;Tester

Ram;34;DataScience

Prasad,34,DataScience

Bonam,34,Developer

empCounts Data :

(Tester,1)

(Ram,2)

(job,1)

(Jorge,1)

(name,1)

(32,2)

(34,2)

(age,1)

(30,1)

(Prasad,34,DataScience,1)

(DataScience,2)

(Bonam,34,Developer,1)

(swetha,1)

(39,1)

(Developer,3)

(Bob,2)

Total words: 16

Spark –read ,write and delete files in scala

Spark –read ,write and delete files in scala

· File[] java.io.File.listFiles()

Returns an array of abstract pathnames denoting the files in the directory denoted by this abstract pathname.

· boolean java.io.File.delete()

Deletes the file or directory denoted by this abstract pathname. If this pathname denotes a directory, then the directory must be empty in order to be deleted.

· boolean java.io.File.exists()

Tests whether the file or directory denoted by this abstract pathname exists.

· def textFile(path: String, minPartitions: Int): RDD[String]

Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.

· def reduceByKey(func: (V, V) => V): RDD[(K, V)]

Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.

· def saveAsTextFile(path: String): Unit

Save this RDD as a text file, using string representations of elements.

· def foreach(f: String => Unit): Unit

Applies a function f to all elements of this RDD.

· def map[U](f: String => U)(implicit evidence$4483: ClassTag[U]): RDD[U]

Return a new RDD by applying a function to all elements of this RDD.

· def flatMap[U](f: String => TraversableOnce[U])(implicit evidence$4484: ClassTag[U]): RDD[U]

Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.

1. Program Setup instructions:

Create a maven project and make sure present below jars in pom.xml

2. Example:

Search

Categories

Sub-Categories

Related Articles

Leave a Comment:

Spark –read ,write and delete files in scala

Spark –read ,write and delete files in scala

· File[] java.io.File.listFiles()

Returns an array of abstract pathnames denoting the files in the directory denoted by this abstract pathname.

· boolean java.io.File.delete()

Deletes the file or directory denoted by this abstract pathname. If this pathname denotes a directory, then the directory must be empty in order to be deleted.

· boolean java.io.File.exists()

Tests whether the file or directory denoted by this abstract pathname exists.

· def textFile(path: String, minPartitions: Int): RDD[String]

Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.

· def reduceByKey(func: (V, V) => V): RDD[(K, V)]

Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.

· def saveAsTextFile(path: String): Unit

Save this RDD as a text file, using string representations of elements.

· def foreach(f: String => Unit): Unit

Applies a function f to all elements of this RDD.

· def map[U](f: String => U)(implicit evidence$4483: ClassTag[U]): RDD[U]

Return a new RDD by applying a function to all elements of this RDD.

· def flatMap[U](f: String => TraversableOnce[U])(implicit evidence$4484: ClassTag[U]): RDD[U]

Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.

1. Program Setup instructions:

Create a maven project and make sure present below jars in pom.xml

· JDK 1.7 or higher · Scala 2.10.3 · spark-core_2.10 · scala-library CSV file name : employeeData.csv name;age;job Jorge;30;Developer Bob;32;Developer Bob;32;Developer swetha;39;Tester Ram;34;DataScience Ram;34;DataScience Prasad,34,DataScience Bonam,34,Developer

2. Example:

Search

Categories

Sub-Categories

Related Articles

Leave a Comment:

· JDK 1.7 or higher

· Scala 2.10.3

· spark-core_2.10

· scala-library

CSV file name : employeeData.csv

name;age;job

Jorge;30;Developer

Bob;32;Developer

Bob;32;Developer

swetha;39;Tester

Ram;34;DataScience

Ram;34;DataScience

Prasad,34,DataScience

Bonam,34,Developer