Spark -Read files in various formats using the appropriate Spark API

Category : Apache Spark | Sub Category : Apache Spark Programs | By Prasad Bonam Last updated: 2023-09-29 00:33:12 Viewed : 261


In Apache Spark, you can read files in various formats using the appropriate Spark API. Spark provides built-in support for reading and processing data from different file formats. Here are some common file formats and how to read them using Spark:

1. Reading Text Files (e.g., CSV, JSON, TXT):

You can use the spark.read.text method to read text files. It can be used for various text-based formats such as CSV, JSON, or plain text files.

scala
import org.apache.spark.sql.SparkSession val spark = SparkSession.builder() .appName("TextFileReadExample") .getOrCreate() val textDF = spark.read.text("path/to/text/file") textDF.show()

2. Reading CSV Files:

You can use the spark.read.csv method to read CSV files. This method automatically infers the schema and creates a DataFrame.

scala
val csvDF = spark.read.csv("path/to/csv/file") csvDF.show()

3. Reading JSON Files:

To read JSON files, you can use the spark.read.json method. It infers the schema based on the JSON structure.

scala
val jsonDF = spark.read.json("path/to/json/file") jsonDF.show()

4. Reading Parquet Files:

Apache Parquet is a columnar storage format. You can use the spark.read.parquet method to read Parquet files.

scala
val parquetDF = spark.read.parquet("path/to/parquet/file") parquetDF.show()

5. Reading ORC Files:

ORC (Optimized Row Columnar) is another columnar storage format. You can use the spark.read.orc method to read ORC files.

scala
val orcDF = spark.read.orc("path/to/orc/file") orcDF.show()

6. Reading Avro Files:

You can use the spark.read.format("avro").load method to read Avro files.

scala
val avroDF = spark.read.format("avro").load("path/to/avro/file") avroDF.show()

7. Reading XML Files:

Spark does not have native support for XML, but you can use external libraries like "spark-xml" to read XML files.

scala
val xmlDF = spark.read .format("com.databricks.spark.xml") .option("rowTag", "record") .load("path/to/xml/file") xmlDF.show()

8. Reading Sequence Files:

To read Hadoop Sequence Files, you can use the spark.sparkContext.sequenceFile method.

scala
val sequenceRDD = spark.sparkContext.sequenceFile[K, V]("path/to/sequence/file")

Remember to replace K and V with the actual key and value types in your Sequence File.

These are some common file formats you can read in Apache Spark. Depending on your use case and data, you may choose the appropriate format and API for reading files in Spark.


Here are examples of how to read files in different formats using Apache Spark, along with sample outputs:

1. Reading Text Files (e.g., CSV, JSON, TXT):

You can read text-based files using the spark.read.text method. This example shows reading a CSV file:

scala
import org.apache.spark.sql.SparkSession val spark = SparkSession.builder() .appName("TextFileReadExample") .getOrCreate() val textDF = spark.read.text("path/to/csv/file.csv") textDF.show()

Sample Output:

sql
+--------------------+ | value| +--------------------+ |name,age,city |Alice,25,New York |Bob,30,Los Angeles |Charlie,22,San Francisco +--------------------+

2. Reading CSV Files:

You can read CSV files using the spark.read.csv method:

scala
val csvDF = spark.read.csv("path/to/csv/file.csv") csvDF.show()

Sample Output:

sql
+-------+---+-------------+ | _c0|_c1| _c2| +-------+---+-------------+ | name|age| city| | Alice| 25| New York| | Bob| 30|Los Angeles| |Charlie| 22|San Francisco| +-------+---+-------------+

3. Reading JSON Files:

To read JSON files, you can use the spark.read.json method:

scala
val jsonDF = spark.read.json("path/to/json/file.json") jsonDF.show()

Sample Output:

diff
+---+-------+ |age| name| +---+-------+ | 25| Alice| | 30| Bob| | 22|Charlie| +---+-------+

4. Reading Parquet Files:

You can read Parquet files using the spark.read.parquet method:

scala
val parquetDF = spark.read.parquet("path/to/parquet/file.parquet") parquetDF.show()

Sample Output:

diff
+-------+---+ | name|age| +-------+---+ | Alice| 25| | Bob| 30| |Charlie| 22| +-------+---+

5. Reading ORC Files:

ORC files can be read using the spark.read.orc method:

scala
val orcDF = spark.read.orc("path/to/orc/file.orc") orcDF.show()

Sample Output:

diff
+-------+---+ | name|age| +-------+---+ | Alice| 25| | Bob| 30| |Charlie| 22| +-------+---+

6. Reading Avro Files:

You can read Avro files using the spark.read.format("avro").load method:

scala
val avroDF = spark.read.format("avro").load("path/to/avro/file.avro") avroDF.show()

Sample Output:

diff
+---+-------+ |age| name| +---+-------+ | 25| Alice| | 30| Bob| | 22|Charlie| +---+-------+

These examples demonstrate how to read various file formats in Apache Spark, and the sample outputs show the data loaded into Spark DataFrames. Depending on your specific data and use case, you can choose the appropriate file format and reading method.

Search
Related Articles

Leave a Comment: