Category : Apache Spark | Sub Category : Apache Spark Programs | By Prasad Bonam Last updated: 2023-09-29 00:33:12 Viewed : 261
In Apache Spark, you can read files in various formats using the appropriate Spark API. Spark provides built-in support for reading and processing data from different file formats. Here are some common file formats and how to read them using Spark:
1. Reading Text Files (e.g., CSV, JSON, TXT):
You can use the spark.read.text
method to read text files. It can be used for various text-based formats such as CSV, JSON, or plain text files.
scalaimport org.apache.spark.sql.SparkSession val spark = SparkSession.builder() .appName("TextFileReadExample") .getOrCreate() val textDF = spark.read.text("path/to/text/file") textDF.show()
2. Reading CSV Files:
You can use the spark.read.csv
method to read CSV files. This method automatically infers the schema and creates a DataFrame.
scalaval csvDF = spark.read.csv("path/to/csv/file") csvDF.show()
3. Reading JSON Files:
To read JSON files, you can use the spark.read.json
method. It infers the schema based on the JSON structure.
scalaval jsonDF = spark.read.json("path/to/json/file") jsonDF.show()
4. Reading Parquet Files:
Apache Parquet is a columnar storage format. You can use the spark.read.parquet
method to read Parquet files.
scalaval parquetDF = spark.read.parquet("path/to/parquet/file") parquetDF.show()
5. Reading ORC Files:
ORC (Optimized Row Columnar) is another columnar storage format. You can use the spark.read.orc
method to read ORC files.
scalaval orcDF = spark.read.orc("path/to/orc/file") orcDF.show()
6. Reading Avro Files:
You can use the spark.read.format("avro").load
method to read Avro files.
scalaval avroDF = spark.read.format("avro").load("path/to/avro/file") avroDF.show()
7. Reading XML Files:
Spark does not have native support for XML, but you can use external libraries like "spark-xml" to read XML files.
scalaval xmlDF = spark.read .format("com.databricks.spark.xml") .option("rowTag", "record") .load("path/to/xml/file") xmlDF.show()
8. Reading Sequence Files:
To read Hadoop Sequence Files, you can use the spark.sparkContext.sequenceFile
method.
scalaval sequenceRDD = spark.sparkContext.sequenceFile[K, V]("path/to/sequence/file")
Remember to replace K
and V
with the actual key and value types in your Sequence File.
These are some common file formats you can read in Apache Spark. Depending on your use case and data, you may choose the appropriate format and API for reading files in Spark.
Here are examples of how to read files in different formats using Apache Spark, along with sample outputs:
1. Reading Text Files (e.g., CSV, JSON, TXT):
You can read text-based files using the spark.read.text
method. This example shows reading a CSV file:
scalaimport org.apache.spark.sql.SparkSession val spark = SparkSession.builder() .appName("TextFileReadExample") .getOrCreate() val textDF = spark.read.text("path/to/csv/file.csv") textDF.show()
Sample Output:
sql+--------------------+
| value|
+--------------------+
|name,age,city
|Alice,25,New York
|Bob,30,Los Angeles
|Charlie,22,San Francisco
+--------------------+
2. Reading CSV Files:
You can read CSV files using the spark.read.csv
method:
scalaval csvDF = spark.read.csv("path/to/csv/file.csv") csvDF.show()
Sample Output:
sql+-------+---+-------------+
| _c0|_c1| _c2|
+-------+---+-------------+
| name|age| city|
| Alice| 25| New York|
| Bob| 30|Los Angeles|
|Charlie| 22|San Francisco|
+-------+---+-------------+
3. Reading JSON Files:
To read JSON files, you can use the spark.read.json
method:
scalaval jsonDF = spark.read.json("path/to/json/file.json") jsonDF.show()
Sample Output:
diff+---+-------+
|age| name|
+---+-------+
| 25| Alice|
| 30| Bob|
| 22|Charlie|
+---+-------+
4. Reading Parquet Files:
You can read Parquet files using the spark.read.parquet
method:
scalaval parquetDF = spark.read.parquet("path/to/parquet/file.parquet") parquetDF.show()
Sample Output:
diff+-------+---+
| name|age|
+-------+---+
| Alice| 25|
| Bob| 30|
|Charlie| 22|
+-------+---+
5. Reading ORC Files:
ORC files can be read using the spark.read.orc
method:
scalaval orcDF = spark.read.orc("path/to/orc/file.orc") orcDF.show()
Sample Output:
diff+-------+---+
| name|age|
+-------+---+
| Alice| 25|
| Bob| 30|
|Charlie| 22|
+-------+---+
6. Reading Avro Files:
You can read Avro files using the spark.read.format("avro").load
method:
scalaval avroDF = spark.read.format("avro").load("path/to/avro/file.avro") avroDF.show()
Sample Output:
diff+---+-------+
|age| name|
+---+-------+
| 25| Alice|
| 30| Bob|
| 22|Charlie|
+---+-------+
These examples demonstrate how to read various file formats in Apache Spark, and the sample outputs show the data loaded into Spark DataFrames. Depending on your specific data and use case, you can choose the appropriate file format and reading method.