Skip to content
Quang Hoang Xuan edited this page Sep 4, 2021 · 18 revisions

Welcome to the spark-excel wiki!

There are pages with "Examples" prefix are examples, each one will try to highlight one (or some) main use case with given options in action. Basically, it "borrows" idea from #issues as the starting point. Preferred approach is documented by example, with actual data alongside with the reported issue (if any). Otherwise, examples use a pseudo ca_dataset dataset (you can find a copy in src/test/resources). Some of these examples can be found in the docs/spark_excel_example.ipynb as well.

This wiki will be built gradually. And it means to be open for edit, feel free to jump in and correct existing bugs and issues.

TBD:

Setup Spark Excel

Dependencies

Spark Excel, as of August 2021, beside the bundled jars, need following dependencies

Credit to #133 Apache commons dependency issue & @jakeatmsft and @fwani solution

Method I: Put into $SPARK_HOME/jars

Put a copy of spark-excel and its dependencies jar files into the $SPARK_HOME/jars folder. If this is a cluster setup, make sure spark driver and all executor nodes have their $SPARK_HOME/jars get updated as well.

Restart spark/spark-cluster to make these jars available.

Method II: Cannot put spark-excel into $SPARK_HOME/jars

For pyspark's bundled spark distribution or with existing SPARK_HOME that cannot add additional jar files. In this case, please make sure sparks-excel and its dependencies jar files are ready. Then pass these jars file to spark session builder as following:

from pyspark.sql import SparkSession

SPARK_EXCEL_LIBS="/path/to/spark-excel/jars" # Update this please

DEPENDENCIES_JARS = [
    f"{SPARK_EXCEL_LIBS}/spark-excel_2.12-0.14.0.jar",
    f"{SPARK_EXCEL_LIBS}/poi-ooxml-schemas-4.1.2.jar",
    f"{SPARK_EXCEL_LIBS}/commons-collections4-4.4.jar",
    f"{SPARK_EXCEL_LIBS}/xmlbeans-3.1.0.jar"
]

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.jars", ",".join(jars)) \
    .getOrCreate()

Restart Jupyter python kernel and rerun SparkSession.builder paragraph to make sure spark session can access given jar files.

Spark Excel Options

Options Default Reading Writing Description
columnNameOfCorruptRecord None Support Name for column of corrupted records
columnNameOfRowNumber None Support Additional column for excel row number
dataAddress A1 Support Support Data address, default to everything
dateFormat Support String dateFormat
enforceSchema False Support Forcibly apply the specified or inferred schema to data files. If the option is enabled, headers of Excel files will be ignored.
excerptSize None (all) Support excerptSize
fileExtension xlsx Support Support Output excel file extension, default to xlsx
header Yes Support Support Have header line when reading and writing
ignoreAfterHeader None (0) Support Number of rows to ignore after header. Only in reading
ignoreLeadingWhiteSpace False Support ignoreLeadingWhiteSpace
ignoreTrailingWhiteSpace False Support ignoreTrailingWhiteSpace
inferSchema False Support inferSchema
keepUndefinedRows False Support If true, keep undefined Excel rows
locale US Support Support Locale = A language tag in IETF BCP 47 format
nanValue NaN Support nanValue
negativeInf Inf Support negativeInf
nullValue (empty) Support nullValue
parseMode permissive Support Parsing mode, how to handle corrupted records. Default to permissive
positiveInf Inf Support positiveInf
samplingRatio 1.0 (all) Support Defines fraction of file used for schema inferring. For default and invalid values, 1.0 will be used
timestampFormat Support Format to parse timestamp from text cell
useNullForErrorCells Yes Support Use null value for error cells
usePlainNumberFormat False Support If true, format the cells without rounding and scientific notations
workbookPassword (empty) Support Support Workbook password, optional
zoneId (empty) Support

Or you can checkout the actual ExcelOptions.scala

References: