Skip to content
Darin McBeath edited this page Mar 9, 2017 · 8 revisions

XSLT

The XSLTProcessor class defined in spark-xml-utils provides methods that enable the transformation of a record by applying a stylesheet. The record is assumed to be a string of XML.

Imports

The following import is required for the XSLTProcessor.

	import com.elsevier.spark_xml_utils.xslt.XSLTProcessor

Creating the XSLTProcessor

All that is required is the stylesheet that will be used for the transformation. Typically I store the stylesheet (as a string) in an S3 bucket. The stylesheet can then be easily retrieved using sc.textFile. Alternatively, the stylesheet could be defined in the code as a string.

	val stylesheet = sc.textFile("s3n://spark-xml-utils/stylesheets/srctitle.xsl").collect.head
	val proc = XSLTProcessor.getInstance(stylesheet)

Transform

The result of an transform operation will be the result of applying the stylesheet against the content (a string of XML). The transformation can occur locally on the driver (if you have returned records to the driver) or on the workers. In practice, the transformation will typically occur on the workers but I will show examples of both. The transform() method will accept either a String or an InputStream.

When transforming locally on the driver , the code would be something like the following. In the example below local is an Array of (String,String) where the first item is the key and the second item is the string of XML.

	import com.elsevier.spark_xml_utils.xslt.XSLTProcessor

	val xmlKeyPair = sc.sequenceFile[String, String]("s3n://spark-xml-utils/xml/part*")
	val local = xmlKeyPair.take(10)
	
	val stylesheet = sc.textFile("s3n://spark-xml-utils/stylesheets/srctitle.xsl").collect.head

	val proc = XSLTProcessor.getInstance(stylesheet)

	val localSrctitles = local.map(rec => proc.transform(rec._2))

When transforming on the workers, the code would be something like the following. In the example below xmlKeyPair is an RDD of (String,String) where the first item is the key and the second item is the string of XML. We use mapPartitions to initialize the processor for XSLT once per partition for optimal performance. We then use an iterator to process each record in the partition.

	import com.elsevier.spark_xml_utils.xslt.XSLTProcessor

	val xmlKeyPair = sc.sequenceFile[String, String]("s3n://spark-xml-utils/xml/part*")

	val stylesheet = sc.textFile("s3n://spark-xml-utils/stylesheets/srctitle.xsl").collect.head

	val srctitles = xmlKeyPair.mapPartitions(recsIter => {
                      val proc = XSLTProcessor.getInstance(stylesheet)
                      recsIter.map(rec => proc.transform(rec._2))
                    })

If there is an error encountered during the operation, the error will be logged and an exception will be raised.

There is also support for stylesheet parameters. The following sets the stylesheet parameter named publisher to the value <p>Elsevier</p>. This parameter can then be easily accessed in the stylesheet

import com.elsevier.spark_xml_utils.xslt.XSLTProcessor
import scala.collection.JavaConverters._
import java.util.HashMap 

val stylesheet = sc.textFile("/mnt/spark-xml-utils/stylesheets/params.xsl").collect.head
xmlKeyPair.mapPartitions(recsIter => {
             val proc = XSLTProcessor.getInstance(stylesheet)
             recsIter.map(rec => {
               val stylesheetParams = new HashMap[String,String](Map("publisher" -> "<p>Elsevier</p>").asJava)
               proc.transform(rec._2,stylesheetParams)
             })
           }).collect.foreach(println(_))

Spark-Shell and Notebooks

I have successfully used XSLTProcessor from the spark-shell and notebook environments (such as Databricks and Zeppelin). Depending on the environment, you just need to get the spark-xml-utils.jar installed and available to the driver and workers. For the spark-shell, something like the following would be done.

	cd {spark-install-dir}
	./bin/spark-shell --jars lib/uber-spark-xml-utils-1.4.0.jar

You can also use the 'packages' option as well.

	cd {spark-install-dir}
	./bin/spark-shell --packages elsevierlabs-os:spark-xml-utils:1.4.0
Clone this wiki locally