==Requirements==
Spark 1.5+
Maven 3.0+
Hadoop 2.6
- Create a Spark cluster on VCL by following the instructions at this repository.
- Download the English Wikipedia XML BZip2 file. It can be downloaded in one of the two ways:
a) as torrent from [dump torrents] (https://meta.wikimedia.org/wiki/Data_dump_torrents#enwiki)
b) direct download as multiple BZIP2 streams from the [dump](link https://dumps.wikimedia.org/enwiki/) - Uncompress the bzip2 file using the below command. The uncompressed file is over 50GB. So make sure there is enough disk space. Uncompressing will take some time.
bzip2 -dk <filename>.bz2
- Tranfer the uncompressed XML file to HDFS using the below command. This may take a lot of time (well over 8 hours).
- Git clone this repository.
- Run mvn package to build an assembly jar. It will download parent dependencies as well the dependencies for modules lda & xml.
- After running the above step, an assembly jar LDA-1.0.2-jar-with-dependencies.jar should be present in lda/target folder. Transfer this jar to Spark master. You can use scp to do this.
- ssh to Spark master and submit the Spark job.