WikiLDA

==Requirements==
Spark 1.5+
Maven 3.0+
Hadoop 2.6

Create a Spark cluster on VCL by following the instructions at this repository.
Download the English Wikipedia XML BZip2 file. It can be downloaded in one of the two ways:
a) as torrent from [dump torrents] (https://meta.wikimedia.org/wiki/Data_dump_torrents#enwiki)
b) direct download as multiple BZIP2 streams from the [dump](link https://dumps.wikimedia.org/enwiki/)
Uncompress the bzip2 file using the below command. The uncompressed file is over 50GB. So make sure there is enough disk space. Uncompressing will take some time. bzip2 -dk <filename>.bz2
Tranfer the uncompressed XML file to HDFS using the below command. This may take a lot of time (well over 8 hours).
Git clone this repository.
Run mvn package to build an assembly jar. It will download parent dependencies as well the dependencies for modules lda & xml.
After running the above step, an assembly jar LDA-1.0.2-jar-with-dependencies.jar should be present in lda/target folder. Transfer this jar to Spark master. You can use scp to do this.
ssh to Spark master and submit the Spark job.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
lda		lda
xml		xml
README.md		README.md
docfreqs.tsv		docfreqs.tsv
pom.xml		pom.xml
sample.xml		sample.xml
stopwords.txt		stopwords.txt

Provide feedback