ml2npy - Export spark ml SparseVectors as numpy csr matrix

The aim of this project is to provide that tools that efficiently implement the components that are required for large scale text mining.

The idea for this project came out from experience,

Most of time it is data preprocessing that is expensive and demanding
Distributed algorithm implementations are not still as effective as Multicore/sequential implementations.

This project intends to leverage the best of both worlds. In case of text mining, a traditional powerful approach is to use TF-IDF as numerical representation of the document. This enables a vareity of machine learning techniques to be readily applied on the data. Converting a document in to TF-IDF or any other numerical format is compute intensive and once a numerical representation is available, we could try out various algorithms and models on the preprocessed data.

Numerical representation of text tends to be very sparse. By choosing sparse matrix formats to save this data, we could save memory and disk usage. ml2npy provides tools and utilities to load a large corpus of text and save its numerical respresentation as CSR Matrix in numpy format

Why Npy format?

Python and scikit-learn ecosystem has made machine learning a lot more accessible. By being able to load data in to python, means a lot of algorithms could be easily applied.

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
project		project
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ml2npy - Export spark ml SparseVectors as numpy csr matrix

Why Npy format?

About

Releases

Packages

Contributors 4

Languages

License

indix/ml2npy

Folders and files

Latest commit

History

Repository files navigation

ml2npy - Export spark ml SparseVectors as numpy csr matrix

Why Npy format?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages