Skip to content
/ ml2npy Public

Export spark ml SparseVectors as numpy csr matrix

License

Notifications You must be signed in to change notification settings

indix/ml2npy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

91 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ml2npy - Export spark ml SparseVectors as numpy csr matrix

Maven Central

The aim of this project is to provide that tools that efficiently implement the components that are required for large scale text mining.

The idea for this project came out from experience,

  1. Most of time it is data preprocessing that is expensive and demanding
  2. Distributed algorithm implementations are not still as effective as Multicore/sequential implementations.

This project intends to leverage the best of both worlds. In case of text mining, a traditional powerful approach is to use TF-IDF as numerical representation of the document. This enables a vareity of machine learning techniques to be readily applied on the data. Converting a document in to TF-IDF or any other numerical format is compute intensive and once a numerical representation is available, we could try out various algorithms and models on the preprocessed data.

Numerical representation of text tends to be very sparse. By choosing sparse matrix formats to save this data, we could save memory and disk usage. ml2npy provides tools and utilities to load a large corpus of text and save its numerical respresentation as CSR Matrix in numpy format

Why Npy format?

Python and scikit-learn ecosystem has made machine learning a lot more accessible. By being able to load data in to python, means a lot of algorithms could be easily applied.

About

Export spark ml SparseVectors as numpy csr matrix

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •