Modern Information Retrieval

The course covered broad topics in information retrieval from traditional ways of scoring relevance between a query and documents to the state-of-the-art machine learning algorithms for scoring documents. The project was spilled into three different phases as follows:

1. Processing Persian Wikipedia Documents

Firstly, I designed a tree-based data structure (trie) to index words at each document and saved its number of occurrence and its position in the text. By sending a request to this index, we can get a posting list of each word.

Then, I implemented spell correction with edit distance on the query and calculated similarity scores of documents with the query based on term frequency (tf) and inverse document frequency (idf) in the documents' vector space. The output of the algorithm was evaluated with MAP, F-Measure, R-Precision, and NDCG.

2. Classification (KNN, Naive Bayes)

In this phase, each document has been classified based on its content into four categories: world, sports, business, and science/tech. We used simple classifiers like Naive Bayes and K nearest neighbors (with cosine-similarity and Euclidean distance) to train on the training set and validate its performance on the test set.

KNN Results

Naive Bayes Results

Then we examined the effects of text processing methods, i.e., stopword removal, lemmatization, stemming, on the classification's accuracy. The results showed changes in accuracy are minuscule.

2. Clustering (K-means, t-SNE, Word2Vec)

We clusterd documents into four categories in tf-idf space with k-means algorithm and showed clsutred documents with t-SNE method in 2 dimentional space:

Ground truth:

In the next part, we found word embedding of words in each document with the CBOW method of word2Vec. Then, we examined different methods for finding each document representation based on its words: maximum elementwise, minimum elementwise, concatenate previous max and min vectors, average elementwise. I found average elementwise method worked better than other methods with a high margin. The result of clustering with the new representation of docs is as follow:

Ground truth:

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
figures		figures
phase 2		phase 2
phase 3		phase 3
README.md		README.md
phase 1.ipynb		phase 1.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modern Information Retrieval

1. Processing Persian Wikipedia Documents

2. Classification (KNN, Naive Bayes)

KNN Results

Naive Bayes Results

2. Clustering (K-means, t-SNE, Word2Vec)

About

Releases

Packages

Languages

HosseinEbrahimiK/Modern-Information-Retrieval

Folders and files

Latest commit

History

Repository files navigation

Modern Information Retrieval

1. Processing Persian Wikipedia Documents

2. Classification (KNN, Naive Bayes)

KNN Results

Naive Bayes Results

2. Clustering (K-means, t-SNE, Word2Vec)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages