Skip to content

This repository contains codes, data and the report for the project of the Modern Information Retrieval course.

Notifications You must be signed in to change notification settings

HosseinEbrahimiK/Modern-Information-Retrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Modern Information Retrieval

The course covered broad topics in information retrieval from traditional ways of scoring relevance between a query and documents to the state-of-the-art machine learning algorithms for scoring documents. The project was spilled into three different phases as follows:

1. Processing Persian Wikipedia Documents

Firstly, I designed a tree-based data structure (trie) to index words at each document and saved its number of occurrence and its position in the text. By sending a request to this index, we can get a posting list of each word.

drawing

Then, I implemented spell correction with edit distance on the query and calculated similarity scores of documents with the query based on term frequency (tf) and inverse document frequency (idf) in the documents' vector space. The output of the algorithm was evaluated with MAP, F-Measure, R-Precision, and NDCG.

drawing

2. Classification (KNN, Naive Bayes)

In this phase, each document has been classified based on its content into four categories: world, sports, business, and science/tech. We used simple classifiers like Naive Bayes and K nearest neighbors (with cosine-similarity and Euclidean distance) to train on the training set and validate its performance on the test set.

KNN Results

drawing

drawing

Naive Bayes Results

drawing

Then we examined the effects of text processing methods, i.e., stopword removal, lemmatization, stemming, on the classification's accuracy. The results showed changes in accuracy are minuscule.

drawing

2. Clustering (K-means, t-SNE, Word2Vec)

We clusterd documents into four categories in tf-idf space with k-means algorithm and showed clsutred documents with t-SNE method in 2 dimentional space:

drawing

Ground truth:

drawing

In the next part, we found word embedding of words in each document with the CBOW method of word2Vec. Then, we examined different methods for finding each document representation based on its words: maximum elementwise, minimum elementwise, concatenate previous max and min vectors, average elementwise. I found average elementwise method worked better than other methods with a high margin. The result of clustering with the new representation of docs is as follow:

drawing

Ground truth:

drawing

About

This repository contains codes, data and the report for the project of the Modern Information Retrieval course.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published