The course covered broad topics in information retrieval from traditional ways of scoring relevance between a query and documents to the state-of-the-art machine learning algorithms for scoring documents. The project was spilled into three different phases as follows:
Firstly, I designed a tree-based data structure (trie) to index words at each document and saved its number of occurrence and its position in the text. By sending a request to this index, we can get a posting list of each word.
Then, I implemented spell correction with edit distance on the query and calculated similarity scores of documents with the query based on term frequency (tf) and inverse document frequency (idf) in the documents' vector space. The output of the algorithm was evaluated with MAP, F-Measure, R-Precision, and NDCG.
In this phase, each document has been classified based on its content into four categories: world, sports, business, and science/tech. We used simple classifiers like Naive Bayes and K nearest neighbors (with cosine-similarity and Euclidean distance) to train on the training set and validate its performance on the test set.
Then we examined the effects of text processing methods, i.e., stopword removal, lemmatization, stemming, on the classification's accuracy. The results showed changes in accuracy are minuscule.
We clusterd documents into four categories in tf-idf space with k-means algorithm and showed clsutred documents with t-SNE method in 2 dimentional space:
Ground truth:
In the next part, we found word embedding of words in each document with the CBOW method of word2Vec. Then, we examined different methods for finding each document representation based on its words: maximum elementwise, minimum elementwise, concatenate previous max and min vectors, average elementwise. I found average elementwise method worked better than other methods with a high margin. The result of clustering with the new representation of docs is as follow:
Ground truth: