Implemented a Part Of Speech Tagger using Support Vector Machine(SVM), Hidden Markov Model(HMM) and Bi-directional Long-Short Term Memory(Bi-LSTM)
All the results, detailed error analysis, strengths and weakness of the models and references are also present in the Report.pdf
-
Instructions for running the code
SVM_PoS_tagger.ipynb contains the implementation of SVM
Run the code from https://colab.research.google.com/drive/178a6M4J3lt-twzGr1Nv-Y2EliW77NEt9?usp=sharing
It can also be run as a python script from svm.py in the Support Vector Machine directory.
No need to download any dependencies if running from the colab file.
-
Results
-
Per-POS accuracy vs Relative Frequency
-
Accuracy
Model Test Accuracy(%) Train Accuracy(%) HMM 83.25 83.36 -
Feature Engineering
Features Selected Accuracy(%) Word length, capitalisation, upper-case, lower-case, isNumeric 40 Prefix and suffix for nouns, verbs, adjectives and adverbs 55 Word stems using PorterStemmer 60 Tag of Previous Word 65 Pre-trained Glove word-embeddings (1 lakh 50-dimensional vectors) 83 Pre-trained word2vec embeddings (10 crore 300-dimensional vectors) 90 Including the features of previous 3 words and following 3 words 95
-
Hidden Markov Model
-
Instructions for running the code
The file main.ipynb contains the implementation of HMM. Use jupyter notebook to access it.
Following utilities are to be installed: nltk, pandas, seaborn, matplotlib, sklearn, scikit-learn, tqdm
Also make sure to input the following commands in jupyter notebook:
nltk.download('brown') nltk.download('universal_tagset')
Hidden Markov Model directory has a copy of all the images, plots and data produced from main.ipynb
-
Results
-
Baseline Model Architecture
Layer (type) Output Shape # Param embedding (Embedding) (None, 180, 300) 14944800 bidirectional (None, 180, 64) 85248 time_distributed (None, 180, 13) 845 Total params: 15,030,893
Trainable params: 86,093
Non-trainable params: 14,944,800 -
Instructions for running the code
BiLSTMBaseline.ipynb contains the implementation of baseline code for HMM
Run the baseline code from https://colab.research.google.com/drive/1lhBd-gxsXNVeQJtBXoLZ7HABI5yYttrw?usp=sharing
The CNN based code is available in the following two formats:
- CNNBiLSTMCRF.ipynb for jupyter notebooks and
- main.py can be run as a python script
The following utilities are to be installed for the CNN based code: pytorch, nltk, torchvision, numpy, seaborn, pandas, matplotlib, tqdm, scikit-learn, sklearn
Also download the glove embedding from http://nlp.stanford.edu/data/wordvecs/glove.6B.zip and extract it in the source directory:
Execute the following commands for the CNN based code
nltk.download('brown') nltk.download('universal_tagset')
The results of running the CNN based code are stored in the Bi-LSTM folder.
-
Results
- Leon Bottou “UNE APPROCHE THEORIQUE DE L’APPRENTISSAGE CONNEXIONNISTE ET AP-PLICATIONS A LA RECONNAISSANCE DE LA PAROLE” PhD thesis (1991)
- Xuezhe Ma and Eduard Hovy “End-to-end sequence labeling via bi-directional lstm-cnns-crf” (2016)
- Jesus Gimenez and Llus Marquez "Fast and Accurate Part-of-Speech Tagging: The SVM Approach Revisited" (2003)
- Mathieu Blondel, Akinori Fujino, Naonori Ueda Large-scale Multiclass Support Vector Machine Training via Euclidean Projection onto the Simplex (2014)