Part-of-Speech-Tagger

Implemented a Part Of Speech Tagger using Support Vector Machine(SVM), Hidden Markov Model(HMM) and Bi-directional Long-Short Term Memory(Bi-LSTM)

All the results, detailed error analysis, strengths and weakness of the models and references are also present in the Report.pdf

Support Vector Machine

Instructions for running the code

SVM_PoS_tagger.ipynb contains the implementation of SVM

Run the code from https://colab.research.google.com/drive/178a6M4J3lt-twzGr1Nv-Y2EliW77NEt9?usp=sharing

It can also be run as a python script from svm.py in the Support Vector Machine directory.

No need to download any dependencies if running from the colab file.

Results

Per-POS accuracy vs Relative Frequency
Accuracy

Model Test Accuracy(%) Train Accuracy(%)

HMM 83.25 83.36

Feature Engineering

Features Selected	Accuracy(%)
Word length, capitalisation, upper-case, lower-case, isNumeric	40
Prefix and suffix for nouns, verbs, adjectives and adverbs	55
Word stems using PorterStemmer	60
Tag of Previous Word	65
Pre-trained Glove word-embeddings (1 lakh 50-dimensional vectors)	83
Pre-trained word2vec embeddings (10 crore 300-dimensional vectors)	90
Including the features of previous 3 words and following 3 words	95

Hidden Markov Model

Instructions for running the code

The file main.ipynb contains the implementation of HMM. Use jupyter notebook to access it.

Following utilities are to be installed: nltk, pandas, seaborn, matplotlib, sklearn, scikit-learn, tqdm

Also make sure to input the following commands in jupyter notebook:
```
nltk.download('brown')
nltk.download('universal_tagset')
```
Hidden Markov Model directory has a copy of all the images, plots and data produced from main.ipynb
Results
- Per-POS accuracy vs Relative Frequency
- Accuracy
  
  Model Test Accuracy(%) Train Accuracy(%)
  
  HMM 96.01 97.35

Bi-LSTM

Baseline Model Architecture

Layer (type) Output Shape # Param

embedding (Embedding) (None, 180, 300) 14944800

bidirectional (None, 180, 64) 85248

time_distributed (None, 180, 13) 845

Total params: 15,030,893
Trainable params: 86,093
Non-trainable params: 14,944,800
Instructions for running the code

BiLSTMBaseline.ipynb contains the implementation of baseline code for HMM

Run the baseline code from https://colab.research.google.com/drive/1lhBd-gxsXNVeQJtBXoLZ7HABI5yYttrw?usp=sharing

The CNN based code is available in the following two formats:
1. CNNBiLSTMCRF.ipynb for jupyter notebooks and
2. main.py can be run as a python script
The following utilities are to be installed for the CNN based code: pytorch, nltk, torchvision, numpy, seaborn, pandas, matplotlib, tqdm, scikit-learn, sklearn

Also download the glove embedding from http://nlp.stanford.edu/data/wordvecs/glove.6B.zip and extract it in the source directory:

Execute the following commands for the CNN based code
```
nltk.download('brown')
nltk.download('universal_tagset')
```
The results of running the CNN based code are stored in the Bi-LSTM folder.
Results
- Per-POS accuracy vs Relative Frequency
- Accuracy
  
  Model Test Accuracy(%) Train Accuracy(%)
  
  Bi-LSTM Baseline 79.38 78.39
  
  Bi-LSTM-CNN 87.15 87.19

References

Leon Bottou “UNE APPROCHE THEORIQUE DE L’APPRENTISSAGE CONNEXIONNISTE ET AP-PLICATIONS A LA RECONNAISSANCE DE LA PAROLE” PhD thesis (1991)
Xuezhe Ma and Eduard Hovy “End-to-end sequence labeling via bi-directional lstm-cnns-crf” (2016)
Jesus Gimenez and Llus Marquez "Fast and Accurate Part-of-Speech Tagging: The SVM Approach Revisited" (2003)
Mathieu Blondel, Akinori Fujino, Naonori Ueda Large-scale Multiclass Support Vector Machine Training via Euclidean Projection onto the Simplex (2014)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Part-of-Speech-Tagger

Support Vector Machine

Hidden Markov Model

Bi-LSTM

References

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Bi-LSTM		Bi-LSTM
Hidden Markov Model		Hidden Markov Model
Support Vector Machine		Support Vector Machine
README.md		README.md
Report.pdf		Report.pdf

Layer (type)	Output Shape	# Param
embedding (Embedding)	(None, 180, 300)	14944800
bidirectional	(None, 180, 64)	85248
time_distributed	(None, 180, 13)	845

Model	Test Accuracy(%)	Train Accuracy(%)
Bi-LSTM Baseline	79.38	78.39
Bi-LSTM-CNN	87.15	87.19

sp1999/Part-of-Speech-Tagger

Folders and files

Latest commit

History

Repository files navigation

Part-of-Speech-Tagger

Support Vector Machine

Hidden Markov Model

Bi-LSTM

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages