Skip to content

sp1999/Part-of-Speech-Tagger

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Part-of-Speech-Tagger

Implemented a Part Of Speech Tagger using Support Vector Machine(SVM), Hidden Markov Model(HMM) and Bi-directional Long-Short Term Memory(Bi-LSTM)

All the results, detailed error analysis, strengths and weakness of the models and references are also present in the Report.pdf

Support Vector Machine

  • Instructions for running the code

    SVM_PoS_tagger.ipynb contains the implementation of SVM

    Run the code from https://colab.research.google.com/drive/178a6M4J3lt-twzGr1Nv-Y2EliW77NEt9?usp=sharing

    It can also be run as a python script from svm.py in the Support Vector Machine directory.

    No need to download any dependencies if running from the colab file.

  • Results

    • Per-POS accuracy vs Relative Frequency

      per-POS accuracy train per-POS accuracy test

    • Accuracy

      Model Test Accuracy(%) Train Accuracy(%)
      HMM 83.25 83.36
    • Feature Engineering

      Features Selected Accuracy(%)
      Word length, capitalisation, upper-case, lower-case, isNumeric 40
      Prefix and suffix for nouns, verbs, adjectives and adverbs 55
      Word stems using PorterStemmer 60
      Tag of Previous Word 65
      Pre-trained Glove word-embeddings (1 lakh 50-dimensional vectors) 83
      Pre-trained word2vec embeddings (10 crore 300-dimensional vectors) 90
      Including the features of previous 3 words and following 3 words 95

Hidden Markov Model

  • Instructions for running the code

    The file main.ipynb contains the implementation of HMM. Use jupyter notebook to access it.

    Following utilities are to be installed: nltk, pandas, seaborn, matplotlib, sklearn, scikit-learn, tqdm

    Also make sure to input the following commands in jupyter notebook:

    nltk.download('brown')
    nltk.download('universal_tagset')
    

    Hidden Markov Model directory has a copy of all the images, plots and data produced from main.ipynb

  • Results

    • Per-POS accuracy vs Relative Frequency

      per-POS accuracy train per-POS accuracy test

    • Accuracy

      Model Test Accuracy(%) Train Accuracy(%)
      HMM 96.01 97.35

Bi-LSTM

  • Baseline Model Architecture

    Layer (type) Output Shape # Param
    embedding (Embedding) (None, 180, 300) 14944800
    bidirectional (None, 180, 64) 85248
    time_distributed (None, 180, 13) 845

    Total params: 15,030,893
    Trainable params: 86,093
    Non-trainable params: 14,944,800

  • Instructions for running the code

    BiLSTMBaseline.ipynb contains the implementation of baseline code for HMM

    Run the baseline code from https://colab.research.google.com/drive/1lhBd-gxsXNVeQJtBXoLZ7HABI5yYttrw?usp=sharing

    The CNN based code is available in the following two formats:

    1. CNNBiLSTMCRF.ipynb for jupyter notebooks and
    2. main.py can be run as a python script

    The following utilities are to be installed for the CNN based code: pytorch, nltk, torchvision, numpy, seaborn, pandas, matplotlib, tqdm, scikit-learn, sklearn

    Also download the glove embedding from http://nlp.stanford.edu/data/wordvecs/glove.6B.zip and extract it in the source directory:

    Execute the following commands for the CNN based code

    nltk.download('brown')
    nltk.download('universal_tagset')
    

    The results of running the CNN based code are stored in the Bi-LSTM folder.

  • Results

    • Per-POS accuracy vs Relative Frequency

      per-POS accuracy train per-POS accuracy test

    • Accuracy

      Model Test Accuracy(%) Train Accuracy(%)
      Bi-LSTM Baseline 79.38 78.39
      Bi-LSTM-CNN 87.15 87.19

References

  1. Leon Bottou “UNE APPROCHE THEORIQUE DE L’APPRENTISSAGE CONNEXIONNISTE ET AP-PLICATIONS A LA RECONNAISSANCE DE LA PAROLE” PhD thesis (1991)
  2. Xuezhe Ma and Eduard Hovy “End-to-end sequence labeling via bi-directional lstm-cnns-crf” (2016)
  3. Jesus Gimenez and Llus Marquez "Fast and Accurate Part-of-Speech Tagging: The SVM Approach Revisited" (2003)
  4. Mathieu Blondel, Akinori Fujino, Naonori Ueda Large-scale Multiclass Support Vector Machine Training via Euclidean Projection onto the Simplex (2014)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published