In this project, I download persian wiki-dump dataset from wikipedia, preprocess it and finally train a spell checker and a kenlm language model.
Download the persian wiki dump using the following bash script. The size of the dataset is about 1G so have patience!
Note: If you live in Iran, most surely you do because this repo is for the Persian language, turn on your vpn!
language=fa
bash download_wiki_dump.sh $language
Extract and convert .bz2
format to .txt
. Using wikiextractor
the dump is cleaned and converted .txt
file. This may take some time as well!
n_processors=16
bash extract_and_clean_wiki_dump.sh ${language}wiki-latest-pages-articles.xml.bz2 $n_processors
Note: In case of a pdb error, change the expand_templates=True
variable to expand_templates=False
which is an
input argument to the clean_text
function located in around line 948 of wikiextractor/wikiextractor/extract.py.
The output text should be preprocessed and normalized to remove unnecessary texts like "[doc]" and normalize the texts using hazm
and nltk
libraries!
Install the requirements:
pip install -r requirements.txt
Main Processing. It may take some time!
python preprocess_wiki_dump.py fawiki-latest-pages-articles.txt
python cleaner.py
Using this script the corpus words will be counted. Before that some extra normalization and cleaning will be applied to the words as well.
sudo apt-get install pv
bash get_counts.sh
Symspell needs a text file that contains vocabs and their occurrence. fa_wiki.counts
that created in the
Get the word-count of the corpus
section should be trimmed to only contain the 80k top frequent words and
prevent those that have lower frequency than 50.
python get_spellchecker_top_vocabs.py --top-vocabs 80000 --ignore-less 25 --output wiki_fa_80k.txt
Symspell is a simple spell checker. First, install it from pypi using the following command:
pip install symspellpy
For using it, just instantiate it with the vocab dictionary we created in the Get top frequent vocabs for SymSpell
section
# import symspell
from symspellpy import SymSpell, Verbosity
# instantiate it
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = "wiki_fa_80k.txt"
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
# input sample:
input_term = "اهوار" # misspelling of "اهواز" It's a city name!
# lookup the dictionary
suggestions = sym_spell.lookup(input_term, Verbosity.ALL, max_edit_distance=2)
# display suggestion term, term frequency, and edit distance
for suggestion in suggestions[:5]:
print(suggestion)
The output is as follows. As you can see اهواز
is correctly chosen!
اهواز, 1, 4692
ادوار, 1, 1350
الوار, 1, 651
انوار, 1, 305
اهورا, 1, 225
Using the following code, top most frequent 80K samples is written to kenlm_vocabs.txt
. To make it faster vocabs with
less than 25 occurrences are discarded!
python get_kenlm_top_vocabs.py --top-vocabs 80000 --ignore-less 25 --output wiki_fa_kenlm_vocabs.txt
First install the KenLM requirements using the following commands:
sudo apt-get update
sudo apt-get install cmake build-essential libssl-dev libeigen3-dev libboost-all-dev zlib1g-dev libbz2-dev liblzma-dev -y
Then clone
and make the c++ modules:
git clone https://github.com/kpu/kenlm.git
cd kenlm
mkdir -p build
cd build
cmake ..
make -j 4
If everything goes fine, you can find lmplz
and build_binary
under the ./kenlm/build/bin
directory. Eventually,
train kenlm
language model using following bash script.
bash train_kenlm.sh -o 4 -l fa
Note: the binary module is also created because it's much faster than the non-binarized one.
Install KenLm:
pip install https://github.com/kpu/kenlm/archive/master.zip
How to use it:
import kenlm
model = kenlm.Model('fa_wiki.binary')
print("score: ", model.score('کشور ایران شهر تهران', bos=True, eos=True))
print("score: ", model.score('کشور تهران شهر ایران', bos=True, eos=True))
# score: -11.683658599853516
# score: -15.572178840637207
For more examples check out the following link: https://github.com/kpu/kenlm/blob/master/python/example.py