Merge pull request #82 from dayyass/develop

release v0.1.5
dayyass · Oct 21, 2021 · 3353a53 · 3353a53
2 parents c9d6876 + 095c958
commit 3353a53
Show file tree

Hide file tree

Showing 28 changed files with 28,775 additions and 172 deletions.
diff --git a/.coveragerc b/.coveragerc
@@ -9,6 +9,7 @@ exclude_lines =
     raise AssertionError
     raise NotImplementedError
     if __name__ == .__main__.:
+    ...
 
 omit =
     text_clf/__main__.py

diff --git a/Makefile b/Makefile
@@ -17,4 +17,4 @@ pypi_twine:
 pypi_clean:
 	rm -rf dist text_classification_baseline.egg-info
 clean:
-	rm -rf models/model*
+	rm -rf models/*
diff --git a/README.md b/README.md
@@ -12,10 +12,10 @@
 [![pypi version](https://img.shields.io/pypi/v/text-classification-baseline)](https://pypi.org/project/text-classification-baseline)
 [![pypi downloads](https://img.shields.io/pypi/dm/text-classification-baseline)](https://pypi.org/project/text-classification-baseline)
 
-### Text Classification Baseline
+## Text Classification Baseline
 Pipeline for fast building text classification baselines with **TF-IDF + LogReg**.
 
-### Usage
+## Usage
 Instead of writing custom code for specific text classification task, you just need:
 1. install pipeline:
 ```shell script
@@ -41,7 +41,7 @@ No data preparation is needed, only a **csv** file with two raw columns (with ar
 
 The **target** can be presented in any format, including text - not necessarily integers from *0* to *n_classes-1*.
 
-#### Config
+### Config
 The user interface consists of two files:
 - [**config.yaml**](https://github.com/dayyass/text-classification-baseline/blob/main/config.yaml) - general configuration with sklearn **TF-IDF** and **LogReg** parameters
 - [**hyperparams.py**](https://github.com/dayyass/text-classification-baseline/blob/main/hyperparams.py) - sklearn **GridSearchCV** parameters
@@ -62,6 +62,7 @@ Default **config.yaml**:
 ```yaml
 seed: 42
 path_to_save_folder: models
+experiment_name: model
 
 # data
 data:
@@ -71,6 +72,11 @@ data:
   text_column: text
   target_column: target_name_short
 
+# preprocessing
+# (included in resulting model pipeline, so preserved for inference)
+preprocessing:
+  lemmatization: null  # pymorphy2
+
 # tf-idf
 tf-idf:
   lowercase: true
@@ -96,18 +102,29 @@ grid-search:
 
 **NOTE**: `tf-idf` and `logreg` are sklearn [**TfidfVectorizer**](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html?highlight=tfidf#sklearn.feature_extraction.text.TfidfVectorizer) and [**LogisticRegression**](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) parameters correspondingly, so you can parameterize instances of these classes however you want. The same logic applies to `grid-search` which is sklearn [**GridSearchCV**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) parametrized with [**hyperparams.py**](https://github.com/dayyass/text-classification-baseline/blob/main/hyperparams.py).
 
-#### Output
+### Output
 After training the model, the pipeline will return the following files:
 - `model.joblib` - sklearn pipeline with TF-IDF and LogReg steps
 - `target_names.json` - mapping from encoded target labels from *0* to *n_classes-1* to it names
 - `config.yaml` - config that was used to train the model
 - `hyperparams.py` - grid-search parameters (if grid-search was used)
 - `logging.txt` - logging file
 
-### Requirements
+
+### Additional functions
+- `text_clf.token_frequency.get_token_frequency(path_to_config)` - <br> get token frequency of **train dataset** according to the config file parameters
+
+**Only for binary classifiers**:
+- `text_clf.pr_roc_curve.get_precision_recall_curve(path_to_model_folder)` - <br> get *precision* and *recall* metrics for precision-recall curve
+- `text_clf.pr_roc_curve.get_roc_curve(path_to_model_folder)` - <br> get *false positive rate (fpr)* and *true positive rate (tpr)* metrics for roc curve
+- `text_clf.pr_roc_curve.plot_precision_recall_curve(precision, recall)` - <br> plot *precision-recall curve*
+- `text_clf.pr_roc_curve.plot_roc_curve(fpr, tpr)` - <br> plot *roc curve*
+- `text_clf.pr_roc_curve.plot_precision_recall_f1_curves_for_thresholds(precision, recall, thresholds)` - <br> plot *precision*, *recall*, *f1-score* curves for probability thresholds
+
+## Requirements
 Python >= 3.6
 
-### Citation
+## Citation
 If you use **text-classification-baseline** in a scientific publication, we would appreciate references to the following BibTex entry:
 ```bibtex
 @misc{dayyass2021textclf,

diff --git a/config.yaml b/config.yaml
@@ -1,5 +1,6 @@
 seed: 42
 path_to_save_folder: models
+experiment_name: model
 
 # data
 data:
@@ -9,6 +10,11 @@ data:
   text_column: text
   target_column: target_name_short
 
+# preprocessing
+# (included in resulting model pipeline, so preserved for inference)
+preprocessing:
+  lemmatization: null  # pymorphy2
+
 # tf-idf
 tf-idf:
   lowercase: true

diff --git a/requirements.txt b/requirements.txt
@@ -1,3 +1,10 @@
+coverage==6.0.2  # dev
+matplotlib>=3.3.4
+numpy >= 1.19.5
 pandas>=1.1.5
+parameterized==0.8.1  # dev
+pre-commit==2.15.0  # dev
+pymorphy2>=0.9.1
 PyYAML>=5.4.1
 scikit-learn>=0.24.2
+scipy >= 1.5.4
diff --git a/setup.cfg b/setup.cfg
@@ -1,6 +1,6 @@
 [metadata]
 name = text-classification-baseline
-version = 0.1.4
+version = 0.1.5
 author = Dani El-Ayyass
 author_email = [email protected]
 description = TF-IDF + LogReg baseline for text classification
@@ -18,9 +18,13 @@ classifiers =
 packages = find:
 python_requires = >=3.6
 install_requires =
+    numpy >= 1.19.5
+    scipy >= 1.5.4
     pandas >= 1.1.5
-    PyYAML >= 5.4.1
     scikit-learn >= 0.24.2
+    matplotlib >= 3.3.4
+    pymorphy2 >= 0.9.1
+    PyYAML >= 5.4.1
 
 [options.entry_points]
 console_scripts =

diff --git a/tests/config/config.yaml b/tests/config/config.yaml
@@ -0,0 +1,35 @@
+seed: 42
+path_to_save_folder: tests/models
+
+# data
+data:
+  train_data_path: data/train.csv
+  test_data_path: data/test.csv
+  sep: ','
+  text_column: text
+  target_column: target_name_short
+
+# preprocessing
+# (included in resulting model pipeline, so preserved for inference)
+preprocessing:
+  lemmatization: null  # pymorphy2
+
+# tf-idf
+tf-idf:
+  lowercase: true
+  ngram_range: (1, 1)
+  max_df: 1.0
+  min_df: 1
+
+# logreg
+logreg:
+  penalty: l2
+  C: 1.0
+  class_weight: balanced
+  solver: saga
+  n_jobs: -1
+
+# grid-search
+grid-search:
+  do_grid_search: false
+  grid_search_params_path: hyperparams.py
diff --git a/tests/config/config_grid_search.yaml b/tests/config/config_grid_search.yaml
@@ -0,0 +1,31 @@
+seed: 42
+path_to_save_folder: tests/models
+experiment_name: grid_search
+
+# data
+data:
+  train_data_path: data/train.csv
+  test_data_path: data/test.csv
+  sep: ','
+  text_column: text
+  target_column: target_name_short
+
+# tf-idf
+tf-idf:
+  lowercase: true
+  ngram_range: (1, 1)
+  max_df: 1.0
+  min_df: 1
+
+# logreg
+logreg:
+  penalty: l2
+  C: 1.0
+  class_weight: balanced
+  solver: saga
+  n_jobs: -1
+
+# grid-search
+grid-search:
+  do_grid_search: true
+  grid_search_params_path: tests/hyperparams/hyperparams_for_tests.py
diff --git a/tests/config/config_lemmatizer_error.yaml b/tests/config/config_lemmatizer_error.yaml
@@ -0,0 +1,36 @@
+seed: 42
+path_to_save_folder: tests/models
+experiment_name: spacy
+
+# data
+data:
+  train_data_path: data/train.csv
+  test_data_path: data/test.csv
+  sep: ','
+  text_column: text
+  target_column: target_name_short
+
+# preprocessing
+# (included in resulting model pipeline, so preserved for inference)
+preprocessing:
+  lemmatization: spacy
+
+# tf-idf
+tf-idf:
+  lowercase: true
+  ngram_range: (1, 1)
+  max_df: 1.0
+  min_df: 1
+
+# logreg
+logreg:
+  penalty: l2
+  C: 1.0
+  class_weight: balanced
+  solver: saga
+  n_jobs: -1
+
+# grid-search
+grid-search:
+  do_grid_search: false
+  grid_search_params_path: hyperparams.py
diff --git a/tests/config/config_pymorphy2.yaml b/tests/config/config_pymorphy2.yaml
@@ -0,0 +1,36 @@
+seed: 42
+path_to_save_folder: tests/models
+experiment_name: pymorphy2
+
+# data
+data:
+  train_data_path: data/train.csv
+  test_data_path: data/test.csv
+  sep: ','
+  text_column: text
+  target_column: target_name_short
+
+# preprocessing
+# (included in resulting model pipeline, so preserved for inference)
+preprocessing:
+  lemmatization: pymorphy2
+
+# tf-idf
+tf-idf:
+  lowercase: true
+  ngram_range: (1, 1)
+  max_df: 1.0
+  min_df: 1
+
+# logreg
+logreg:
+  penalty: l2
+  C: 1.0
+  class_weight: balanced
+  solver: saga
+  n_jobs: -1
+
+# grid-search
+grid-search:
+  do_grid_search: false
+  grid_search_params_path: hyperparams.py
diff --git a/tests/config/config_russian.yaml b/tests/config/config_russian.yaml
@@ -0,0 +1,36 @@
+seed: 42
+path_to_save_folder: tests/models
+experiment_name: russian
+
+# data
+data:
+  train_data_path: tests/data/russian_language_toxic_comments.csv
+  test_data_path: tests/data/russian_language_toxic_comments.csv
+  sep: ','
+  text_column: comment
+  target_column: toxic
+
+# preprocessing
+# (included in resulting model pipeline, so preserved for inference)
+preprocessing:
+  lemmatization: null  # pymorphy2
+
+# tf-idf
+tf-idf:
+  lowercase: true
+  ngram_range: (1, 1)
+  max_df: 1.0
+  min_df: 1
+
+# logreg
+logreg:
+  penalty: l2
+  C: 1.0
+  class_weight: balanced
+  solver: saga
+  n_jobs: -1
+
+# grid-search
+grid-search:
+  do_grid_search: false
+  grid_search_params_path: hyperparams.py
diff --git a/tests/config/config_russian_pymorphy2.yaml b/tests/config/config_russian_pymorphy2.yaml
@@ -0,0 +1,36 @@
+seed: 42
+path_to_save_folder: tests/models
+experiment_name: russian_pymorphy2
+
+# data
+data:
+  train_data_path: tests/data/russian_language_toxic_comments.csv
+  test_data_path: tests/data/russian_language_toxic_comments.csv
+  sep: ','
+  text_column: comment
+  target_column: toxic
+
+# preprocessing
+# (included in resulting model pipeline, so preserved for inference)
+preprocessing:
+  lemmatization: pymorphy2
+
+# tf-idf
+tf-idf:
+  lowercase: true
+  ngram_range: (1, 1)
+  max_df: 1.0
+  min_df: 1
+
+# logreg
+logreg:
+  penalty: l2
+  C: 1.0
+  class_weight: balanced
+  solver: saga
+  n_jobs: -1
+
+# grid-search
+grid-search:
+  do_grid_search: false
+  grid_search_params_path: hyperparams.py
diff --git a/tests/data/README.md b/tests/data/README.md
@@ -0,0 +1,3 @@
+### Data
+
+[Russian Language Toxic Comments](https://www.kaggle.com/blackmoon/russian-language-toxic-comments)