Skip to content

Commit

Permalink
Merge pull request #82 from dayyass/develop
Browse files Browse the repository at this point in the history
release v0.1.5
  • Loading branch information
dayyass authored Oct 21, 2021
2 parents c9d6876 + 095c958 commit 3353a53
Show file tree
Hide file tree
Showing 28 changed files with 28,775 additions and 172 deletions.
1 change: 1 addition & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ exclude_lines =
raise AssertionError
raise NotImplementedError
if __name__ == .__main__.:
...

omit =
text_clf/__main__.py
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,4 @@ pypi_twine:
pypi_clean:
rm -rf dist text_classification_baseline.egg-info
clean:
rm -rf models/model*
rm -rf models/*
29 changes: 23 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@
[![pypi version](https://img.shields.io/pypi/v/text-classification-baseline)](https://pypi.org/project/text-classification-baseline)
[![pypi downloads](https://img.shields.io/pypi/dm/text-classification-baseline)](https://pypi.org/project/text-classification-baseline)

### Text Classification Baseline
## Text Classification Baseline
Pipeline for fast building text classification baselines with **TF-IDF + LogReg**.

### Usage
## Usage
Instead of writing custom code for specific text classification task, you just need:
1. install pipeline:
```shell script
Expand All @@ -41,7 +41,7 @@ No data preparation is needed, only a **csv** file with two raw columns (with ar

The **target** can be presented in any format, including text - not necessarily integers from *0* to *n_classes-1*.

#### Config
### Config
The user interface consists of two files:
- [**config.yaml**](https://github.com/dayyass/text-classification-baseline/blob/main/config.yaml) - general configuration with sklearn **TF-IDF** and **LogReg** parameters
- [**hyperparams.py**](https://github.com/dayyass/text-classification-baseline/blob/main/hyperparams.py) - sklearn **GridSearchCV** parameters
Expand All @@ -62,6 +62,7 @@ Default **config.yaml**:
```yaml
seed: 42
path_to_save_folder: models
experiment_name: model

# data
data:
Expand All @@ -71,6 +72,11 @@ data:
text_column: text
target_column: target_name_short

# preprocessing
# (included in resulting model pipeline, so preserved for inference)
preprocessing:
lemmatization: null # pymorphy2

# tf-idf
tf-idf:
lowercase: true
Expand All @@ -96,18 +102,29 @@ grid-search:

**NOTE**: `tf-idf` and `logreg` are sklearn [**TfidfVectorizer**](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html?highlight=tfidf#sklearn.feature_extraction.text.TfidfVectorizer) and [**LogisticRegression**](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) parameters correspondingly, so you can parameterize instances of these classes however you want. The same logic applies to `grid-search` which is sklearn [**GridSearchCV**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) parametrized with [**hyperparams.py**](https://github.com/dayyass/text-classification-baseline/blob/main/hyperparams.py).

#### Output
### Output
After training the model, the pipeline will return the following files:
- `model.joblib` - sklearn pipeline with TF-IDF and LogReg steps
- `target_names.json` - mapping from encoded target labels from *0* to *n_classes-1* to it names
- `config.yaml` - config that was used to train the model
- `hyperparams.py` - grid-search parameters (if grid-search was used)
- `logging.txt` - logging file

### Requirements

### Additional functions
- `text_clf.token_frequency.get_token_frequency(path_to_config)` - <br> get token frequency of **train dataset** according to the config file parameters

**Only for binary classifiers**:
- `text_clf.pr_roc_curve.get_precision_recall_curve(path_to_model_folder)` - <br> get *precision* and *recall* metrics for precision-recall curve
- `text_clf.pr_roc_curve.get_roc_curve(path_to_model_folder)` - <br> get *false positive rate (fpr)* and *true positive rate (tpr)* metrics for roc curve
- `text_clf.pr_roc_curve.plot_precision_recall_curve(precision, recall)` - <br> plot *precision-recall curve*
- `text_clf.pr_roc_curve.plot_roc_curve(fpr, tpr)` - <br> plot *roc curve*
- `text_clf.pr_roc_curve.plot_precision_recall_f1_curves_for_thresholds(precision, recall, thresholds)` - <br> plot *precision*, *recall*, *f1-score* curves for probability thresholds

## Requirements
Python >= 3.6

### Citation
## Citation
If you use **text-classification-baseline** in a scientific publication, we would appreciate references to the following BibTex entry:
```bibtex
@misc{dayyass2021textclf,
Expand Down
6 changes: 6 additions & 0 deletions config.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
seed: 42
path_to_save_folder: models
experiment_name: model

# data
data:
Expand All @@ -9,6 +10,11 @@ data:
text_column: text
target_column: target_name_short

# preprocessing
# (included in resulting model pipeline, so preserved for inference)
preprocessing:
lemmatization: null # pymorphy2

# tf-idf
tf-idf:
lowercase: true
Expand Down
7 changes: 7 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
coverage==6.0.2 # dev
matplotlib>=3.3.4
numpy >= 1.19.5
pandas>=1.1.5
parameterized==0.8.1 # dev
pre-commit==2.15.0 # dev
pymorphy2>=0.9.1
PyYAML>=5.4.1
scikit-learn>=0.24.2
scipy >= 1.5.4
8 changes: 6 additions & 2 deletions setup.cfg
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[metadata]
name = text-classification-baseline
version = 0.1.4
version = 0.1.5
author = Dani El-Ayyass
author_email = [email protected]
description = TF-IDF + LogReg baseline for text classification
Expand All @@ -18,9 +18,13 @@ classifiers =
packages = find:
python_requires = >=3.6
install_requires =
numpy >= 1.19.5
scipy >= 1.5.4
pandas >= 1.1.5
PyYAML >= 5.4.1
scikit-learn >= 0.24.2
matplotlib >= 3.3.4
pymorphy2 >= 0.9.1
PyYAML >= 5.4.1

[options.entry_points]
console_scripts =
Expand Down
35 changes: 35 additions & 0 deletions tests/config/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
seed: 42
path_to_save_folder: tests/models

# data
data:
train_data_path: data/train.csv
test_data_path: data/test.csv
sep: ','
text_column: text
target_column: target_name_short

# preprocessing
# (included in resulting model pipeline, so preserved for inference)
preprocessing:
lemmatization: null # pymorphy2

# tf-idf
tf-idf:
lowercase: true
ngram_range: (1, 1)
max_df: 1.0
min_df: 1

# logreg
logreg:
penalty: l2
C: 1.0
class_weight: balanced
solver: saga
n_jobs: -1

# grid-search
grid-search:
do_grid_search: false
grid_search_params_path: hyperparams.py
31 changes: 31 additions & 0 deletions tests/config/config_grid_search.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
seed: 42
path_to_save_folder: tests/models
experiment_name: grid_search

# data
data:
train_data_path: data/train.csv
test_data_path: data/test.csv
sep: ','
text_column: text
target_column: target_name_short

# tf-idf
tf-idf:
lowercase: true
ngram_range: (1, 1)
max_df: 1.0
min_df: 1

# logreg
logreg:
penalty: l2
C: 1.0
class_weight: balanced
solver: saga
n_jobs: -1

# grid-search
grid-search:
do_grid_search: true
grid_search_params_path: tests/hyperparams/hyperparams_for_tests.py
36 changes: 36 additions & 0 deletions tests/config/config_lemmatizer_error.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
seed: 42
path_to_save_folder: tests/models
experiment_name: spacy

# data
data:
train_data_path: data/train.csv
test_data_path: data/test.csv
sep: ','
text_column: text
target_column: target_name_short

# preprocessing
# (included in resulting model pipeline, so preserved for inference)
preprocessing:
lemmatization: spacy

# tf-idf
tf-idf:
lowercase: true
ngram_range: (1, 1)
max_df: 1.0
min_df: 1

# logreg
logreg:
penalty: l2
C: 1.0
class_weight: balanced
solver: saga
n_jobs: -1

# grid-search
grid-search:
do_grid_search: false
grid_search_params_path: hyperparams.py
36 changes: 36 additions & 0 deletions tests/config/config_pymorphy2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
seed: 42
path_to_save_folder: tests/models
experiment_name: pymorphy2

# data
data:
train_data_path: data/train.csv
test_data_path: data/test.csv
sep: ','
text_column: text
target_column: target_name_short

# preprocessing
# (included in resulting model pipeline, so preserved for inference)
preprocessing:
lemmatization: pymorphy2

# tf-idf
tf-idf:
lowercase: true
ngram_range: (1, 1)
max_df: 1.0
min_df: 1

# logreg
logreg:
penalty: l2
C: 1.0
class_weight: balanced
solver: saga
n_jobs: -1

# grid-search
grid-search:
do_grid_search: false
grid_search_params_path: hyperparams.py
36 changes: 36 additions & 0 deletions tests/config/config_russian.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
seed: 42
path_to_save_folder: tests/models
experiment_name: russian

# data
data:
train_data_path: tests/data/russian_language_toxic_comments.csv
test_data_path: tests/data/russian_language_toxic_comments.csv
sep: ','
text_column: comment
target_column: toxic

# preprocessing
# (included in resulting model pipeline, so preserved for inference)
preprocessing:
lemmatization: null # pymorphy2

# tf-idf
tf-idf:
lowercase: true
ngram_range: (1, 1)
max_df: 1.0
min_df: 1

# logreg
logreg:
penalty: l2
C: 1.0
class_weight: balanced
solver: saga
n_jobs: -1

# grid-search
grid-search:
do_grid_search: false
grid_search_params_path: hyperparams.py
36 changes: 36 additions & 0 deletions tests/config/config_russian_pymorphy2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
seed: 42
path_to_save_folder: tests/models
experiment_name: russian_pymorphy2

# data
data:
train_data_path: tests/data/russian_language_toxic_comments.csv
test_data_path: tests/data/russian_language_toxic_comments.csv
sep: ','
text_column: comment
target_column: toxic

# preprocessing
# (included in resulting model pipeline, so preserved for inference)
preprocessing:
lemmatization: pymorphy2

# tf-idf
tf-idf:
lowercase: true
ngram_range: (1, 1)
max_df: 1.0
min_df: 1

# logreg
logreg:
penalty: l2
C: 1.0
class_weight: balanced
solver: saga
n_jobs: -1

# grid-search
grid-search:
do_grid_search: false
grid_search_params_path: hyperparams.py
3 changes: 3 additions & 0 deletions tests/data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
### Data

[Russian Language Toxic Comments](https://www.kaggle.com/blackmoon/russian-language-toxic-comments)
Loading

0 comments on commit 3353a53

Please sign in to comment.