Skip to content

Commit

Permalink
Initial release
Browse files Browse the repository at this point in the history
  • Loading branch information
Rustam Aliyev committed Jan 17, 2019
1 parent 859623b commit cf82c5a
Show file tree
Hide file tree
Showing 5 changed files with 338 additions and 0 deletions.
62 changes: 62 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# General Audience Content Classifier
General Audience Content Classifier (GACC) is a pre-trained deep neural network model for classifying images which are suitable for general audiences with a particular focus on children and adolescents below the age of 12.

Currently, the model detects sexually explicit content. In the future, we plan to extend it to other types of harmful visual content such as violence and horror.

Please note that GACC is not designed to be a general purpose porn classifier. It is deliberately trained to be stricter. We like to think of it as a parent of 8 years old, although even that would be a very subjective criterion.

To minimise the number of false-positives in this particular context, for "benign" training images emphasis was made on the scenes which children come across most often: cartoons, children movies, toys, games, nurseries, playgrounds, etc.

If you find it useful, please let us know about your use case by filling in short form here. As a non-profit organisation, it's essential for us to gauge the impact of our work.

## Download
Pre-trained model is available in Keras HDF5 format and can be [downloaded here](https://github.com/purify-ai/gacc/releases).

## Definitions
Definition of "General Audience" varies depending on the country and type of content. Our definition influenced by [Television Content Rating systems](https://en.wikipedia.org/wiki/Television_content_rating_system) which are stricter than movie and gaming rating systems. TV content rating systems of many countries define "General Audience" content suitable for children under 12.

The precise definition of sexually explicit content is also highly subjective. In this project we consider _any visual material that may cause sexual arousal or fantasy, whether intentional or unintentional_, to be harmful. Framing the problem in that particular way prioritises child safety over objectivity, removes ambiguity and makes machine learning model more robust.

## Deep Neural Network architecture
This model based on lightweight [MobileNetV2](https://ai.googleblog.com/2018/04/mobilenetv2-next-generation-of-on.html) architecture which was specifically designed to run on personal mobile devices.

This choice aligned with our vision for an on-device child protection systems which we [described here](https://medium.com/purify-foundation/how-artificial-intelligence-can-help-protect-children-37ce51b75c35).

In the future, we plan to release pre-trained models using other architectures which may provide higher accuracy for those who don't need to run inference on mobile devices (e.g. Inception).

## Data set
The dataset consists of 50.000 images, equally split between two classes - _benign_ and _malign_. In addition to that, test dataset of ~3.600 images used for validation.

| Class | Training Images | Test Images |
| -------- | ------- | ------ |
| Benign | 25,000 | 1,800 |
| Malign | 25,000 | 1,800 |

Images in this dataset are mainly collected online, while some images taken from [Caltech256](https://authors.library.caltech.edu/7694/) and [Porn Database](https://sites.google.com/site/pornographydatabase/).

We do not provide original dataset due to the nature of the data and the fact that Purify Foundation does not own the copyright of those images. However, in the future, we plan to publish the list of URLs.

## Training Process
Training was performed using Keras with TensorFlow backend.

Instead of training from scratch, we used fine-tuning approach. MobileNetV2 model pre-trained with ImageNet was used as the basis for the GACC model. The top fully connected layer was replaced for binary classification (malign/benign). All layers except the top 30 layers were frozen.

Images were resized and cropped to match 224x224 input size and augmented to improve accuracy.

Hyperparameters and other details can be found in the source code of the training script.

## Results
GACC Model achieved more than 95% accuracy on the test dataset. Confusion matrix below has a more granular view of the results (`cutoff=0.5`).

![alt text](assets/gacc-cm.png?raw=true "GACC Results Confusion Matrix")

## How does it compare to Yahoo! OpenNSFW?
For comparison, we also ran our test data though OpenNSFW model with `cutoff=0.5`. As can be seen in the confusion matrix below, OpenNSFW model is less strict with malign images.

![alt text](assets/opennsfw-cm.png?raw=true "OpenNSFW Results Confusion Matrix")

## Disclaimer
This project is currently in the early development stage. We do not provide guarantees of output accuracy.

## License
Models and source code are licensed under [Apache License 2.0](LICENSE)
Binary file added assets/gacc-cm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/opennsfw-cm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
134 changes: 134 additions & 0 deletions training/evaluate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# Copyright 2019 Purify Foundation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Evaluate GACC trained model

from keras import layers, models, callbacks, backend
from keras.optimizers import Adam
from keras.applications.mobilenetv2 import MobileNetV2, preprocess_input
from keras.preprocessing.image import ImageDataGenerator
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.metrics import confusion_matrix

### Hyperparameters
output_classes = 2
learning_rate = 4e-5 #0.00004
img_size = 224 # img width and height
batch_size = 128
epochs = 90
resume_model = False
trainable_layers = 30 # number of trainable layers at the top of the model; all other bottom layers will be frozen

train_dir = "../training_data/train"
test_dir = "../training_data/validate"

output_name = "PurifyAI_GACC_MobileNetV2_{dim_img}_lr{lr}bs{bs}ep{ep}tl{tl}".format(dim_img=img_size, lr=learning_rate, bs=batch_size, ep=epochs, tl=trainable_layers)

def draw_confusion_matrix(confusion_matrix, class_names, figsize = (10,7), fontsize=14):
"""Prints a confusion matrix, as returned by sklearn.metrics.confusion_matrix, as a heatmap.
Arguments
---------
confusion_matrix: numpy.ndarray
The numpy.ndarray object returned from a call to sklearn.metrics.confusion_matrix.
Similarly constructed ndarrays can also be used.
class_names: list
An ordered list of class names, in the order they index the given confusion matrix.
figsize: tuple
A 2-long tuple, the first value determining the horizontal size of the ouputted figure,
the second determining the vertical size. Defaults to (10,7).
fontsize: int
Font size for axes labels. Defaults to 14.
Returns
-------
matplotlib.figure.Figure
The resulting confusion matrix figure
"""
df_cm = pd.DataFrame(
confusion_matrix, index=class_names, columns=class_names,
)
fig = plt.figure(figsize=figsize)
try:
heatmap = sns.heatmap(df_cm, annot=True, fmt="d")
except ValueError:
raise ValueError("Confusion matrix values must be integers.")

heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=fontsize)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=fontsize)
plt.ylabel('True label')
plt.xlabel('Predicted label')


#%%
### Load model
model = models.load_model('../models/PurifyAI_GACC_MobileNetV2_224.h5')

#%%
### Load test images
img_generator = ImageDataGenerator(preprocessing_function=preprocess_input)

test_img_generator = img_generator.flow_from_directory(
test_dir,
target_size = (img_size, img_size),
class_mode = 'categorical',
batch_size= batch_size,
interpolation = 'lanczos',
shuffle = False)

class_names = list(test_img_generator.class_indices.keys())
print("""Class names: {}""".format(class_names))

steps_test = test_img_generator.n // batch_size
test_classes = test_img_generator.classes[:steps_test*batch_size]
print("""Steps on test: {}""".format(steps_test))

#%%
### Evaluate model accuracy and loss
test_img_generator.reset()
results = model.evaluate_generator(test_img_generator, steps_test, workers=4)
print("Loss: ", "{0:.4f}".format(results[0]), "Accuracy: ", "{0:.4f}".format(results[1]))

#%%
### Produce confusion matrix
test_img_generator.reset()
predictions = model.predict_generator(test_img_generator, steps_test, workers=4)

# Convert the predicted classes from arrays to integers.
predicted_class_indices = np.argmax(predictions, axis=1)

# Get the confusion matrix using sklearn.
cfmx = confusion_matrix(y_true=test_classes, # True class for test-set.
y_pred=predicted_class_indices) # Predicted class.

draw_confusion_matrix(cfmx, class_names, (4,3))

#%%
### Detailed list of all filenames, predictions and scores
labels = (test_img_generator.class_indices)
labels = dict((v,k) for k,v in labels.items())
predictions_labeled = [labels[k] for k in predicted_class_indices]

pd.set_option('display.max_rows', None)
cdf = pd.DataFrame({"Filename": test_img_generator.filenames[:len(test_classes)],
"Prediction": predictions_labeled,
"Benign": ["{0:.4f}".format(i[0]) for i in predictions],
"Malign": ["{0:.4f}".format(i[1]) for i in predictions]})

output_file = '../data/'+'{name}.csv'.format(name=output_name)
cdf.to_csv(output_file, encoding='utf-8', index=False)
142 changes: 142 additions & 0 deletions training/train.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# Copyright 2019 Purify Foundation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Fine-tuning ImageNet trained MobileNetV2 with GACC dataset

#%%
from keras import layers, models, callbacks, backend
from keras.optimizers import Adam
from keras.applications.mobilenetv2 import MobileNetV2, preprocess_input
from keras.preprocessing.image import ImageDataGenerator
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
import datetime

### Hyperparameters
output_classes = 2
learning_rate = 4e-5 #0.00004
img_size = 224 # img width and height
batch_size = 128
epochs = 60
trainable_layers = 30 # number of trainable layers at the top of the model; all other bottom layers will be frozen

train_dir = "../training_data/train"
test_dir = "../training_data/validate"

output_name = "PurifyAI_GACC_MobileNetV2_{dim_img}_lr{lr}bs{bs}ep{ep}tl{tl}".format(dim_img=img_size, lr=learning_rate, bs=batch_size, ep=epochs, tl=trainable_layers)
tensorboard_logs = "./tb_logs/"

print('Available GPUs:', backend.tensorflow_backend._get_available_gpus())
print('TensorBoard events:', tensorboard_logs)

#%%
### Prepare images for training
img_generator = ImageDataGenerator(
rotation_range=23,
width_shift_range=0.2,
height_shift_range=0.2,
horizontal_flip=True,
preprocessing_function=preprocess_input)

print('Train image generator')
train_img_generator = img_generator.flow_from_directory(
train_dir,
target_size = (img_size, img_size),
batch_size = batch_size,
class_mode = 'categorical',
interpolation = 'lanczos',
shuffle = True)

print('Test image generator')
test_img_generator = img_generator.flow_from_directory(
test_dir,
target_size = (img_size, img_size),
batch_size= batch_size,
class_mode = 'categorical',
interpolation = 'lanczos',
shuffle = False)

train_classes = train_img_generator.classes
test_classes = test_img_generator.classes

class_names = list(train_img_generator.class_indices.keys())
print("""Class names: {}""".format(class_names))

steps_train = train_img_generator.n // batch_size
print("""Steps on train: {}""".format(steps_train))

steps_test = test_img_generator.n // batch_size
print("""Steps on test: {}""".format(steps_test))

#%%
### Calculate class weights for balancing
#counter = Counter(train_classes)
#max_val = float(max(counter.values()))
#class_weights = {class_id : max_val/num_images for class_id, num_images in counter.items()}
#print("Class weights:", class_weights)

#%%
### Build and compile the model
def build_model():
""" Build new model through the following steps:
1. Load ImageNet trained MobileNetV2 without fully-connected layer at the top of the network
2. Freeze all layers except the top M layers. Top M layers will be trainable.
3. Add final fully-connected (Dense) layer
"""
input_tensor = layers.Input(shape=(img_size, img_size, 3))
base_model = MobileNetV2(
include_top=False,
weights='imagenet',
input_tensor=input_tensor,
input_shape=(img_size, img_size, 3),
pooling='avg'
)

# Only top M layers are trainable
for layer in base_model.layers[:-trainable_layers]:
layer.trainable = False

output_tensor = layers.Dense(output_classes, activation='softmax')(base_model.output)
model = models.Model(inputs=input_tensor, outputs=output_tensor)

return model

model = build_model()
#model.summary()

model.compile(optimizer=Adam(lr=learning_rate),
loss='categorical_crossentropy',
metrics=['categorical_accuracy'])

#%%
### Train model
early_stop = callbacks.EarlyStopping(monitor = 'val_loss', min_delta=0.01, patience=10)
tensorboard = callbacks.TensorBoard(log_dir=tensorboard_logs)

checkpoint_file = '../models/' + output_name + "_{epoch:02d}_{val_loss:.2f}.h5"
checkpointer = callbacks.ModelCheckpoint(filepath=checkpoint_file, verbose=1, save_best_only=True)

model.fit_generator(train_img_generator,
steps_per_epoch = steps_train,
epochs = epochs,
validation_data = test_img_generator,
validation_steps = steps_test,
#class_weight = class_weights,
callbacks=[tensorboard, checkpointer])

#%%
### Save final model
output_file = '../models/'+'{name}_final.h5'.format(name=output_name)
model.save(output_file)

0 comments on commit cf82c5a

Please sign in to comment.