Initial release

purify-ai · Jan 17, 2019 · cf82c5a · cf82c5a
1 parent 859623b
commit cf82c5a
Show file tree

Hide file tree

Showing 5 changed files with 338 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,62 @@
+# General Audience Content Classifier
+General Audience Content Classifier (GACC) is a pre-trained deep neural network model for classifying images which are suitable for general audiences with a particular focus on children and adolescents below the age of 12.
+
+Currently, the model detects sexually explicit content. In the future, we plan to extend it to other types of harmful visual content such as violence and horror.
+
+Please note that GACC is not designed to be a general purpose porn classifier. It is deliberately trained to be stricter. We like to think of it as a parent of 8 years old, although even that would be a very subjective criterion.
+
+To minimise the number of false-positives in this particular context, for "benign" training images emphasis was made on the scenes which children come across most often: cartoons, children movies, toys, games, nurseries, playgrounds, etc.
+
+If you find it useful, please let us know about your use case by filling in short form here. As a non-profit organisation, it's essential for us to gauge the impact of our work.
+
+## Download
+Pre-trained model is available in Keras HDF5 format and can be [downloaded here](https://github.com/purify-ai/gacc/releases).
+
+## Definitions
+Definition of "General Audience" varies depending on the country and type of content. Our definition influenced by [Television Content Rating systems](https://en.wikipedia.org/wiki/Television_content_rating_system) which are stricter than movie and gaming rating systems. TV content rating systems of many countries define "General Audience" content suitable for children under 12.
+
+The precise definition of sexually explicit content is also highly subjective. In this project we consider _any visual material that may cause sexual arousal or fantasy, whether intentional or unintentional_, to be harmful. Framing the problem in that particular way prioritises child safety over objectivity, removes ambiguity and makes machine learning model more robust.
+
+## Deep Neural Network architecture
+This model based on lightweight [MobileNetV2](https://ai.googleblog.com/2018/04/mobilenetv2-next-generation-of-on.html) architecture which was specifically designed to run on personal mobile devices.
+
+This choice aligned with our vision for an on-device child protection systems which we [described here](https://medium.com/purify-foundation/how-artificial-intelligence-can-help-protect-children-37ce51b75c35).
+
+In the future, we plan to release pre-trained models using other architectures which may provide higher accuracy for those who don't need to run inference on mobile devices (e.g. Inception).
+
+## Data set
+The dataset consists of 50.000 images, equally split between two classes - _benign_ and _malign_. In addition to that, test dataset of ~3.600 images used for validation.
+
+| Class    | Training Images | Test Images |
+| -------- | ------- | ------ |
+| Benign   | 25,000  | 1,800  |
+| Malign   | 25,000  | 1,800  |
+
+Images in this dataset are mainly collected online, while some images taken from [Caltech256](https://authors.library.caltech.edu/7694/) and [Porn Database](https://sites.google.com/site/pornographydatabase/).
+
+We do not provide original dataset due to the nature of the data and the fact that Purify Foundation does not own the copyright of those images. However, in the future, we plan to publish the list of URLs.
+
+## Training Process
+Training was performed using Keras with TensorFlow backend.
+
+Instead of training from scratch, we used fine-tuning approach. MobileNetV2 model pre-trained with ImageNet was used as the basis for the GACC model. The top fully connected layer was replaced for binary classification (malign/benign). All layers except the top 30 layers were frozen.
+
+Images were resized and cropped to match 224x224 input size and augmented to improve accuracy.
+
+Hyperparameters and other details can be found in the source code of the training script.
+
+## Results
+GACC Model achieved more than 95% accuracy on the test dataset. Confusion matrix below has a more granular view of the results (`cutoff=0.5`).
+
+![alt text](assets/gacc-cm.png?raw=true "GACC Results Confusion Matrix")
+
+## How does it compare to Yahoo! OpenNSFW?
+For comparison, we also ran our test data though OpenNSFW model with `cutoff=0.5`. As can be seen in the confusion matrix below, OpenNSFW model is less strict with malign images.
+
+![alt text](assets/opennsfw-cm.png?raw=true "OpenNSFW Results Confusion Matrix")
+
+## Disclaimer
+This project is currently in the early development stage. We do not provide guarantees of output accuracy.
+
+## License
+Models and source code are licensed under [Apache License 2.0](LICENSE)
diff --git a/assets/gacc-cm.png b/assets/gacc-cm.png
diff --git a/assets/opennsfw-cm.png b/assets/opennsfw-cm.png
diff --git a/training/evaluate.py b/training/evaluate.py
@@ -0,0 +1,134 @@
+#    Copyright 2019 Purify Foundation
+#
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+# Evaluate GACC trained model
+
+from keras import layers, models, callbacks, backend
+from keras.optimizers import Adam
+from keras.applications.mobilenetv2 import MobileNetV2, preprocess_input
+from keras.preprocessing.image import ImageDataGenerator
+from collections import Counter
+import numpy as np
+import matplotlib.pyplot as plt
+import pandas as pd
+import seaborn as sns
+from sklearn.metrics import confusion_matrix
+
+### Hyperparameters
+output_classes = 2
+learning_rate = 4e-5  #0.00004
+img_size = 224 # img width and height
+batch_size = 128
+epochs = 90
+resume_model = False
+trainable_layers = 30 # number of trainable layers at the top of the model; all other bottom layers will be frozen
+
+train_dir = "../training_data/train"
+test_dir  = "../training_data/validate"
+
+output_name = "PurifyAI_GACC_MobileNetV2_{dim_img}_lr{lr}bs{bs}ep{ep}tl{tl}".format(dim_img=img_size, lr=learning_rate, bs=batch_size, ep=epochs, tl=trainable_layers)
+
+def draw_confusion_matrix(confusion_matrix, class_names, figsize = (10,7), fontsize=14):
+    """Prints a confusion matrix, as returned by sklearn.metrics.confusion_matrix, as a heatmap.
+    
+    Arguments
+    ---------
+    confusion_matrix: numpy.ndarray
+        The numpy.ndarray object returned from a call to sklearn.metrics.confusion_matrix. 
+        Similarly constructed ndarrays can also be used.
+    class_names: list
+        An ordered list of class names, in the order they index the given confusion matrix.
+    figsize: tuple
+        A 2-long tuple, the first value determining the horizontal size of the ouputted figure,
+        the second determining the vertical size. Defaults to (10,7).
+    fontsize: int
+        Font size for axes labels. Defaults to 14.
+        
+    Returns
+    -------
+    matplotlib.figure.Figure
+        The resulting confusion matrix figure
+    """
+    df_cm = pd.DataFrame(
+        confusion_matrix, index=class_names, columns=class_names, 
+    )
+    fig = plt.figure(figsize=figsize)
+    try:
+        heatmap = sns.heatmap(df_cm, annot=True, fmt="d")
+    except ValueError:
+        raise ValueError("Confusion matrix values must be integers.")
+
+    heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=fontsize)
+    heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=fontsize)
+    plt.ylabel('True label')
+    plt.xlabel('Predicted label')
+
+
+#%%
+### Load model
+model = models.load_model('../models/PurifyAI_GACC_MobileNetV2_224.h5')
+
+#%%
+### Load test images
+img_generator = ImageDataGenerator(preprocessing_function=preprocess_input)
+
+test_img_generator = img_generator.flow_from_directory(
+                        test_dir,
+                        target_size = (img_size, img_size),
+                        class_mode = 'categorical',
+                        batch_size= batch_size,
+                        interpolation = 'lanczos',
+                        shuffle = False)
+
+class_names = list(test_img_generator.class_indices.keys())
+print("""Class names: {}""".format(class_names))
+
+steps_test = test_img_generator.n // batch_size
+test_classes = test_img_generator.classes[:steps_test*batch_size]
+print("""Steps on test: {}""".format(steps_test))
+
+#%%
+### Evaluate model accuracy and loss
+test_img_generator.reset()
+results = model.evaluate_generator(test_img_generator, steps_test, workers=4)
+print("Loss: ", "{0:.4f}".format(results[0]), "Accuracy: ", "{0:.4f}".format(results[1]))
+
+#%%
+### Produce confusion matrix
+test_img_generator.reset()
+predictions = model.predict_generator(test_img_generator, steps_test, workers=4)
+
+# Convert the predicted classes from arrays to integers.
+predicted_class_indices = np.argmax(predictions, axis=1)
+
+# Get the confusion matrix using sklearn.
+cfmx = confusion_matrix(y_true=test_classes,  # True class for test-set.
+                        y_pred=predicted_class_indices)  # Predicted class.
+
+draw_confusion_matrix(cfmx, class_names, (4,3))
+
+#%%
+### Detailed list of all filenames, predictions and scores
+labels = (test_img_generator.class_indices)
+labels = dict((v,k) for k,v in labels.items())
+predictions_labeled = [labels[k] for k in predicted_class_indices]
+
+pd.set_option('display.max_rows', None)
+cdf = pd.DataFrame({"Filename": test_img_generator.filenames[:len(test_classes)],
+              "Prediction": predictions_labeled,
+              "Benign": ["{0:.4f}".format(i[0]) for i in predictions],
+              "Malign": ["{0:.4f}".format(i[1]) for i in predictions]})
+
+output_file = '../data/'+'{name}.csv'.format(name=output_name)
+cdf.to_csv(output_file, encoding='utf-8', index=False)
diff --git a/training/train.py b/training/train.py
@@ -0,0 +1,142 @@
+#    Copyright 2019 Purify Foundation
+#
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+# Fine-tuning ImageNet trained MobileNetV2 with GACC dataset
+
+#%%
+from keras import layers, models, callbacks, backend
+from keras.optimizers import Adam
+from keras.applications.mobilenetv2 import MobileNetV2, preprocess_input
+from keras.preprocessing.image import ImageDataGenerator
+from collections import Counter
+import numpy as np
+import matplotlib.pyplot as plt
+import datetime
+
+### Hyperparameters
+output_classes = 2
+learning_rate = 4e-5  #0.00004
+img_size = 224 # img width and height
+batch_size = 128
+epochs = 60
+trainable_layers = 30 # number of trainable layers at the top of the model; all other bottom layers will be frozen
+
+train_dir = "../training_data/train"
+test_dir  = "../training_data/validate"
+
+output_name = "PurifyAI_GACC_MobileNetV2_{dim_img}_lr{lr}bs{bs}ep{ep}tl{tl}".format(dim_img=img_size, lr=learning_rate, bs=batch_size, ep=epochs, tl=trainable_layers)
+tensorboard_logs = "./tb_logs/"
+
+print('Available GPUs:', backend.tensorflow_backend._get_available_gpus())
+print('TensorBoard events:', tensorboard_logs)
+
+#%%
+### Prepare images for training
+img_generator = ImageDataGenerator(
+    rotation_range=23,
+    width_shift_range=0.2,
+    height_shift_range=0.2,
+    horizontal_flip=True,
+    preprocessing_function=preprocess_input)
+
+print('Train image generator')
+train_img_generator = img_generator.flow_from_directory(
+                        train_dir,
+                        target_size = (img_size, img_size),
+                        batch_size = batch_size,
+                        class_mode = 'categorical',
+                        interpolation = 'lanczos',
+                        shuffle = True)
+
+print('Test image generator')
+test_img_generator = img_generator.flow_from_directory(
+                        test_dir,
+                        target_size = (img_size, img_size),
+                        batch_size= batch_size,
+                        class_mode = 'categorical',
+                        interpolation = 'lanczos',
+                        shuffle = False)
+
+train_classes = train_img_generator.classes
+test_classes = test_img_generator.classes
+
+class_names = list(train_img_generator.class_indices.keys())
+print("""Class names: {}""".format(class_names))
+
+steps_train = train_img_generator.n // batch_size
+print("""Steps on train: {}""".format(steps_train))
+
+steps_test = test_img_generator.n // batch_size
+print("""Steps on test: {}""".format(steps_test))
+
+#%%
+### Calculate class weights for balancing
+#counter = Counter(train_classes)
+#max_val = float(max(counter.values()))
+#class_weights = {class_id : max_val/num_images for class_id, num_images in counter.items()}
+#print("Class weights:", class_weights)
+
+#%%
+### Build and compile the model
+def build_model():
+    """ Build new model through the following steps:
+        1. Load ImageNet trained MobileNetV2 without fully-connected layer at the top of the network
+        2. Freeze all layers except the top M layers. Top M layers will be trainable.
+        3. Add final fully-connected (Dense) layer
+    """
+    input_tensor = layers.Input(shape=(img_size, img_size, 3))
+    base_model = MobileNetV2(
+        include_top=False,
+        weights='imagenet',
+        input_tensor=input_tensor,
+        input_shape=(img_size, img_size, 3),
+        pooling='avg'
+    )
+
+    # Only top M layers are trainable
+    for layer in base_model.layers[:-trainable_layers]:
+        layer.trainable = False
+
+    output_tensor = layers.Dense(output_classes, activation='softmax')(base_model.output)
+    model = models.Model(inputs=input_tensor, outputs=output_tensor)
+
+    return model
+
+model = build_model()
+#model.summary()
+
+model.compile(optimizer=Adam(lr=learning_rate),
+              loss='categorical_crossentropy',
+              metrics=['categorical_accuracy'])
+
+#%% 
+### Train model
+early_stop  = callbacks.EarlyStopping(monitor = 'val_loss', min_delta=0.01, patience=10)
+tensorboard = callbacks.TensorBoard(log_dir=tensorboard_logs)
+
+checkpoint_file = '../models/' + output_name + "_{epoch:02d}_{val_loss:.2f}.h5"
+checkpointer = callbacks.ModelCheckpoint(filepath=checkpoint_file, verbose=1, save_best_only=True)
+
+model.fit_generator(train_img_generator,
+             steps_per_epoch = steps_train,
+             epochs = epochs,
+             validation_data = test_img_generator,
+             validation_steps = steps_test,
+             #class_weight = class_weights,
+             callbacks=[tensorboard, checkpointer])
+
+#%%
+### Save final model
+output_file = '../models/'+'{name}_final.h5'.format(name=output_name)
+model.save(output_file)