TF-IDF Baseline Classifiers
Classifier | # of Features | N-Grams | Accuracy |
XGBoost | 10000 | (1, 1) | 0.5616 |
Random Forest | 5000 | (1, 1) | 0.5386 |
Logistic Regression | 5000 | (1, 1) | 0.5824 |
Naive Bayes | 5000 | (1, 2) | 0.4700 |
Neural Networks
Vocabulary Size | Sequence Length | Accuracy | |
ANN | 5000 | 250 | 0.6717 |
LSTM | 5000 | 250 | 0.7120 |
ANN: 5000 Vocab Size, 250 Sequence Length
Initial Model:
Architecture:
embedding_dim = 64
ANN_Model = tf.keras.Sequential([
vectorize_layer,
tf.keras.layers.Embedding(input_dim=max_features + 2, output_dim=embedding_dim, input_length=sequence_length),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(32, activation='sigmoid'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(5, activation='softmax'),
])
Epochs: 20
Epoch 1/20
1124/1124 [==============================] – 27s 23ms/step – loss: 1.6131 – accuracy: 0.2572 – val_loss: 1.5710 – val_accuracy: 0.2758
Epoch 2/20
1124/1124 [==============================] – 34s 30ms/step – loss: 1.4006 – accuracy: 0.3691 – val_loss: 1.0783 – val_accuracy: 0.5938
Epoch 19/20
1124/1124 [==============================] – 31s 28ms/step – loss: 0.1632 – accuracy: 0.9581 – val_loss: 2.1774 – val_accuracy: 0.6160
Epoch 20/20
1124/1124 [==============================] – 30s 26ms/step – loss: 0.1512 – accuracy: 0.9611 – val_loss: 2.0996 – val_accuracy: 0.6093
Accuracy Plot:
Loss Plot:
Test Loss: 2.2156 Test Accuracy: 0.6069
Summary: Looking at the loss and accuracy plots there is a clear sign of overfitting. The validation accuracy plateaus, while the train accuracy continues to increase. The validation loss stops decreasing and starts increasing, while the train loss continues to decrease. Both of these are clear signs of overfitting. The final model will attempt to resolve this overfitting issue.
Final Model:
Architecture:
callback = keras.callbacks.EarlyStopping(monitor='val_loss', patience=1)
embedding_dim = 16
ANN_Model_Final = tf.keras.Sequential([
vectorize_layer,
tf.keras.layers.Embedding(input_dim=max_features + 2, output_dim=embedding_dim, input_length=sequence_length),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(16, activation='relu', kernel_regularizer=l2(l2=0.01)),
tf.keras.layers.Dense(5, activation='softmax'),
])
Epochs: 9
Epoch 1/50
1124/1124 [==============================] – 7s 5ms/step – loss: 1.4487 – accuracy: 0.3688 – val_loss: 1.2911 – val_accuracy: 0.4538
Epoch 2/50
1124/1124 [==============================] – 6s 6ms/step – loss: 1.1887 – accuracy: 0.5440 – val_loss: 1.1458 – val_accuracy: 0.5944
Epoch 8/50
1124/1124 [==============================] – 8s 7ms/step – loss: 0.8352 – accuracy: 0.7403 – val_loss: 1.0029 – val_accuracy: 0.6511
Epoch 9/50
1124/1124 [==============================] – 7s 6ms/step – loss: 0.7973 – accuracy: 0.7585 – val_loss: 0.9971 – val_accuracy: 0.6627
Accuracy Plot:
Loss Plot:
Test Loss: 0.9996 Test Accuracy: 0.6807
Confusion Matrix:
LSTM: 5000 Vocab Size, 250 Sequence Length
Initial Model:
Architecture:
embedding_dim = 16
LSTM_Model = tf.keras.Sequential([
vectorize_layer,
tf.keras.layers.Embedding(input_dim=max_features + 2, output_dim=embedding_dim, input_length=sequence_length),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50)),
tf.keras.layers.Dense(5, activation='softmax')
])
Epochs: 10
Epoch 1/10
1124/1124 [==============================] – 274s 238ms/step – loss: 1.2051 – accuracy: 0.4932 – val_loss: 0.9076 – val_accuracy: 0.6516
Epoch 2/10
1124/1124 [==============================] – 242s 215ms/step – loss: 0.7858 – accuracy: 0.7133 – val_loss: 0.7986 – val_accuracy: 0.7124
Epoch 9/10
1124/1124 [==============================] – 215s 191ms/step – loss: 0.4754 – accuracy: 0.8427 – val_loss: 0.9166 – val_accuracy: 0.7044
Epoch 10/10
1124/1124 [==============================] – 213s 190ms/step – loss: 0.4442 – accuracy: 0.8554 – val_loss: 0.9620 – val_accuracy: 0.7049
Accuracy Plot:
Loss Plot:
Test Loss: 0,9745 Test Accuracy: 0.6922
Summary: Looking at the loss and accuracy plots there is a clear sign of overfitting happening early on. The validation accuracy plateaus, while the train accuracy continues to increase. The validation loss stops decreasing and starts increasing, while the train loss continues to decrease. Both of these are clear signs of overfitting. The final model will attempt to resolve this overfitting issue.
Final Model:
Architecture:
callback = keras.callbacks.EarlyStopping(monitor='val_loss', patience=2)
embedding_dim = 16
#, kernel_regularizer=l1_l2(l1=0.001, l2=0.001)
LSTM_Model_Final = tf.keras.Sequential([
vectorize_layer,
tf.keras.layers.Embedding(input_dim=max_features + 2, output_dim=embedding_dim, input_length=sequence_length),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50)),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(5, activation='softmax')
])
Epochs: 5
Epoch 1/50
1124/1124 [==============================] – 276s 239ms/step – loss: 1.2163 – accuracy: 0.4829 – val_loss: 0.9061 – val_accuracy: 0.6611
Epoch 2/50
1124/1124 [==============================] – 194s 173ms/step – loss: 0.7986 – accuracy: 0.7093 – val_loss: 0.7975 – val_accuracy: 0.7151
Epoch 4/50
1124/1124 [==============================] – 186s 166ms/step – loss: 0.6694 – accuracy: 0.7746 – val_loss: 0.7817 – val_accuracy: 0.7289
Epoch 5/50
1124/1124 [==============================] – 189s 168ms/step – loss: 0.6369 – accuracy: 0.7828 – val_loss: 0.8004 – val_accuracy: 0.7167
Accuracy Plot:
Loss Plot:
Test Loss: 0.8066 Test Accuracy: 0.7073
Confusion Matrix: