Tweet Sentiment Analysis Results

TF-IDF Baseline Classifiers

Classifier	# of Features	N-Grams	Accuracy
XGBoost	10000	(1, 1)	0.5616
Random Forest	5000	(1, 1)	0.5386
Logistic Regression	5000	(1, 1)	0.5824
Naive Bayes	5000	(1, 2)	0.4700

Best Results

Neural Networks

	Vocabulary Size	Sequence Length	Accuracy
ANN	5000	250	0.6717
LSTM	5000	250	0.7120

Current Best Results (Different vocab sizes/sequence lengths need to be tested still)

ANN: 5000 Vocab Size, 250 Sequence Length

Initial Model:

Architecture:

embedding_dim = 64

ANN_Model = tf.keras.Sequential([
  vectorize_layer,
  tf.keras.layers.Embedding(input_dim=max_features + 2, output_dim=embedding_dim, input_length=sequence_length),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(64, activation='relu'),
  tf.keras.layers.Dropout(0.5),
  tf.keras.layers.Dense(32, activation='sigmoid'),
  tf.keras.layers.Dropout(0.5),
  tf.keras.layers.Dense(5, activation='softmax'), 
  ])

Epochs: 20

Epoch 1/20
1124/1124 [==============================] – 27s 23ms/step – loss: 1.6131 – accuracy: 0.2572 – val_loss: 1.5710 – val_accuracy: 0.2758
Epoch 2/20
1124/1124 [==============================] – 34s 30ms/step – loss: 1.4006 – accuracy: 0.3691 – val_loss: 1.0783 – val_accuracy: 0.5938

Epoch 19/20
1124/1124 [==============================] – 31s 28ms/step – loss: 0.1632 – accuracy: 0.9581 – val_loss: 2.1774 – val_accuracy: 0.6160
Epoch 20/20
1124/1124 [==============================] – 30s 26ms/step – loss: 0.1512 – accuracy: 0.9611 – val_loss: 2.0996 – val_accuracy: 0.6093

Accuracy Plot:

Loss Plot:

Test Loss: 2.2156 Test Accuracy: 0.6069

Summary: Looking at the loss and accuracy plots there is a clear sign of overfitting. The validation accuracy plateaus, while the train accuracy continues to increase. The validation loss stops decreasing and starts increasing, while the train loss continues to decrease. Both of these are clear signs of overfitting. The final model will attempt to resolve this overfitting issue.

Final Model:

Architecture:

callback = keras.callbacks.EarlyStopping(monitor='val_loss', patience=1)
embedding_dim = 16

ANN_Model_Final = tf.keras.Sequential([
  vectorize_layer,
  tf.keras.layers.Embedding(input_dim=max_features + 2, output_dim=embedding_dim, input_length=sequence_length),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(16, activation='relu', kernel_regularizer=l2(l2=0.01)),
  tf.keras.layers.Dense(5, activation='softmax'), 
  ])

Epochs: 9

Epoch 1/50
1124/1124 [==============================] – 7s 5ms/step – loss: 1.4487 – accuracy: 0.3688 – val_loss: 1.2911 – val_accuracy: 0.4538
Epoch 2/50
1124/1124 [==============================] – 6s 6ms/step – loss: 1.1887 – accuracy: 0.5440 – val_loss: 1.1458 – val_accuracy: 0.5944

Epoch 8/50
1124/1124 [==============================] – 8s 7ms/step – loss: 0.8352 – accuracy: 0.7403 – val_loss: 1.0029 – val_accuracy: 0.6511
Epoch 9/50
1124/1124 [==============================] – 7s 6ms/step – loss: 0.7973 – accuracy: 0.7585 – val_loss: 0.9971 – val_accuracy: 0.6627

Accuracy Plot:

Loss Plot:

Test Loss: 0.9996 Test Accuracy: 0.6807

Confusion Matrix:

LSTM: 5000 Vocab Size, 250 Sequence Length

Initial Model:

Architecture:

embedding_dim = 16

LSTM_Model = tf.keras.Sequential([
  vectorize_layer,
  tf.keras.layers.Embedding(input_dim=max_features + 2, output_dim=embedding_dim, input_length=sequence_length),
  tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50)),
  tf.keras.layers.Dense(5, activation='softmax')
  ])

Epochs: 10

Epoch 1/10
1124/1124 [==============================] – 274s 238ms/step – loss: 1.2051 – accuracy: 0.4932 – val_loss: 0.9076 – val_accuracy: 0.6516
Epoch 2/10
1124/1124 [==============================] – 242s 215ms/step – loss: 0.7858 – accuracy: 0.7133 – val_loss: 0.7986 – val_accuracy: 0.7124

Epoch 9/10
1124/1124 [==============================] – 215s 191ms/step – loss: 0.4754 – accuracy: 0.8427 – val_loss: 0.9166 – val_accuracy: 0.7044
Epoch 10/10
1124/1124 [==============================] – 213s 190ms/step – loss: 0.4442 – accuracy: 0.8554 – val_loss: 0.9620 – val_accuracy: 0.7049

Accuracy Plot:

Loss Plot:

Test Loss: 0,9745 Test Accuracy: 0.6922

Summary: Looking at the loss and accuracy plots there is a clear sign of overfitting happening early on. The validation accuracy plateaus, while the train accuracy continues to increase. The validation loss stops decreasing and starts increasing, while the train loss continues to decrease. Both of these are clear signs of overfitting. The final model will attempt to resolve this overfitting issue.

Final Model:

Architecture:

callback = keras.callbacks.EarlyStopping(monitor='val_loss', patience=2)
embedding_dim = 16

#, kernel_regularizer=l1_l2(l1=0.001, l2=0.001)

LSTM_Model_Final = tf.keras.Sequential([
  vectorize_layer,
  tf.keras.layers.Embedding(input_dim=max_features + 2, output_dim=embedding_dim, input_length=sequence_length),
  tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50)),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(5, activation='softmax')
  ])

Epochs: 5

Epoch 1/50
1124/1124 [==============================] – 276s 239ms/step – loss: 1.2163 – accuracy: 0.4829 – val_loss: 0.9061 – val_accuracy: 0.6611
Epoch 2/50
1124/1124 [==============================] – 194s 173ms/step – loss: 0.7986 – accuracy: 0.7093 – val_loss: 0.7975 – val_accuracy: 0.7151

Epoch 4/50
1124/1124 [==============================] – 186s 166ms/step – loss: 0.6694 – accuracy: 0.7746 – val_loss: 0.7817 – val_accuracy: 0.7289
Epoch 5/50
1124/1124 [==============================] – 189s 168ms/step – loss: 0.6369 – accuracy: 0.7828 – val_loss: 0.8004 – val_accuracy: 0.7167

Accuracy Plot:

Loss Plot:

Test Loss: 0.8066 Test Accuracy: 0.7073

Confusion Matrix: