EDA
Exploratory data analysis is an important part of the data science process. It ensures that the data looks and is distributed how it is expected to be. Exploratory data analysis also allows for the discovery of potential issues with the data that may lead to poor performing or incorrect models.
Missing Values
The columns used for classification are ‘Sentiment’ and ‘OriginalTweet’. There are no missing values in either of those columns, so nothing more needs to be done.
RangeIndex: 44955 entries, 0 to 44954
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 UserName 44955 non-null int64
1 ScreenName 44955 non-null int64
2 Location 35531 non-null object
3 TweetAt 44955 non-null object
4 OriginalTweet 44955 non-null object
5 Sentiment 44955 non-null object
Label Balance
Ideally with 5 classes, each class would make up 20% of the dataset. In this case some classes do deviate a little bit. However, each class is close enough to 20% that there shouldn’t be an issue with training the model. To be safe an F1 score, or class weights could be used as well.
Common Words By Class
The following plots represent some of the most common words found among all tweets for each class. This is after the tweets have undergone pre-processing.
Positive Tweets
Extremely Positive Tweets
Negative Tweets
Extremely Negative Tweets
Neutral Tweets
These plots reveal that many of the most common words are shared across classes. This likely means that some additional pre-processing is necessary to remove some of the noise that all classes share in order to create a better classification model. Using TF-IDF could also potentially help this problem as it would reduce the weight of these words that show up commonly across all the classes. However, it’s likely that looking deeper would reveal that ‘positive’ and ‘extremely positive’ have similar words as words used in a positive way could also be used in an extremely positive way. This would be the same for ‘negative’ and ‘extremely negative’. So, TF-IDF most likely wouldn’t capture those minute differences effectively. Instead, a neural network with embeddings would be required. Additional pre-processing to deal with this potential common words issue won’t be done initially in this project, but would be something to explore in future to maybe boost model performance.
The code for the EDA and the csv file containing the tweets can be found below.
# -*- coding: utf-8 -*-
"""
Created on Mon Dec 18 14:18:23 2023
@author: casey
"""
## LOAD LIBRARIES
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Import all we need from nltk
import nltk
import string
from nltk.corpus import stopwords
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import re
# ---------------------------------------------------------------------------------------- #
## LOAD DATA
tweets_df = pd.read_csv('C:/Users/casey/OneDrive/Documents/Data_Science/Projects/Movie_Review_Sentiment_Analysis/data/Corona_tweets.csv', encoding='latin1')
# ---------------------------------------------------------------------------------------- #
## EDA STARTS HERE
## CHECK FOR MISSING VALUES
tweets_df.info() # there are no missing values in the Sentiment or OriginalTweet columns which is what we care about
## LOOK AT LABEL BALANCE
tweets_df['Sentiment'].value_counts()
# percent of labels equal to 'Positive'
positive_percent = round(tweets_df['Sentiment'].value_counts()[0] / len(tweets_df.index) * 100, 2)
# percent of labels equal to 'Negative'
negative_percent = round(tweets_df['Sentiment'].value_counts()[1] / len(tweets_df.index) * 100, 2)
# percent of labels equal to 'Neutral'
neutral_percent = round(tweets_df['Sentiment'].value_counts()[2] / len(tweets_df.index) * 100, 2)
# percent of labels equal to 'Extremely Postive'
extremley_positive_percent = round(tweets_df['Sentiment'].value_counts()[3] / len(tweets_df.index) * 100, 2)
# percent of labels equal to 'Extremely Negative'
extremely_negative_percent = round(tweets_df['Sentiment'].value_counts()[4] / len(tweets_df.index) * 100, 2)
sns.countplot(data=tweets_df, x='Sentiment')
plt.ylim(0, 26000)
plt.text(-0.34, 6700, str(extremely_negative_percent) + '%', size='medium', color='black', weight='semibold')
plt.text(0.65, 13000, str(positive_percent) + '%', size='medium', color='black', weight='semibold')
plt.text(1.7, 7600, str(extremley_positive_percent) + '%', size='medium', color='black', weight='semibold')
plt.text(2.7, 11500, str(negative_percent) + '%', size='medium', color='black', weight='semibold')
plt.text(3.7, 9000, str(neutral_percent) + '%', size='medium', color='black', weight='semibold')
plt.title('Label Balance')
plt.xticks(rotation=45)
plt.show()
## COMMON WORDS PLOTS
# Normalize First
## CONVERT TO LOWERCASE
def to_lowercase(text):
text = text.lower()
return text
#df['review'] = df['review'].apply(to_lowercase)
# ------------------------------------------------------------------------------------------------ #
## REMOVE NUMBERS
def remove_numbers(text):
text = re.sub(r'\d+', '', text)
return text
#df['review'] = df['review'].apply(remove_numbers)
# ------------------------------------------------------------------------------------------------ #
## REMOVE PUNCTUATION
def remove_punctuations(text):
return text.translate(str.maketrans('', '', string.punctuation))
#df['review'] = df['review'].apply(remove_punctuations)
# ------------------------------------------------------------------------------------------------ #
## REMOVE SPECIAL CHARACTERS
def remove_special_chars(text):
return re.sub('[^a-zA-Z]', ' ', text)
#df['review'] = df['review'].apply(remove_special_chars)
# ------------------------------------------------------------------------------------------------ #
## REMOVE UNNECESSARY WHITE SPACE
def remove_whitespace(text):
return " ".join(text.split())
#df['review'] = df['review'].apply(remove_whitespace)
# ------------------------------------------------------------------------------------------------ #
## REMOVE STOPWORDS
# create list of your own words to also remove
my_stopwords = ['br', 'b']
def remove_stopwords(text):
new_list = []
words = word_tokenize(text)
stopwrds = stopwords.words('english') + my_stopwords
for word in words:
if word not in stopwrds:
new_list.append(word)
return ' '.join(new_list)
#df['review'] = df['review'].apply(remove_stopwords)
# ------------------------------------------------------------------------------------------------ #
## LEMMATIZATION
# usually preferred over stemming
# considers context (word part of speech)
# caring -> care
#lem_df = df.copy()
# Part of speech tagger function
def pos_tagger(nltk_tag):
if nltk_tag.startswith('J'):
return wordnet.ADJ
elif nltk_tag.startswith('V'):
return wordnet.VERB
elif nltk_tag.startswith('N'):
return wordnet.NOUN
elif nltk_tag.startswith('R'):
return wordnet.ADV
else:
return None
# Instantiate lemmatizer
lemmatizer = WordNetLemmatizer()
def lemmatize_word(text):
pos_tagged = nltk.pos_tag(nltk.word_tokenize(text))
#word_tokens = word_tokenize(text)
wordnet_tagged = list(map(lambda x: (x[0], pos_tagger(x[1])), pos_tagged))
lemmatized_review = []
for word, tag in wordnet_tagged:
if tag is None:
# if there is no available tag, append the token as is
lemmatized_review.append(word)
else:
# else use the tag to lemmatize the token
lemmatized_review.append(lemmatizer.lemmatize(word, tag))
lemmatized_review = " ".join(lemmatized_review)
return lemmatized_review
#lem_df['review'] = lem_df['review'].apply(lemmatize_word)
# Twitter Specific Cleaning
def remove_hyperlinks(text):
return re.sub(r'https?:\/\/.*[\r\n]*', ' ', text)
def remove_hashtag_symbol(text):
return re.sub(r'#', ' ', text)
def remove_retweet_text(text):
return re.sub(r'^RT[\s]+', ' ', text)
# ---------------------------------------------------------------------------------------------------------------------------- #
## CUSTOM NORMALIZATION FUNCTION
# choose which preprocessing functions to use
# not using lemmatization or stemming since using a neural network
def custom_normalization(reviews):
reviews = reviews.apply(to_lowercase)
reviews = reviews.apply(remove_numbers)
reviews = reviews.apply(remove_punctuations)
reviews = reviews.apply(remove_special_chars)
reviews = reviews.apply(remove_stopwords)
reviews = reviews.apply(remove_hyperlinks)
reviews = reviews.apply(remove_hashtag_symbol)
reviews = reviews.apply(remove_retweet_text)
reviews = reviews.apply(lemmatize_word)
return reviews
tweets_df['OriginalTweet'] = custom_normalization(tweets_df['OriginalTweet'])
# ---------------------------------------------------------------------------------------------------------------------------- #
## GET WORD FREQUENCIES
num_words = 30
pos_freq = tweets_df.iloc[:,1:][tweets_df.Sentiment == 'Positive'].OriginalTweet.str.split(expand=True).stack().value_counts()[0:num_words]
ex_pos_freq = tweets_df.iloc[:,1:][tweets_df.Sentiment == 'Extremely Positive'].OriginalTweet.str.split(expand=True).stack().value_counts()[0:num_words]
neg_freq = tweets_df.iloc[:,1:][tweets_df.Sentiment == 'Negative'].OriginalTweet.str.split(expand=True).stack().value_counts()[0:num_words]
ex_neg_freq = tweets_df.iloc[:,1:][tweets_df.Sentiment == 'Extremely Negative'].OriginalTweet.str.split(expand=True).stack().value_counts()[0:num_words]
nue_freq = tweets_df.iloc[:,1:][tweets_df.Sentiment == 'Neutral'].OriginalTweet.str.split(expand=True).stack().value_counts()[0:num_words]
# ---------------------------------------------------------------------------------------------------------------------------- #
## GET WORDS
pos_cols = pos_freq.index
ex_pos_cols = ex_pos_freq.index
neg_cols = neg_freq.index
ex_neg_cols = ex_neg_freq.index
nue_cols = nue_freq.index
# ---------------------------------------------------------------------------------------------------------------------------- #
## GENERATE PLOTS
sns.barplot(x=pos_cols, y=pos_freq)
plt.xticks(rotation=90)
plt.title('Top ' + str(num_words) + ' Words Among Positive Tweets')
plt.show()
sns.barplot(x=ex_pos_cols, y=ex_pos_freq)
plt.xticks(rotation=90)
plt.title('Top ' + str(num_words) + ' Words Among Extremely Positive Tweets')
plt.show()
sns.barplot(x=neg_cols, y=neg_freq)
plt.xticks(rotation=90)
plt.title('Top ' + str(num_words) + ' Words Among Negative Tweets')
plt.show()
sns.barplot(x=ex_neg_cols, y=ex_neg_freq)
plt.xticks(rotation=90)
plt.title('Top ' + str(num_words) + ' Words Among Extremely Negative Tweets')
plt.show()
sns.barplot(x=nue_cols, y=nue_freq)
plt.xticks(rotation=90)
plt.title('Top ' + str(num_words) + ' Words Among Neutral Tweets')
plt.show()
Data Prep
TF-IDF
The data set contains 44955 tweets about the pandemic and corresponding labels. The labels are originally in categories that represent the sentiment of the tweet. There are five different sentiments in this data set: positive, negative, neutral, extremely positive, and extremely negative. Preparing the data for machine learning models using TF-IDF requires four main steps: encoding the labels, normalizing the text, splitting the data into train and test sets, and creating the TF-IDF matrix to be passed to the model. The labels were encoded as follows: {0: Extremely Negative, 1: Extremely Positive, 2: Negative, 3: Neutral, 4: Positive}. The text of each tweet was normalized using typical normalization techniques, as well as a couple extra techniques required for tweets specifically. The following are all of the normalization techniques that each tweet underwent: converting the text to lowercase, removing numbers, removing punctuation, removing stop words, removing special characters, removing hyperlinks, removing the hashtag symbol, removing the retweet text, and lemmatization. The data set was split into train and test sets, with the train set comprised of 80% of the data and the test set comprised of 20% the data. Finally, the train and test sets were transformed to TF-IDF matrices. It’s important to note that the TF-IDF Vectorizer was fit and then transformed on the train set. The test set underwent a transformation only. This prevents overfitting from occurring.
The code for the data prep is done in the same file as the modeling for ease of use and can be found on the Modeling page under the TF-IDF section.
Neural Networks
The data set contains 44955 tweets about the pandemic and corresponding labels. The labels are originally in categories that represent the sentiment of the tweet. There are five different sentiments in this data set: positive, negative, neutral, extremely positive, and extremely negative. In this case learned embeddings are used instead of pre-trained embeddings. Preparing the data for neural networks in Keras requires eight main steps: encoding the labels, normalizing the text, splitting the data into train, validation, and test sets, defining a vocabulary, defining a sequence length, encoding the text, padding the text, and generating embeddings. The labels were encoded as follows: {0: Extremely Negative, 1: Extremely Positive, 2: Negative, 3: Neutral, 4: Positive}. The text of each tweet was normalized using typical normalization techniques, as well as a couple extra techniques required for tweets specifically. The following are all of the normalization techniques that each tweet underwent: converting the text to lowercase, removing numbers, removing punctuation, removing stop words, removing special characters, removing hyperlinks, removing the hashtag symbol, removing the retweet text, and lemmatization. The data set was split into train, validation, and test sets, with the train set comprised of 80% of the data, the validation set comprised of 10% of the data, and the test set comprised of 10% of the data. The sequence length and vocabulary size are parameters that must be defined. There is no fixed numbers that are used, they generally depend on the task and corpus size. Often different values are tested to find the optimal ones. Once the vocabulary size is chosen, the vocabulary is created usually by taking unique words that occur most frequently in the corpus (all of the tweets) until the vocabulary size is reached. It’s important to only generate the vocabulary from the train set. Otherwise, overfitting can occur. Transforming the data on the vocabulary (essentially replacing words not in the vocabulary with a special [OOV] token), padding the data (adding special [PAD] tokens to the end of each tweet until it reaches the defined sequence length), and encoding the data (representing each unique word in the vocabulary as well as the [OOV] and [PAD] tokens with a number) are all done using the TextVectorization layer from Keras. Finally, embeddings are generated using a defined embedding dimension and the Embedding layer from Keras. The embedding dimension is also a parameter that must be tested to determine the optimal value.
The code for the data prep is done in the same file as the modeling for ease of use and can be found on the Modeling page under the Neural Networks section.