Exploratory data analysis is an important part of the data science process. It ensures that the data looks and is distributed how it is expected to be. Exploratory data analysis also allows for the discovery of potential issues with the data that may lead to poor performing or incorrect models.
First, the label distribution will be explored. Both Naive Bayes and Support Vector Machines are supervised learning classification methods. This means that the data needs labels and the labels need to be balanced.
There are two labels True_News
and Fake_News
, and they are approximately balanced. This is good and means no additional action like random sampling needs to be taken.
Next, the news subject distribution will be explored.
There are three dominant categories: politics
, world
, and news
in general. The smaller categories could easily be considered subcategories of politics
or world
and included in them. For example, Government_News
and left-news
could be considered politics
. Then US_News
and Middle-east
could be considered world
. The fact there are so many articles with just the general news
category is not very useful information or ideal. Knowing what the common news subjects are could be useful for a couple reasons. First, it could reveal what types of words or terms might be commonly found in the articles, which is useful for NLP. Second, the articles could potentially be grouped based on their subject and then used to create a model. This might reveal that some news subjects are easier to predict whether the article is true or fake.
Next, the common words among all articles will be explored.
This plot shows the top 20 words found in all of the news articles in the dataset. There are some definite problems revealed by this plot. There are some words like said
, would
, could
, and also
that show up. These are words that probably won’t provide very much significance in determining whether an article is true or fake. So, before modeling these words should be removed. The one other major problem is the word reuters
. This is the name of a publisher. Normally this wouldn’t be such a major issue. However, since all of the true articles came from Reuters, keeping this word would alter the modeling results in a negative way. The project aims to determine if a news article is true or fake based on the text that the article contains, not who published the article. So, while keeping the word reuters
would most likely lead to model that performs extremely well, that model wouldn’t perform as well when used to make predictions about future news articles that came from another publisher.
Finally, the top words from the true and fake articles will be explored.
These two plots show common words from the true and fake articles. There seems to be a lot of similarities between the two. For example, the words trump
, president
, and state
show up in both plots. There aren’t necessarily words that stand out visually as being correlated with true or fake articles. However, there may potentially be less common words that correlate with the article being true or fake. This should be revealed during modeling.
The following python (EDA
) file to produce the exploratory data analysis and .csv (All_News
and all_news_cv
) needed to run the python code can be downloaded below.
The python code can be found below.
# -*- coding: utf-8 -*-
"""
Created on Thu Mar 23 14:41:10 2023
@author: casey
"""
## LOAD LIBRARIES
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# ---------------------------------------------------------------------------------------- #
## LOAD DATA
all_news_df = pd.read_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Data_Mining/Course_Project/data/All_News.csv')
all_news_cv_df = pd.read_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Data_Mining/Course_Project/data/all_news_cv.csv')
# ---------------------------------------------------------------------------------------- #
## CHECK FOR MISSING VALUES
all_news_df.info()
all_news_cv_df.info()
# ---------------------------------------------------------------------------------------- #
## EDA STARTS HERE
## LOOK AT LABEL BALANCE
# percent of labels equal to 'Fake'
fake_percent = round(all_news_df['Label'].value_counts()[0] / len(all_news_df.index) * 100, 2)
# percent of labels equal to 'True
true_percent = round(all_news_df['Label'].value_counts()[1] / len(all_news_df.index) * 100, 2)
sns.countplot(data=all_news_df, x='Label')
plt.ylim(0, 26000)
plt.text(-0.11, 22000, str(true_percent) + '%', size='medium', color='black', weight='semibold')
plt.text(0.89, 24000, str(fake_percent) + '%', size='medium', color='black', weight='semibold')
plt.title('Label Balance')
plt.show()
## LOOK AT SUBJECTS
sns.countplot(data=all_news_df, x='subject')
plt.title('News Subjects')
plt.xticks(rotation=45)
plt.show()
## COMMON WORDS PLOTS
All_Y = all_news_cv_df.iloc[:,1:].sum().sort_values(ascending = False)[:20]
Fake_Y = all_news_cv_df.iloc[:,1:][all_news_cv_df.Label == 'Fake_News'].sum().sort_values(ascending = False)[:20]
True_Y = all_news_cv_df.iloc[:,1:][all_news_cv_df.Label == 'True_News'].sum().sort_values(ascending = False)[:20]
All_cols = All_Y.index
Fake_cols = Fake_Y.index
True_cols = True_Y.index
sns.barplot(x=All_cols, y=All_Y)
plt.xticks(rotation=90)
plt.title('Top 20 Words Among All Articles')
plt.show()
sns.barplot(x=Fake_cols, y=Fake_Y)
plt.xticks(rotation=90)
plt.title('Top 20 Words Among Fake Articles')
plt.show()
sns.barplot(x=True_cols, y=True_Y)
plt.xticks(rotation=90)
plt.title('Top 20 Words Among True Articles')
plt.show()