There are two datasets total in .csv format. The first dataset, ‘Fake.csv’, contains 23481 different articles. These articles have all been flagged by Politifact as untrustworthy. Politifact is a fact-checking organization that focuses on identifying the trustworthiness of news articles. These
articles come from many different websites that were flagged by Politifact. The majority of articles in this dataset focus on world and politics news. The second dataset, ‘True.csv’, contains 21417 different articles that are real and trustworthy. These articles come from Reuters. The majority
of articles in this dataset also focus on world and politics news. Both datasets contain the title of the article, text of the article, subject of the article, and date the article was published. The majority of articles in both datasets come from 2016 to 2017.
Below are definitions of the columns of the Fake.csv
and the True.csv
dataset.
title
: the title of the articletext
: the text of the articlesubject
: the subject of the articledate
: the date the article was published
Below are images of the two datasets.
The important things to keep in mind regarding data prep for Support Vector Machines and Naive Bayes in python is that all variables must be quantitative and the dataset must contain labels.
These datasets in their current state cannot be used to create machine learning models. They need a few NLP cleaning techniques performed on them first and labels need to be addes. The first step is to create a column containing labels of true or fake for each dataset. Then the dataset needs to be joined together.
The steps above are done in the following python code. The python file (Data_Prep.py
) and .csv files (Fake_News.csv
, True_News.csv
) necessary for running the python file can be downloaded below.
The following is the code that Data_Prep.py
contains.
# -*- coding: utf-8 -*-
"""
Created on Sat Mar 11 11:24:45 2023
@author: casey
"""
## LOAD LIBRARIES
import pandas as pd
# ---------------------------------------------------------------------------------------- #
## LOAD DATA
true_news = pd.read_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Data_Mining/Course_Project/data/True_News.csv')
fake_news = pd.read_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Data_Mining/Course_Project/data/Fake_News.csv')
# ---------------------------------------------------------------------------------------- #
## ADD LABELS
true_news['Label'] = 'True_News'
fake_news['Label'] = 'Fake_News'
# ---------------------------------------------------------------------------------------- #
##MERGE DATAFRAMES
news_df = pd.concat([true_news, fake_news], axis=0)
# ---------------------------------------------------------------------------------------- #
## CHECK FOR MISSING VALUES
news_df.info()
# ---------------------------------------------------------------------------------------- #
## WRITE MERGED DATAFRAME TO .csv
# since there are no missing values, no further data cleaning needs to be done yet
news_df.to_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Data_Mining/Course_Project/data/All_News.csv', index=False)
The result of the above steps creates the .csv file All_News.csv
.
This file can be viewed below.
This file can be downloaded below.
At this point, the true and fake news articles have been assigned labels and joined into one dataset. Next, only the text
column and the label
column will be kept. Then, all the words need to be moved to lowercase, stop words need to be removed, numbers need to be removed, and any special characters need to be removed. Finally, the text needs to be vectorized and turned into a document term matrix, so that models can be built. This is all done using vectorizer classes in python. The text will be vectorized into counts and then also into frequencies. This is done, because one of the two may perform better in modeling.
The steps above are done in the following python code. The python file (Text_Vectorizer.py
) and .csv file (All_News.csv
) necessary for running the python file can be downloaded below.
The following is the code that Text_Vectorizer.py
contains.
# -*- coding: utf-8 -*-
"""
Created on Mon Mar 13 17:06:45 2023
@author: casey
"""
## Walks through how to create a vectorized dataframe from text data using count vectorizer
## and tfidf vectorizer, each row of the .csv file should contain a document
## LOAD LIBRARIES
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import wordcloud
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
import re
import seaborn as sns
# ------------------------------------------------------------------------------------- #
## LOAD DATA
news_df = pd.read_csv('C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Data_Mining/Course_Project/data/All_News.csv', error_bad_lines=False)
# ------------------------------------------------------------------------------------- #
labels = ['True_News', 'Fake_News']
# ------------------------------------------------------------------------------------- #
## TEXT PROCESSING
## REMOVE any rows with NaN in them
news_df = news_df.dropna()
## REMOVE NUMBERS
def remove_numbers(text):
text = re.sub(r'\d+', '', text)
return text
news_df['text'] = news_df['text'].apply(remove_numbers)
# ------------------------------------------------------------------------------------- #
### Tokenize and Vectorize the Headlines
## Create the list of headlines
## Keep the labels!
TextLIST=[]
LabelLIST=[]
for nexthead, nextlabel in zip(news_df["text"], news_df["Label"]):
TextLIST.append(nexthead)
LabelLIST.append(nextlabel)
#print("The headline list is:\n")
#print(TextLIST)
#print("The label list is:\n")
#print(LabelLIST)
# ------------------------------------------------------------------------------------- #
## Remove all words that match the labels.
## For example, if the labels are True and Fake, remove these exact words.
## We will need to do this by hand.
NewTextLIST=[]
for element in TextLIST:
#print(element)
#print(type(element))
## make into list
AllWords=element.split(" ")
#print(AllWords)
## Now remove words that are in your labels
NewWordsList=[]
for word in AllWords:
#print(word)
word=word.lower()
if word in labels:
print(word)
else:
NewWordsList.append(word)
##turn back to string
NewWords=" ".join(NewWordsList)
## Place into NewHeadlineLIST
NewTextLIST.append(NewWords)
##
## Set the TextLIST to the new one
TextLIST = NewTextLIST
#print(TextLIST)
# ------------------------------------------------------------------------------------- #
## Build the labeled dataframes for count vectorizer and Tfidf Vectorizer
# add any additional words you want to remove in my_stop_words
my_stop_words = ['reuters', 'also', 'said']
### Vectorize
## Instantiate your Count Vectorizer
## Choose how many words to include with max_features
CountV=CountVectorizer(
input="content", ## because we have a csv file
lowercase=True,
stop_words = stopwords.words('english') + my_stop_words,
max_features = 200
)
## Instantiate your Tfidf Vectorizer
Tfidf = TfidfVectorizer(
input="content", ## because we have a csv file
lowercase = True,
stop_words = stopwords.words('english') + my_stop_words,
max_features = 200
)
## Use your CV
CV_DTM = CountV.fit_transform(TextLIST) # create a sparse matrix
#print(type(CV_DTM))
tfidf_DTM = Tfidf.fit_transform(TextLIST) # create a sparse matrix
#print(type(tfidf_DTM))
CV_Column_Names=CountV.get_feature_names()
#print(type(CV_Column_Names))
tfidf_Column_Names=Tfidf.get_feature_names()
#print(type(tfidf_Column_Names))
## Build the data frame
CV_DTM_DF=pd.DataFrame(CV_DTM.toarray(), columns=CV_Column_Names)
tfidf_DTM_DF=pd.DataFrame(tfidf_DTM.toarray(), columns=tfidf_Column_Names)
## Convert the labels from list to df
Labels_DF = pd.DataFrame(LabelLIST,columns=['Label'])
## Check your new DF and you new Labels df:
#print("Labels\n")
#print(Labels_DF)
#print("News df\n")
#print(CV_DTM_DF.iloc[:,0:])
#print(tfidf_DTM_DF.iloc[:,0:])
##Save original DF - without the labels
My_Orig_CV_DF = CV_DTM_DF
#print(My_Orig_CV_DF)
My_Orig_tfidf_DF = tfidf_DTM_DF
#print(My_Orig_tfidf_DF)
## Now - let's create a complete and labeled
## dataframe:
CV_dfs = [Labels_DF, CV_DTM_DF]
#print(CV_dfs)
tfidf_dfs = [Labels_DF, tfidf_DTM_DF]
#print(tfidf_dfs)
CV_Final_News_DF_Labeled = pd.concat(CV_dfs,axis=1, join='inner')
tfidf_Final_News_DF_Labeled = pd.concat(tfidf_dfs,axis=1, join='inner')
## DF with labels
#print(CV_Final_News_DF_Labeled)
#print(tfidf_Final_News_DF_Labeled)
## with open
### Place the column names in - write to the first row
CV_Final_News_DF_Labeled.to_csv("C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Data_Mining/Course_Project/data/all_news_cv.csv", index=False)
tfidf_Final_News_DF_Labeled.to_csv("C:/Users/casey/OneDrive/Documents/MSDS_Courses/Spring_2023/Data_Mining/Course_Project/data/all_news_tfidf.csv", index=False)
The result of the steps above creates the .csv files (all_news_cv
and all_news_tfidf
) , which contain the vectorized articles in document term matrix format.
The datasets can be viewed below.
The datasets can be downloaded below as they will be used for modeling.
all_news_cv
is the count vectorized articles textall_news_tfidf
is the term frequency vectorized articles text