Spam Filtering Based on Naive Bayes Classification¶

In this project, I used one of the widely used statistical method, the Naive Bayes method to build a spam filter algorithm. Sample emails have two labels are Spam and Ham. Naive Bayes classification will be explained and we will apply this method to our sample emails for testing.

Sample emails:

index	Text	label
0	gpcm summary this been great yea...	Ham
1	national charity suffering since ..	Spam
:	:	:

import wordcloud 
from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords
import string
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report,confusion_matrix
import random as rd
import os
import json
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('mode.chained_assignment', None)

#Read data
path = "enron2/"
labels = ['Ham','Spam']
Category=[]
textdata = []
a=0
for l in labels:
    for file in os.listdir(path+l):
        Category.append(l)
        with open(os.path.join(path+l,file),encoding='utf8',errors="ignore") as f:
            data = f.read()
            textdata.append(data)
df = pd.DataFrame({"Text" : textdata , "Category":Category})

#Clean non letter characters and words containing three or less letters
df = df.replace("[^a-zA-Z]"," ", regex=True)
for i in range(5):
    df = df.replace("\s(\w{1,3})\s",' ', regex=True)

df.rename(columns = {'Category': 'label', 'Text': 'Email'}, inplace = True)
df['label'] = df['label'].map({'Ham': 0, 'Spam': 1})
df=df.sample(frac=1).reset_index(drop=True)
df.head()

#Graph for Spam Words
spam_words = ' '.join(list(df[df['label'] == 1]['Email']))
spam_wc = WordCloud(width = 512,height = 512).generate(spam_words)
plt.figure(figsize = (10, 8), facecolor = 'k')
plt.imshow(spam_wc)
plt.axis('off')
plt.tight_layout(pad = 0)
plt.show()

#Graph for Ham Words
spam_words = ' '.join(list(df[df['label'] == 0]['Email']))
spam_wc = WordCloud(width = 512,height = 512).generate(spam_words)
plt.figure(figsize = (10, 8), facecolor = 'k')
plt.imshow(spam_wc)
plt.axis('off')
plt.tight_layout(pad = 0)
plt.show()

df['Email']=df['Email'].str.lower()
df['words']=df['Email'].str.split()
df.head()

Split Data to Train and Test¶

#Split Test and Training
np.random.RandomState(42)
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.3)

train=train.reset_index(drop=True)
test=test.reset_index(drop=True)

train.head()

print('Number of Emails')
print ("Train_Spam: {}".format(len(train[train.label==1])))
print ("Train_Ham : {}".format(len(train[train.label==0])))
print ("Test_Spam : {}".format(len(test[test.label==1])))
print ("Test_Ham  : {}".format(len(test[test.label==0])))
print ("Total     : {}".format(len(df)))

Number of Emails
Train_Spam: 1047
Train_Ham : 3052
Test_Spam : 449
Test_Ham  : 1309
Total     : 5857

Training Data¶

The following table is made to show the frequency of a token appear in both email groups, the appearances of each token count only once for each email received.

WORDS = []
for cell in train.words:
    for word in cell:
        WORDS.append(word)
words=pd.DataFrame(WORDS, columns=['word'])
words['count']=1
Words=words.groupby('word', as_index=False).sum().iloc[:,0:2]
#Words=Words[Words['count']>10]
Words.head()

Hamwords = []
for cell in train[train.label==0].words:
    for word in cell:
        Hamwords.append(word)
words=pd.DataFrame(Hamwords, columns=['word'])
words['Appearances in Ham']=1
HamWord=words.groupby('word', as_index=False).sum().iloc[:,0:2]
#HamWord=HamWord[HamWord['Appearances in Ham']>10]
HamWord.head()

Spamwords = []
for cell in train[train.label==1].words:
    for word in cell:
        Spamwords.append(word)
words=pd.DataFrame(Spamwords, columns=['word'])
words['Appearances in Spam']=1
SpamWord=words.groupby('word', as_index=False).sum().iloc[:,0:2]
#SpamWord=SpamWord[SpamWord['Appearances in Spam']>10]
SpamWord.head()

merged = pd.merge(Words,HamWord, on='word', how='outer')
Words=pd.merge(merged,SpamWord, on='word', how='outer')
Words=Words.fillna(0)
Words.head()

Filtering Process¶

Firstly, we break the email we want to classify into a group of individual words w1, . . . , wn, and denote this email as a capital letter E. The probability of receiving the email E is equal to the probability of receiving the list of words w1, . . . , wn.

$$P(E) = P(w_1, . . . , w_n)$$

$$P(E)=\prod_{i=0}^n P(w_i)$$

$$P(E|H) = P(w_1, . . . , w_n|H)$$

$$P(E|H)=\prod_{i=0}^n P(w_i|H)$$

We will calculate the probability of each word for being a spam P(wi|S) (probability of given a email from email class S which it contains the word wi) and ham P(wi|H) (probability of given a email from email class H which it contains the word wi) by using training dataset.

In the following formula, P(wi∩S) is the probability that a given email is a spam email and contains the word wi. Thus, by Bayes theorem:

$$P(w_i|S)=\frac{P(w_i\cap S)}{P(S)}$$

$$P(w_i|H)=\frac{P(w_i\cap H)}{P(H)}$$

#We will use Laplace Smoothing 
#This estimation of P(W|H) or P(W|S) could be problematic since it would give us probability 0 for documents with unknown words. 
#A common way of solving this problem is to use Laplace smoothing. add 1/len(unique words)
Words['P(W|H)']=(Words['Appearances in Ham'])/(len(train[train.label==0]))+0.1/Words['count'].sum()
Words['P(W|S)']=(Words['Appearances in Spam'])/(len(train[train.label==1]))+0.1/Words['count'].sum()
#Words['log(P(W|S)/P(W|H))']=np.log(Words['P(W|S)']/Words['P(W|H)'])
Words.head()

We will add $log(\frac{P(S)}{P(H)})$ , $log({P(w_i|S)})$ and $log({P(w_i|H)})$ to test data frame as following to find $log(\frac{P(S|E)}{P(H|E)})$. If $log(\frac{P(S|E)}{P(H|E)})>0$ , we classify Email as spam, Similarly, we classify the email as ham if it is less than zero.

Sample emails:

index	Text	label	$log\frac{P(S)}{P(H)}$	logP(w_i/S)	logP(w_i/H)
0	gpcm summary this been great yea...	Ham	...	...	...
1	national charity suffering since ..	Spam	...	...	...
:	:	:	...	...	...

$$Log(\prod_{i=0}^n P(w_i|S))= log(P(w_0|S)+log(P(w_1|S))+log(P(w_2|S))+.....+log(P(w_n|S))$$

$$Log(\prod_{i=0}^n P(w_i|H))= log(P(w_0|H)+log(P(w_1|H))+log(P(w_2|H))+.....+log(P(w_n|H))$$

arr = ['P(W|H)', 'P(W|S)']
arr2 = ['log(P(wi|H)', 'log(P(wi|S))']

for p,p2 in zip(arr,arr2):
    Y=[]
    for i in range (0, len(test)):
        Y.append((pd.DataFrame(test.words[i], columns=['word'])))  
    A=[]
    for i in range(len(test)):
        A.append([np.array(Words[Words.word==wrds][p]) for wrds in Y[i].word])
    ZZ=[]
    for i in range (0, len(test)):
        ZZ.append(np.log(pd.DataFrame(A[i]).iloc[1:,]).dropna().sum())
    test[p2]=pd.DataFrame(ZZ)
test.head()

test['log(P(S)/P(H))']=np.log(len(test[test.label==1])/len(test[test.label==0]))

test['log(P(S|E)/P(H|E))']=test['log(P(S)/P(H))']+test['log(P(wi|S))']-test['log(P(wi|H)']
test.head()

label=[]
for i in range (0, len(test)):
        if test['log(P(S|E)/P(H|E))'][i]<0:
            label.append(0)
        else:
            label.append(1)

test['predict_label']=label

print(metrics.confusion_matrix(test['label'], test['predict_label']))

[[1308    1]
 [  27  422]]

array = confusion_matrix(test['label'], test['predict_label'])
df_cm = pd.DataFrame(array, range(2),range(2))
(df_cm)

array = confusion_matrix(test['label'], test['predict_label'])
df_cm = pd.DataFrame(array, range(2),
                  range(2))
#plt.figure(figsize = (10,7))
sns.set(font_scale=1.4)#for label size
sns.heatmap(df_cm, annot=True,annot_kws={"size": 16}, fmt="d")# font size
print('Number of Emails')
print ("Train_Spam: {}".format(len(train[train.label==1])))
print ("Train_Ham : {}".format(len(train[train.label==0])))
print ("Test_Spam : {}".format(len(test[test.label==1])))
print ("Test_Ham  : {}".format(len(test[test.label==0])))
print ("Total     : {}".format(len(df)))
print ("Accuracy : %{:.4f}".format(((array[0][0]+array[1][1])/len(test))*100))

Number of Emails
Train_Spam: 1047
Train_Ham : 3052
Test_Spam : 449
Test_Ham  : 1309
Total     : 5857
Accuracy : %98.4073

print(classification_report(test['label'], test['predict_label']))

             precision    recall  f1-score   support

          0       0.98      1.00      0.99      1309
          1       1.00      0.94      0.97       449

avg / total       0.98      0.98      0.98      1758

	Email	label
0	Subject book lacima course attendees julie ...	0
1	Subject summary spreadsheet data vendor res...	0
2	Subject california update have que...	0
3	Subject real options openings vince tha...	0
4	Subject resco database customer capture ste...	0

	Email	words
0	subject book lacima course attendees julie ...	[subject, book, lacima, course, attendees, jul...
1	subject summary spreadsheet data vendor res...	[subject, summary, spreadsheet, data, vendor, ...
2	subject california update have que...	[subject, california, update, have, questions,...
3	subject real options openings vince tha...	[subject, real, options, openings, vince, than...
4	subject resco database customer capture ste...	[subject, resco, database, customer, capture, ...

	Email	words
0	subject citi wells enron form v...	[subject, citi, wells, enron, form, venture, g...
1	subject bullet points please respond vince ...	[subject, bullet, points, please, respond, vin...
2	subject hello from vince kaminski enron shm...	[subject, hello, from, vince, kaminski, enron,...
3	subject chicago partners have received inquir...	[subject, chicago, partners, have, received, i...
4	subject enron announcement rental options enr...	[subject, enron, announcement, rental, options...

	word	count
0	a	1
1	aada	1
2	aaldous	3
3	aaliyah	1
4	aall	2

	word	Appearances in Ham
0	a	1
1	aaldous	3
2	aanalysis	1
3	aaron	2
4	abacus	16

	word	count	Appearances in Ham	Appearances in Spam	P(W\|H)	P(W\|S)
0	a	1	1.0	0.0	3.278334e-04	1.794356e-07
1	aada	1	0.0	1.0	1.794356e-07	9.552893e-04
2	aaldous	3	3.0	0.0	9.831414e-04	1.794356e-07
3	aaliyah	1	0.0	1.0	1.794356e-07	9.552893e-04
4	aall	2	0.0	2.0	1.794356e-07	1.910399e-03

	Email	label	words	log(P(wi\|H)	log(P(wi\|S))
0	subject agenda mission possible meeting agend...	0	[subject, agenda, mission, possible, meeting, ...	-67.598837	-106.870520
1	subject summer internship position vince ...	0	[subject, summer, internship, position, vince,...	-514.663007	-1176.028929
2	subject graphics software available cheap v...	1	[subject, graphics, software, available, cheap...	-651.106740	-223.812194
3	subject class request chart exce...	0	[subject, class, request, chart, excel, charti...	-169.999713	-346.530882
4	subject wharton tiger team have reserve...	0	[subject, wharton, tiger, team, have, reserved...	-725.153785	-1943.115134

	0	1
0	1308	1
1	27	422