Fork me on GitHub
PREPARE_DATA_MPG_Data_Set_R Spam_filter_Enron2

Spam Filtering Based on Naive Bayes Classification

In this project, I used one of the widely used statistical method, the Naive Bayes method to build a spam filter algorithm. Sample emails have two labels are Spam and Ham. Naive Bayes classification will be explained and we will apply this method to our sample emails for testing.

Sample emails:

index Text label
0 gpcm summary this been great yea... Ham
1 national charity suffering since .. Spam
: : :
In [1]:
import wordcloud 
from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords
import string
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report,confusion_matrix
import random as rd
import os
import json
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('mode.chained_assignment', None)
In [2]:
#Read data
path = "enron2/"
labels = ['Ham','Spam']
Category=[]
textdata = []
a=0
for l in labels:
    for file in os.listdir(path+l):
        Category.append(l)
        with open(os.path.join(path+l,file),encoding='utf8',errors="ignore") as f:
            data = f.read()
            textdata.append(data)
df = pd.DataFrame({"Text" : textdata , "Category":Category})
In [3]:
#Clean non letter characters and words containing three or less letters
df = df.replace("[^a-zA-Z]"," ", regex=True)
for i in range(5):
    df = df.replace("\s(\w{1,3})\s",' ', regex=True)
In [4]:
df.rename(columns = {'Category': 'label', 'Text': 'Email'}, inplace = True)
df['label'] = df['label'].map({'Ham': 0, 'Spam': 1})
df=df.sample(frac=1).reset_index(drop=True)
df.head()
Out[4]:
Email label
0 Subject book lacima course attendees julie ... 0
1 Subject summary spreadsheet data vendor res... 0
2 Subject california update have que... 0
3 Subject real options openings vince tha... 0
4 Subject resco database customer capture ste... 0
In [5]:
#Graph for Spam Words
spam_words = ' '.join(list(df[df['label'] == 1]['Email']))
spam_wc = WordCloud(width = 512,height = 512).generate(spam_words)
plt.figure(figsize = (10, 8), facecolor = 'k')
plt.imshow(spam_wc)
plt.axis('off')
plt.tight_layout(pad = 0)
plt.show()
In [6]:
#Graph for Ham Words
spam_words = ' '.join(list(df[df['label'] == 0]['Email']))
spam_wc = WordCloud(width = 512,height = 512).generate(spam_words)
plt.figure(figsize = (10, 8), facecolor = 'k')
plt.imshow(spam_wc)
plt.axis('off')
plt.tight_layout(pad = 0)
plt.show()
In [7]:
df['Email']=df['Email'].str.lower()
df['words']=df['Email'].str.split()
df.head()
Out[7]:
Email label words
0 subject book lacima course attendees julie ... 0 [subject, book, lacima, course, attendees, jul...
1 subject summary spreadsheet data vendor res... 0 [subject, summary, spreadsheet, data, vendor, ...
2 subject california update have que... 0 [subject, california, update, have, questions,...
3 subject real options openings vince tha... 0 [subject, real, options, openings, vince, than...
4 subject resco database customer capture ste... 0 [subject, resco, database, customer, capture, ...

Split Data to Train and Test

In [8]:
#Split Test and Training
np.random.RandomState(42)
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.3)
In [9]:
train=train.reset_index(drop=True)
test=test.reset_index(drop=True)
In [10]:
train.head()
Out[10]:
Email label words
0 subject citi wells enron form v... 0 [subject, citi, wells, enron, form, venture, g...
1 subject bullet points please respond vince ... 0 [subject, bullet, points, please, respond, vin...
2 subject hello from vince kaminski enron shm... 0 [subject, hello, from, vince, kaminski, enron,...
3 subject chicago partners have received inquir... 0 [subject, chicago, partners, have, received, i...
4 subject enron announcement rental options enr... 0 [subject, enron, announcement, rental, options...
In [11]:
print('Number of Emails')
print ("Train_Spam: {}".format(len(train[train.label==1])))
print ("Train_Ham : {}".format(len(train[train.label==0])))
print ("Test_Spam : {}".format(len(test[test.label==1])))
print ("Test_Ham  : {}".format(len(test[test.label==0])))
print ("Total     : {}".format(len(df)))
Number of Emails
Train_Spam: 1047
Train_Ham : 3052
Test_Spam : 449
Test_Ham  : 1309
Total     : 5857

Training Data

The following table is made to show the frequency of a token appear in both email groups, the appearances of each token count only once for each email received.

In [12]:
WORDS = []
for cell in train.words:
    for word in cell:
        WORDS.append(word)
words=pd.DataFrame(WORDS, columns=['word'])
words['count']=1
Words=words.groupby('word', as_index=False).sum().iloc[:,0:2]
#Words=Words[Words['count']>10]
Words.head()
Out[12]:
word count
0 a 1
1 aada 1
2 aaldous 3
3 aaliyah 1
4 aall 2
In [13]:
Hamwords = []
for cell in train[train.label==0].words:
    for word in cell:
        Hamwords.append(word)
words=pd.DataFrame(Hamwords, columns=['word'])
words['Appearances in Ham']=1
HamWord=words.groupby('word', as_index=False).sum().iloc[:,0:2]
#HamWord=HamWord[HamWord['Appearances in Ham']>10]
HamWord.head()
Out[13]:
word Appearances in Ham
0 a 1
1 aaldous 3
2 aanalysis 1
3 aaron 2
4 abacus 16
In [14]:
Spamwords = []
for cell in train[train.label==1].words:
    for word in cell:
        Spamwords.append(word)
words=pd.DataFrame(Spamwords, columns=['word'])
words['Appearances in Spam']=1
SpamWord=words.groupby('word', as_index=False).sum().iloc[:,0:2]
#SpamWord=SpamWord[SpamWord['Appearances in Spam']>10]
SpamWord.head()
Out[14]:
word Appearances in Spam
0 aada 1
1 aaliyah 1
2 aall 2
3 aanbracht 1
4 aangekondigde 1
In [15]:
merged = pd.merge(Words,HamWord, on='word', how='outer')
Words=pd.merge(merged,SpamWord, on='word', how='outer')
Words=Words.fillna(0)
Words.head()
Out[15]:
word count Appearances in Ham Appearances in Spam
0 a 1 1.0 0.0
1 aada 1 0.0 1.0
2 aaldous 3 3.0 0.0
3 aaliyah 1 0.0 1.0
4 aall 2 0.0 2.0

Filtering Process

Firstly, we break the email we want to classify into a group of individual words w1, . . . , wn, and denote this email as a capital letter E. The probability of receiving the email E is equal to the probability of receiving the list of words w1, . . . , wn.

$$P(E) = P(w_1, . . . , w_n)$$
$$P(E)=\prod_{i=0}^n P(w_i)$$
$$P(E|H) = P(w_1, . . . , w_n|H)$$
$$P(E|H)=\prod_{i=0}^n P(w_i|H)$$

We will calculate the probability of each word for being a spam P(wi|S) (probability of given a email from email class S which it contains the word wi) and ham P(wi|H) (probability of given a email from email class H which it contains the word wi) by using training dataset.

In the following formula, P(wi∩S) is the probability that a given email is a spam email and contains the word wi. Thus, by Bayes theorem:

$$P(w_i|S)=\frac{P(w_i\cap S)}{P(S)}$$
$$P(w_i|H)=\frac{P(w_i\cap H)}{P(H)}$$
In [16]:
#We will use Laplace Smoothing 
#This estimation of P(W|H) or P(W|S) could be problematic since it would give us probability 0 for documents with unknown words. 
#A common way of solving this problem is to use Laplace smoothing. add 1/len(unique words)
Words['P(W|H)']=(Words['Appearances in Ham'])/(len(train[train.label==0]))+0.1/Words['count'].sum()
Words['P(W|S)']=(Words['Appearances in Spam'])/(len(train[train.label==1]))+0.1/Words['count'].sum()
#Words['log(P(W|S)/P(W|H))']=np.log(Words['P(W|S)']/Words['P(W|H)'])
Words.head()
Out[16]:
word count Appearances in Ham Appearances in Spam P(W|H) P(W|S)
0 a 1 1.0 0.0 3.278334e-04 1.794356e-07
1 aada 1 0.0 1.0 1.794356e-07 9.552893e-04
2 aaldous 3 3.0 0.0 9.831414e-04 1.794356e-07
3 aaliyah 1 0.0 1.0 1.794356e-07 9.552893e-04
4 aall 2 0.0 2.0 1.794356e-07 1.910399e-03

We will add $log(\frac{P(S)}{P(H)})$ , $log({P(w_i|S)})$ and $log({P(w_i|H)})$ to test data frame as following to find $log(\frac{P(S|E)}{P(H|E)})$. If $log(\frac{P(S|E)}{P(H|E)})>0$ , we classify Email as spam, Similarly, we classify the email as ham if it is less than zero.

Sample emails:

index Text label $log\frac{P(S)}{P(H)}$ logP(w_i/S) logP(w_i/H) predict_label
0 gpcm summary this been great yea... Ham ... ... ...
1 national charity suffering since .. Spam ... ... ...
: : : ... ... ...
$$Log(\prod_{i=0}^n P(w_i|S))= log(P(w_0|S)+log(P(w_1|S))+log(P(w_2|S))+.....+log(P(w_n|S))$$
$$Log(\prod_{i=0}^n P(w_i|H))= log(P(w_0|H)+log(P(w_1|H))+log(P(w_2|H))+.....+log(P(w_n|H))$$
In [17]:
arr = ['P(W|H)', 'P(W|S)']
arr2 = ['log(P(wi|H)', 'log(P(wi|S))']

for p,p2 in zip(arr,arr2):
    Y=[]
    for i in range (0, len(test)):
        Y.append((pd.DataFrame(test.words[i], columns=['word'])))  
    A=[]
    for i in range(len(test)):
        A.append([np.array(Words[Words.word==wrds][p]) for wrds in Y[i].word])
    ZZ=[]
    for i in range (0, len(test)):
        ZZ.append(np.log(pd.DataFrame(A[i]).iloc[1:,]).dropna().sum())
    test[p2]=pd.DataFrame(ZZ)
test.head()
Out[17]:
Email label words log(P(wi|H) log(P(wi|S))
0 subject agenda mission possible meeting agend... 0 [subject, agenda, mission, possible, meeting, ... -67.598837 -106.870520
1 subject summer internship position vince ... 0 [subject, summer, internship, position, vince,... -514.663007 -1176.028929
2 subject graphics software available cheap v... 1 [subject, graphics, software, available, cheap... -651.106740 -223.812194
3 subject class request chart exce... 0 [subject, class, request, chart, excel, charti... -169.999713 -346.530882
4 subject wharton tiger team have reserve... 0 [subject, wharton, tiger, team, have, reserved... -725.153785 -1943.115134
In [18]:
test['log(P(S)/P(H))']=np.log(len(test[test.label==1])/len(test[test.label==0]))
In [19]:
test['log(P(S|E)/P(H|E))']=test['log(P(S)/P(H))']+test['log(P(wi|S))']-test['log(P(wi|H)']
test.head()
Out[19]:
Email label words log(P(wi|H) log(P(wi|S)) log(P(S)/P(H)) log(P(S|E)/P(H|E))
0 subject agenda mission possible meeting agend... 0 [subject, agenda, mission, possible, meeting, ... -67.598837 -106.870520 -1.069996 -40.341679
1 subject summer internship position vince ... 0 [subject, summer, internship, position, vince,... -514.663007 -1176.028929 -1.069996 -662.435918
2 subject graphics software available cheap v... 1 [subject, graphics, software, available, cheap... -651.106740 -223.812194 -1.069996 426.224550
3 subject class request chart exce... 0 [subject, class, request, chart, excel, charti... -169.999713 -346.530882 -1.069996 -177.601165
4 subject wharton tiger team have reserve... 0 [subject, wharton, tiger, team, have, reserved... -725.153785 -1943.115134 -1.069996 -1219.031345
In [20]:
label=[]
for i in range (0, len(test)):
        if test['log(P(S|E)/P(H|E))'][i]<0:
            label.append(0)
        else:
            label.append(1)
In [21]:
test['predict_label']=label
In [22]:
print(metrics.confusion_matrix(test['label'], test['predict_label']))
[[1308    1]
 [  27  422]]
In [23]:
array = confusion_matrix(test['label'], test['predict_label'])
df_cm = pd.DataFrame(array, range(2),range(2))
(df_cm)
Out[23]:
0 1
0 1308 1
1 27 422
In [24]:
array = confusion_matrix(test['label'], test['predict_label'])
df_cm = pd.DataFrame(array, range(2),
                  range(2))
#plt.figure(figsize = (10,7))
sns.set(font_scale=1.4)#for label size
sns.heatmap(df_cm, annot=True,annot_kws={"size": 16}, fmt="d")# font size
print('Number of Emails')
print ("Train_Spam: {}".format(len(train[train.label==1])))
print ("Train_Ham : {}".format(len(train[train.label==0])))
print ("Test_Spam : {}".format(len(test[test.label==1])))
print ("Test_Ham  : {}".format(len(test[test.label==0])))
print ("Total     : {}".format(len(df)))
print ("Accuracy : %{:.4f}".format(((array[0][0]+array[1][1])/len(test))*100))
Number of Emails
Train_Spam: 1047
Train_Ham : 3052
Test_Spam : 449
Test_Ham  : 1309
Total     : 5857
Accuracy : %98.4073
In [25]:
print(classification_report(test['label'], test['predict_label']))
             precision    recall  f1-score   support

          0       0.98      1.00      0.99      1309
          1       1.00      0.94      0.97       449

avg / total       0.98      0.98      0.98      1758

Published: October 03, 2018