Fork me on GitHub
NLTK_NLP

Using natural language processing to build a spam filter for text messages

In [3]:
from PIL import Image
Image.open('NLP.png')
Out[3]:

Table of contents

  • Inspecting the dataset

  • Text preprocessing

  • Feature engineering

  • Training and evaluating a model

  • What terms are the top predictors of spam?

  • In [4]:
    import wordcloud 
    from wordcloud import WordCloud
    import nltk
    from nltk.corpus import stopwords
    import string
    from sklearn import metrics
    from sklearn.model_selection import train_test_split
    from sklearn.pipeline import Pipeline
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.metrics import classification_report,confusion_matrix
    import random as rd
    import os
    import json
    import pandas as pd
    import numpy as np
    import seaborn as sns
    import matplotlib.pyplot as plt
    pd.set_option('mode.chained_assignment', None)
    

    1. Inspecting the dataset

    In [5]:
    #Read data
    path = "Emails/"
    labels = ['Ham','Spam']
    Category=[]
    textdata = []
    a=0
    for l in labels:
        for file in os.listdir(path+l):
            Category.append(l)
            with open(os.path.join(path+l,file),encoding='utf8',errors="ignore") as f:
                data = f.read()
                textdata.append(data)
    df = pd.DataFrame({"Text" : textdata , "Category":Category})
    
    In [6]:
    df.info()
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 5857 entries, 0 to 5856
    Data columns (total 2 columns):
    Text        5857 non-null object
    Category    5857 non-null object
    dtypes: object(2)
    memory usage: 91.6+ KB
    

    We have a collection of text data known as a corpus. Specifically, there are 5857 emails written in English, serving as training examples. The first column is the target variable containing the class labels, which tells us if the message is spam or ham (aka not spam). The second column is the SMS message itself, stored as a string.

    Since the target variable contains discrete values, this is a classification task. Let's start by placing the target variable in its own table and checking out how the two classes are distributed.

    In [7]:
    y = df.Category
    y.value_counts()
    
    Out[7]:
    Ham     4361
    Spam    1496
    Name: Category, dtype: int64

    2. Text preprocessing

    There are many feature engineering strategies for transforming text data into features. Some involve assigning each unique word-like term to a feature and counting the number of occurrences per training example. However, if we were to perform this strategy right now, we'd end up with an absurd number of features, a result of the myriad possible terms. The classifier would take too long to train and likely overfit. As a result, each NLP problem requires a tailored approach to determine which terms are relevant and meaningful.

    Normalization

    Let's begin by taking a step back and examining the terms of a hypothetical Emails.

    In [8]:
    #Clean non letter characters and words containing three or less letters
    #Clean non letter characters and words containing three or less letters
    df = df.replace("[^a-zA-Z]"," ", regex=True)
    df = df.replace("\s(\w{1,3})\s",' ', regex=True)
    
    In [9]:
    df.head()
    
    Out[9]:
    Text Category
    0 Subject rankings thank Ham
    1 Subject visit enron vince wil... Ham
    2 Subject research group move the floor hell... Ham
    3 Subject christmas baskets christmas baskets h... Ham
    4 Subject japan candidate vince spoke with wh... Ham
    In [10]:
    df.rename(columns = {'Category': 'label', 'Text': 'Email'}, inplace = True)
    df['label'] = df['label'].map({'Ham': 0, 'Spam': 1})
    df=df.sample(frac=1).reset_index(drop=True)
    df.head()
    
    Out[10]:
    Email label
    0 Subject datren williams acceptance are righ... 0
    1 Subject position smith norma need a pos... 0
    2 Subject transport model andy scale effect t... 0
    3 Subject considered unsolicited bulk email fro... 1
    4 Subject subject thanks vince vince ka... 0
    In [11]:
    processed=df.Email
    
    In [12]:
    processed = processed.str.lower()
    

    Removing stop words

    Some words in the English language, while necessary, don't contribute much to the meaning of a phrase. These words, such as "when", "had", "those" or "before", are called stop words and should be filtered out. The Natural Language Toolkit (NLTK), a popular Python library for NLP, provides common stop words.

    In [13]:
    stop_words = nltk.corpus.stopwords.words('english')
    
    In [14]:
    processed = processed.apply(lambda x: ' '.join(
        term for term in x.split() if term not in set(stop_words))
    )
    

    Stemming

    It's likely the corpus contains words with various suffixes such as "distribute", "distributing", "distributor" or "distribution". We can replace these four words with just "distribut" via a preprocessing step called stemming. There are numerous stemming strategies, some more aggressive than others.

    In [15]:
    porter = nltk.PorterStemmer()
    processed = processed.apply(lambda x: ' '.join(
        porter.stem(term) for term in x.split())
    )
    

    3. Feature engineering

    Implementing the tf-idf statistic

    In [16]:
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn import svm
    
    In [17]:
    vectorizer = TfidfVectorizer(ngram_range=(1, 2))
    X_ngrams = vectorizer.fit_transform(processed)
    
    In [18]:
    X_ngrams.shape
    
    Out[18]:
    (5857, 319282)

    4. Training and evaluating a model

    In [19]:
    y_enc=df['label']
    
    In [20]:
    X_train, X_test, y_train, y_test = train_test_split(
        X_ngrams,
        y_enc,
        test_size=0.2,
        random_state=42,
        stratify=y_enc
    )
    
    clf = svm.LinearSVC(loss='hinge')
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    metrics.f1_score(y_test, y_pred)
    
    Out[20]:
    0.9847715736040609
    In [21]:
    pd.DataFrame(
        metrics.confusion_matrix(y_test, y_pred),
        index=[['actual', 'actual'], ['spam', 'ham']],
        columns=[['predicted', 'predicted'], ['spam', 'ham']]
    )
    
    Out[21]:
    predicted
    spam ham
    actual spam 872 1
    ham 8 291
    In [22]:
    array = confusion_matrix(y_test, y_pred)
    df_cm = pd.DataFrame(array, range(2),
                      range(2))
    #plt.figure(figsize = (10,7))
    sns.set(font_scale=1.4)#for label size
    sns.heatmap(df_cm, annot=True,annot_kws={"size": 16}, fmt="d")# font size
    
    Out[22]:
    <matplotlib.axes._subplots.AxesSubplot at 0x1d0c81ce438>
    In [23]:
    print(classification_report(y_test, y_pred))
    
                  precision    recall  f1-score   support
    
               0       0.99      1.00      0.99       873
               1       1.00      0.97      0.98       299
    
       micro avg       0.99      0.99      0.99      1172
       macro avg       0.99      0.99      0.99      1172
    weighted avg       0.99      0.99      0.99      1172
    
    

    Diagnosing the model with learning curves

    In [75]:
    from sklearn.model_selection import learning_curve
    from sklearn.model_selection import StratifiedShuffleSplit
    sample_space = np.linspace(500, len(df.Email) * 0.8, 10, dtype='int')
    
    train_sizes, train_scores, valid_scores = learning_curve(
        estimator=svm.LinearSVC(loss='hinge', C=1e10),
        X=X_ngrams,
        y=y_enc,
        train_sizes=sample_space,
        cv=StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=40),
        scoring='f1',
        n_jobs=-1
    )
    
    In [76]:
    def make_tidy(sample_space, train_scores, valid_scores):
        messy_format = pd.DataFrame(
            np.stack((sample_space, train_scores.mean(axis=1),
                      valid_scores.mean(axis=1)), axis=1),
            columns=['# of training examples', 'Training set', 'Validation set']
        )
        
        return pd.melt(
            messy_format,
            id_vars='# of training examples',
            value_vars=['Training set', 'Validation set'],
            var_name='Scores',
            value_name='F1 score'
        )
    
    In [77]:
    g = sns.FacetGrid(
        make_tidy(sample_space, train_scores, valid_scores), hue='Scores', size=5
    )
    
    g.map(plt.scatter, '# of training examples', 'F1 score')
    g.map(plt.plot, '# of training examples', 'F1 score').add_legend();
    
    C:\Users\AKAYA\Anaconda\lib\site-packages\seaborn\axisgrid.py:230: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
      warnings.warn(msg, UserWarning)
    

    Notice the performance on the training set is near perfect regardless of dataset size, which makes sense because we're evaluating the classifier on the same data used to train it. At first, it looks like the classifier is suffering from high variance and is overfitting since the validation scores never reach the same level. However, taking a closer look at the scale of the y-axis makes me believe this issue isn't that pronounced.

    Using nested cross-validation to minimize information leakage

    In [78]:
    from sklearn.model_selection import GridSearchCV
    from sklearn.model_selection import cross_val_score
    param_grid = [{'C': np.logspace(-4, 4, 20)}]
    
    grid_search = GridSearchCV(
        estimator=svm.LinearSVC(loss='hinge'),
        param_grid=param_grid,
        cv=StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42),
        scoring='f1',
        n_jobs=-1
    )
    
    scores = cross_val_score(
        estimator=grid_search,
        X=X_ngrams,
        y=y_enc,
        cv=StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=0),
        scoring='f1',
        n_jobs=-1
    )
    
    scores
    
    Out[78]:
    array([0.98132428, 0.96763203, 0.98333333, 0.98138748, 0.97800338,
           0.97792869, 0.97959184, 0.98138748, 0.97627119, 0.97822446])
    In [79]:
    scores.mean()
    
    Out[79]:
    0.9785084152308429

    5. What terms are the top predictors of spam?

    In [80]:
    grid_search.fit(X_ngrams, y_enc)
    final_clf = svm.LinearSVC(loss='hinge', C=grid_search.best_params_['C'])
    final_clf.fit(X_ngrams, y_enc);
    
    C:\Users\AKAYA\Anaconda\lib\site-packages\sklearn\svm\base.py:922: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
      "the number of iterations.", ConvergenceWarning)
    C:\Users\AKAYA\Anaconda\lib\site-packages\sklearn\svm\base.py:922: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
      "the number of iterations.", ConvergenceWarning)
    
    In [92]:
    pd.Series(
        final_clf.coef_.T.ravel(),
        index=vectorizer.get_feature_names()
    ).sort_values(ascending=False)[:10].plot(kind='barh', figsize=(5,10))
    
    Out[92]:
    <matplotlib.axes._subplots.AxesSubplot at 0x1eb125fc438>

    There are a few obvious ones at the top: click, life, softwar, http..etc