An Exploratory Data Analysis (EDA) of the Home Price Data¶
#!pip install pypng
#from PIL import Image
#Image.open('housesbanner.png')
We will look at each variable and use visualisation and quantitative methods to do a philosophical analysis of their meaning and importance for this problem before diving into machine learning or statistical modeling.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
import warnings
from collections import Counter
import math
warnings.filterwarnings('ignore')
%matplotlib inline
1. Download and Load Data File¶
The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset. The data split in two-part train and test. The training dataset has target value (home price), the test set doesn't have a target value and it needs to be predicted.
File descriptions:
*train.csv - the training set
*test.csv - the test set
*data_description.txt - full description of each column, originally prepared by Dean De Cock but lightly edited to match the column names used here
*sample_submission.csv - a benchmark submission from a linear regression on year and month of sale, lot square footage, and number of bedrooms
df= pd.read_csv('../input/train.csv')
print (df.shape)
df.head()
We want to save the 'Id' columns from both datasets for later use when preparing the submission data.
The data has 79 columns (features) and 1460 entries. We will look in the data description file.
df=df.iloc[:,1:81]
f=open('data_description.txt', 'r')
print(f.read())
# check the columns
names=df.columns
print(names)
Before looking at each variable we will check missing values.
df.isna().sum().sort_values(ascending=False).head(20)
We will analyze the missing values for each variables.
2. Data Preparation¶
Since our target value is the sale price, we will first look into Sale Price. We will look data type, min max and visualize the histogram to see if it shows normal distribution behave.
SalePrice¶
#descriptive statistics summary
df['SalePrice'].describe()
#histogram
#histogram and normal probability plot
sns.distplot(df['SalePrice'], fit=norm);
fig = plt.figure()
res = stats.probplot(df['SalePrice'], plot=plt)
The 'SalePrice' is not showing normal distribution behave. It shows positive (right) skewness and does not follow the diagonal line. We will transform the data to solve problem. Log transformations usually works well.
#applying log transformation
df['SalePrice'] = np.log(df['SalePrice'])
#histogram and normal probability plot
sns.distplot(df['SalePrice'], fit=norm);
fig = plt.figure()
res = stats.probplot(df['SalePrice'], plot=plt)
1stFlrSF¶
First Floor square feet
We will apply log transformation and change the column name as "FrstFlrSF" due to transforming in .csv file. We would like to have columns name without numbers to reduce errors when we will loading the data
df['FrstFlrSF'] = np.log(df['1stFlrSF'])
df=df.drop(columns=['1stFlrSF'])
#histogram and normal probability plot
sns.distplot((df['FrstFlrSF']), fit=norm);
fig = plt.figure()
res = stats.probplot(df['FrstFlrSF'], plot=plt)
df=df.drop(columns=['2ndFlrSF'])
LowQualFinSF¶
Low quality finished square feet (all floors)
Most values 0 so I will remove that column
df[df['LowQualFinSF']==0]['LowQualFinSF'].count()
df=df.drop(columns=['LowQualFinSF'])
LotFrontage¶
Linear feet of street connected to property
df['LotFrontage'].isna().sum()
There are 259 missing values. Replacing the missing values with the mean / median / mode is a crude way of treating missing values. Depending on the context, like if the variation is low or if the variable has low leverage over the response, such a rough approximation is acceptable and could possibly give satisfactory results.
To fix the missing value problem, we will fill in those locations with the median due to impact of the outliers.
sns.boxplot(df[~np.isnan(df['LotFrontage'])]['LotFrontage'], orient='v' )
print (df['LotFrontage'].describe())
print(df['LotFrontage'].median())
df['LotFrontage']=df['LotFrontage'].fillna(df['LotFrontage'].median())
LotArea¶
Lot size in square feet
df[['LotArea']]=np.sqrt(df[['LotArea']])
sns.distplot(df['LotArea'], fit=norm);
fig = plt.figure()
res = stats.probplot(df['LotArea'], plot=plt)
ax = sns.regplot(x="SalePrice", y="LotArea", data=df)
This will be a very important feature within my analysis, due to such a high correlation with Saleprice.
GrLivArea¶
Above grade (ground) living area square feet
sns.distplot(np.sqrt(df['GrLivArea']), fit=norm);
fig = plt.figure()
res = stats.probplot(np.sqrt(df['GrLivArea']), plot=plt)
df['GrLivArea']=np.sqrt(df['GrLivArea'])
'MasVnrArea'¶
Masonry veneer area in square feet
print ('missing Values')
print(df['MasVnrArea'].isna().sum())
print ('Values=0')
print(df[df['MasVnrArea']==0]['MasVnrArea'].count())
df[df['MasVnrArea']==0]['MasVnrArea'].count()
There are 5 missing values and 861 values are 0. We need to check 0 values. We will look at the 'MasVnrType' to fill vlues which equal to 0.
#box plot overallqual/saleprice
var = 'MasVnrType'
data = pd.concat([df['MasVnrArea'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(10,5))
fig = sns.boxplot(x=var, y='MasVnrArea', data=df)
fig.axis(ymin=0, ymax=1600);
it is highly correlated with "MasVnrType" (if "MasVnrType = "None" then it has to be equal to 0). So, I will drop this feature.
df=df.drop(columns=['MasVnrArea'])
MasVnrType¶
Masonry veneer type
BrkCmn Brick Common
BrkFace Brick Face
CBlock Cinder Block
None None
Stone Stone
print("na Valuest")
print(df['MasVnrType'].isna().sum())
The 8 values of variable are na so I will fill na as 0
# I will fill na as 0
df['MasVnrType']=df['MasVnrType'].fillna(0)
pd.Series(df['MasVnrType']).value_counts().plot('bar')
#box plot overallqual/saleprice
var = 'MasVnrType'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(10,5))
fig = sns.boxplot(x=var, y='SalePrice', data=df)
Since this a categorical feature without order, I will create dummy features here.
cleanup_nums = {"MasVnrType": {"None": 1, "BrkFace": 2, "Stone":3, "BrkCmn":4}}
df.replace(cleanup_nums, inplace=True)
#box plot
var = 'MasVnrType'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(10,5))
fig = sns.boxplot(x=var, y='SalePrice', data=df)
BsmtFinSF1¶
Type 1 finished square feet
we would like to see if there is any coralation between 'BsmtFinSF1' and 'BldgType'
#box plot
var = 'BldgType'
data = pd.concat([df['BsmtFinSF1'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(10,5))
fig = sns.boxplot(x=var, y='BsmtFinSF1', data=df)
#box plot
var = 'BldgType'
data = pd.concat([df[df['BsmtFinSF1']>0]['BsmtFinSF1'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(10,5))
fig = sns.boxplot(x=var, y=df[df['BsmtFinSF1']>0]['BsmtFinSF1'], data=df)
print ('missing Values')
print(df['BsmtFinSF1'].isna().sum())
print ('Values=0')
print(df[df['BsmtFinSF1']==0]['BsmtFinSF1'].count())
#histogram and normal probability plot
sns.distplot(np.sqrt(df[df['BsmtFinSF1']>0]['BsmtFinSF1']), fit=norm);
fig = plt.figure()
res = stats.probplot(np.sqrt(df[df['BsmtFinSF1']>0]['BsmtFinSF1']), plot=plt)
ax = sns.regplot(x=df[df.BsmtFinSF1>0]["SalePrice"], y=np.sqrt(df[df.BsmtFinSF1>0]['BsmtFinSF1']), data=df)
Aplying sqrt to values is showing normal distribution behave.
df['BsmtFinSF1']=np.sqrt(df['BsmtFinSF1'])
'BsmtFinSF2'¶
Type 2 finished square feet Most Values are 0 so I will put flag on it. And I also apply sqrt to transform it.
print ('missing Values')
print(df['BsmtFinSF2'].isna().sum())
print ('Values=0')
print(df[df['BsmtFinSF2']==0]['BsmtFinSF2'].count())
df['BsmtFinSF2_Flag']=np.sqrt(df['BsmtFinSF2'])
dfTrain=df.drop(columns=['BsmtFinSF2'])
'BsmtUnfSF'¶
Unfinished square feet of basement area
print ('missing Values')
print(df['BsmtUnfSF'].isna().sum())
print ('Values=0')
print(df[df['BsmtUnfSF']==0]['BsmtUnfSF'].count())
#box plot
var = 'BldgType'
data = pd.concat([df['BsmtUnfSF'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(10,5))
fig = sns.boxplot(x=var, y=df['BsmtUnfSF'], data=df)
print((df[df['BldgType']=='Duplex']['BsmtUnfSF']==0).count())
ax = sns.regplot(x=df[df.BsmtUnfSF>0]["SalePrice"], y=np.sqrt(df[df.BsmtUnfSF>0]['BsmtUnfSF']), data=df)
I will transform it sqrt
df['BsmtUnfSF']=np.sqrt(df['BsmtUnfSF'])
print ('missing Values')
print(df['TotalBsmtSF'].isna().sum())
print ('Values=0')
print(df[df['TotalBsmtSF']==0]['TotalBsmtSF'].count())
There is no missing value, but it has 37 values equal to 0.
#box plot
var = 'BldgType'
data = pd.concat([df['TotalBsmtSF'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(10,5))
fig = sns.boxplot(x=var, y=df['TotalBsmtSF'], data=df)
Most of Duplex's doesnt have basement.
#histogram and normal probability plot
sns.distplot(np.sqrt(df[df['TotalBsmtSF']>0]['TotalBsmtSF']), fit=norm);
fig = plt.figure()
res = stats.probplot(np.sqrt(df[df['TotalBsmtSF']>0]['TotalBsmtSF']), plot=plt)
ax = sns.regplot(x=df[df.TotalBsmtSF>0]["SalePrice"], y=np.sqrt(df[df.TotalBsmtSF>0]['TotalBsmtSF']), data=df)
I will apply sqrt to transform it
df['TotalBsmtSF']=np.sqrt(df['TotalBsmtSF'])
BsmtHalfBath:¶
Basement half bathrooms
it was assigned as 1 and 0.
print ('missing Values')
print(df['BsmtHalfBath'].isna().sum())
pd.Series(df['BsmtHalfBath']).value_counts().plot('bar')
MSZoning¶
Identifies the general zoning classification of the sale.
A Agriculture
C Commercial
FV Floating Village Residential
I Industrial
RH Residential High Density
RL Residential Low Density
RP Residential Low Density Park
RM Residential Medium Density
print ('missing Values')
print(df['MSZoning'].isna().sum())
pd.Series(df['MSZoning']).value_counts().plot('bar')
#box plot overallqual/saleprice
var = 'MSZoning'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(10,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
cleanup_nums = {"MSZoning": {"C (all)": 1, "RM": 2, "RH":3, "RL":4, "FV":5}}
df.replace(cleanup_nums, inplace=True)
#df = df.copy()
#df = pd.get_dummies(df, columns=['MSZoning'], prefix = ['MSZoning'])
#df.head()
#df=df.drop(columns=['MSZoning_RH'])
#df=df.drop(columns=['MSZoning_C (all)'])
#df=df.drop(columns=['MSZoning'])
One-Hot encoding:The basic strategy is to convert each category value into a column and assign a 1, 2, 3, 4, and 5. This has the benefit of not weighting a value improperly.
print ('missing Values')
print(df['Street'].isna().sum())
pd.Series(df['Street']).value_counts().plot('bar')
The categorical value represents the numerical value of the entry in the dataset. The categories are of type character, so I will use the following function to enumerate them
cleanup_nums = {"Street": {"Pave": 1, "Grvl": 2}}
df.replace(cleanup_nums, inplace=True)
print(df['Alley'].isna().sum())
pd.Series(df['Alley']).value_counts().plot('bar')
Since he most of valuse are missing and it is not a decisive feature on sale price so I will drop "Alley"
df=df.drop(columns=['Alley'])
LotShape¶
General shape of property
Reg Regular
IR1 Slightly irregular
IR2 Moderately Irregular
IR3 Irregular
print(df['LotShape'].isna().sum())
pd.Series(df['LotShape']).value_counts().plot('bar')
#box plot overallqual/saleprice
var = 'LotShape'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=dfTrain)
I will enumerate characters as following;
cleanup_nums = {"LotShape": {"Reg": 1, "IR1": 2, "IR2": 3, "IR3": 4}}
df.replace(cleanup_nums, inplace=True)
#box plot overallqual/saleprice
var = 'LotShape'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
LandContour¶
Flatness of the property
Lvl Near Flat/Level
Bnk Banked - Quick and significant rise from street grade to building
HLS Hillside - Significant slope from side to side
Low Depression
print("na Values")
print(df['LandContour'].isna().sum())
pd.Series(df['LandContour']).value_counts().plot('bar')
#box plot
var = 'LandContour'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
I will enumerate characters as following;
cleanup_nums = {"LandContour": {"Lvl": 1, "Bnk": 2, "HLS": 3, "Low": 4}}
df.replace(cleanup_nums, inplace=True)
#box plot
var = 'LandContour'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
Utilities¶
Type of utilities available
AllPub All public Utilities (E,G,W,& S)
NoSewr Electricity, Gas, and Water (Septic Tank)
NoSeWa Electricity and Gas Only
ELO Electricity only
print("na Values")
print(df['Utilities'].isna().sum())
pd.Series(df['Utilities']).value_counts().plot('bar')
I will drop it becouse all of the homes have "AllPub".
df=df.drop(columns=['Utilities'])
LotConfig¶
Lot configuration
Inside Inside lot
Corner Corner lot
CulDSac Cul-de-sac
FR2 Frontage on 2 sides of property
FR3 Frontage on 3 sides of property
print("na Values")
print(df['LotConfig'].isna().sum())
pd.Series(df['LotConfig']).value_counts().plot('bar')
I will enumerate characters as following;
cleanup_nums = {"LotConfig": {"Inside": 1, "Corner": 2, "CulDSac": 3, "FR2": 4, "FR3": 5}}
df.replace(cleanup_nums, inplace=True)
#box plot
var = 'LotConfig'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
print("na Values")
print(df['LandSlope'].isna().sum())
pd.Series(df['LandSlope']).value_counts().plot('bar')
I will enumerate characters as following;
#cleanup_nums = {"LandSlope": {"Gtl": 1, "Mod": 2, "Sev": 3}}
#df.replace(cleanup_nums, inplace=True)
#box plot
var = 'LandSlope'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
df = df.copy()
df = pd.get_dummies(df, columns=['LandSlope'], prefix = ['LandSlope'])
df.head()
Neighborhood¶
Physical locations within Ames city limits
Blmngtn Bloomington Heights
Blueste Bluestem
BrDale Briardale
BrkSide Brookside
ClearCr Clear Creek
CollgCr College Creek
Crawfor Crawford
Edwards Edwards
Gilbert Gilbert
IDOTRR Iowa DOT and Rail Road
MeadowV Meadow Village
Mitchel Mitchell
Names North Ames
NoRidge Northridge
NPkVill Northpark Villa
NridgHt Northridge Heights
NWAmes Northwest Ames
OldTown Old Town
SWISU South & West of Iowa State University
Sawyer Sawyer
SawyerW Sawyer West
Somerst Somerset
StoneBr Stone Brook
Timber Timberland
Veenker Veenker
print("na Values")
print(df['Neighborhood'].isna().sum())
pd.Series(df['Neighborhood']).value_counts().plot('bar')
Since this is a categorical feature without order, I will create dummy features.
cleanup_nums = {"Neighborhood": {"Blmngtn":0, "Blueste":1, "BrDale":2, "BrkSide":3, "ClearCr":4, "CollgCr": 5, "Crawfor":6, "Edwards":7, "Gilbert":8,¶
"IDOTRR":9, "MeadowV":10, "Mitchel":11, "Names":12, "NoRidge":13, "NPkVill":14,
"NridgHt":15, "NWAmes":16, "OldTown":17, "SWISU":18, "Sawyer":19,
"SawyerW":20, "Somerst":21, "StoneBr":22,
"Timber":23, "Veenker":24, "Names":25}}
df.replace(cleanup_nums, inplace=True)
box plot¶
var = 'Neighborhood' data = pd.concat([df['SalePrice'], df[var]], axis=1) f, ax = plt.subplots(figsize=(15,10)) fig = sns.boxplot(x=var, y="SalePrice", data=df)
df = df.copy()
df = pd.get_dummies(df, columns=['Neighborhood'], prefix = ['Neighborhood'])
df.head()
Condition1¶
Proximity to various conditions
Artery Adjacent to arterial street
Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200' of North-South Railroad
RRAn Adjacent to North-South Railroad
PosN Near positive off-site feature--park, greenbelt, etc.
PosA Adjacent to postive off-site feature
RRNe Within 200' of East-West Railroad
RRAe Adjacent to East-West Railroad
print("na Values")
print(df['Condition1'].isna().sum())
pd.Series(df['Condition1']).value_counts().plot('bar')
cleanup_nums = {"Condition1": {"Norm": 1, "Feedr": 2, "Artery":3, "RRAn":4,"PosN":5, "RRAe":6, "PosA":7, "RRNn":8, "RRNe":9}}
df.replace(cleanup_nums, inplace=True)
#box plot
var = 'Condition1'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
df = df.copy() df = pd.get_dummies(df, columns=['Condition1'], prefix = ['Condition1'])
df.head()
Condition2¶
Proximity to various conditions (if more than one is present)
Artery Adjacent to arterial street
Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200' of North-South Railroad
RRAn Adjacent to North-South Railroad
PosN Near positive off-site feature--park, greenbelt, etc.
PosA Adjacent to postive off-site feature
RRNe Within 200' of East-West Railroad
RRAe Adjacent to East-West Railroad
print("na Values")
print(df['Condition2'].isna().sum())
pd.Series(df['Condition2']).value_counts().plot('bar')
cleanup_nums = {"Condition2": {"Norm": 1, "Feedr": 2, "Artery":3, "RRAn":4,"PosN":5, "RRAe":6, "PosA":7, "RRNn":8, "RRNe":9}}
df.replace(cleanup_nums, inplace=True)
#box plot
var = 'Condition2'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
BldgType¶
Type of dwelling
1Fam Single-family Detached
2FmCon Two-family Conversion; originally built as one-family dwelling
Duplx Duplex
TwnhsE Townhouse End Unit
TwnhsI Townhouse Inside Unit
print("na Values")
print(df['BldgType'].isna().sum())
pd.Series(df['BldgType']).value_counts().plot('bar')
I will enumerate characters as following;
#Lets asign 1Fam=1, 2FmCon=2, Duplx=3, TwnhsE=4, TwnhsI=5
# For each row in the column,
BldgType=[]
for row in df['BldgType']:
if row =='1Fam':
BldgType.append(1)
elif row =='2fmCon':
BldgType.append(2)
elif row =='Duplex':
BldgType.append(3)
elif row =='TwnhsE':
BldgType.append(4)
elif row =='Twnhs':
BldgType.append(5)
#change asign numbers to column values
df['BldgType']=pd.DataFrame(BldgType)
df['BldgType'].head()
var = 'BldgType'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
HouseStyle¶
Style of dwelling
1Story One story
1.5Fin One and one-half story: 2nd level finished
1.5Unf One and one-half story: 2nd level unfinished
2Story Two story
2.5Fin Two and one-half story: 2nd level finished
2.5Unf Two and one-half story: 2nd level unfinished
SFoyer Split Foyer
SLvl Split Level
print("na Values")
print(df['HouseStyle'].isna().sum())
pd.Series(df['HouseStyle']).value_counts().plot('bar')
cleanup_nums = {"HouseStyle": {"1Story": 1, "1.5Unf": 1.5, "1.5Fin":1.8, "2Story":2,"2.5Unf":2.5, "2.5Fin":2.8, "SFoyer":3, "SLvl":4}}
df.replace(cleanup_nums, inplace=True)
#box plot
var = 'HouseStyle'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
OverallQual¶
Rates the overall material and finish of the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor
print("na Values")
print(df['OverallQual'].isna().sum())
pd.Series(df['OverallQual']).value_counts().plot('bar')
#box plot
var = 'OverallQual'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
OverallCond¶
Rates the overall condition of the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor
print("na Values")
print(df['OverallCond'].isna().sum())
pd.Series(df['OverallCond']).value_counts().plot('bar')
#box plot
var = 'OverallCond'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
YearRemodAdd¶
Remodel date (same as construction date if no remodeling or additions)
(df['YearRemodAdd']-df['YearBuilt']).head()
If the remodeling and building date are same I will assign 0, otherwise, I will assign 1.
# Create a list to store the data
YY=df['YearRemodAdd']-df['YearBuilt']
# For each row in the column,
RemodAdd=[]
for row in YY:
if row == 0:
RemodAdd.append(0)
elif row>0:
RemodAdd.append(1)
df['RemodAdd']=pd.DataFrame(RemodAdd)
df.head()
YearBuilt¶
Here we can see a fairly consistent upward trend for the SalePrice as houses are more modern.
ax = sns.regplot(x="SalePrice", y="YearBuilt", data=df)
there is linear a correlation between 'YearBuilt' and 'Sale Price'.
For this feature, I am going to create bins and dummy features
df['YearBuilt'].describe()
bins = [1872, 1954, 1973, 1990, 2000, 2010]
df['binned'] = pd.cut(df['YearBuilt'], bins)
df = df.copy()
df = pd.get_dummies(df, columns=['binned'], prefix = ['binned'])
df.head()
df=df.rename(columns={"binned_(1872, 1954]":"BuiltYear1", "binned_(1954, 1973]":"BuiltYear2",
"binned_(1973, 1990]":"BuiltYear3","binned_(1990, 2000]":"BuiltYear4",
"binned_(2000, 2010]":"BuiltYear5"})
df.head()
print("na Values")
print(df['RoofStyle'].isna().sum())
pd.Series(df['RoofStyle']).value_counts().plot('bar')
#I will assign numbers for each type of roof
cleanup_nums = {"RoofStyle": {"Gable": 1, "Hip": 2, "Flat":3, "Gambrel":4,"Mansard":5, "Shed":6 }}
df.replace(cleanup_nums, inplace=True)
#box plot
var = 'RoofStyle'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
RoofMatl¶
Roof material
ClyTile Clay or Tile
CompShg Standard (Composite) Shingle
Membran Membrane
Metal Metal
Roll Roll
Tar&Grv Gravel & Tar
WdShake Wood Shakes
WdShngl Wood Shingles
print("na Values")
print(df['RoofMatl'].isna().sum())
pd.Series(df['RoofMatl']).value_counts().plot('bar')
#I will assign numbers for each type of roof matarial
cleanup_nums = {"RoofMatl": {"CompShg": 1, "Tar&Grv": 2, "WdShngl":3, "WdShake":4,"ClyTile":4, "Roll":4, "Membran":4, "Metal":4 }}
df.replace(cleanup_nums, inplace=True)
#box plot
var = 'RoofMatl'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
Exterior1st¶
Exterior covering on house
AsbShng Asbestos Shingles
AsphShn Asphalt Shingles
BrkComm Brick Common
BrkFace Brick Face
CBlock Cinder Block
CemntBd Cement Board
HdBoard Hard Board
ImStucc Imitation Stucco
MetalSd Metal Siding
Other Other
Plywood Plywood
PreCast PreCast
Stone Stone
Stucco Stucco
VinylSd Vinyl Siding
Wd Sdng Wood Siding
WdShing Wood Shingles
*They very close eachother
*I will put flag on the they ar smilar so I will drop 'Exterior2nd'
df=df.drop(columns=['Exterior2nd'])
print("na Values for Exterior1st")
print(df['Exterior1st'].isna().sum())
pd.Series(df['Exterior1st']).value_counts().plot('bar')
#I will assign numbers for each type of
cleanup_nums = {"Exterior1st": {"VinylSd": 1, "HdBoard": 2, "MetalSd":3, "Wd Sdng":4,"Plywood":5, "CemntBd":6, "BrkFace":7, "WdShing":8, "Stucco":9, "AsbShng":10,"Stone":0,"BrkComm":0, "ImStucc":0, "AsphShn":0, "CBlock":0}}
df.replace(cleanup_nums, inplace=True)
#box plot 'Exterior1st'
var = 'Exterior1st'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
ExterQual¶
Evaluates the quality of the material on the exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
print("na Values")
print(df['ExterQual'].isna().sum())
pd.Series(df['ExterQual']).value_counts().plot('bar')
#I will assign numbers
cleanup_nums = {"ExterQual": {"Fa": 2, "TA": 3, "Gd":4, "Ex":5}}
df.replace(cleanup_nums, inplace=True)
#box plot 'Exterior1st'
var = 'ExterQual'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
There is high liner coralation between "ExterQual" and "SalePrice"
ExterCond¶
Evaluates the present condition of the material on the exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
print("na Values")
print(df['ExterCond'].isna().sum())
pd.Series(df['ExterCond']).value_counts().plot('bar')
#I will assign numbers
cleanup_nums = {"ExterCond": {"Po":1, "Fa": 2, "TA": 3, "Gd":4, "Ex":5}}
df.replace(cleanup_nums, inplace=True)
#box plot 'Exterior1st'
var = 'ExterCond'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
Foundation¶
Type of foundation
BrkTil Brick & Tile
CBlock Cinder Block
PConc Poured Contrete
Slab Slab
Stone Stone
Wood Wood
print("na Values")
print(df['Foundation'].isna().sum())
pd.Series(df['Foundation']).value_counts().plot('bar')
#I will assign numbers
cleanup_nums = {"Foundation": {"BrkTil":3, "CBlock":2, "PConc":1, "Slab":4, "Stone":5, "Wood":6}}
df.replace(cleanup_nums, inplace=True)
#box plot 'Exterior1st'
var = 'Foundation'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
BsmtQual¶
Evaluates the height of the basement
Ex Excellent (100+ inches)
Gd Good (90-99 inches)
TA Typical (80-89 inches)
Fa Fair (70-79 inches)
Po Poor (<70 inches
NA No Basement
print("na Values")
print(df['BsmtQual'].isna().sum())
The 37 values of variable are na so I will fill na as 0
# I will fill na as 0
df['BsmtQual']=df['BsmtQual'].fillna(0)
pd.Series(df['BsmtQual']).value_counts().plot('bar')
#I will assign numbers
cleanup_nums = {"BsmtQual": {"Fa": 2, "TA": 3, "Gd":4, "Ex":5}}
df.replace(cleanup_nums, inplace=True)
#box plot 'Exterior1st'
var = 'BsmtQual'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
BsmtCond¶
Evaluates the general condition of the basement
Ex Excellent
Gd Good
TA Typical - slight dampness allowed
Fa Fair - dampness or some cracking or settling
Po Poor - Severe cracking, settling, or wetness
NA No Basement
print("na Values")
print(df['BsmtCond'].isna().sum())
The 37 values of variable are na so I will fill na as 0
# I will fill na as 0
df['BsmtCond']=df['BsmtCond'].fillna(0)
pd.Series(df['BsmtCond']).value_counts().plot('bar')
#I will assign numbers
cleanup_nums = {"BsmtCond": {"Po": 1, "Fa": 2, "TA": 3, "Gd":4, "Ex":5}}
df.replace(cleanup_nums, inplace=True)
#box plot 'Exterior1st'
var = 'BsmtCond'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
BsmtExposure¶
Refers to walkout or garden level walls
Gd Good Exposure
Av Average Exposure (split levels or foyers typically score average or above)
Mn Mimimum Exposure
No No Exposure
NA No Basement
print("na Values")
print(df['BsmtExposure'].isna().sum())
The 38 values of variable are na so I will fill na as 0
# I will fill na as 0
df['BsmtExposure']=df['BsmtExposure'].fillna(0)
pd.Series(df['BsmtExposure']).value_counts().plot('bar')
#I will assign numbers
cleanup_nums = {"BsmtExposure": {"No": 1, "Mn": 2, "Av": 3, "Gd":4}}
df.replace(cleanup_nums, inplace=True)
#box plot 'Exterior1st'
var = 'BsmtExposure'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
BsmtFinType2¶
Rating of basement finished area (if multiple types)
GLQ Good Living Quarters
ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No Basement
Since 'Bsmt Fin Type 1' and 'Bsmt FreeType2' are highly correlated I will assign as 1 and 0
df[['BsmtFinType2']].isna().sum()
# For each row in the column,
BsmtFTM_Flag=[]
x=df['BsmtFinType1']
y=df['BsmtFinType2']
ZZ=x == y
YY=pd.DataFrame(ZZ)
for row in YY[0]:
if row==True:
BsmtFTM_Flag.append(1)
else:
BsmtFTM_Flag.append(0)
df['BsmtFTM_Flag']=pd.DataFrame(BsmtFTM_Flag)
df.head()
df=df.drop(columns=['BsmtFinType2'])
BsmtFinType1¶
Rating of basement finished area
GLQ Good Living Quarters
ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No Basement
print("na Values")
print(df['BsmtFinType1'].isna().sum())
Since NA is No BasementThe, I will fill na as 0
# I will fill na as 0
df['BsmtFinType1']=df['BsmtFinType1'].fillna(0)
pd.Series(df['BsmtFinType1']).value_counts().plot('bar')
#I will assign numbers
cleanup_nums = {"BsmtFinType1": {"Unf": 1, "GLQ": 2, "ALQ": 3, "BLQ":4, "Rec": 5, "LwQ":5}}
df.replace(cleanup_nums, inplace=True)
#box plot 'BsmtFinType1'
var = 'BsmtFinType1'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
Heating¶
Type of heating
Floor Floor Furnace
GasA Gas forced warm air furnace
GasW Gas hot water or steam heat
Grav Gravity furnace
OthW Hot water or steam heat other than gas
Wall Wall furnace
print("na Values")
print(df['Heating'].isna().sum())
pd.Series(df['Heating']).value_counts().plot('bar')
It can be seen frm figure most of properties have GasA and coralataion between SalePrice and Gas is depends on having heating system or not so I will create dummy with 1 and 0
#box plot overallqual/saleprice
var = 'Heating'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
#I will assign numbers
#cleanup_nums = {"Heating": {"GasA":6, 'Floor':2, 'GasW':5,'Grav':1, 'OthW':4, 'Wall':3}}
#df.replace(cleanup_nums, inplace=True)
#box plot 'BsmtFinType1'
var = 'Heating'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
df = df.copy()
df = pd.get_dummies(df, columns=['Heating'], prefix = ['Heating'])
df.head()
df=df.drop(columns=['Heating_Floor', 'Heating_GasW','Heating_Grav', 'Heating_OthW', 'Heating_Wall'])
df=df.rename(columns = {'Heating_GasA':'GasA_Flag'})
df=df.rename(columns = {'Heating_GasA':'GasA_Flag'})
print("na Values")
print(df['HeatingQC'].isna().sum())
pd.Series(df['HeatingQC']).value_counts().plot('bar')
#I will assign numbers
cleanup_nums = {"HeatingQC": {"Po": 1, "Fa": 2, "TA": 3, "Gd":4, "Ex":5}}
df.replace(cleanup_nums, inplace=True)
#box plot 'BsmtFinType1'
var = 'HeatingQC'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
print("na Values")
print(df['CentralAir'].isna().sum())
pd.Series(df['CentralAir']).value_counts().plot('bar')
# For each row in the column,
CentralAir=[]
for row in df['CentralAir']:
if row == 'Y':
CentralAir.append(1)
elif row=='N':
CentralAir.append(0)
df['CentralAir']=pd.DataFrame(CentralAir)
df['CentralAir'].head()
Electrical¶
Electrical system
SBrkr Standard Circuit Breakers & Romex
FuseA Fuse Box over 60 AMP and all Romex wiring (Average)
FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)
FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)
Mix Mixed
print("na Values")
print(df['Electrical'].isna().sum())
df['Electrical']=df['Electrical'].fillna(0)
pd.Series(df['Electrical']).value_counts().plot('bar')
cleanup_nums = {"Electrical": {"SBrkr": 5, "FuseA": 4, "FuseF": 3, "FuseP":2, "Mix":1}}
df.replace(cleanup_nums, inplace=True)
#box plot overallqual/saleprice
var = 'Electrical'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
print("na Values")
print(df['KitchenQual'].isna().sum())
pd.Series(df['KitchenQual']).value_counts().plot('bar')
cleanup_nums = {"KitchenQual": {"Fa": 2, "TA": 3, "Gd":4, "Ex":5}}
df.replace(cleanup_nums, inplace=True)
#box plot overallqual/saleprice
var = 'KitchenQual'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df, order=[2,3,4,5])
Functional¶
Home functionality (Assume typical unless deductions are warranted)
Typ Typical Functionality
Min1 Minor Deductions 1
Min2 Minor Deductions 2
Mod Moderate Deductions
Maj1 Major Deductions 1
Maj2 Major Deductions 2
Sev Severely Damaged
Sal Salvage only
print("na Values")
print(df['Functional'].isna().sum())
pd.Series(df['Functional']).value_counts().plot('bar')
cleanup_nums = {"Functional": {"Typ": 1, "Min1": 2, "Min2": 3, "Mod":4, "Maj1":5, "Maj2":6, "Sev":7}}
df.replace(cleanup_nums, inplace=True)
#box plot overallqual/saleprice
var = 'Functional'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df, order=[1,2,3,4,5,6,7])
FireplaceQu¶
Fireplace quality
Ex Excellent - Exceptional Masonry Fireplace
Gd Good - Masonry Fireplace in main level
TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
Fa Fair - Prefabricated Fireplace in basement
Po Poor - Ben Franklin Stove
NA No Fireplace
print("na Values")
print(df['FireplaceQu'].isna().sum())
690 values are na for "FireplaceQu" so I will check the 'Fireplaces'
print("na Values")
print(df['Fireplaces'].isna().sum())
pd.Series(df['Fireplaces']).value_counts().plot('bar')
df[['Fireplaces', 'FireplaceQu']].head()
It can be seen no fire place is same as na fire place quality so I will asign the na as 0
df['FireplaceQu']=df['FireplaceQu'].fillna(0)
cleanup_nums = {"FireplaceQu": {"Po":1, "Fa": 2, "TA": 3, "Gd":4, "Ex":5}}
df.replace(cleanup_nums, inplace=True)
#box plot overallqual/saleprice
var = 'FireplaceQu'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df, order=[0, 1,2,3,4,5])
GarageArea¶
Size of garage in square feet
df[df['GarageArea']==0]['GarageArea'].count()
ax = sns.regplot(x="SalePrice", y=np.sqrt(df["GarageArea"]), data=df)
df['GarageArea']=np.sqrt(df['GarageArea'])
GarageType and GarageFinish¶
GarageType: Garage location
2Types More than one type of garage
Attchd Attached to home
Basment Basement Garage
BuiltIn Built-In (Garage part of house - typically has room above garage)
CarPort Car Port
Detchd Detached from home
NA No Garage
GarageFinish : Interior finish of the garage
Fin Finished
RFn Rough Finished
Unf Unfinished
NA No Garage
df[["GarageFinish","GarageType" ]].isna().sum()
df[df.GarageType.isnull()][['GarageType', "GarageFinish"]].head()
Since NA mean is No Garage I will asign it with 0.
df['GarageFinish']=df['GarageFinish'].fillna(0)
df['GarageType']=df['GarageType'].fillna(0)
pd.Series(df['GarageType']).value_counts().plot('bar')
cleanup_nums = {"GarageType": {"Attchd":1, "Detchd": 2, "BuiltIn": 3, "CarPort":4, "Basment":5, "2Types":6}}
df.replace(cleanup_nums, inplace=True)
#box plot overallqual/saleprice
var = 'GarageType'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df, order=[0, 1,2,3,4,5,6])
pd.Series(df['GarageFinish']).value_counts().plot('bar')
cleanup_nums = {"GarageFinish": {"Unf":1, "RFn": 2, "Fin": 3}}
df.replace(cleanup_nums, inplace=True)
#box plot overallqual/saleprice
var = 'GarageFinish'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df, order=[0, 1,2,3])
GarageQual and GarageCond¶
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
NA No Garage
df[["GarageQual", "GarageCond" ]].isna().sum()
Since NA mean is No Garage I will asign it with 0.
df['GarageQual']=df['GarageQual'].fillna(0)
df['GarageCond']=df['GarageCond'].fillna(0)
pd.Series(df['GarageQual']).value_counts().plot('bar')
pd.Series(df['GarageCond']).value_counts().plot('bar')
cleanup_nums = {"GarageQual": {"Po":1, "Fa": 2, "TA": 3, "Gd":4, "Ex":5}}
df.replace(cleanup_nums, inplace=True)
#box plot overallqual/saleprice
var = 'GarageQual'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df, order=[0, 1,2,3, 4, 5])
cleanup_nums = {"GarageCond": {"Po":1, "Fa": 2, "TA": 3, "Gd":4, "Ex":5}}
df.replace(cleanup_nums, inplace=True)
#box plot overallqual/saleprice
var = 'GarageCond'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df, order=[0, 1,2,3, 4, 5])
ax = sns.regplot(x="GarageCond", y="GarageQual", data=df)
df=df.drop(columns=['GarageQual'])
GarageYrBlt¶
Year garage was built
df[["GarageYrBlt" ]].isna().sum()
df['GarageYrBlt']=df['GarageYrBlt'].fillna(0)
#create column for new variable (one is enough because it's a binary categorical feature)
#if area>0 it gets 1, for area==0 it gets 0
df['HasGrg'] = pd.Series(len(df['GarageYrBlt']), index=df.index)
df['HasGrg'] = 0
df.loc[df['GarageYrBlt']>0,'HasGrg'] = 1
df=df.drop(columns=['GarageYrBlt']) #ekledim
df[["PavedDrive" ]].isna().sum()
pd.Series(df["PavedDrive"]).value_counts().plot('bar')
cleanup_nums = {"PavedDrive": {"Y":3, "N": 1, "P": 2}}
df.replace(cleanup_nums, inplace=True)
#box plot overallqual/saleprice
var = 'PavedDrive'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df, order=[1,2,3])
print("NA Pools")
print(df[["PoolQC" ]].isna().sum())
print("0 area Pools")
print(df[df.PoolArea==0]["PoolArea"].count())
df[["PoolQC", "PoolArea" ]].head()
I will asign NA as 0
df['PoolQC']=df['PoolQC'].fillna(0)
pd.Series(df["PoolQC"]).value_counts().plot('bar')
cleanup_nums = {"PoolQC": {"Fa": 2, "Gd":4, "Ex":5}}
df.replace(cleanup_nums, inplace=True)
#box plot overallqual/saleprice
var = 'PoolQC'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df, order=[0, 2, 4, 5])
df=df.drop(columns=['PoolQC'])
PoolArea¶
Pool area in square feet
print("NA Pools")
print(df[["PoolArea" ]].isna().sum())
print("0 area Pools")
print(df[df.PoolArea==0]["PoolArea"].count())
df[df.PoolArea==0]["PoolArea"].count()
#create column for new variable (one is enough because it's a binary categorical feature)
#if area>0 it gets 1, for area==0 it gets 0
df['HasPool'] = pd.Series(len(df['PoolArea']), index=df.index)
df['HasPool'] = 0
df.loc[df['PoolArea']>0,'HasPool'] = 1
#df=df.drop(columns=['PoolArea']) #ekledim
df=df.drop(columns=['PoolArea'])
Fence¶
Fence quality
GdPrv Good Privacy
MnPrv Minimum Privacy
GdWo Good Wood
MnWw Minimum Wood/Wire
NA No Fence
print("NA")
print(df[["Fence"]].isna().sum())
#I will asign NA as 0
df['Fence']=df['Fence'].fillna(0)
pd.Series(df["Fence"]).value_counts().plot('bar')
cleanup_nums = {"Fence": {"MnPrv": 1,"GdPrv":2, "GdWo":4,"MnWw":3}}
df.replace(cleanup_nums, inplace=True)
#box plot overallqual/saleprice
var = 'Fence'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df, order=[0, 1, 2, 3, 4])
MiscFeature¶
Miscellaneous feature not covered in other categories
Elev Elevator
Gar2 2nd Garage (if not described in garage section)
Othr Other
Shed Shed (over 100 SF)
TenC Tennis Court
NA None
print("NA")
print(df[["MiscFeature"]].isna().sum())
#I will asign NA as 0
df['MiscFeature']=df['MiscFeature'].fillna(0)
pd.Series(df["MiscFeature"]).value_counts().plot('bar')
cleanup_nums = {"MiscFeature": {"Shed": 2,"Gar2":3, "Othr":1,"TenC":4}}
df.replace(cleanup_nums, inplace=True)
#box plot overallqual/saleprice
var = 'MiscFeature'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df, order=[0, 1, 2, 3, 4])
SaleType¶
Type of sale
WD Warranty Deed - Conventional
CWD Warranty Deed - Cash
VWD Warranty Deed - VA Loan
New Home just constructed and sold
COD Court Officer Deed/Estate
Con Contract 15% Down payment regular terms
ConLw Contract Low Down payment and low interest
ConLI Contract Low Interest
ConLD Contract Low Down
Oth Other
print("NA")
print(df[["SaleType"]].isna().sum())
pd.Series(df["SaleType"]).value_counts().plot('bar')
cleanup_nums = {"SaleType": {"WD": 1,"New":2, "COD":3,"ConLD":4,"ConLw":5, "ConLI":6, "CWD":7, "Oth":8, "Con":9}}
df.replace(cleanup_nums, inplace=True)
#box plot overallqual/saleprice
var = 'SaleType'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
SaleCondition¶
Condition of sale
Normal Normal Sale
Abnorml Abnormal Sale - trade, foreclosure, short sale
AdjLand Adjoining Land Purchase
Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit
Family Sale between family members
Partial Home was not completed when last assessed (associated with New Homes)
print("NA")
print(df[["SaleCondition"]].isna().sum())
pd.Series(df["SaleCondition"]).value_counts().plot('bar')
cleanup_nums = {"SaleCondition": {"Normal": 1,"Partial":2, "Abnorml":3,"Family":4,"Alloca":5, "AdjLand":6}}
df.replace(cleanup_nums, inplace=True)
#box plot overallqual/saleprice
var = 'SaleCondition'
data = pd.concat([df['SalePrice'], df[var]], axis=1)
f, ax = plt.subplots(figsize=(7,5))
fig = sns.boxplot(x=var, y="SalePrice", data=df)
df['OpenPorchSF']=np.sqrt(df['OpenPorchSF'])
df[df.OpenPorchSF==0]['OpenPorchSF'].count()
ax = sns.regplot(x="SalePrice", y="OpenPorchSF", data=df)
EnclosedPorch¶
Enclosed porch area in square feet
df['EnclosedPorch']=np.sqrt(df['EnclosedPorch'])
print("NA")
print(df[["EnclosedPorch"]].isna().sum())
df[df.EnclosedPorch==0]['EnclosedPorch'].count()
df[df.EnclosedPorch>0]['EnclosedPorch'].count()
more than 1200 values are missing or 0 so I will remove that columnm
df=df.drop(columns=['EnclosedPorch'])
3SsnPorch¶
Three season porch area in square feet
print("NA")
print(df[["3SsnPorch"]].isna().sum())
df[df['3SsnPorch']==0]['3SsnPorch'].count()
more than 1400 values are missing or 0 so I will remve that columnm
df=df.drop(columns=['3SsnPorch'])
ScreenPorch¶
Screen porch area in square feet
print("NA")
print(df[["ScreenPorch"]].isna().sum())
df[df['ScreenPorch']==0]['ScreenPorch'].count()
df=df.drop(columns=['ScreenPorch'])
WoodDeckSF¶
Wood deck area in square feet
print("NA")
print(df[["WoodDeckSF"]].isna().sum())
df[df['WoodDeckSF']==0]['WoodDeckSF'].count()
df['WoodDeckSF']=np.sqrt(df['WoodDeckSF'])
Replace SalePrice as target¶
XX=df['SalePrice']
df=df.drop(columns=['SalePrice'])
df['SalePrice']=pd.DataFrame(XX)
df.head()
3. Saving the Data as a .CSV File¶
Now that the problems associated with the data are ironed out, we can save this version of the file to separate file. The data will be saved to a .csv due to its versatility.
#df.to_csv('HPP.csv', index=False)
When needed, this file can now be loaded using the command df = pd.read_csv('HPP.csv') to variable df.