House Price Predictions Using Keras

THE FOREFRONT OF TECHNOLOGY

3 years ago

This is a starter tutorial on modeling using Keras which includes hyper-parameter tuning along with callbacks. Problem Statement Creating a Keras-Regression model that can accurately analyse features of a given house and predict the price accordingly. Steps Involved Analysis and Imputation of missing values One-Hot Encoding of Categorical Features Exploratory Data Analysis(EDA) & Outliers Detection.

Keras-Regression Modelling along with hyper-parameter tuning. Training the Model along with EarlyStopping Callback. Prediction and Evaluation Kaggle Notebook Link Importing Libraries We would be using numpy and pandas for processing our dataset, matplotlib and seaborn for data visualization, and Keras for implementing our neural network.

Also, we would be using Sklearn for outlier detection and scaling our dataset. import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snssns.setfrom sklearn.preprocessing import StandardScaler # Standardizationfrom sklearn.ensemble import IsolationForest # Outlier Detectionfrom keras.models import Sequential # Sequential Neural Network from keras.layers import Densefrom keras.callbacks import EarlyStopping # Early Stopping Callbackfrom keras.optimizers import Adam # Optimizerfrom kerastuner.tuners import RandomSearch # HyperParameter Tuningimport warningswarnings.filterwarnings(‘ignore’) # To ignore warnings.

Loading the Dataset

Here we have used the Dataset from House Prices — Advanced Regression Techniques train = pd.read_csv(‘../input/house-prices-advanced-regression-techniques/train.csv’)test = pd.read_csv(‘../input/house-prices-advanced-regression-techniques/test.csv’)y = train[‘SalePrice’].valuesdata = pd.concat([train,test],axis=0,sort=False)data.drop([‘SalePrice’],axis=1,inplace=True)data.head First 10 columns of the dataset Analysis and Imputation of missing values We would first see all the features having missing values. This would include data from both training and testing data. missing_values = data.isnull.summissing_values = missing_values[missing_values > 0].sort_values(ascending = False)NAN_col = list(missing_values.to_dict.keys)missing_values_data = pd.DataFrame(missing_values)missing_values_data.reset_index(level=0, inplace=True)missing_values_data.columns = [‘Feature’,’Number of Missing Values’]missing_values_data[‘Percentage of Missing Values’] = (100.0*missing_values_data[‘Number of Missing Values’])/len(data)missing_values_data Top 20 columns of missing features There are in total 33 features having missing values.

Although in some of the top features in terms of percentage of missing values such as PoolQC, the missing value is representing that the house simply does not have that feature(in this case house does not have a pool) which is evident from the Pool Area feature which shows a value of 0 corresponding to all the missing values of PoolQC feature. We fill the missing values in the following way: Basement: This includes BsmtFinSF1, BsmtFinSF2, TotalBsmtSF and BsmtUnfSF. We would fill all missing values with 0 since the NAN values here simply represent that the house does not have a basement so the area of the basement would be 0. Electrical: Only one row has a missing value for this feature.

Therefore after manual inspection, we would put its value to be ‘FuseA’ KitchenQual: Again since only one row has a missing value for this, therefore we put ‘TA’ for this which is the most common value for this feature in the dataset. LotFrontage: Here we would first fill all missing values by taking the mean of the LotFrontage values of all groups having the same values of 1stFlrSF. This is because LotFrontage has a high correlation with 1stFlrSF.

However, there can be cases where all the LotFrontage values corresponding to a particular 1stFlrSF value can be missing. To tackle such cases we would then fill it by using interpolate function of pandas to fill missing values linearly. MasVnrArea: Again we would be applying the same analogy as we did above with LotFrontage.

Others: For other features, we would follow the most generic approach, that is, we would fill numeric ones by the mean of all the values of that feature and for categorical we would fill it by NA. data[‘BsmtFinSF1’].fillna(0, inplace=True)data[‘BsmtFinSF2’].fillna(0, inplace=True)data[‘TotalBsmtSF’].fillna(0, inplace=True)data[‘BsmtUnfSF’].fillna(0, inplace=True)data[‘Electrical’].fillna(‘FuseA’,inplace = True)data[‘KitchenQual’].fillna(‘TA’,inplace=True)data[‘LotFrontage’].fillna(data.groupby(‘1stFlrSF’)[‘LotFrontage’].transform(‘mean’),inplace=True)data[‘LotFrontage’].interpolate(method=’linear’,inplace=True)data[‘MasVnrArea’].fillna(data.groupby(‘MasVnrType’)[‘MasVnrArea’].transform(‘mean’),inplace=True)data[‘MasVnrArea’].interpolate(method=’linear’,inplace=True) for col in NAN_col: data_type = data[col].dtype if data_type == ‘object’: data[col].fillna(‘NA’,inplace=True) else: data[col].fillna(data[col].mean,inplace=True) Adding New Features After thoroughly understanding the data, we have also created some new features by combining the given ones. data[‘Total_Square_Feet’] = (data[‘BsmtFinSF1’] + data[‘BsmtFinSF2’] + data[‘1stFlrSF’] + data[‘2ndFlrSF’] + data[‘TotalBsmtSF’])data[‘Total_Bath’] = (data[‘FullBath’] + (0.5 data[‘HalfBath’]) + data[‘BsmtFullBath’] + (0.5 data[‘BsmtHalfBath’]))data[‘Total_Porch_Area’] = (data[‘OpenPorchSF’] + data[‘3SsnPorch’] + data[‘EnclosedPorch’] + data[‘ScreenPorch’] + data[‘WoodDeckSF’])data[‘SqFtPerRoom’] = data[‘GrLivArea’] / (data[‘TotRmsAbvGrd’] + data[‘FullBath’] + data[‘HalfBath’] + data[‘KitchenAbvGr’]) One Hot Encoding of the Categorical Features We would first see the distribution of features between numeric and categorical. column_data_type = []for col in data.columns: data_type = data[col].dtype if data[col].dtype in [‘int64′,’float64’]: column_data_type.append(‘numeric’) else: column_data_type.append(‘categorical’)plt.figure(figsize=(15,5))sns.countplot(x=column_data_type)plt.show Bar Plot of Categorical & Numeric Features Here as we see that the number of categorical features actually exceeds the number of numeric features which shows how important these features are.

Here we have chosen One hot encoding to convert these categorical features into numerical. data = pd.get_dummies(data) After this operation, our original 80 features have been expanded to 314 features. Basically, each label of a categorical feature turns into a new feature with binary values(1 for present and 0 for not present).

Now we would split our combined data into training and testing data to do some exploratory analysis on our training data.

train = data[:1460].copytest = data[1460:].copytrain[‘SalePrice’] = y Exploratory Data Analysis & Outliers Detection We would first extract the top-features from our training dataset that have the highest correlation with the Sale Price. top_features = train.corr[[‘SalePrice’]].sort_values(by=[‘SalePrice’],ascending=False).head(30)plt.figure(figsize=(5,10))sns.heatmap(top_features,cmap=’rainbow’,annot=True,annot_kws={“size”: 16},vmin=-1

Top 30 features with respect to a positive correlation with the dataset

Thus we have extracted the top 30 features having the highest positive correlation with the SalePrice in descending order. Now we would plot some of these features against SalePrice to find outliers in our dataset. def plot_data(col, discrete=False): if discrete: fig, ax = plt.subplots(1,2,figsize=(14,6)) sns.stripplot(x=col, y=’SalePrice’, data=train, ax=ax[0]) sns.countplot(train[col], ax=ax[1]) fig.suptitle(str(col) + ‘ Analysis’) else: fig, ax = plt.subplots(1,2,figsize=(12,6)) sns.scatterplot(x=col, y=’SalePrice’, data=train, ax=ax[0]) sns.distplot(train[col], kde=False, ax=ax[1]) fig.suptitle(str(col) + ‘ Analysis’) This is the plot function that we would use to plot graphs of various features. OverallQual plot_data(‘OverallQual’,True) We see there are two outliers with 10 overall quality and sale price less than 200000. train = train.drop(train[(train[‘OverallQual’] == 10) & (train[‘SalePrice’] < 200000)].index) Thus we dropped these outliers from our dataset.

Now we move on to analyze another feature.

Total_Square_Feet plot_data(‘Total_Square_Feet’) This seems more or less appropriate distribution with no outliers whatsoever. GrLivArea plot_data(‘GrLivArea’) Again no outliers can be eliminated. Total_Bath plot_data(‘Total_Bath’) Here we see two outliers that have Total_Bath more than 4 but with sale price less than 200000. train = train.drop(train[(train[‘Total_Bath’] > 4) & (train[‘SalePrice’] < 200000)].index) Thus we removed these outliers. TotalBsmtSF plot_data(‘TotalBsmtSF’)

Here as well we see 1 clear outlier that has TotalBsmtSF more than 3000 but sale price less than 300000. train = train.drop(train[(train[‘TotalBsmtSF’] > 3000) & (train[‘SalePrice’] < 400000)].index) train.reset_index # To reset the index Now that we have taken care of the top features of our dataset we would further remove outliers using the Isolation Forest Algorithm.

We use this algorithm since it would be difficult to go through all the features and eliminate the outliers manually but it was important to do it manually for the features that have a high correlation with the SalePrice. clf = IsolationForest(max_samples = 100, random_state = 42)clf.fit(train)y_noano = clf.predict(train)y_noano = pd.DataFrame(y_noano, columns = [‘Top’])y_noano[y_noano[‘Top’] == 1].index.valuestrain = train.iloc[y_noano[y_noano[‘Top’] == 1].index.values]train.reset_index(drop = True, inplace = True)print(“Number of Outliers:”, y_noano[y_noano[‘Top’] == -1].shape[0])print(“Number of rows without outliers:”, train.shape[0]) We would finally use Standard Scalar from sklearn to scale our data. X = train.copyX.drop([‘SalePrice’],axis=1,inplace=True) # Dropped the y featurey = train[‘SalePrice’].values

This takes care of our dataset preprocessing and we are finally ready for the next step, which is modeling our data. MODELLING We would use Random Search Algorithm from Keras for hyper-parameter tuning of the model. def build_model(hp): model =