Goal

To predict the sale prices of homes in Ames, Iowa using the 79 explanatory variables provided.

 

Work Outline

  • Data exploration
    • Examining the data set
    • Clean data:
      • Check for missing values
      • Impute/Replace missing values
      • Change ordinal variables to numerical variables
    • Visualize data and calculate correlations
    • Handle outliers
    • Feature engineering
  • Hypothesis

  • Model Building
    • Analyzing variable distributions
    • Transforming data
    • Scaling data
    • Removing Variables with low variance
    • Creating dummy variables for categorical variables
    • Since this is a prediction problem, I am planning to use a Gradient Boost and an XGBM model to predict sale prices

Data Exploration

Examining the data set

  • The data set consists of 1460 rows and 81 columns including the target column, Sale Price and Id a unique Id for each home
  • There are 79 columns describing various features of the house which can be divided into 6 categories:
  1. Age attributes (5)
    -Year built, Yead Remod Add, YrSold, MoSold, GarageYrBlt

  2. Location,lot and home style attributes (14)
    -Location - MSZoning, MSSubClass, Neighborhood, Street, Alley -Lot - Lot Frontage, Lot Config, Lot shape, Landslope, LandContour -Home style - Building type, House style, Street, Alley

  3. Condition and Quality attributes (14)
    -Condition1, Condition2, ExterCond, ExterQual, BsmtQual, BsmtCond, KitchenQual, GarageQual,GarageCond, HeatingQC, OverallQual, OverallCond, SaleCondition,PoolQC

  4. Technical attributes (15)
    -RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, MasVnrArea, Foundation, BsmtExposure, Electrical, CentralAir, GarageType, GarageFinish, BsmtFinType1, BsmtFinType2,Utilities, Heating

  5. Size attributes (22)
    -LotArea, GrLivArea, TotalBsmtSF, TotRmsAbvGrd, FullBath, HalfBath, BsmtFullBath, BsmtHalfBath,BedroomAbvGr, KitchenAbvGr, GarageArea, WoodDeckSF, EnclosedPorch, OpenPorchSF, X3SsnPorch,ScreenPorch,BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, X1stFlrSF, X2ndFlrSF,LowQualFinSF

  6. Luxury attributes (7)
    -Fireplaces,GarageCars,Fence,MiscFeature,Pool Area, MiscVal,PavedDrive

  7. Other (2) -Saletype, Functional

  • There are 36 numerical columns and 42 categorical columns. Some of the categorical columns are ordinal and during the data cleaning stage, we will convert these to numerical variables.

  • There is redundancy in some columns:
    -TotalBsmtSF is the sum of BsmtFinSF1, BsmtFinSF2 and BsmtUnSF.
    -GrLIvArea which describes the Above grade (ground) living area square feet is the sum of X1stFlrSF, X2ndFlrSF and LowQualFinSF.

  • We will retain the TotalBsmtSF and GrlivArea and drop the other 6 columns.

  • This leaves us with 73 columns to explore

Check for missing data

  • 3 columns (PoolQC, misc feature,and alley) have more than 90% missing values. Fence and fireplace quality columns have ~80% and ~49% missing values
  • All these columns are factor/categorical variables and the description text describes NA is either “Not Applicable” or “None” for all factor variable columns
  • This means that these five columns have big chunks of data which is a single value. So, we will not explore these features and drop these columns.

Data cleaning

  • Drop 5 columns PoolQC, miscfeature, alley, fence, and fireplace quality
  • We will replace NA with NOTAPP in the remaining factor variables columns which consist NA values
  • We will replace missing values with 0 for all numerical variables
  • We will also convert 8 ordinal variable columns to numerical variables
  • Additionally, utilities, heating, street, condition2, and miscVal columns contain a single value. So, we will eliminate these columns since they wont be very useful for modeling.
  • So far, we have eliminated 15 columns which leaves us with 62 columns

Data Visualization & correlation values

  • For each category attribute as described in data description, we will calculate correlation values
  • Plots and values are displayed for columns that have \(R^2\)>0.25
  • Numerical variables are plotted as scatter plots while categorical variables are plotted as box plots

Age attributes

Age of the house has very little correlation with the saleprice

Location,lot and home style attributes

Of the 14 columns that describe location,lot, and home style attributes, Neighborhood is the only variable that shows a strong correlation to sale price

Condition and Quality attributes

External, kitchen, basement, and overall quality show a strong correlation to sale price

Technical attributes

Garage Finish and Foundation show correlation to sale price.

Size attributes

A number of size attributes have \(R^2\)>0.3. Note that most of these attributes are related to indoor area

Luxury attributes

People are willing to pay extra for a bigger garage but not for fireplaces

 

Table of attributes that have \(R^2\)> 0.25

Attribute \(R^2\)
Foundation 0.26
Year RemodAdd 0.26
Year Built 0.27
TotRmsAbvGrd 0.28
Full Bath 0.31
Garage Finish 0.31
Total BsmtSF 0.38
Bsmt Qual 0.39
Garage Area 0.39
Garage Cars 0.41
Kitchen Qual 0.44
ExterQual 0.47
GrLivinArea 0.50
Neighborhood 0.55
Overall Qual 0.63
  • Based on \(R^2\) values, it seems that overall quality, neighborhood, and grlivingarea(Above ground living area square feet) are the top features that have the strongest correlations (>=0.5) with sale price.
  • External and kitchen quality also seem to have an effect on sale price.
  • It is interesting that while GrLivingArea has a strong impact, the TotRmsAbvGrd doesnt seem to have any impact. Does this mean that the total number of rooms in the house dont matter as long as there is enough space?

Handling outliers

  • Visualizing the data allowed us to observe some outliers. TotalBsmtSF and GrLivingArea show outliers when SF > 4000. So, lets examine these.  
Id YearBuilt GrLivArea TotalBsmtSF SalePrice
524 524 2007 4676 3138 184750
692 692 1994 4316 2444 755000
1183 1183 1996 4476 2396 745000
1299 1299 2008 5642 6110 160000
Despite being relatively newer homes and having a living area and basment area >4000 sqft, Ids 524 and 1299 have a really low price. Something seems wrong here. So, will eliminate these two data points to help modeling

Feature Engineering

Based on our correlation results, lets build some new features

  • GrLiving Area and TotalBsmt Sqft have a strong correlation with saleprice. It might be beneficial to add these two columns to create a new variable called TotalIDRSF
  • Notice that the size variables that had a correlation > 0.25 were all related to indoor area. None of the outdoor area variables showed a correlation. Maybe combining them all might change this. Combine all variables (6 columns) related to outdoor sqft to create a new variable called TotalODRSF
  • Combine TotalIDRSF and TotalODRSF to create TotalSF
  • Combine Full Baths and Half baths to create TotalBaths
  • We will also add columns Total Age and RemodAge
    -Totalage = Difference between year built and year sold
    -Remodage = Difference between year built and RemodAdd

 

Lets take a look at the \(R^2\) values after removing the outliers and adding new feature columns

name new_r2 old_r2
4 YearRemodAdd 0.26 0.26
6 Foundation 0.26 0.26
21 RemodAge 0.26 NA
3 YearBuilt 0.27 0.27
20 TotalAge 0.27 NA
12 TotRmsAbvGrd 0.29 0.28
13 GarageFinish 0.31 0.31
10 FullBath 0.32 0.31
19 TotalBaths 0.38 NA
7 BsmtQual 0.39 0.39
15 GarageArea 0.40 0.39
14 GarageCars 0.41 0.41
8 TotalBsmtSF 0.42 0.38
11 KitchenQual 0.44 0.44
17 TotalODRSF 0.44 NA
5 ExterQual 0.47 0.47
9 GrLivArea 0.54 0.50
1 Neighborhood 0.55 0.55
2 OverallQual 0.63 0.63
16 TotalIDRSF 0.69 NA
18 TotalSF 0.73 NA
  • Eliminating the two outliers has improved the \(R^2\) values for GrLving Area and TotalBsmtSF

  • Four of the new features we added, Total IDRSF, TotalODRSF, TotalSF, and TotalBaths, show strong correlations. Look at the Total SF!!!

  • Adding all the outdoor spaces shows that the outdoor area can have a significant correlation however not as much as the indoor area.

  • Of the 64 variables we explored, 21 variables including the 6 new variables we added seem to have an effect on Sale Price.

  • The variables which, we used to create the new features, are redundant since they have the same information and probably exhibit collinearity. Lets take a look at the correlation matrix.
  • As expected, we observe collinearity between:
  • IDRSF,ODRSF,GrLivArea,TotalBsmtSF and TotalSF
  • Garage Cars, Garage Area, and TotalODRSF
  • We will eliminate columns with redundant information:
    • Remove IDRSF, ODRSF, GrLivArea, TotalBSmtSF, Garage Area, OpenPorchSF, ScreenPorch, EnclosedPorch, X3SsnPotch, WoodDeckSF = All of this is in the TotalSF variable
    • Remove Half Baths and Full Baths - retain Total Baths
    • Remove YearRemodAdd and YearBuilt - retain TotalAge and RemodAge

Hypothesis

  • Size of the house (TOTALSF - includes indoor and outdoor spaces such as porch,garage space but not the lot area) seems to be the most important feature that governs the price of the house followed by the overall quality and neighborhood
  • External appearance and Kitchen Quality also have some effect on the sale price

Model Building

Analyze distributions

  • Before we beging modeling, we need to do a few more things
  • First we will analyze distributions of the target variable (Sale price) and other numerical variables that show a strong correlation to sale price

Sale Price

Its right skewed. Lets try to transform this to a normal distribution by taking the log value

Looks better!  

Lets take a look at the distributions of all the numerical variables  

While most of them are normally distributed, there are a few variables that are not.

Transform Variables

  • Lets use a box-cox transformation to transform the numerical variables that are not normally distributed
  • We will use transformTukey function from the rcompanion package to do this

Before we proceed further, lets check if the transformations affected the correlation values

name newr2 oldr2
8 TotRmsAbvGrd 0.29 0.29
1 Foundation 0.30 0.26
4 RemodAge 0.32 0.26
3 TotalAge 0.35 0.27
13 GarageFinish 0.38 0.31
6 BsmtQual 0.42 0.39
11 TotalBaths 0.44 0.38
7 KitchenQual 0.45 0.44
9 GarageCars 0.46 0.41
5 ExterQual 0.47 0.47
12 Neighborhood 0.57 0.55
2 OverallQual 0.67 0.63
10 TotalSF 0.73 0.73

A few more steps before modeling

  • We will use the caret package to accomplish these tasks
  • lets center and scale the numeric variables
    • preProcess()
  • Create dummy variables for categorical variables
    • dummyVars()
  • Remove variables with low variance
    • nearZeroVar()
  • Partition the data for modeling (70/30)
    • CreateDataPartition()
## [1] "The training set has 1023 observations"
## [1] "Validation set has 435 observations"

Modeling

  • Using caret package for modeling

GBM model

  • Model parameters:
  • n.trees=700
  • interaction.depth=5
  • shrinkage=0.05
  • n.minobsinnode=10

Variable Importance

RMSE of validation set: 0.1181907

Kaggle Leaderboard score/RMSE = 0.12563

XGBM model

RMSE of validation set: 0.1144982

Kaggle Leaderboard score/RMSE = 0.12358

Total sqft and Overall quality of the house are the biggest drivers of home prices in Ames, Iowa.