Goal

To predict the sale prices of homes in Ames, Iowa using the 79 explanatory variables provided.

Work Outline

Data exploration
- Examining the data set
- Clean data:
  - Check for missing values
  - Impute/Replace missing values
  - Change ordinal variables to numerical variables
- Visualize data and calculate correlations
- Handle outliers
- Feature engineering
Hypothesis
Model Building
- Analyzing variable distributions
- Transforming data
- Scaling data
- Removing Variables with low variance
- Creating dummy variables for categorical variables
- Since this is a prediction problem, I am planning to use a Gradient Boost and an XGBM model to predict sale prices

Data Exploration

Examining the data set

The data set consists of 1460 rows and 81 columns including the target column, Sale Price and Id a unique Id for each home
There are 79 columns describing various features of the house which can be divided into 6 categories:

Age attributes (5)
-Year built, Yead Remod Add, YrSold, MoSold, GarageYrBlt
Location,lot and home style attributes (14)
-Location - MSZoning, MSSubClass, Neighborhood, Street, Alley -Lot - Lot Frontage, Lot Config, Lot shape, Landslope, LandContour -Home style - Building type, House style, Street, Alley
Condition and Quality attributes (14)
-Condition1, Condition2, ExterCond, ExterQual, BsmtQual, BsmtCond, KitchenQual, GarageQual,GarageCond, HeatingQC, OverallQual, OverallCond, SaleCondition,PoolQC
Technical attributes (15)
-RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, MasVnrArea, Foundation, BsmtExposure, Electrical, CentralAir, GarageType, GarageFinish, BsmtFinType1, BsmtFinType2,Utilities, Heating
Size attributes (22)
-LotArea, GrLivArea, TotalBsmtSF, TotRmsAbvGrd, FullBath, HalfBath, BsmtFullBath, BsmtHalfBath,BedroomAbvGr, KitchenAbvGr, GarageArea, WoodDeckSF, EnclosedPorch, OpenPorchSF, X3SsnPorch,ScreenPorch,BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, X1stFlrSF, X2ndFlrSF,LowQualFinSF
Luxury attributes (7)
-Fireplaces,GarageCars,Fence,MiscFeature,Pool Area, MiscVal,PavedDrive
Other (2) -Saletype, Functional

There are 36 numerical columns and 42 categorical columns. Some of the categorical columns are ordinal and during the data cleaning stage, we will convert these to numerical variables.
There is redundancy in some columns:
-TotalBsmtSF is the sum of BsmtFinSF1, BsmtFinSF2 and BsmtUnSF.
-GrLIvArea which describes the Above grade (ground) living area square feet is the sum of X1stFlrSF, X2ndFlrSF and LowQualFinSF.
We will retain the TotalBsmtSF and GrlivArea and drop the other 6 columns.
This leaves us with 73 columns to explore

Check for missing data

3 columns (PoolQC, misc feature,and alley) have more than 90% missing values. Fence and fireplace quality columns have ~80% and ~49% missing values
All these columns are factor/categorical variables and the description text describes NA is either “Not Applicable” or “None” for all factor variable columns
This means that these five columns have big chunks of data which is a single value. So, we will not explore these features and drop these columns.

Data cleaning

Drop 5 columns PoolQC, miscfeature, alley, fence, and fireplace quality
We will replace NA with NOTAPP in the remaining factor variables columns which consist NA values
We will replace missing values with 0 for all numerical variables
We will also convert 8 ordinal variable columns to numerical variables
Additionally, utilities, heating, street, condition2, and miscVal columns contain a single value. So, we will eliminate these columns since they wont be very useful for modeling.
So far, we have eliminated 15 columns which leaves us with 62 columns

Data Visualization & correlation values

For each category attribute as described in data description, we will calculate correlation values
Plots and values are displayed for columns that have \(R^2\)>0.25
Numerical variables are plotted as scatter plots while categorical variables are plotted as box plots

Age attributes

Age of the house has very little correlation with the saleprice

Location,lot and home style attributes

Of the 14 columns that describe location,lot, and home style attributes, Neighborhood is the only variable that shows a strong correlation to sale price

Condition and Quality attributes

External, kitchen, basement, and overall quality show a strong correlation to sale price

Technical attributes

Garage Finish and Foundation show correlation to sale price.

Size attributes

A number of size attributes have \(R^2\)>0.3. Note that most of these attributes are related to indoor area

Luxury attributes

People are willing to pay extra for a bigger garage but not for fireplaces

Table of attributes that have \(R^2\)> 0.25

Attribute	\(R^2\)
Foundation	0.26
Year RemodAdd	0.26
Year Built	0.27
TotRmsAbvGrd	0.28
Full Bath	0.31
Garage Finish	0.31
Total BsmtSF	0.38
Bsmt Qual	0.39
Garage Area	0.39
Garage Cars	0.41
Kitchen Qual	0.44
ExterQual	0.47
GrLivinArea	0.50
Neighborhood	0.55
Overall Qual	0.63

Based on \(R^2\) values, it seems that overall quality, neighborhood, and grlivingarea(Above ground living area square feet) are the top features that have the strongest correlations (>=0.5) with sale price.
External and kitchen quality also seem to have an effect on sale price.
It is interesting that while GrLivingArea has a strong impact, the TotRmsAbvGrd doesnt seem to have any impact. Does this mean that the total number of rooms in the house dont matter as long as there is enough space?

Handling outliers

Visualizing the data allowed us to observe some outliers. TotalBsmtSF and GrLivingArea show outliers when SF > 4000. So, lets examine these.

	Id	YearBuilt	GrLivArea	TotalBsmtSF	SalePrice
524	524	2007	4676	3138	184750
692	692	1994	4316	2444	755000
1183	1183	1996	4476	2396	745000
1299	1299	2008	5642	6110	160000

Despite being relatively newer homes and having a living area and basment area >4000 sqft, Ids 524 and 1299 have a really low price. Something seems wrong here. So, will eliminate these two data points to help modeling

Feature Engineering

Based on our correlation results, lets build some new features

GrLiving Area and TotalBsmt Sqft have a strong correlation with saleprice. It might be beneficial to add these two columns to create a new variable called TotalIDRSF
Notice that the size variables that had a correlation > 0.25 were all related to indoor area. None of the outdoor area variables showed a correlation. Maybe combining them all might change this. Combine all variables (6 columns) related to outdoor sqft to create a new variable called TotalODRSF
Combine TotalIDRSF and TotalODRSF to create TotalSF
Combine Full Baths and Half baths to create TotalBaths
We will also add columns Total Age and RemodAge
-Totalage = Difference between year built and year sold
-Remodage = Difference between year built and RemodAdd

Lets take a look at the \(R^2\) values after removing the outliers and adding new feature columns

	name	new_r2	old_r2
4	YearRemodAdd	0.26	0.26
6	Foundation	0.26	0.26
21	RemodAge	0.26	NA
3	YearBuilt	0.27	0.27
20	TotalAge	0.27	NA
12	TotRmsAbvGrd	0.29	0.28
13	GarageFinish	0.31	0.31
10	FullBath	0.32	0.31
19	TotalBaths	0.38	NA
7	BsmtQual	0.39	0.39
15	GarageArea	0.40	0.39
14	GarageCars	0.41	0.41
8	TotalBsmtSF	0.42	0.38
11	KitchenQual	0.44	0.44
17	TotalODRSF	0.44	NA
5	ExterQual	0.47	0.47
9	GrLivArea	0.54	0.50
1	Neighborhood	0.55	0.55
2	OverallQual	0.63	0.63
16	TotalIDRSF	0.69	NA
18	TotalSF	0.73	NA

Eliminating the two outliers has improved the \(R^2\) values for GrLving Area and TotalBsmtSF
Four of the new features we added, Total IDRSF, TotalODRSF, TotalSF, and TotalBaths, show strong correlations. Look at the Total SF!!!
Adding all the outdoor spaces shows that the outdoor area can have a significant correlation however not as much as the indoor area.
Of the 64 variables we explored, 21 variables including the 6 new variables we added seem to have an effect on Sale Price.
The variables which, we used to create the new features, are redundant since they have the same information and probably exhibit collinearity. Lets take a look at the correlation matrix.

As expected, we observe collinearity between:
IDRSF,ODRSF,GrLivArea,TotalBsmtSF and TotalSF
Garage Cars, Garage Area, and TotalODRSF
We will eliminate columns with redundant information:
- Remove IDRSF, ODRSF, GrLivArea, TotalBSmtSF, Garage Area, OpenPorchSF, ScreenPorch, EnclosedPorch, X3SsnPotch, WoodDeckSF = All of this is in the TotalSF variable
- Remove Half Baths and Full Baths - retain Total Baths
- Remove YearRemodAdd and YearBuilt - retain TotalAge and RemodAge

Hypothesis

Size of the house (TOTALSF - includes indoor and outdoor spaces such as porch,garage space but not the lot area) seems to be the most important feature that governs the price of the house followed by the overall quality and neighborhood
External appearance and Kitchen Quality also have some effect on the sale price

Model Building

Analyze distributions

Before we beging modeling, we need to do a few more things
First we will analyze distributions of the target variable (Sale price) and other numerical variables that show a strong correlation to sale price

Sale Price

Its right skewed. Lets try to transform this to a normal distribution by taking the log value

Looks better!

Lets take a look at the distributions of all the numerical variables

While most of them are normally distributed, there are a few variables that are not.

Transform Variables

Lets use a box-cox transformation to transform the numerical variables that are not normally distributed
We will use transformTukey function from the rcompanion package to do this

Before we proceed further, lets check if the transformations affected the correlation values

	name	newr2	oldr2
8	TotRmsAbvGrd	0.29	0.29
1	Foundation	0.30	0.26
4	RemodAge	0.32	0.26
3	TotalAge	0.35	0.27
13	GarageFinish	0.38	0.31
6	BsmtQual	0.42	0.39
11	TotalBaths	0.44	0.38
7	KitchenQual	0.45	0.44
9	GarageCars	0.46	0.41
5	ExterQual	0.47	0.47
12	Neighborhood	0.57	0.55
2	OverallQual	0.67	0.63
10	TotalSF	0.73	0.73

A few more steps before modeling

We will use the caret package to accomplish these tasks
lets center and scale the numeric variables
- preProcess()
Create dummy variables for categorical variables
- dummyVars()
Remove variables with low variance
- nearZeroVar()
Partition the data for modeling (70/30)
- CreateDataPartition()

## [1] "The training set has 1023 observations"

## [1] "Validation set has 435 observations"

Modeling

Using caret package for modeling

GBM model

Model parameters:
n.trees=700
interaction.depth=5
shrinkage=0.05
n.minobsinnode=10

Variable Importance

RMSE of validation set: 0.1181907

Kaggle Leaderboard score/RMSE = 0.12563

XGBM model

RMSE of validation set: 0.1144982

Kaggle Leaderboard score/RMSE = 0.12358

Total sqft and Overall quality of the house are the biggest drivers of home prices in Ames, Iowa.

Home price prediction

Spandana Makeneni