Hypothesis
Age attributes (5)
-Year built, Yead Remod Add, YrSold, MoSold, GarageYrBlt
Location,lot and home style attributes (14)
-Location - MSZoning, MSSubClass, Neighborhood, Street, Alley -Lot - Lot Frontage, Lot Config, Lot shape, Landslope, LandContour -Home style - Building type, House style, Street, Alley
Condition and Quality attributes (14)
-Condition1, Condition2, ExterCond, ExterQual, BsmtQual, BsmtCond, KitchenQual, GarageQual,GarageCond, HeatingQC, OverallQual, OverallCond, SaleCondition,PoolQC
Technical attributes (15)
-RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, MasVnrArea, Foundation, BsmtExposure, Electrical, CentralAir, GarageType, GarageFinish, BsmtFinType1, BsmtFinType2,Utilities, Heating
Size attributes (22)
-LotArea, GrLivArea, TotalBsmtSF, TotRmsAbvGrd, FullBath, HalfBath, BsmtFullBath, BsmtHalfBath,BedroomAbvGr, KitchenAbvGr, GarageArea, WoodDeckSF, EnclosedPorch, OpenPorchSF, X3SsnPorch,ScreenPorch,BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, X1stFlrSF, X2ndFlrSF,LowQualFinSF
Luxury attributes (7)
-Fireplaces,GarageCars,Fence,MiscFeature,Pool Area, MiscVal,PavedDrive
Other (2) -Saletype, Functional
There are 36 numerical columns and 42 categorical columns. Some of the categorical columns are ordinal and during the data cleaning stage, we will convert these to numerical variables.
There is redundancy in some columns:
-TotalBsmtSF is the sum of BsmtFinSF1, BsmtFinSF2 and BsmtUnSF.
-GrLIvArea which describes the Above grade (ground) living area square feet is the sum of X1stFlrSF, X2ndFlrSF and LowQualFinSF.
We will retain the TotalBsmtSF and GrlivArea and drop the other 6 columns.
This leaves us with 73 columns to explore
Age of the house has very little correlation with the saleprice
Of the 14 columns that describe location,lot, and home style attributes, Neighborhood is the only variable that shows a strong correlation to sale price
External, kitchen, basement, and overall quality show a strong correlation to sale price
Garage Finish and Foundation show correlation to sale price.
A number of size attributes have \(R^2\)>0.3. Note that most of these attributes are related to indoor area
People are willing to pay extra for a bigger garage but not for fireplaces
Table of attributes that have \(R^2\)> 0.25
Attribute | \(R^2\) |
---|---|
Foundation | 0.26 |
Year RemodAdd | 0.26 |
Year Built | 0.27 |
TotRmsAbvGrd | 0.28 |
Full Bath | 0.31 |
Garage Finish | 0.31 |
Total BsmtSF | 0.38 |
Bsmt Qual | 0.39 |
Garage Area | 0.39 |
Garage Cars | 0.41 |
Kitchen Qual | 0.44 |
ExterQual | 0.47 |
GrLivinArea | 0.50 |
Neighborhood | 0.55 |
Overall Qual | 0.63 |
Id | YearBuilt | GrLivArea | TotalBsmtSF | SalePrice | |
---|---|---|---|---|---|
524 | 524 | 2007 | 4676 | 3138 | 184750 |
692 | 692 | 1994 | 4316 | 2444 | 755000 |
1183 | 1183 | 1996 | 4476 | 2396 | 745000 |
1299 | 1299 | 2008 | 5642 | 6110 | 160000 |
Based on our correlation results, lets build some new features
Lets take a look at the \(R^2\) values after removing the outliers and adding new feature columns
name | new_r2 | old_r2 | |
---|---|---|---|
4 | YearRemodAdd | 0.26 | 0.26 |
6 | Foundation | 0.26 | 0.26 |
21 | RemodAge | 0.26 | NA |
3 | YearBuilt | 0.27 | 0.27 |
20 | TotalAge | 0.27 | NA |
12 | TotRmsAbvGrd | 0.29 | 0.28 |
13 | GarageFinish | 0.31 | 0.31 |
10 | FullBath | 0.32 | 0.31 |
19 | TotalBaths | 0.38 | NA |
7 | BsmtQual | 0.39 | 0.39 |
15 | GarageArea | 0.40 | 0.39 |
14 | GarageCars | 0.41 | 0.41 |
8 | TotalBsmtSF | 0.42 | 0.38 |
11 | KitchenQual | 0.44 | 0.44 |
17 | TotalODRSF | 0.44 | NA |
5 | ExterQual | 0.47 | 0.47 |
9 | GrLivArea | 0.54 | 0.50 |
1 | Neighborhood | 0.55 | 0.55 |
2 | OverallQual | 0.63 | 0.63 |
16 | TotalIDRSF | 0.69 | NA |
18 | TotalSF | 0.73 | NA |
Eliminating the two outliers has improved the \(R^2\) values for GrLving Area and TotalBsmtSF
Four of the new features we added, Total IDRSF, TotalODRSF, TotalSF, and TotalBaths, show strong correlations. Look at the Total SF!!!
Adding all the outdoor spaces shows that the outdoor area can have a significant correlation however not as much as the indoor area.
Of the 64 variables we explored, 21 variables including the 6 new variables we added seem to have an effect on Sale Price.
Sale Price
Its right skewed. Lets try to transform this to a normal distribution by taking the log value
Looks better!
Lets take a look at the distributions of all the numerical variables
While most of them are normally distributed, there are a few variables that are not.
Before we proceed further, lets check if the transformations affected the correlation values
name | newr2 | oldr2 | |
---|---|---|---|
8 | TotRmsAbvGrd | 0.29 | 0.29 |
1 | Foundation | 0.30 | 0.26 |
4 | RemodAge | 0.32 | 0.26 |
3 | TotalAge | 0.35 | 0.27 |
13 | GarageFinish | 0.38 | 0.31 |
6 | BsmtQual | 0.42 | 0.39 |
11 | TotalBaths | 0.44 | 0.38 |
7 | KitchenQual | 0.45 | 0.44 |
9 | GarageCars | 0.46 | 0.41 |
5 | ExterQual | 0.47 | 0.47 |
12 | Neighborhood | 0.57 | 0.55 |
2 | OverallQual | 0.67 | 0.63 |
10 | TotalSF | 0.73 | 0.73 |
## [1] "The training set has 1023 observations"
## [1] "Validation set has 435 observations"
RMSE of validation set: 0.1181907
Kaggle Leaderboard score/RMSE = 0.12563
RMSE of validation set: 0.1144982
Kaggle Leaderboard score/RMSE = 0.12358
Total sqft and Overall quality of the house are the biggest drivers of home prices in Ames, Iowa.