id date price bedrooms bathrooms sqft_living
1 7129300520 20141013T000000 221900 3 1.00 1180
2 6414100192 20141209T000000 538000 3 2.25 2570
3 5631500400 20150225T000000 180000 2 1.00 770
4 2487200875 20141209T000000 604000 4 3.00 1960
5 1954400510 20150218T000000 510000 3 2.00 1680
6 7237550310 20140512T000000 1225000 4 4.50 5420
sqft_lot floors waterfront view condition grade sqft_above sqft_basement
1 5650 1 0 0 3 7 1180 0
2 7242 2 0 0 3 7 2170 400
3 10000 1 0 0 3 6 770 0
4 5000 1 0 0 5 7 1050 910
5 8080 1 0 0 3 8 1680 0
6 101930 1 0 0 3 11 3890 1530
yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
1 1955 0 98178 47.5112 -122.257 1340 5650
2 1951 1991 98125 47.7210 -122.319 1690 7639
3 1933 0 98028 47.7379 -122.233 2720 8062
4 1965 0 98136 47.5208 -122.393 1360 5000
5 1987 0 98074 47.6168 -122.045 1800 7503
6 2001 0 98053 47.6561 -122.005 4760 101930
Caption for the picture.
This project predicts the price of houses in King County, Washington by using the house selling data that is download from Kaggle between May 2014 and May 2015 of the county. We selected and processed the data carefully. After that, we built a multiple linear regression model. The model passed the diagnostic and hypothesis test, has a high R-square, and get a reasonable mean absolute error for predicting testing data.
In life, the house is an important part of living and growing, many people would face the price problem if they want to buy or sell a house. Therefore, the goal of this project is to predict the price of houses in King County, Washington. The dataset was provided by the King county, and download from Kaggle [1], it includes 21613 houses selling price between May 2014 and May 2015 of the county. According to the zip code region with regional median price, the different zip code region has large differences in price.
Available themes include:
Id
Date
Price
Bedrooms
Sqft_living
Sqft_lot
Floors
Waterfront
View
Condition
Grade
Sqft_basement
Yr_built
Yr_renovated
Zipcode
Lat
Long
Sqft_living15
Sqft_lot15
The dataset is divided into training and testing, 80% for training (17290 observations), 20% for testing (4323 observations). We standardize the training and testing data, the testing data is standardized by using the mean and standard deviation of training data.
After observing the price distribution from the histogram, we decide to transform the price by using log() function. As we can see, the histogram of price close to the normal distribution after transformation.
Selected regressors for multiple regression include (blue fonts represent indicator variables):
Bedrooms
Sqft_living
Sqft_lot
\(\color{blue}{\text{Floors}}\)
\(\color{blue}{\text{Waterfront}}\)
View
Condition
Grade
\(\color{blue}{\text{Sqft_basement}}\)
Yr_built
Yr_renovated
\(\color{blue}{\text{Zipcat}}\)
In the above variables, the sqrt_living is the summation of sqrt_above and sqrt_basement, therefore, we decide to skip sqrt_above, and transform the sqrt_basement to the sign to represent that the house has a basement or not. The waterfront is a sign to represent the house near the water or not, so we also use it as an indicator variable.
To further study the data, we observe the relationships between price and some variables. According to these box plots. For the condition, we found the condition 1 has fewer samples, to avoid the effect of extremums, so we plot both training data and all data. We found that there is a significant positive correlation between price and these variables except floors. Therefore, the floors also becomes an indicator variable.
Then, the Yr_renovated is a variable representing the year of renovation for the house, the Yr_renovated is equal to zero if the house was not renovated. We change it to the variable represent the house that was renovated how many years after the house was built. We also found that there have no enough samples for 3.5 floors, so we transform all samples that have 3.5 floors to 3 floors.
Furthermore, we add a new variable named zipcat, we will talk about it in next part of Data Exploration & Method section
Call:
lm(formula = price ~ bedrooms + bathrooms + sqft_living + floors +
condition + grade + yr_built + yr_renovated + sqft_basement +
sqft_lot + waterfront + zipcode + 0, data = train)
Residuals:
Min 1Q Median 3Q Max
-3.2669 -0.1937 0.0110 0.2054 1.9224
Coefficients:
Estimate Std. Error t value Pr(>|t|)
bedrooms -0.006320 0.003723 -1.698 0.089570 .
bathrooms 0.048359 0.005286 9.149 < 2e-16 ***
sqft_living 0.369940 0.006101 60.632 < 2e-16 ***
floors1 -0.913209 0.021919 -41.663 < 2e-16 ***
floors1.5 -0.865434 0.024322 -35.582 < 2e-16 ***
floors2 -0.910480 0.023161 -39.311 < 2e-16 ***
floors2.5 -0.941720 0.040192 -23.430 < 2e-16 ***
floors3 -1.116616 0.030723 -36.345 < 2e-16 ***
condition 0.066530 0.003209 20.731 < 2e-16 ***
grade 0.271060 0.005181 52.314 < 2e-16 ***
yr_built -0.029161 0.005249 -5.555 2.82e-08 ***
yr_renovated 0.024784 0.003085 8.034 1.00e-15 ***
sqft_basement1 0.028356 0.007246 3.913 9.13e-05 ***
sqft_lot 0.054725 0.003100 17.653 < 2e-16 ***
waterfront1 1.251023 0.033541 37.298 < 2e-16 ***
zipcode98002 -0.059912 0.035811 -1.673 0.094345 .
zipcode98003 0.033232 0.032970 1.008 0.313496
zipcode98004 2.106258 0.032242 65.327 < 2e-16 ***
zipcode98005 1.392496 0.038471 36.196 < 2e-16 ***
zipcode98006 1.242887 0.028446 43.693 < 2e-16 ***
zipcode98007 1.221288 0.040773 29.953 < 2e-16 ***
zipcode98008 1.243427 0.032653 38.080 < 2e-16 ***
zipcode98010 0.494278 0.046215 10.695 < 2e-16 ***
zipcode98011 0.853283 0.036450 23.410 < 2e-16 ***
zipcode98014 0.566282 0.043205 13.107 < 2e-16 ***
zipcode98019 0.637937 0.036821 17.325 < 2e-16 ***
zipcode98022 0.158000 0.034710 4.552 5.35e-06 ***
zipcode98023 -0.065519 0.028050 -2.336 0.019515 *
zipcode98024 0.840634 0.051782 16.234 < 2e-16 ***
zipcode98027 0.964357 0.029626 32.551 < 2e-16 ***
zipcode98028 0.821010 0.032056 25.612 < 2e-16 ***
zipcode98029 1.095141 0.031691 34.557 < 2e-16 ***
zipcode98030 0.099253 0.033592 2.955 0.003135 **
zipcode98031 0.146052 0.032424 4.505 6.70e-06 ***
zipcode98032 -0.081236 0.041663 -1.950 0.051211 .
zipcode98033 1.491002 0.029190 51.079 < 2e-16 ***
zipcode98034 1.040756 0.027468 37.889 < 2e-16 ***
zipcode98038 0.342384 0.027325 12.530 < 2e-16 ***
zipcode98039 2.375102 0.062617 37.931 < 2e-16 ***
zipcode98040 1.687095 0.033097 50.974 < 2e-16 ***
zipcode98042 0.131000 0.027634 4.741 2.15e-06 ***
zipcode98045 0.668548 0.034879 19.167 < 2e-16 ***
zipcode98052 1.218583 0.027395 44.481 < 2e-16 ***
zipcode98053 1.141159 0.029790 38.307 < 2e-16 ***
zipcode98055 0.288708 0.033696 8.568 < 2e-16 ***
zipcode98056 0.608420 0.029737 20.460 < 2e-16 ***
zipcode98058 0.311726 0.028597 10.901 < 2e-16 ***
zipcode98059 0.671389 0.028614 23.463 < 2e-16 ***
zipcode98065 0.825636 0.031683 26.059 < 2e-16 ***
zipcode98070 0.655326 0.043271 15.145 < 2e-16 ***
zipcode98072 0.941474 0.032550 28.924 < 2e-16 ***
zipcode98074 1.070789 0.029427 36.389 < 2e-16 ***
zipcode98075 1.087295 0.030778 35.327 < 2e-16 ***
zipcode98077 0.848067 0.036737 23.085 < 2e-16 ***
zipcode98092 0.078714 0.031084 2.532 0.011340 *
zipcode98102 1.717695 0.047384 36.251 < 2e-16 ***
zipcode98103 1.484595 0.028445 52.192 < 2e-16 ***
zipcode98105 1.730170 0.035167 49.199 < 2e-16 ***
zipcode98106 0.557803 0.030975 18.008 < 2e-16 ***
zipcode98107 1.543710 0.034445 44.816 < 2e-16 ***
zipcode98108 0.595244 0.036523 16.298 < 2e-16 ***
zipcode98109 1.845966 0.046049 40.087 < 2e-16 ***
zipcode98112 1.914447 0.034534 55.437 < 2e-16 ***
zipcode98115 1.503297 0.028073 53.551 < 2e-16 ***
zipcode98116 1.409298 0.031869 44.221 < 2e-16 ***
zipcode98117 1.467415 0.028144 52.140 < 2e-16 ***
zipcode98118 0.836366 0.028357 29.494 < 2e-16 ***
zipcode98119 1.821370 0.038508 47.298 < 2e-16 ***
zipcode98122 1.452184 0.032585 44.566 < 2e-16 ***
zipcode98125 1.059552 0.029825 35.526 < 2e-16 ***
zipcode98126 0.969466 0.031064 31.208 < 2e-16 ***
zipcode98133 0.818265 0.028576 28.634 < 2e-16 ***
zipcode98136 1.276935 0.033319 38.324 < 2e-16 ***
zipcode98144 1.200232 0.031636 37.938 < 2e-16 ***
zipcode98146 0.528984 0.032085 16.487 < 2e-16 ***
zipcode98148 0.298945 0.056848 5.259 1.47e-07 ***
zipcode98155 0.794322 0.028849 27.533 < 2e-16 ***
zipcode98166 0.568824 0.034297 16.585 < 2e-16 ***
zipcode98168 0.114226 0.032975 3.464 0.000533 ***
zipcode98177 1.204004 0.033835 35.584 < 2e-16 ***
zipcode98178 0.298045 0.033387 8.927 < 2e-16 ***
zipcode98188 0.171499 0.040437 4.241 2.24e-05 ***
zipcode98198 0.150499 0.032875 4.578 4.73e-06 ***
zipcode98199 1.619681 0.031766 50.988 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3684 on 17206 degrees of freedom
Multiple R-squared: 0.865, Adjusted R-squared: 0.8643
F-statistic: 1312 on 84 and 17206 DF, p-value: < 2.2e-16
We use the Zipcode as the indicator variable for the linear regreesion. Although the R-square of the model is not low, according to the t-test, some zip codes are insignificant at 0.05 level, and the Q-Q plot shows that the points do not follow the 45-degree line.
To solve the problem, we decided not to use the zip code as an indicator variable directly. The zip code includes 70 different Zip codes for houses, each Zip code is a category. To reduce the number of categories, we transform the Zip codes to two categories by comparing the mean price of each category to the overall mean price, if the mean price of the category is smaller than the overall mean price, it belongs to zip category zero, otherwise, it belongs to zip category one. In all data, 9933 samples belong to zip category one, and 11680 samples belong to zip category zero.
Furthermore, Both latitude, longitude and zip code are location information for the house. In this project, we decided to use zip code but not latitude and longitude. Because the zip code is already a well-grouped category, but the latitude and longitude need some group strategy before use.
The final model is built after data selection and transformation. So We need to check the model by diagnostics process.
1. Linearity Assumption
We check the linearity assumption: according to the residual vs fitted value plot, there is a linear relationship between the price and all used variable.
2. Normality Assumption
We check the normality assumption: the residuals almost follow the normal distribution. We exam this assumption by checking the Q-Q plot.
3. Equal Variance Assumption
We check the equal variance assumption: the variability of the errors should be about the same for all values of each predictor. We exam this assumption by checking the Scale-Location plot.
4. leverage or influencial points Since the Q-Q plot looks good, there is no need to check leverage or influencial points. But we can still check it by exam the Residuals vs Leverage plot. There are no prominence samples in the plot.
Call:
lm(formula = price ~ bedrooms + bathrooms + sqft_living + floors +
condition + grade + yr_built + yr_renovated + sqft_basement +
sqft_lot + waterfront + zipcat + 0, data = train)
Residuals:
Min 1Q Median 3Q Max
-3.5522 -0.3030 -0.0083 0.3006 2.2569
Coefficients:
Estimate Std. Error t value Pr(>|t|)
bedrooms -0.021566 0.004735 -4.555 5.28e-06 ***
bathrooms 0.073478 0.006781 10.835 < 2e-16 ***
sqft_living 0.347118 0.007561 45.908 < 2e-16 ***
floors1 -0.377190 0.007751 -48.666 < 2e-16 ***
floors1.5 -0.305017 0.015293 -19.944 < 2e-16 ***
floors2 -0.309577 0.011322 -27.342 < 2e-16 ***
floors2.5 -0.321577 0.043711 -7.357 1.97e-13 ***
floors3 -0.229850 0.025070 -9.169 < 2e-16 ***
condition 0.049228 0.004014 12.264 < 2e-16 ***
grade 0.324202 0.006480 50.028 < 2e-16 ***
yr_built -0.158195 0.005861 -26.993 < 2e-16 ***
yr_renovated 0.014872 0.003957 3.758 0.000172 ***
sqft_basement1 -0.062789 0.008663 -7.248 4.42e-13 ***
sqft_lot 0.025842 0.003743 6.905 5.21e-12 ***
waterfront1 1.207505 0.042473 28.430 < 2e-16 ***
zipcat1 0.805849 0.008208 98.182 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.477 on 17274 degrees of freedom
Multiple R-squared: 0.7727, Adjusted R-squared: 0.7725
F-statistic: 3670 on 16 and 17274 DF, p-value: < 2.2e-16
| DF | Sum Sq | Mean Sq | F Value | Pr(>F) | |
|---|---|---|---|---|---|
| Regression | 15 | 8469.519 | 2823.17299 | 55.5535 | 0 |
| Residual Error | 17274 | 254.095 | 50.81899 | ||
| Total | 17289 | 8723.614 |
Mean absolute error: 106125.6
Standard deviation of testing price: 359712.5
According to the model summary, all variables are significant at 0.01 level of t-test, and more than 77% of the total variability in the price can be explained by the used variables.
Then, we can reject the \(H_0\) at a significance level of 0.01 by examining the p-value (9.429656e-11<0.01) from the Analysis of Variance Table.
Furthermore, the predicted mean absolute error for the testing price seems not low, but it is reasonable because of the high standard deviation of testing price.
In this project, we predict the price of houses in King County, Washington. The model passed the diagnostic and hypothesis test, has a high R-square, and get a reasonable mean absolute error for predicting testing data. Overall, the project is succeeded.
[1] “House Sales in King County, USA.” [Online]. Available: https://kaggle.com/harlfoxem/housesalesprediction. [Accessed: 04-Dec-2019].
---
title: "House Price Prediction"
author: "Zhiyuan Xie"
output:
flexdashboard::flex_dashboard:
theme: simplex
orientation: columns
social:
source_code: embed
---
```{r setup, include=FALSE}
# load necessary packages
library(ggplot2)
library(plotly)
library(plyr)
library(flexdashboard) ## you need this package to create dashboard
# read the data set here, I use data: mtcars as an example
prodata <- read.csv("D:/ECE/MTH543/project/kc_house_data.csv")
```
Introduction
=======================================================================
Column {.tabset data-width=650}
-----------------------------------------------------------------------
### Get to know the data
```{r}
head(prodata)
prodata$condition=as.numeric(prodata$condition)
prodata$grade=as.numeric(prodata$grade)
prodata$yr_built=as.numeric(prodata$yr_built)
prodata$yr_renovated=as.numeric(prodata$yr_renovated)
#head(prodata)
index = sample(1:nrow(prodata),size=0.8*nrow(prodata))
train=prodata[index,]
test=prodata[-index,]
train$zipcat=rep(NA,nrow(train))
test$zipcat=rep(NA,nrow(test))
meanpr=mean(train$price)
zip=names(table(train$zipcode))
for(i in 1:length(table(train$zipcode)))
{
index1=which(train$zipcode==zip[i])
index2=which(test$zipcode==zip[i])
if(mean(train$price[index1])>=meanpr)
{
train$zipcat[index1]=1
test$zipcat[index2]=1
}
else
{
train$zipcat[index1]=0
test$zipcat[index2]=0
}
}
```
### Zipcode region with regional median price

Column {data-width=350}
-----------------------------------------------------------------------
### Abstract
This project predicts the price of houses in King County, Washington by using the house selling data that is download from Kaggle between May 2014 and May 2015 of the county. We selected and processed the data carefully. After that, we built a multiple linear regression model. The model passed the diagnostic and hypothesis test, has a high R-square, and get a reasonable mean absolute error for predicting testing data.
### Introduction
In life, the house is an important part of living and growing, many people would face the price problem if they want to buy or sell a house. Therefore, the goal of this project is to predict the price of houses in King County, Washington. The dataset was provided by the King county, and download from Kaggle [1], it includes 21613 houses selling price between May 2014 and May 2015 of the county. According to the zip code region with regional median price, the different zip code region has large differences in price.
Available themes include:
- Id
- Date
- Price
- Bedrooms
- Sqft_living
- Sqft_lot
- Floors
- Waterfront
- View
- Condition
- Grade
- Sqft_basement
- Yr_built
- Yr_renovated
- Zipcode
- Lat
- Long
- Sqft_living15
- Sqft_lot15
Data Exploration & Method part 1
=======================================================================
Column {.tabset data-width=650}
-----------------------------------------------------------------------
### transform response
```{r}
par(mfrow=c(1,2))
hist(train$price,xlab="price", main="Histogram of price")
train$price <- log(train$price)
hist(train$price,xlab="log(price)", main="Histogram of log(price)")
```
### Price vs Floors
```{r}
boxplot(train$price~train$floors,xlab="floors",ylab="standardised price", col="#03f8fc")
```
### Price vs Grade
```{r}
boxplot(train$price~train$grade,col="#03f8fc",xlab="grade", ylab="standardised price")
```
### Price vs View
```{r}
boxplot(train$price~train$view,col="#03f8fc",xlab="view", ylab="standardised price")
```
### Price vs Condition
```{r}
train$floors<- ifelse(train$floors>3,3,train$floors)
train$floors<-as.factor(train$floors)
train$zipcat<-as.factor(train$zipcat)
train$sqft_basement<-as.factor(ifelse(train$sqft_basement>0,0,1))
train$waterfront<-as.factor(train$waterfront)
train$yr_renovated<-ifelse((train$yr_renovated-train$yr_built)<0,0,train$yr_renovated-train$yr_built)
#train$renovated<-as.factor(ifelse(train$yr_renovated>0,0,1))
varind=c(3,4,5,6,7,11,12,15,16)
trainmean=apply(train[varind],2,mean)
trainvar=apply(train[varind],2,var)
test$floors<- ifelse(test$floors>3,3,test$floors)
test$price <- log(test$price)
#test$sqft_living <- log(test$sqft_living)
test$floors<-as.factor(test$floors)
test$zipcat<-as.factor(test$zipcat)
test$sqft_basement<-as.factor(ifelse(test$sqft_basement>0,0,1))
test$waterfront<-as.factor(test$waterfront)
test$yr_renovated<-ifelse((test$yr_renovated-test$yr_built)<0,0,test$yr_renovated-test$yr_built)
par(mfrow=c(1,2))
boxplot(exp(train$price)~train$condition,col="#03f8fc",xlab="condition", ylab="price",ylim=c(0,2e+6))
boxplot(prodata$price~prodata$condition,col="#03f8fc",xlab="condition", ylab="price",ylim=c(0,2e+6))
```
Column {data-width=350}
-----------------------------------------------------------------------
The dataset is divided into training and testing, 80% for training (17290 observations), 20% for testing (4323 observations). We standardize the training and testing data, the testing data is standardized by using the mean and standard deviation of training data.
After observing the price distribution from the histogram, we decide to transform the price by using log() function. As we can see, the histogram of price close to the normal distribution after transformation.
Selected regressors for multiple regression include (blue fonts represent indicator variables):
- Bedrooms
- Sqft_living
- Sqft_lot
- $\color{blue}{\text{Floors}}$
- $\color{blue}{\text{Waterfront}}$
- View
- Condition
- Grade
- $\color{blue}{\text{Sqft_basement}}$
- Yr_built
- Yr_renovated
- $\color{blue}{\text{Zipcat}}$
In the above variables, the sqrt_living is the summation of sqrt_above and sqrt_basement, therefore, we decide to skip sqrt_above, and transform the sqrt_basement to the sign to represent that the house has a basement or not. The waterfront is a sign to represent the house near the water or not, so we also use it as an indicator variable.
To further study the data, we observe the relationships between price and some variables. According to these box plots. For the condition, we found the condition 1 has fewer samples, to avoid the effect of extremums, so we plot both training data and all data. We found that there is a significant positive correlation between price and these variables except floors. Therefore, the floors also becomes an indicator variable.
Then, the Yr_renovated is a variable representing the year of renovation for the house, the Yr_renovated is equal to zero if the house was not renovated. We change it to the variable represent the house that was renovated how many years after the house was built. We also found that there have no enough samples for 3.5 floors, so we transform all samples that have 3.5 floors to 3 floors.
Furthermore, we add a new variable named zipcat, we will talk about it in next part of Data Exploration & Method section
Data Exploration & Method part 2
=======================================================================
```{r}
#test$renovated<-as.factor(ifelse(test$yr_renovated>0,0,1))
for(i in c(1:length(varind)))
{
train[varind[i]]=(train[varind[i]]-trainmean[i])/sqrt(trainvar[i])
test[varind[i]]=(test[varind[i]]-trainmean[i])/sqrt(trainvar[i])
}
```
Column {.tabset data-width=650}
-----------------------------------------------------------------------
### model summary for using Zipcode as an indicator variable
```{r}
train$zipcode<-as.factor(train$zipcode)
fit_zipcode=lm(formula = price~bedrooms+bathrooms+sqft_living+floors+condition+grade
+yr_built+yr_renovated+sqft_basement+sqft_lot+waterfront+zipcode+0,data=train)
summary(fit_zipcode)
```
### Linearity
```{r}
Fitted.Values <- fit_zipcode$fitted.values
# Extract residuals
Residuals <- fit_zipcode$residuals
# Calculate standardized residuals
Standardized.Residuals <- scale(fit_zipcode$residuals)
# Extract fitted values for lm() object
Theoretical.Quantiles <- qqnorm(Residuals, plot.it = F)$x
# find Square root of abs(residuals)
Root.Residuals <- sqrt(abs(Standardized.Residuals))
# Calculate Leverage
Leverage <- lm.influence(fit_zipcode)$hat
# Create data frame
# Will be used as input to plot_ly
diagnostics <- data.frame(Fitted.Values,
Residuals,
Standardized.Residuals,
Theoretical.Quantiles,
Root.Residuals,
Leverage)
m <- list(
l = 100,
r = 100,
b = 100,
t = 100,
pad = 4
)
# Fitted vs Residuals
p1 <- plot_ly(diagnostics, x = Fitted.Values, y = Residuals,
type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data",
marker = list(size = 10, opacity = 0.5))%>%
layout(title = "Residuals vs Fitted Values",
xaxis = list(title="Fitted Values", font=list(size=14)),
yaxis = list(title="Residuals", font=list(size=14)),
plot_bgcolor = "#e6e6e6",
font=list(size=14), margin=m)
ggplotly(p1)
```
### Normality
```{r}
# QQ Pot
p2 <- plot_ly(diagnostics, x = Theoretical.Quantiles, y = Standardized.Residuals, type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data", marker = list(size = 10, opacity = 0.5), showlegend = F)%>%
add_trace(x = Theoretical.Quantiles, y = Theoretical.Quantiles, type = "scatter", mode = "line", name = "", line = list(width = 2))%>%
layout(title = "Q-Q Plot", plot_bgcolor = "#e6e6e6",
xaxis = list(title="Theoretical Quantiles", font=list(size=14)),
yaxis = list(title="Standardized Residuals", font=list(size=14)), font=list(size=14), margin=m)
ggplotly(p2)
```
### Equality Variance
```{r}
# Scale Location
p3 <- plot_ly(diagnostics, x = Fitted.Values, y = Root.Residuals,
type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data",
marker = list(size = 10, opacity = 0.5), showlegend = F)%>%
layout(title = "Scale-Location", plot_bgcolor = "#e6e6e6", xaxis = list(title="Fitted Values", font=list(size=14)),
yaxis = list(title=expression(sqrt("|Standardized Residuals|")), font=list(size=14)), font=list(size=14), margin=m)
ggplotly(p3)
```
### Residuals vs Leverage
```{r}
s <- loess.smooth(Leverage, Residuals)
p4 <- plot_ly(diagnostics, x = Leverage, y = Residuals,
type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data", marker = list(size = 10, opacity = 0.5), showlegend = F) %>%
add_trace(x = s$x, y = s$y, type = "scatter", mode = "line", name = "Smooth", line = list(width = 2)) %>%
layout(title = "Leverage vs Residuals", plot_bgcolor = "#e6e6e6", xaxis = list(title="Leverage", font=list(size=14)),
yaxis = list(title="Residuals", font=list(size=14)), font=list(size=14), margin=m)
ggplotly(p4)
```
Column {.tabset data-width=350}
-----------------------------------------------------------------------
We use the Zipcode as the indicator variable for the linear regreesion. Although the R-square of the model is not low, according to the t-test, some zip codes are insignificant at 0.05 level, and the Q-Q plot shows that the points do not follow the 45-degree line.
To solve the problem, we decided not to use the zip code as an indicator variable directly. The zip code includes 70 different Zip codes for houses, each Zip code is a category. To reduce the number of categories, we transform the Zip codes to two categories by comparing the mean price of each category to the overall mean price, if the mean price of the category is smaller than the overall mean price, it belongs to zip category zero, otherwise, it belongs to zip category one. In all data, 9933 samples belong to zip category one, and 11680 samples belong to zip category zero.
Furthermore, Both latitude, longitude and zip code are location information for the house. In this project, we decided to use zip code but not latitude and longitude. Because the zip code is already a well-grouped category, but the latitude and longitude need some group strategy before use.
Diagnostics
=======================================================================
Column {.tabset data-width=650}
-----------------------------------------------------------------------
```{r}
#train[varind]<-as.data.frame(apply(train[varind],2,scale))
fit=lm(formula = price~bedrooms+bathrooms+sqft_living+floors+condition+grade
+yr_built+yr_renovated+sqft_basement+sqft_lot+waterfront+zipcat+0,data=train)
#fit=lm(formula = price~bedrooms+bathrooms+sqft_living+floors+condition+grade
# +yr_built+renovated+yr_renovated*renovated+sqft_basement+sqft_lot+waterfront+zipcat+0,data=train)
testprice <- exp(test$price*sqrt(trainvar[1])+trainmean[1])
predictprice <- exp(predict(fit,newdata=test)*sqrt(trainvar[1])+trainmean[1])
MAE=mean(abs(testprice - predictprice))
Fitted.Values <- fit$fitted.values
# Extract residuals
Residuals <- fit$residuals
# Calculate standardized residuals
Standardized.Residuals <- scale(fit$residuals)
# Extract fitted values for lm() object
Theoretical.Quantiles <- qqnorm(Residuals, plot.it = F)$x
# find Square root of abs(residuals)
Root.Residuals <- sqrt(abs(Standardized.Residuals))
# Calculate Leverage
Leverage <- lm.influence(fit)$hat
# Create data frame
# Will be used as input to plot_ly
diagnostics <- data.frame(Fitted.Values,
Residuals,
Standardized.Residuals,
Theoretical.Quantiles,
Root.Residuals,
Leverage)
```
### Linearity
```{r}
m <- list(
l = 100,
r = 100,
b = 100,
t = 100,
pad = 4
)
# Fitted vs Residuals
p1 <- plot_ly(diagnostics, x = Fitted.Values, y = Residuals,
type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data",
marker = list(size = 10, opacity = 0.5))%>%
layout(title = "Residuals vs Fitted Values",
xaxis = list(title="Fitted Values", font=list(size=14)),
yaxis = list(title="Residuals", font=list(size=14)),
plot_bgcolor = "#e6e6e6",
font=list(size=14), margin=m)
ggplotly(p1)
```
### Normality
```{r}
# QQ Pot
p2 <- plot_ly(diagnostics, x = Theoretical.Quantiles, y = Standardized.Residuals, type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data", marker = list(size = 10, opacity = 0.5), showlegend = F)%>%
add_trace(x = Theoretical.Quantiles, y = Theoretical.Quantiles, type = "scatter", mode = "line", name = "", line = list(width = 2))%>%
layout(title = "Q-Q Plot", plot_bgcolor = "#e6e6e6",
xaxis = list(title="Theoretical Quantiles", font=list(size=14)),
yaxis = list(title="Standardized Residuals", font=list(size=14)), font=list(size=14), margin=m)
ggplotly(p2)
```
### Equality Variance
```{r}
# Scale Location
p3 <- plot_ly(diagnostics, x = Fitted.Values, y = Root.Residuals,
type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data",
marker = list(size = 10, opacity = 0.5), showlegend = F)%>%
layout(title = "Scale-Location", plot_bgcolor = "#e6e6e6", xaxis = list(title="Fitted Values", font=list(size=14)),
yaxis = list(title=expression(sqrt("|Standardized Residuals|")), font=list(size=14)), font=list(size=14), margin=m)
ggplotly(p3)
```
### Residuals vs Leverage
```{r}
s <- loess.smooth(Leverage, Residuals)
p4 <- plot_ly(diagnostics, x = Leverage, y = Residuals,
type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data", marker = list(size = 10, opacity = 0.5), showlegend = F) %>%
add_trace(x = s$x, y = s$y, type = "scatter", mode = "line", name = "Smooth", line = list(width = 2)) %>%
layout(title = "Leverage vs Residuals", plot_bgcolor = "#e6e6e6", xaxis = list(title="Leverage", font=list(size=14)),
yaxis = list(title="Residuals", font=list(size=14)), font=list(size=14), margin=m)
ggplotly(p4)
```
Column {data-width=350}
-----------------------------------------------------------------------
The final model is built after data selection and transformation. So We need to check the model by diagnostics process.
**1. Linearity Assumption**
We check the linearity assumption: according to the residual vs fitted value plot, there is a linear relationship between the price and all used variable.
**2. Normality Assumption**
We check the normality assumption: the residuals almost follow the normal distribution. We exam this assumption by checking the Q-Q plot.
**3. Equal Variance Assumption**
We check the equal variance assumption: the variability of the errors should be about the same for all values of each predictor. We exam this assumption by checking the Scale-Location plot.
**4. leverage or influencial points**
Since the Q-Q plot looks good, there is no need to check leverage or influencial points. But we can still check it by exam the Residuals vs Leverage plot. There are no prominence samples in the plot.
Results & discussion
=======================================================================
Column {data-width=600}
-----------------------------------------------------------------------
### model summary
```{r}
summary(fit)
```
Column {data-width=400}
-----------------------------------------------------------------------
### The Analysis of Variance Table
```{r}
options(knitr.kable.NA = '')
variance_table <- anova(fit)
V_table <- data.frame(matrix(NA, nrow=3, ncol=5))
rownames(V_table) <- c("Regression", "Residual Error", "Total")
colnames(V_table) <- c("DF", "Sum Sq", "Mean Sq", "F Value", "Pr(>F)")
V_table$DF <- c(15, 17274, 17289)
V_table$`Sum Sq` <- c(sum(variance_table$`Sum Sq`[1:3]), variance_table$`Sum Sq`[4], sum(variance_table$`Sum Sq`[1:4]))
V_table$`Mean Sq`[1:2] <- c(mean(variance_table$`Sum Sq`[1:3]), variance_table$`Mean Sq`[4])
V_table$`F Value`[1] <- V_table$`Mean Sq`[1]/V_table$`Mean Sq`[2]
V_table$`Pr(>F)`[1] <- pf(V_table$`F Value`[1], V_table$DF[1], V_table$DF[2], lower.tail = F)
library(knitr)
kable(V_table)
```
### Mean absolute error & standard deviation of testing price
```{r}
cat("Mean absolute error:", MAE)
cat("Standard deviation of testing price:", sd(testprice))
```
### discussion
According to the model summary, all variables are significant at 0.01 level of t-test, and more than 77% of the total variability in the price can be explained by the used variables.
Then, we can reject the $H_0$ at a significance level of 0.01 by examining the p-value (9.429656e-11<0.01) from the Analysis of Variance Table.
Furthermore, the predicted mean absolute error for the testing price seems not low, but it is reasonable because of the high standard deviation of testing price.
### conclusion
In this project, we predict the price of houses in King County, Washington. The model passed the diagnostic and hypothesis test, has a high R-square, and get a reasonable mean absolute error for predicting testing data. Overall, the project is succeeded.
### reference
[1] “House Sales in King County, USA.” [Online]. Available: https://kaggle.com/harlfoxem/housesalesprediction. [Accessed: 04-Dec-2019].