Introduction

Column

Get to know the data

          id            date   price bedrooms bathrooms sqft_living
1 7129300520 20141013T000000  221900        3      1.00        1180
2 6414100192 20141209T000000  538000        3      2.25        2570
3 5631500400 20150225T000000  180000        2      1.00         770
4 2487200875 20141209T000000  604000        4      3.00        1960
5 1954400510 20150218T000000  510000        3      2.00        1680
6 7237550310 20140512T000000 1225000        4      4.50        5420
  sqft_lot floors waterfront view condition grade sqft_above sqft_basement
1     5650      1          0    0         3     7       1180             0
2     7242      2          0    0         3     7       2170           400
3    10000      1          0    0         3     6        770             0
4     5000      1          0    0         5     7       1050           910
5     8080      1          0    0         3     8       1680             0
6   101930      1          0    0         3    11       3890          1530
  yr_built yr_renovated zipcode     lat     long sqft_living15 sqft_lot15
1     1955            0   98178 47.5112 -122.257          1340       5650
2     1951         1991   98125 47.7210 -122.319          1690       7639
3     1933            0   98028 47.7379 -122.233          2720       8062
4     1965            0   98136 47.5208 -122.393          1360       5000
5     1987            0   98074 47.6168 -122.045          1800       7503
6     2001            0   98053 47.6561 -122.005          4760     101930

Zipcode region with regional median price

Caption for the picture.

Column

Abstract

This project predicts the price of houses in King County, Washington by using the house selling data that is download from Kaggle between May 2014 and May 2015 of the county. We selected and processed the data carefully. After that, we built a multiple linear regression model. The model passed the diagnostic and hypothesis test, has a high R-square, and get a reasonable mean absolute error for predicting testing data.

Introduction

In life, the house is an important part of living and growing, many people would face the price problem if they want to buy or sell a house. Therefore, the goal of this project is to predict the price of houses in King County, Washington. The dataset was provided by the King county, and download from Kaggle [1], it includes 21613 houses selling price between May 2014 and May 2015 of the county. According to the zip code region with regional median price, the different zip code region has large differences in price.

Available themes include:

Id
Date
Price
Bedrooms
Sqft_living
Sqft_lot
Floors
Waterfront
View
Condition
Grade
Sqft_basement
Yr_built
Yr_renovated
Zipcode
Lat
Long
Sqft_living15
Sqft_lot15

Data Exploration & Method part 1

Column

transform response

Price vs Floors

Price vs Grade

Price vs View

Price vs Condition

Column

The dataset is divided into training and testing, 80% for training (17290 observations), 20% for testing (4323 observations). We standardize the training and testing data, the testing data is standardized by using the mean and standard deviation of training data.

After observing the price distribution from the histogram, we decide to transform the price by using log() function. As we can see, the histogram of price close to the normal distribution after transformation.

Selected regressors for multiple regression include (blue fonts represent indicator variables):

Bedrooms
Sqft_living
Sqft_lot
\(\color{blue}{\text{Floors}}\)
\(\color{blue}{\text{Waterfront}}\)
View
Condition
Grade
\(\color{blue}{\text{Sqft_basement}}\)
Yr_built
Yr_renovated
\(\color{blue}{\text{Zipcat}}\)

In the above variables, the sqrt_living is the summation of sqrt_above and sqrt_basement, therefore, we decide to skip sqrt_above, and transform the sqrt_basement to the sign to represent that the house has a basement or not. The waterfront is a sign to represent the house near the water or not, so we also use it as an indicator variable.

To further study the data, we observe the relationships between price and some variables. According to these box plots. For the condition, we found the condition 1 has fewer samples, to avoid the effect of extremums, so we plot both training data and all data. We found that there is a significant positive correlation between price and these variables except floors. Therefore, the floors also becomes an indicator variable.

Then, the Yr_renovated is a variable representing the year of renovation for the house, the Yr_renovated is equal to zero if the house was not renovated. We change it to the variable represent the house that was renovated how many years after the house was built. We also found that there have no enough samples for 3.5 floors, so we transform all samples that have 3.5 floors to 3 floors.

Furthermore, we add a new variable named zipcat, we will talk about it in next part of Data Exploration & Method section

Data Exploration & Method part 2

Column

model summary for using Zipcode as an indicator variable


Call:
lm(formula = price ~ bedrooms + bathrooms + sqft_living + floors + 
    condition + grade + yr_built + yr_renovated + sqft_basement + 
    sqft_lot + waterfront + zipcode + 0, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2669 -0.1937  0.0110  0.2054  1.9224 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
bedrooms       -0.006320   0.003723  -1.698 0.089570 .  
bathrooms       0.048359   0.005286   9.149  < 2e-16 ***
sqft_living     0.369940   0.006101  60.632  < 2e-16 ***
floors1        -0.913209   0.021919 -41.663  < 2e-16 ***
floors1.5      -0.865434   0.024322 -35.582  < 2e-16 ***
floors2        -0.910480   0.023161 -39.311  < 2e-16 ***
floors2.5      -0.941720   0.040192 -23.430  < 2e-16 ***
floors3        -1.116616   0.030723 -36.345  < 2e-16 ***
condition       0.066530   0.003209  20.731  < 2e-16 ***
grade           0.271060   0.005181  52.314  < 2e-16 ***
yr_built       -0.029161   0.005249  -5.555 2.82e-08 ***
yr_renovated    0.024784   0.003085   8.034 1.00e-15 ***
sqft_basement1  0.028356   0.007246   3.913 9.13e-05 ***
sqft_lot        0.054725   0.003100  17.653  < 2e-16 ***
waterfront1     1.251023   0.033541  37.298  < 2e-16 ***
zipcode98002   -0.059912   0.035811  -1.673 0.094345 .  
zipcode98003    0.033232   0.032970   1.008 0.313496    
zipcode98004    2.106258   0.032242  65.327  < 2e-16 ***
zipcode98005    1.392496   0.038471  36.196  < 2e-16 ***
zipcode98006    1.242887   0.028446  43.693  < 2e-16 ***
zipcode98007    1.221288   0.040773  29.953  < 2e-16 ***
zipcode98008    1.243427   0.032653  38.080  < 2e-16 ***
zipcode98010    0.494278   0.046215  10.695  < 2e-16 ***
zipcode98011    0.853283   0.036450  23.410  < 2e-16 ***
zipcode98014    0.566282   0.043205  13.107  < 2e-16 ***
zipcode98019    0.637937   0.036821  17.325  < 2e-16 ***
zipcode98022    0.158000   0.034710   4.552 5.35e-06 ***
zipcode98023   -0.065519   0.028050  -2.336 0.019515 *  
zipcode98024    0.840634   0.051782  16.234  < 2e-16 ***
zipcode98027    0.964357   0.029626  32.551  < 2e-16 ***
zipcode98028    0.821010   0.032056  25.612  < 2e-16 ***
zipcode98029    1.095141   0.031691  34.557  < 2e-16 ***
zipcode98030    0.099253   0.033592   2.955 0.003135 ** 
zipcode98031    0.146052   0.032424   4.505 6.70e-06 ***
zipcode98032   -0.081236   0.041663  -1.950 0.051211 .  
zipcode98033    1.491002   0.029190  51.079  < 2e-16 ***
zipcode98034    1.040756   0.027468  37.889  < 2e-16 ***
zipcode98038    0.342384   0.027325  12.530  < 2e-16 ***
zipcode98039    2.375102   0.062617  37.931  < 2e-16 ***
zipcode98040    1.687095   0.033097  50.974  < 2e-16 ***
zipcode98042    0.131000   0.027634   4.741 2.15e-06 ***
zipcode98045    0.668548   0.034879  19.167  < 2e-16 ***
zipcode98052    1.218583   0.027395  44.481  < 2e-16 ***
zipcode98053    1.141159   0.029790  38.307  < 2e-16 ***
zipcode98055    0.288708   0.033696   8.568  < 2e-16 ***
zipcode98056    0.608420   0.029737  20.460  < 2e-16 ***
zipcode98058    0.311726   0.028597  10.901  < 2e-16 ***
zipcode98059    0.671389   0.028614  23.463  < 2e-16 ***
zipcode98065    0.825636   0.031683  26.059  < 2e-16 ***
zipcode98070    0.655326   0.043271  15.145  < 2e-16 ***
zipcode98072    0.941474   0.032550  28.924  < 2e-16 ***
zipcode98074    1.070789   0.029427  36.389  < 2e-16 ***
zipcode98075    1.087295   0.030778  35.327  < 2e-16 ***
zipcode98077    0.848067   0.036737  23.085  < 2e-16 ***
zipcode98092    0.078714   0.031084   2.532 0.011340 *  
zipcode98102    1.717695   0.047384  36.251  < 2e-16 ***
zipcode98103    1.484595   0.028445  52.192  < 2e-16 ***
zipcode98105    1.730170   0.035167  49.199  < 2e-16 ***
zipcode98106    0.557803   0.030975  18.008  < 2e-16 ***
zipcode98107    1.543710   0.034445  44.816  < 2e-16 ***
zipcode98108    0.595244   0.036523  16.298  < 2e-16 ***
zipcode98109    1.845966   0.046049  40.087  < 2e-16 ***
zipcode98112    1.914447   0.034534  55.437  < 2e-16 ***
zipcode98115    1.503297   0.028073  53.551  < 2e-16 ***
zipcode98116    1.409298   0.031869  44.221  < 2e-16 ***
zipcode98117    1.467415   0.028144  52.140  < 2e-16 ***
zipcode98118    0.836366   0.028357  29.494  < 2e-16 ***
zipcode98119    1.821370   0.038508  47.298  < 2e-16 ***
zipcode98122    1.452184   0.032585  44.566  < 2e-16 ***
zipcode98125    1.059552   0.029825  35.526  < 2e-16 ***
zipcode98126    0.969466   0.031064  31.208  < 2e-16 ***
zipcode98133    0.818265   0.028576  28.634  < 2e-16 ***
zipcode98136    1.276935   0.033319  38.324  < 2e-16 ***
zipcode98144    1.200232   0.031636  37.938  < 2e-16 ***
zipcode98146    0.528984   0.032085  16.487  < 2e-16 ***
zipcode98148    0.298945   0.056848   5.259 1.47e-07 ***
zipcode98155    0.794322   0.028849  27.533  < 2e-16 ***
zipcode98166    0.568824   0.034297  16.585  < 2e-16 ***
zipcode98168    0.114226   0.032975   3.464 0.000533 ***
zipcode98177    1.204004   0.033835  35.584  < 2e-16 ***
zipcode98178    0.298045   0.033387   8.927  < 2e-16 ***
zipcode98188    0.171499   0.040437   4.241 2.24e-05 ***
zipcode98198    0.150499   0.032875   4.578 4.73e-06 ***
zipcode98199    1.619681   0.031766  50.988  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3684 on 17206 degrees of freedom
Multiple R-squared:  0.865, Adjusted R-squared:  0.8643 
F-statistic:  1312 on 84 and 17206 DF,  p-value: < 2.2e-16

Linearity

Normality

Equality Variance

Residuals vs Leverage

Column

We use the Zipcode as the indicator variable for the linear regreesion. Although the R-square of the model is not low, according to the t-test, some zip codes are insignificant at 0.05 level, and the Q-Q plot shows that the points do not follow the 45-degree line.

To solve the problem, we decided not to use the zip code as an indicator variable directly. The zip code includes 70 different Zip codes for houses, each Zip code is a category. To reduce the number of categories, we transform the Zip codes to two categories by comparing the mean price of each category to the overall mean price, if the mean price of the category is smaller than the overall mean price, it belongs to zip category zero, otherwise, it belongs to zip category one. In all data, 9933 samples belong to zip category one, and 11680 samples belong to zip category zero.

Furthermore, Both latitude, longitude and zip code are location information for the house. In this project, we decided to use zip code but not latitude and longitude. Because the zip code is already a well-grouped category, but the latitude and longitude need some group strategy before use.

Diagnostics

Column

Linearity

Normality

Equality Variance

Residuals vs Leverage

Column

The final model is built after data selection and transformation. So We need to check the model by diagnostics process.

1. Linearity Assumption

We check the linearity assumption: according to the residual vs fitted value plot, there is a linear relationship between the price and all used variable.

2. Normality Assumption

We check the normality assumption: the residuals almost follow the normal distribution. We exam this assumption by checking the Q-Q plot.

3. Equal Variance Assumption

We check the equal variance assumption: the variability of the errors should be about the same for all values of each predictor. We exam this assumption by checking the Scale-Location plot.

4. leverage or influencial points Since the Q-Q plot looks good, there is no need to check leverage or influencial points. But we can still check it by exam the Residuals vs Leverage plot. There are no prominence samples in the plot.

Results & discussion

Column

model summary


Call:
lm(formula = price ~ bedrooms + bathrooms + sqft_living + floors + 
    condition + grade + yr_built + yr_renovated + sqft_basement + 
    sqft_lot + waterfront + zipcat + 0, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5522 -0.3030 -0.0083  0.3006  2.2569 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
bedrooms       -0.021566   0.004735  -4.555 5.28e-06 ***
bathrooms       0.073478   0.006781  10.835  < 2e-16 ***
sqft_living     0.347118   0.007561  45.908  < 2e-16 ***
floors1        -0.377190   0.007751 -48.666  < 2e-16 ***
floors1.5      -0.305017   0.015293 -19.944  < 2e-16 ***
floors2        -0.309577   0.011322 -27.342  < 2e-16 ***
floors2.5      -0.321577   0.043711  -7.357 1.97e-13 ***
floors3        -0.229850   0.025070  -9.169  < 2e-16 ***
condition       0.049228   0.004014  12.264  < 2e-16 ***
grade           0.324202   0.006480  50.028  < 2e-16 ***
yr_built       -0.158195   0.005861 -26.993  < 2e-16 ***
yr_renovated    0.014872   0.003957   3.758 0.000172 ***
sqft_basement1 -0.062789   0.008663  -7.248 4.42e-13 ***
sqft_lot        0.025842   0.003743   6.905 5.21e-12 ***
waterfront1     1.207505   0.042473  28.430  < 2e-16 ***
zipcat1         0.805849   0.008208  98.182  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.477 on 17274 degrees of freedom
Multiple R-squared:  0.7727,    Adjusted R-squared:  0.7725 
F-statistic:  3670 on 16 and 17274 DF,  p-value: < 2.2e-16

Column

The Analysis of Variance Table

	DF	Sum Sq	Mean Sq	F Value	Pr(>F)
Regression	15	8469.519	2823.17299	55.5535	0
Residual Error	17274	254.095	50.81899
Total	17289	8723.614

Mean absolute error & standard deviation of testing price

Mean absolute error: 106125.6

Standard deviation of testing price: 359712.5

discussion

According to the model summary, all variables are significant at 0.01 level of t-test, and more than 77% of the total variability in the price can be explained by the used variables.

Then, we can reject the \(H_0\) at a significance level of 0.01 by examining the p-value (9.429656e-11<0.01) from the Analysis of Variance Table.

Furthermore, the predicted mean absolute error for the testing price seems not low, but it is reasonable because of the high standard deviation of testing price.

conclusion

In this project, we predict the price of houses in King County, Washington. The model passed the diagnostic and hypothesis test, has a high R-square, and get a reasonable mean absolute error for predicting testing data. Overall, the project is succeeded.

reference

[1] “House Sales in King County, USA.” [Online]. Available: https://kaggle.com/harlfoxem/housesalesprediction. [Accessed: 04-Dec-2019].

---
title: "House Price Prediction"
author: "Zhiyuan Xie"
output: 
  flexdashboard::flex_dashboard:
    theme: simplex
    orientation: columns
    social: 
    source_code: embed
---

```{r setup, include=FALSE}
# load necessary packages
library(ggplot2)
library(plotly)
library(plyr)
library(flexdashboard)  ## you need this package to create dashboard

# read the data set here, I use data: mtcars as an example
prodata <- read.csv("D:/ECE/MTH543/project/kc_house_data.csv")
```

Introduction
=======================================================================

Column {.tabset data-width=650}
-----------------------------------------------------------------------

### Get to know the data

```{r}
head(prodata)

prodata$condition=as.numeric(prodata$condition)

prodata$grade=as.numeric(prodata$grade)

prodata$yr_built=as.numeric(prodata$yr_built)

prodata$yr_renovated=as.numeric(prodata$yr_renovated)
  
#head(prodata)

index = sample(1:nrow(prodata),size=0.8*nrow(prodata))

train=prodata[index,]

test=prodata[-index,]

train$zipcat=rep(NA,nrow(train))

test$zipcat=rep(NA,nrow(test))

meanpr=mean(train$price)

zip=names(table(train$zipcode))

for(i in 1:length(table(train$zipcode)))
{
  
  index1=which(train$zipcode==zip[i])
  
  index2=which(test$zipcode==zip[i])
  
  if(mean(train$price[index1])>=meanpr)
  {
    train$zipcat[index1]=1
    test$zipcat[index2]=1
    
  }
  else
  {
    train$zipcat[index1]=0 
    test$zipcat[index2]=0
  }
  
}
```


### Zipcode region with regional median price 

![Caption for the picture.](D:/ECE/MTH543/project/median_price.jpg)


Column {data-width=350}
-----------------------------------------------------------------------

### Abstract

 

This project predicts the price of houses in King County, Washington by using the house selling data that is download from Kaggle between May 2014 and May 2015 of the county. We selected and processed the data carefully. After that, we built a multiple linear regression model. The model passed the diagnostic and hypothesis test, has a high R-square, and get a reasonable mean absolute error for predicting testing data.


 

### Introduction

 

In life, the house is an important part of living and growing, many people would face the price problem if they want to buy or sell a house. Therefore, the goal of this project is to predict the price of houses in King County, Washington. The dataset was provided by the King county, and download from Kaggle [1], it includes 21613 houses selling price between May 2014 and May 2015 of the county. According to the zip code region with regional median price, the different zip code region has large differences in price.

Available themes include:

- Id

- Date

- Price

- Bedrooms

- Sqft_living

- Sqft_lot

- Floors

- Waterfront

- View

- Condition

- Grade

- Sqft_basement

- Yr_built

- Yr_renovated

- Zipcode

- Lat

- Long

- Sqft_living15

- Sqft_lot15


 



Data Exploration & Method part 1
=======================================================================

Column {.tabset data-width=650}
-----------------------------------------------------------------------

### transform response

```{r}
par(mfrow=c(1,2))
hist(train$price,xlab="price", main="Histogram of price")
train$price <- log(train$price)
hist(train$price,xlab="log(price)", main="Histogram of log(price)")

```

### Price vs Floors

```{r}

boxplot(train$price~train$floors,xlab="floors",ylab="standardised price", col="#03f8fc")

```


### Price vs Grade

```{r}

boxplot(train$price~train$grade,col="#03f8fc",xlab="grade", ylab="standardised price")

```


### Price vs View

```{r}

boxplot(train$price~train$view,col="#03f8fc",xlab="view", ylab="standardised price")

```

### Price vs Condition

```{r}
train$floors<- ifelse(train$floors>3,3,train$floors)
train$floors<-as.factor(train$floors)
train$zipcat<-as.factor(train$zipcat)
train$sqft_basement<-as.factor(ifelse(train$sqft_basement>0,0,1))
train$waterfront<-as.factor(train$waterfront)
train$yr_renovated<-ifelse((train$yr_renovated-train$yr_built)<0,0,train$yr_renovated-train$yr_built)

#train$renovated<-as.factor(ifelse(train$yr_renovated>0,0,1))

varind=c(3,4,5,6,7,11,12,15,16)

trainmean=apply(train[varind],2,mean)

trainvar=apply(train[varind],2,var)

test$floors<- ifelse(test$floors>3,3,test$floors)
test$price <- log(test$price)
#test$sqft_living <- log(test$sqft_living)
test$floors<-as.factor(test$floors)
test$zipcat<-as.factor(test$zipcat)
test$sqft_basement<-as.factor(ifelse(test$sqft_basement>0,0,1))
test$waterfront<-as.factor(test$waterfront)
test$yr_renovated<-ifelse((test$yr_renovated-test$yr_built)<0,0,test$yr_renovated-test$yr_built)

par(mfrow=c(1,2))

boxplot(exp(train$price)~train$condition,col="#03f8fc",xlab="condition", ylab="price",ylim=c(0,2e+6))

boxplot(prodata$price~prodata$condition,col="#03f8fc",xlab="condition", ylab="price",ylim=c(0,2e+6))

```


Column {data-width=350}
-----------------------------------------------------------------------
 

The dataset is divided into training and testing, 80% for training (17290 observations), 20% for testing (4323 observations). We standardize the training and testing data, the testing data is standardized by using the mean and standard deviation of training data.

After observing the price distribution from the histogram, we decide to transform the price by using log() function. As we can see, the histogram of price close to the normal distribution after transformation.

Selected regressors for multiple regression include (blue fonts represent indicator variables):

  - Bedrooms

  - Sqft_living

  - Sqft_lot

  - $\color{blue}{\text{Floors}}$

  - $\color{blue}{\text{Waterfront}}$

  - View

  - Condition

  - Grade

  - $\color{blue}{\text{Sqft_basement}}$

  - Yr_built

  - Yr_renovated

  - $\color{blue}{\text{Zipcat}}$

In the above variables, the sqrt_living is the summation of sqrt_above and sqrt_basement, therefore, we decide to skip sqrt_above, and transform the sqrt_basement to the sign to represent that the house has a basement or not. The waterfront is a sign to represent the house near the water or not, so we also use it as an indicator variable. 

To further study the data, we observe the relationships between price and some variables. According to these box plots. For the condition, we found the condition 1 has fewer samples, to avoid the effect of extremums, so we plot both training data and all data. We found that there is a significant positive correlation between price and these variables except floors. Therefore, the floors also becomes an indicator variable.

Then, the Yr_renovated is a variable representing the year of renovation for the house, the Yr_renovated is equal to zero if the house was not renovated. We change it to the variable represent the house that was renovated how many years after the house was built. We also found that there have no enough samples for 3.5 floors, so we transform all samples that have 3.5 floors to 3 floors.

Furthermore, we add a new variable named zipcat, we will talk about it in next part of Data Exploration & Method section





Data Exploration & Method part 2
=======================================================================

```{r}

#test$renovated<-as.factor(ifelse(test$yr_renovated>0,0,1))

for(i in c(1:length(varind)))
{
  train[varind[i]]=(train[varind[i]]-trainmean[i])/sqrt(trainvar[i])
  test[varind[i]]=(test[varind[i]]-trainmean[i])/sqrt(trainvar[i])
  
}
```

Column {.tabset data-width=650}
-----------------------------------------------------------------------

### model summary for using Zipcode as an indicator variable

```{r}

train$zipcode<-as.factor(train$zipcode)
fit_zipcode=lm(formula = price~bedrooms+bathrooms+sqft_living+floors+condition+grade
+yr_built+yr_renovated+sqft_basement+sqft_lot+waterfront+zipcode+0,data=train)

summary(fit_zipcode)

```

### Linearity

```{r}

Fitted.Values <- fit_zipcode$fitted.values

# Extract residuals
Residuals <- fit_zipcode$residuals

# Calculate standardized residuals 
Standardized.Residuals <- scale(fit_zipcode$residuals)

# Extract fitted values for lm() object
Theoretical.Quantiles <- qqnorm(Residuals, plot.it = F)$x

# find Square root of abs(residuals)
Root.Residuals <- sqrt(abs(Standardized.Residuals))

# Calculate Leverage
Leverage <- lm.influence(fit_zipcode)$hat

# Create data frame 
# Will be used as input to plot_ly

diagnostics <- data.frame(Fitted.Values, 
                     Residuals, 
                     Standardized.Residuals, 
                     Theoretical.Quantiles,
                     Root.Residuals,
                     Leverage)

m <- list(
  l = 100,
  r = 100,
  b = 100,
  t = 100,
  pad = 4
)

# Fitted vs Residuals
p1 <- plot_ly(diagnostics, x = Fitted.Values, y = Residuals, 
          type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data", 
          marker = list(size = 10, opacity = 0.5))%>%
  layout(title = "Residuals vs Fitted Values", 
       xaxis = list(title="Fitted Values", font=list(size=14)), 
       yaxis = list(title="Residuals", font=list(size=14)), 
       plot_bgcolor = "#e6e6e6", 
       font=list(size=14), margin=m)

ggplotly(p1)

```

### Normality
```{r}
# QQ Pot
p2 <- plot_ly(diagnostics, x = Theoretical.Quantiles, y = Standardized.Residuals, type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data", marker = list(size = 10, opacity = 0.5), showlegend = F)%>%
    add_trace(x = Theoretical.Quantiles, y = Theoretical.Quantiles, type = "scatter", mode = "line", name = "", line = list(width = 2))%>%
  layout(title = "Q-Q Plot", plot_bgcolor = "#e6e6e6",
              xaxis = list(title="Theoretical Quantiles", font=list(size=14)), 
       yaxis = list(title="Standardized Residuals", font=list(size=14)), font=list(size=14), margin=m)

ggplotly(p2)
```

### Equality Variance
```{r}
# Scale Location
p3 <- plot_ly(diagnostics, x = Fitted.Values, y = Root.Residuals, 
          type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data",
          marker = list(size = 10, opacity = 0.5), showlegend = F)%>%
  layout(title = "Scale-Location", plot_bgcolor = "#e6e6e6", xaxis = list(title="Fitted Values", font=list(size=14)), 
       yaxis = list(title=expression(sqrt("|Standardized Residuals|")), font=list(size=14)), font=list(size=14), margin=m)

ggplotly(p3)
```

### Residuals vs Leverage
```{r}
s <- loess.smooth(Leverage, Residuals)
p4 <- plot_ly(diagnostics, x = Leverage, y = Residuals, 
            type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data", marker = list(size = 10, opacity = 0.5), showlegend = F) %>% 
    add_trace(x = s$x, y = s$y, type = "scatter", mode = "line", name = "Smooth", line = list(width = 2)) %>% 
    layout(title = "Leverage vs Residuals", plot_bgcolor = "#e6e6e6", xaxis = list(title="Leverage", font=list(size=14)), 
       yaxis = list(title="Residuals", font=list(size=14)), font=list(size=14), margin=m)
ggplotly(p4)  
```



Column {.tabset data-width=350}
-----------------------------------------------------------------------


We use the Zipcode as the indicator variable for the linear regreesion. Although the R-square of the model is not low, according to the t-test, some zip codes are insignificant at 0.05 level, and the Q-Q plot shows that the points do not follow the 45-degree line.

To solve the problem, we decided not to use the zip code as an indicator variable directly. The zip code includes 70 different Zip codes for houses, each Zip code is a category. To reduce the number of categories, we transform the Zip codes to two categories by comparing the mean price of each category to the overall mean price, if the mean price of the category is smaller than the overall mean price, it belongs to zip category zero, otherwise, it belongs to zip category one. In all data, 9933 samples belong to zip category one, and 11680 samples belong to zip category zero.

Furthermore, Both latitude, longitude and zip code are location information for the house. In this project, we decided to use zip code but not latitude and longitude. Because the zip code is already a well-grouped category, but the latitude and longitude need some group strategy before use.






Diagnostics
=======================================================================
Column {.tabset data-width=650}
-----------------------------------------------------------------------

```{r}


#train[varind]<-as.data.frame(apply(train[varind],2,scale))

fit=lm(formula = price~bedrooms+bathrooms+sqft_living+floors+condition+grade
+yr_built+yr_renovated+sqft_basement+sqft_lot+waterfront+zipcat+0,data=train)

#fit=lm(formula = price~bedrooms+bathrooms+sqft_living+floors+condition+grade
#      +yr_built+renovated+yr_renovated*renovated+sqft_basement+sqft_lot+waterfront+zipcat+0,data=train)

testprice <- exp(test$price*sqrt(trainvar[1])+trainmean[1])
predictprice <- exp(predict(fit,newdata=test)*sqrt(trainvar[1])+trainmean[1])

MAE=mean(abs(testprice - predictprice))



Fitted.Values <- fit$fitted.values

# Extract residuals
Residuals <- fit$residuals

# Calculate standardized residuals 
Standardized.Residuals <- scale(fit$residuals)

# Extract fitted values for lm() object
Theoretical.Quantiles <- qqnorm(Residuals, plot.it = F)$x

# find Square root of abs(residuals)
Root.Residuals <- sqrt(abs(Standardized.Residuals))

# Calculate Leverage
Leverage <- lm.influence(fit)$hat

# Create data frame 
# Will be used as input to plot_ly

diagnostics <- data.frame(Fitted.Values, 
                     Residuals, 
                     Standardized.Residuals, 
                     Theoretical.Quantiles,
                     Root.Residuals,
                     Leverage)
```

### Linearity

```{r}
m <- list(
  l = 100,
  r = 100,
  b = 100,
  t = 100,
  pad = 4
)

# Fitted vs Residuals
p1 <- plot_ly(diagnostics, x = Fitted.Values, y = Residuals, 
          type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data", 
          marker = list(size = 10, opacity = 0.5))%>%
  layout(title = "Residuals vs Fitted Values", 
       xaxis = list(title="Fitted Values", font=list(size=14)), 
       yaxis = list(title="Residuals", font=list(size=14)), 
       plot_bgcolor = "#e6e6e6", 
       font=list(size=14), margin=m)

ggplotly(p1)
```

### Normality
```{r}
# QQ Pot
p2 <- plot_ly(diagnostics, x = Theoretical.Quantiles, y = Standardized.Residuals, type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data", marker = list(size = 10, opacity = 0.5), showlegend = F)%>%
    add_trace(x = Theoretical.Quantiles, y = Theoretical.Quantiles, type = "scatter", mode = "line", name = "", line = list(width = 2))%>%
  layout(title = "Q-Q Plot", plot_bgcolor = "#e6e6e6",
              xaxis = list(title="Theoretical Quantiles", font=list(size=14)), 
       yaxis = list(title="Standardized Residuals", font=list(size=14)), font=list(size=14), margin=m)

ggplotly(p2)
```

### Equality Variance
```{r}
# Scale Location
p3 <- plot_ly(diagnostics, x = Fitted.Values, y = Root.Residuals, 
          type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data",
          marker = list(size = 10, opacity = 0.5), showlegend = F)%>%
  layout(title = "Scale-Location", plot_bgcolor = "#e6e6e6", xaxis = list(title="Fitted Values", font=list(size=14)), 
       yaxis = list(title=expression(sqrt("|Standardized Residuals|")), font=list(size=14)), font=list(size=14), margin=m)

ggplotly(p3)
```

### Residuals vs Leverage
```{r}
s <- loess.smooth(Leverage, Residuals)
p4 <- plot_ly(diagnostics, x = Leverage, y = Residuals, 
            type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data", marker = list(size = 10, opacity = 0.5), showlegend = F) %>% 
    add_trace(x = s$x, y = s$y, type = "scatter", mode = "line", name = "Smooth", line = list(width = 2)) %>% 
    layout(title = "Leverage vs Residuals", plot_bgcolor = "#e6e6e6", xaxis = list(title="Leverage", font=list(size=14)), 
       yaxis = list(title="Residuals", font=list(size=14)), font=list(size=14), margin=m)
ggplotly(p4)  
```


Column {data-width=350}
-----------------------------------------------------------------------
 

The final model is built after data selection and transformation. So We need to check the model by diagnostics process.

**1. Linearity Assumption**

We check the linearity assumption: according to the residual vs fitted value plot, there is a linear relationship between the price and all used variable.

**2. Normality Assumption**

We check the normality assumption: the residuals almost follow the normal distribution. We exam this assumption by checking the Q-Q plot.

**3. Equal Variance Assumption**

We check the equal variance assumption: the variability of the errors should be about the same for all values of each predictor. We exam this assumption by checking the Scale-Location plot.

**4. leverage or influencial points**
Since the Q-Q plot looks good, there is no need to check leverage or influencial points. But we can still check it by exam the Residuals vs Leverage plot. There are no prominence samples in the plot.



Results & discussion
=======================================================================

Column {data-width=600}
-----------------------------------------------------------------------

### model summary

```{r}
summary(fit)
```

Column {data-width=400}
-----------------------------------------------------------------------


### The Analysis of Variance Table
```{r}
options(knitr.kable.NA = '')
variance_table <- anova(fit)
V_table <- data.frame(matrix(NA, nrow=3, ncol=5))
rownames(V_table) <- c("Regression", "Residual Error", "Total")
colnames(V_table) <- c("DF", "Sum Sq", "Mean Sq", "F Value", "Pr(>F)")
V_table$DF <- c(15, 17274, 17289)
V_table$`Sum Sq` <- c(sum(variance_table$`Sum Sq`[1:3]), variance_table$`Sum Sq`[4], sum(variance_table$`Sum Sq`[1:4]))
V_table$`Mean Sq`[1:2] <- c(mean(variance_table$`Sum Sq`[1:3]), variance_table$`Mean Sq`[4])
V_table$`F Value`[1] <- V_table$`Mean Sq`[1]/V_table$`Mean Sq`[2]
V_table$`Pr(>F)`[1] <- pf(V_table$`F Value`[1], V_table$DF[1], V_table$DF[2], lower.tail = F)
library(knitr)
kable(V_table)
```


### Mean absolute error & standard deviation of testing price

```{r}
cat("Mean absolute error:", MAE)
cat("Standard deviation of testing price:", sd(testprice))

```

### discussion

 

According to the model summary, all variables are significant at 0.01 level of t-test, and more than 77% of the total variability in the price can be explained by the used variables.

Then, we can reject the $H_0$ at a significance level of 0.01 by examining the p-value (9.429656e-11<0.01) from the Analysis of Variance Table.

Furthermore, the predicted mean absolute error for the testing price seems not low, but it is reasonable because of the high standard deviation of testing price.



### conclusion

 

In this project, we predict the price of houses in King County, Washington. The model passed the diagnostic and hypothesis test, has a high R-square, and get a reasonable mean absolute error for predicting testing data. Overall, the project is succeeded.



### reference

[1] “House Sales in King County, USA.” [Online]. Available: https://kaggle.com/harlfoxem/housesalesprediction. [Accessed: 04-Dec-2019].