City-to-City Housing Value Impact Modeler: Code and Stats Explained

In our last blog post I introduced Intrepid Insight’s City-to-City Home Value Impact Calculator. In this blog post, as promised, I will expand on the statistics and the code behind the tool. One of the main prongs of Intrepid Insight’s mission is to “show our work”, or be transparent about how we get to our results. This has two benefits. First, if it turns out we made a mistake, someone can correct us and we can update our methodology. In fact, while making this blog post, I realized an error. You can check out the blog post for more details, but basically I was controlling for realtor fees rather than the mortgage interest rate. Second, it allows our audience to learn from our methods and our code. If you would rather not read my commentary and explanation, and want to get straight to the code, it is available in full on I²’s Github.

Because of our many-facetted audience, we think it is wise to warn the reader when a blog post is going to dive into highly technical content. Paul Krugman at the New York Times puts “Wonkish” in his column titles when they are mathematically or economically involved. We think this is a good practice, so we promise to put “TECHNICAL” in the titles of the posts we believe have a lot to do with programming, mathematics or statistics. You’ll see that we have started this little tradition with this very blog post.

Now enough of that administrative fluff, and onto the content. Here is how CTCHVIM works!

Data Building

There are only two data sources used. The first is Zillow’s Home Value Index, specifically the “Median Home Value Per Square Foot” time series by city, available here. Essentially, Zillow maintains what it calls a “living database” of more than 100 million homes across the United States. Three times a week, it calculates estimates of each home’s value based on “prior sales, county records,”prior sales, county records, tax assessments, real estate listings, mortgage information and GIS data…and users.” The Median Home Value per Square Foot is than a statistic derived from these estimates, and like Zillow’s other derived statistics, it is updated monthly. The Zillow metrics also boast a “95% coverage of US housing stock by market value.” You can read more extensively about Zillow’s methodology here.

The compelling claim in their documentation is that their indices, while not perfect, are unbiased. In laymen’s terms, they may not always hit the mark, but their valuations are just as likely to be high as they are to be low. Zillow also purports to best other home value methods in terms of selection bias. Other metrics essentially rely on sold homes to compute statistics, and this leads to an underrepresentation of newly constructed homes and homes that do not get sold frequently.

After downloading this series, I began cleaning the data using the data.table framework. Since learning about the package from Jimmy Lepore (another team member on I²), I have become a total data.table believer. Part of this is because I started my data science forays using STATA, and I find the structure imposed by data.table to be the most similar. The following lines read in the data and reshape it. The data is produced by Zillow in a wide format, with each column representing a date. I find this a little hard to work with. You can see why if you look at the printed sections of the data tables.NOTE: This project only uses data up to December 2017.

zillowdata<-data.table(zillowdata)
kable(zillowdata[1:1, 1:12], align='c')

RegionID	RegionName	State	Metro	CountyName	SizeRank	X1996.04	X1996.05	X1996.06	X1996.07	X1996.08	X1996.09
6181	New York	NY	New York	Queens	1	NA	NA	NA	NA	NA	NA

zillowdata<-melt(zillowdata, measure=patterns("^X"), value.name="pricepersq", variable.name="monthlydate")
kable(zillowdata[1:12], align='c')

RegionID	RegionName	State	Metro	CountyName	SizeRank	monthlydate	pricepersq
6181	New York	NY	New York	Queens	1	X1996.04	NA
12447	Los Angeles	CA	Los Angeles-Long Beach-Anaheim	Los Angeles	2	X1996.04	110
17426	Chicago	IL	Chicago	Cook	3	X1996.04	94
13271	Philadelphia	PA	Philadelphia	Philadelphia	4	X1996.04	39
40326	Phoenix	AZ	Phoenix	Maricopa	5	X1996.04	58
18959	Las Vegas	NV	Las Vegas	Clark	6	X1996.04	76
54296	San Diego	CA	San Diego	San Diego	7	X1996.04	112
33839	San Jose	CA	San Jose	Santa Clara	8	X1996.04	147
25290	Jacksonville	FL	Jacksonville	Duval	9	X1996.04	51
20330	San Francisco	CA	San Francisco	San Francisco	10	X1996.04	201
32149	Indianapolis	IN	Indianapolis	Marion	11	X1996.04	NA
10221	Austin	TX	Austin	Travis	12	X1996.04	NA

After doing this, the only remaining data cleaning needed on the global level is to create a monthly date variable (stored as an actual R date variable, not a string), limit to California, remove records with a missing Price Per Sq, slightly clean up the variable storing the city name, and include San Francisco with San Mateo County (because otherwise San Francisco would not be in the tool at all, since it is the only city in a county named after itself). I also threw in a few stopifnot checks to make sure that there are no internal gaps in each city’s time series. because these pass, it means that our removal of NA records is always on the tail ends of the data. There are not random missing values dispersed in the middle of the time series.

zillowdata<-zillowdata[,monthlydate := as.Date(paste0(substring(monthlydate, 2),".01"), "%Y.%m.%d")][,month_year := format(monthlydate, "%Y-%m")]
setkey(zillowdata,RegionID, monthlydate)
zillowdata<-zillowdata[,diff := monthlydate - shift(monthlydate, fill=first(monthlydate)), by = RegionID]
zillowdata<-zillowdata[,gap := is.na(pricepersq) & 
                         !is.na(shift(pricepersq, fill=first(pricepersq), type="lag")) & 
                                  !is.na(shift(pricepersq, fill=first(pricepersq), type="lead")), by = RegionID]
## limit to CA
zillowdata<-zillowdata[State=="CA"]
## destroy NA records
zillowdata<-zillowdata[!is.na(pricepersq)]
## create cityname from region and state
zillowdata<-zillowdata[,city := str_replace_all(RegionName, "-", " ")]
zillowdata<-zillowdata[CountyName=="San Francisco" | CountyName=="San Mateo", CountyName:="San Mateo + San Francisco"]

## no internal gaps?
stopifnot(unique(zillowdata[,diff])<=31)
stopifnot(unique(zillowdata[,gap])==FALSE)

The second data source is the “30-Year Fixed Rate Mortgage Average in the United States” available through the St Louis Fed’s FRED Economic Data website and originally produced by Freddie Mac. This data will be used as an exogenous control in all of our VAR models. The time series itself is a percentage recorded weekly, and is not seasonally adjusted. The only major cleaning needed for this data is to limit it to the first available mortgage interest rate each month, since our Zillow data is monthly and this data in its raw form is weekly.

# the exogenous mortgage rate data.
mortgage<-data.table(mortgage)
mortgage<-mortgage[,DATE := as.Date(DATE, "%Y-%m-%d")][MORTGAGE30US!="."][,month_year := format(DATE, "%Y-%m")]
kable(mortgage[1:12], align='c')

DATE	MORTGAGE30US	month_year
1971-04-02	7.33	1971-04
1971-04-09	7.31	1971-04
1971-04-16	7.31	1971-04
1971-04-23	7.31	1971-04
1971-04-30	7.29	1971-04
1971-05-07	7.38	1971-05
1971-05-14	7.42	1971-05
1971-05-21	7.44	1971-05
1971-05-28	7.46	1971-05
1971-06-04	7.52	1971-06
1971-06-11	7.52	1971-06
1971-06-18	7.54	1971-06

# keep the first mortgage rate in the month
mortgage<-mortgage[,first_date := min(DATE), by=month_year][DATE==first_date][,first_date:=NULL]
kable(mortgage[1:12], align='c')

DATE	MORTGAGE30US	month_year
1971-04-02	7.33	1971-04
1971-05-07	7.38	1971-05
1971-06-04	7.52	1971-06
1971-07-02	7.54	1971-07
1971-08-06	7.66	1971-08
1971-09-03	7.71	1971-09
1971-10-01	7.67	1971-10
1971-11-05	7.59	1971-11
1971-12-03	7.49	1971-12
1972-01-07	7.46	1972-01
1972-02-04	7.35	1972-02
1972-03-03	7.32	1972-03

setnames(mortgage, "MORTGAGE30US", "mort_rate")

Analysis

For the remainder of this blog post, I will focus on just one pair of cities: Norwalk and La Mirada. I chose this pair for illustration purposes only. The process outlined for these two cities is identical for all city pairs available in the tool. If you are interested in the full code which performs the calculations for all 11,000+ city combinations, check out our GitHub.

Because we are using the vars package, we need to declare all three pieces (Norwalk, La Mirada, and the mortgage interest rate) as time series, then combine them into one object that covers the same period. This is actualy best done using ts.intersect because it takes care of differences in time coverage automatically. We finally want to remove the mortgage interest rate series and vectorize it. Again, this is only because of the way the data needs to enter the VAR command later on.

ts_norwalk<-ts(data=zillowdata[city=="Norwalk",pricepersq],start=c(as.numeric(format(min(zillowdata[city=="Norwalk",monthlydate]),"%Y")),
                                                  as.numeric(format(min(zillowdata[city=="Norwalk",monthlydate]),"%m"))), frequency=12)
ts_lamirada<-ts(data=zillowdata[city=="La Mirada",pricepersq],start=c(as.numeric(format(min(zillowdata[city=="La Mirada",monthlydate]),"%Y")),
                                                  as.numeric(format(min(zillowdata[city=="La Mirada",monthlydate]),"%m"))), frequency=12)
ts_mort<-ts(data=mortgage[,mort_rate], 
                start=c(as.numeric(format(min(mortgage[,DATE]),"%Y")),
                      as.numeric(format(min(mortgage[,DATE]),"%m"))), frequency=12)
## limit to just when they overlap.
together=ts.intersect(ts_norwalk, ts_lamirada, ts_mort)

together=data.table(together)

## vectorize mort rate. remove from df.
mort_vector <-together[,ts_mort]
together<-together[,ts_mort := NULL]

After this, we can move on to selecting the model. The tool uses the same algorithm for all city pairs, but it is worth noting that this is not a good practice if one is interested in making inferences or predictions based on the vector auto regression model that results (and if a person is only computing one model, not 11,000). It was done in this case mainly because individual evaluation of each of the 11,000+ combinations was impossible. We will post a blog post later that takes a longer, more thorough route to specifying a VAR model, employing cross-validation, structural break analysis and additional specification tests. For now, we will continue using the methodology employed in the tool itself on our La Mirada-Norwalk example.

The main thing we need to specify in the model is how many lags to use. In order to decide, we used this alogirthm, which is in line with the typical information criteria method used to select lag length: 1. Use the VARselect command in the vars package to calculate the best model under the AIC, HQ, SC, and FPE. 2. Select the model that is chosen by the most metrics (we define a mode function to do this). If there is a tie, prioritize parsimony and select the simpler model.

## use var select.
choicemodel<-VARselect(together, lag.max=12,type="trend", season=3, exogen = cbind(mort_vector))
## define mode function.
Mode <- function(x) {
   ux <- unique(x)
  tab <- tabulate(match(x, ux))
  ux[tab == max(tab)]
}
choice<-Mode(cbind(choicemodel$selection))
## when there is a tie, use the more simple model (less lags)
choice<-min(choice)
var_model=VAR(together, p=choice, type="trend",season=3, exogen = cbind(mort_vector))
summary(var_model)

## 
## VAR Estimation Results:
## ========================= 
## Endogenous variables: ts_norwalk, ts_lamirada 
## Deterministic variables: trend 
## Sample size: 259 
## Log Likelihood: -731.103 
## Roots of the characteristic polynomial:
## 0.9755 0.9755 0.963 0.4951
## Call:
## VAR(y = together, p = choice, type = "trend", season = 3L, exogen = cbind(mort_vector))
## 
## 
## Estimation results for equation ts_norwalk: 
## =========================================== 
## ts_norwalk = ts_norwalk.l1 + ts_lamirada.l1 + ts_norwalk.l2 + ts_lamirada.l2 + trend + sd1 + sd2 + mort_vector 
## 
##                 Estimate Std. Error t value Pr(>|t|)    
## ts_norwalk.l1   1.735591   0.046875  37.026  < 2e-16 ***
## ts_lamirada.l1  0.237534   0.056703   4.189 3.88e-05 ***
## ts_norwalk.l2  -0.757709   0.046566 -16.272  < 2e-16 ***
## ts_lamirada.l2 -0.214021   0.058085  -3.685 0.000281 ***
## trend          -0.001971   0.002085  -0.945 0.345455    
## sd1            -0.029707   0.159841  -0.186 0.852712    
## sd2            -0.047697   0.159835  -0.298 0.765635    
## mort_vector    -0.046654   0.034892  -1.337 0.182399    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 1.051 on 251 degrees of freedom
## Multiple R-Squared:     1,   Adjusted R-squared:     1 
## F-statistic: 2.025e+06 on 8 and 251 DF,  p-value: < 2.2e-16 
## 
## 
## Estimation results for equation ts_lamirada: 
## ============================================ 
## ts_lamirada = ts_norwalk.l1 + ts_lamirada.l1 + ts_norwalk.l2 + ts_lamirada.l2 + trend + sd1 + sd2 + mort_vector 
## 
##                 Estimate Std. Error t value Pr(>|t|)    
## ts_norwalk.l1   0.205634   0.047287   4.349 1.99e-05 ***
## ts_lamirada.l1  1.671200   0.057200  29.217  < 2e-16 ***
## ts_norwalk.l2  -0.218107   0.046975  -4.643 5.54e-06 ***
## ts_lamirada.l2 -0.660417   0.058595 -11.271  < 2e-16 ***
## trend           0.002011   0.002103   0.956    0.340    
## sd1            -0.071435   0.161245  -0.443    0.658    
## sd2            -0.078095   0.161238  -0.484    0.629    
## mort_vector     0.033951   0.035198   0.965    0.336    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 1.06 on 251 degrees of freedom
## Multiple R-Squared:     1,   Adjusted R-squared:     1 
## F-statistic: 2.1e+06 on 8 and 251 DF,  p-value: < 2.2e-16 
## 
## 
## 
## Covariance matrix of residuals:
##             ts_norwalk ts_lamirada
## ts_norwalk      1.1048      0.4572
## ts_lamirada     0.4572      1.1243
## 
## Correlation matrix of residuals:
##             ts_norwalk ts_lamirada
## ts_norwalk      1.0000      0.4103
## ts_lamirada     0.4103      1.0000

Statistical Aside

Before moving on, there is a huge caveat to the above output and methodology. The vector autoregression (VAR) model I ran is in levels. We have displayed the coefficients and the full output of the estimation command. You will see that none of this is diplayed in the tool, and this was indeed purposeful. Significance, hypothesis testing, and many other inferences are not reliable when working with non-stationary time series in levels. It is neccessary to use some method to make the data stationary to perform these inferences. One of the most prevalent methods is differencing. I chose not to difference the data for the impulse response functions because I was interested in the co-movements of the series both in the long and short term, and differencing, for all its merits, does throw away some of that information. A good way to think about this is that differencing is analogous to taking the derivative of any function. Take f(x)=x⁴ − x² + 4, the derivative of which is f′(x)=4x³ − 2x. The derivative provides valuable information about the function itself, but the resulting function will not be a perfect picture of the original function. In fact it is quite obvious that it is impossible to know that the constant, 4, was even in the original function without additional information. Indeed, this is why if we attempt to return to the original function using the Fundamental Theorem of Calculus, we are still left with ∫f′(x)dx = x⁴ − x² + C, where C is unknown. In the same way, differencing can yield helpful information, but it is not without cost.

The question of whether to difference than use a VAR, or estimate a vector error correction model (VECM) or do something else is very situation dependent, and based on my survey of online resources many of the questions are not completely resolved in the literature. I found several Stack Exchange threads helpful in digging through the debates and literature. Here are a few threads:

“Under cointegration, evidence is mixed as to whether VECMs yield better forecasts than unrestricted VAR models” The thread is here.
“If I remember well, there are kind of contradictory results from the literature, Hoffman and Rasche saying advantages of VECM appear at a long horizon only, but Christoffersen and Diebold claiming you are fine with a VAR for long term…” The thread is here.
Finally the most helpful thread. The best textbook reference I found was located in this thread:

“There is an issue of whether the variables in a VAR need to be stationary. Sims (1980) and Sims, Stock and Watson (1990) recommend against differencing even if the variables contain a unit root. They argued that the goal of a VAR analysis is to determine the interrelationships among the variables, not to determine the parameter estimates. The main argument against differencing is that it “throws away” information concerning the comovements in the data (such as the possibility of cointegrating relationships). Similary, it is argued that the data need not be detrended. In a VAR, a trending variable will be well approximated by a unit root plus drift. However, majority view is that the form of variables in the VAR should mimic the true data-generating process. This is particularly true if the aim is to estimate a structural model.” –Applied Econometric Time Series, 3rd ed., by Walter Ender.

My final conclusion is there are improvements that can be made over a VAR in levels, but as long as there is cointegration and I am NOT making inferences based on the model coefficients or standard errors (Granger Causality, etc) than my approach is fine. For the sake of covering all of my bases, I will show that both series are non-stationary, are stationary in third differences (aka I(3)), and are cointegrated.

Test for initial stationarity of La Mirada median home values per square foot:

library('urca')
## establish that each series is non-stationary. use three tests: KPSS, ADF, Phillips-Perron
adf.test(together$ts_lamirada)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  together$ts_lamirada
## Dickey-Fuller = -2.7932, Lag order = 6, p-value = 0.2418
## alternative hypothesis: stationary

kpss.test(together$ts_lamirada)

## 
##  KPSS Test for Level Stationarity
## 
## data:  together$ts_lamirada
## KPSS Level = 4.1262, Truncation lag parameter = 3, p-value = 0.01

PP.test(together$ts_lamirada)

## 
##  Phillips-Perron Unit Root Test
## 
## data:  together$ts_lamirada
## Dickey-Fuller = -0.93588, Truncation lag parameter = 5, p-value =
## 0.9475

Using a p-value of 0.05: For the Augmented Dickey-Fuller, we fail to reject the null Hypothesis of non-stationarity. For the KPSS Test we reject the null hypothesis of stationarity. For the Phillips-Person test, we fail to reject the null hypothesis of non-stationarity. In summary, all three tests provide evidence that the La Mirada series is non-stationary.

Test for initial stationarity of Norwalk median home values per square foot:

adf.test(together$ts_norwalk)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  together$ts_norwalk
## Dickey-Fuller = -2.9467, Lag order = 6, p-value = 0.1772
## alternative hypothesis: stationary

kpss.test(together$ts_norwalk)

## 
##  KPSS Test for Level Stationarity
## 
## data:  together$ts_norwalk
## KPSS Level = 3.6827, Truncation lag parameter = 3, p-value = 0.01

PP.test(together$ts_norwalk)

## 
##  Phillips-Perron Unit Root Test
## 
## data:  together$ts_norwalk
## Dickey-Fuller = -0.98806, Truncation lag parameter = 5, p-value =
## 0.9391

The results are the same for the Norwalk series. All tests provide evidence of non-stationarity. Note that it is better practice to use the Bonferroni correction or something similar when doing multiple tests on multiple series. I am not extremely familiar with how to adapt thecorrection to situations where the null hypotheses are different (like in this case), therefore I have not done so.

I now test for the level of integration by looping through the tests until stationarity is achieved.

norwalk_test<-together$ts_norwalk
for (x in 1:10){
  
  a<-adf.test(norwalk_test)$p.value
  b<-kpss.test(norwalk_test)$p.value
  c<-PP.test(norwalk_test)$p.value
  if (a<0.05 & b>0.05 & c<0.05){
    print(paste0("Norwalk is stationary after differencing ", x, " times."))
    break
  }
  norwalk_test<-diff(norwalk_test, x)
}

## [1] "Norwalk is stationary after differencing 3 times."

lamirada_test<-together$ts_lamirada
for (x in 1:10){
  
  a<-adf.test(lamirada_test)$p.value
  b<-kpss.test(lamirada_test)$p.value
  c<-PP.test(lamirada_test)$p.value
  print(paste0("ADF p-value: ", a, " KPSS p-value: ", b, " PP p-value:", c))
  if (a<0.05 & b>0.05 & c<0.05){
    break
  }
  lamirada_test<-diff(lamirada_test, x)
}

## [1] "ADF p-value: 0.241821977731071 KPSS p-value: 0.01 PP p-value:0.947502871695296"
## [1] "ADF p-value: 0.554354480901373 KPSS p-value: 0.0566494767085807 PP p-value:0.199950193996418"
## [1] "ADF p-value: 0.01 KPSS p-value: 0.1 PP p-value:0.01"

Because both series have the same order of integration I(3), we can now test for cointegration. We use the Engle-Granger 2-step approach. More details on the procedure can be found here. Since we have I(3) variables we can adjust the procedure slightly and say that our series is cointegrated if the residuals have a unit root at level, first or second differences (as opposed to just at level as in the Stack Exchange thread)

Performing the Engle-Granger 2-step approach:

testmodel<-lm(together$ts_lamirada~together$ts_norwalk)
adf.test(testmodel$residuals)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  testmodel$residuals
## Dickey-Fuller = -2.7109, Lag order = 6, p-value = 0.2765
## alternative hypothesis: stationary

adf.test(diff(testmodel$residuals))

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(testmodel$residuals)
## Dickey-Fuller = -3.5106, Lag order = 6, p-value = 0.04221
## alternative hypothesis: stationary

adf.test(diff(diff(testmodel$residuals)))

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(diff(testmodel$residuals))
## Dickey-Fuller = -9.8839, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary

Because this is not a normal Augmented Dickey Fuller test, but one being conducted as part of the Engle-Granger process, it is neccessary to use a different distirbution than the one built into the adf.test command to map the test statistics to a p-value. The relevant p-value for this two-variable case with no trend is -3.34 (James G. MacKinnon, 1990. “Critical Values for Cointegration Tests,” Working Papers 1227, Queen’s University, Department of Economics). At the 0.05 level, the non-differenced residuals fail to reject the null hypothesis of non-stationarity. However, the first differenced residuals just barely reject the null hypothesis at the 0.05 level. Because the test statistics is so close to the critical value we difference a second time. In this case the null hypothesis is again rejected but by a much wider margin. There is than evidence that the residuals are integrated of order 1 or 2, which are both less than the original order of integration of the two series, 3. All of this can be summarized as follows:

The La Mirada and Norwalk Zillow Median Home Values per Square Foot are non-stationary, integrated of order 3, and cointegrated. This is evidence that for these two cities, our VAR in levels methodology is justifiable for the purposes of forecasts and the impulse response functions.

Returning to the Explanation of the Tool

After justifying our methodology (somewhat), we can return to the final step used in the CTCHVIM: calculating the impulse response functions. Below is the code used to calculate the impulse response functions.

irf<-irf(var_model, n.ahead=12, impulse="ts_lamirada", seed=99418, runs=250, ortho=FALSE)
irf2<-irf(var_model, n.ahead=12, impulse="ts_norwalk", seed=99418, runs=250, ortho=FALSE)

The forecast horizon I use is 12 (12 months, so 1 year). The irf command defaults to using 50 runs for boostrapping the error bands. I found that sensible error bounds were better achieved using a higher number of iterations. The full code (which will be available on GitHub shortly) includes 500 for the irf2 line. This was an oversight: the original number of runs was 500, but this was computationally cumbersome and was subsequently reduced for one of the two irf commands. ortho is set to FALSE, as the orthogonalized impulse responses are not calculated for the tool. The impulse response functions are graphed below:

plot(irf)

plot(irf2)

It is worth noting that the impulse response functions themselves are just a special kind of prediction, and the error bands are boostrapped and not based on the coefficient standard errors (which I stated earlier cannot be trusted because the VAR is in levels on non-stationary data).

So the tool itself is mainly just two loops repeating these steps for all city combinations within each county in California. The user interface is a Shiny Application that loads the saved data. The code for these parts will be provided in full on I²’s GitHub. Thanks for reading, and feel free to reach out to me or comment if you have questions, comments or concerns!