Real Estate Dictionaries Across Space and Time


Leveraging high-dimensional variable selection methods, we show the textual information provided in real estate agents’ remarks about a property can be used to address spatial and temporal heterogeneity in housing markets. Including the textual information in the pricing model decreases in-sample prediction errors by as much as 18.7% at the MSA-level and 39.1% at the zip code level. These results are robust to transforming the raw text using a real estate specific word list, the choice of n-grams, word stemming, and heteroscedasticity in the hedonic and repeat-sales models. These findings suggest the raw text in the remarks can be included directly in predictive pricing models.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6


  1. 1.

    For example, Sirmans et al. (2006) run a meta analysis that examines the relationship between house prices and nine housing characteristics that are commonly included in hedonic pricing models. The authors find the estimated coefficients for some characteristics vary significantly by geographical location, but not across time.

  2. 2.

    For example, vice and crude are negative in everyday use but neutral when referring to vice president or crude oil in 10-K reports.

  3. 3.

    The phrase “realtor speak” is analogous to the phrase “netspeak” in Crystal (2001) which refers to abbreviations and acronyms that are often used on the internet to speed up the typing of messages.

  4. 4.

    Throughout the paper we emphasize the importance of transparency when performing the transformation, tokenization, and variable selection processes. For this reason, we provide the word lists and an R package that performs the variable selection processes employed in this paper.

  5. 5.

    The authors of this study spent more than 50 hours manually identifying realtor speak in the public remarks for one city (Atlanta).

  6. 6.

    The decision to use the textual information to manually filter the data may have been partially driven by the small sample size (347 repeat-sales pairs) in their study.

  7. 7.

    The penalty on g implies LASSO is not invariant to linear transformations of the wnt,k. In order to remove this invariance and without loss of generality, LASSO (i) normalizes each wnt,k to have unit variance, (ii) solves for the associated standardized beta coefficients, and (iii) undoes this normalization.

  8. 8.

    Define the set of strong predictors as the set of all predictors that have a sufficiently large effect size (Bühlmann and Van De Geer 2011). Given that these tokens are not highly correlated (i.e. the tokens are not interchangeable), \(\hat {S}\) will contain all tokens that are strong predictors of pnt with high probability. This may also be viewed as a variable screening technique for hedonic models.

  9. 9.

    We use the wordStem function in the SnowballC package in R to stem each version of the remarks.

  10. 10.

    The traditional repeat-sales methodology in Bailey et al. (1963) and Case and Shiller (1989) uses differenced sale prices as the dependent variable to create home price indexes. When including property fixed effects we use the log of the level sale price as the dependent variable. The two approaches produce identical results when each house sells exactly twice. Nowak and Smith (2019) demonstrate how textual information can be incorporated using the traditional repeat-sales methodology to create quality-adjusted home price indexes.

  11. 11.

    Nowak and Smith (2019) use textual analysis to show local (zip code level) home price indexes are biased under the constant quality assumption. However, the repeat-sales methodology employed in Nowak and Smith (2019) mitigates the spatial heterogeneity we are interested in documenting in this section.

  12. 12.

    The Internet Appendix provides results showing our approach also improves out-of-sample predictive performance over time. Not surprisingly, the improvements to the out-of-sample predictive performance over space are much more modest due to the highly localized nature of real estate markets.


  1. Antweiler, W., & Frank, M.Z. (2004). Is all that talk just noise? The information content of internet stock message boards. The Journal of Finance, 59(3), 1259–1294.

    Article  Google Scholar 

  2. Bailey, M.J., Muth, R.F., Nourse, H.O. (1963). A regression method for real estate price index construction. Journal of the American Statistical Association, 58(304), 933–942.

    Article  Google Scholar 

  3. Belloni, A., Chen, D., Chernozhukov, V., Hansen, C. (2012). Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica, 80(6), 2369–2429.

    Article  Google Scholar 

  4. Ben-David, I. (2011). Financial constraints and inflated home prices during the real estate boom. American Economic Journal: Applied Economics, 3(3), 55–87.

    Google Scholar 

  5. Bogin, A., Doerner, W., Larson, W. (2019). Local house price paths: accelerations, declines, and recoveries. The Journal of Real Estate Finance and Economics, 58(2), 201–222.

    Article  Google Scholar 

  6. Bühlmann, P., & Van De Geer, S. (2011). Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media.

  7. Case, K.E., & Shiller, R.J. (1989). The efficiency of the market for single-family homes. The American Economic Review, 79(1), 125–137.

    Google Scholar 

  8. Crystal, D. (2001). Language and the internet. Cambridge: CUP.

    Google Scholar 

  9. Das, S.R., & Chen, M.Y. (2007). Yahoo! for Amazon: sentiment extraction from small talk on the web. Management Science, 53(9), 1375–1388.

    Article  Google Scholar 

  10. Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456), 1348–1360.

    Article  Google Scholar 

  11. Frazier, K.B., Ingram, R.W., Tennyson, B.M. (1984). A methodology for the analysis of narrative accounting disclosures. Journal of Accounting Research, 22 (1), 318–331.

    Article  Google Scholar 

  12. Gatzlaff, D.H., & Haurin, D.R. (1998). Sample selection and biases in local house value indices. Journal of Urban Economics, 43(2), 199–222.

    Article  Google Scholar 

  13. Gertheiss, J., & Tutz, G. (2010). Sparse modeling of categorical explanatory variables. The Annals of Applied Statistics, 4(4), 2150–2180.

    Article  Google Scholar 

  14. Goodwin, K., Waller, B., Weeks, H.S. (2014). The impact of broker vernacular in residential real estate. Journal of Housing Research, 23(2), 143–161.

    Article  Google Scholar 

  15. Haag, J., Rutherford, R., Thomson, T. (2000). Real estate agent remarks: help or hype? Journal of Real Estate Research, 20(1-2), 205–215.

    Google Scholar 

  16. Hill, R.C., Knight, J.R., Sirmans, C.F. (1997). Estimating capital asset price indexes. The Review of Economics and Statistics, 79(2), 226–233.

    Article  Google Scholar 

  17. Levitt, S.D., & Syverson, C. (2008). Market distortions when agents are better informed: the value of information in real estate transactions. The Review of Economics and Statistics, 90(4), 599–611.

    Article  Google Scholar 

  18. Liu, C., Nowak, A., Smith, P. (2019). Asymmetric or incomplete information about asset values? The Review of Financial Studies. forthcoming.

  19. Loughran, T., & Mcdonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. The Journal of Finance, 66(1), 35–65.

    Article  Google Scholar 

  20. Loughran, T., & McDonald, B. (2016). Textual analysis in accounting and finance: a survey. Journal of Accounting Research, 54(4), 1187–1230.

    Article  Google Scholar 

  21. Luchtenberg, K.F., Seiler, M.J., Sun, H. (2018). Listing agent signals: does a picture paint a thousand words? The Journal of Real Estate Finance and Economics, 1–32.

  22. Miller, N., Sah, V., Sklarz, M. (2018). Estimating property condition effect on residential property value: evidence from US home sales data. Journal of Real Estate Research, 40(2), 179–198.

    Google Scholar 

  23. Nowak, A., & Smith, P. (2017). Textual analysis in real estate. Journal of Applied Econometrics, 32(4), 896–918.

    Article  Google Scholar 

  24. Nowak, A., & Smith, P. (2019). Quality-adjusted house price indexes. SSRN Working Paper #3424240.

  25. Rutherford, J., Rutherford, R.C., Strom, E., Wedge, L. (2016). The subsequent market value of former REO properties. Real Estate Economics.

  26. Sirmans, G.S., MacDonald, L., Macpherson, D.A., Zietz, E.N. (2006). The value of housing characteristics: a meta analysis. The Journal of Real Estate Finance and Economics, 33(3), 215–240.

    Article  Google Scholar 

  27. Soyeh, K.W., Wiley, J.A., Johnson, K.H. (2014). Do buyer incentives work for houses during a real estate downturn? The Journal of Real Estate Finance and Economics, 48(2), 380–396.

    Article  Google Scholar 

  28. Tetlock, P.C. (2007). Giving content to investor sentiment: the role of media in the stock market. The Journal of Finance, 62(3), 1139–1168.

    Article  Google Scholar 

  29. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2), 301–320.

    Article  Google Scholar 

Download references


We thank Brian Chew and the Georgia Multiple Listing Service for providing access to the MLS data used in this study. The code used to create the real estate dictionaries is available in an R package on the authors’ personal websites and

Author information



Corresponding author

Correspondence to Patrick S. Smith.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(PDF 1.54 MB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Nowak, A.D., Price, B.S. & Smith, P.S. Real Estate Dictionaries Across Space and Time. J Real Estate Finan Econ 62, 139–163 (2021).

Download citation


  • House prices
  • Machine learning
  • Real estate dictionary
  • Textual analysis