Location identification for real estate investment using data analytics


The modeling and control of complex systems, such as transportation, communication, power grids or real estate, require vast amounts of data to be analyzed. The number of variables in the models of such systems is large, typically a few hundred or even thousands. Computing the relationships between these variables, extracting the dominant variables and predicting the temporal and spatial dynamics of the variables are the general focuses of data analytics research. Statistical modeling and artificial intelligence have emerged as crucial solution enablers to these problems. The problem of real estate investment involves social, governmental, environmental and financial factors. Existing work on real estate investment focuses predominantly on the trend predictions of house pricing exclusively from financial factors. In practice, real estate investment is influenced by multiple factors (stated above), and computing an optimal choice is a multivariate optimization problem and lends itself naturally to machine learning-based solutions. In this work, we focus on setting up a machine learning framework to identify an optimal location for investment, given a preference set of an investor. We consider, in this paper, the problem to only direct real estate factors (bedroom type, garage spaces, etc.), other indirect factors like social, governmental, etc., will be incorporated into future work, in the same framework. Two solution approaches are presented here: first, decision trees and principal component analysis (PCA) with K-means clustering to compute optimal locations. In the second, PCA is replaced by artificial neural networks, and both methods are contrasted. To the best of our knowledge, this is the first work where the machine learning framework is introduced to incorporate all realistic parameters influencing the real estate investment decision. The algorithms are verified on the real estate data available in the TerraFly platform.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9


  1. 1.

    Since machine learning is a method under the hood of data analytics, usage of machine learning framework means the same as the data analytics framework in this paper.

  2. 2.

    For example, if the number of bedrooms in a property is an attribute, a user can specify the desired number of rooms.

  3. 3.

    Here a user need not select a specific landmark but in turn a cluster, which is a group of landmarks.

  4. 4.

    \(\chi \) is just the representative of a condominium obtained by summation of two numbers and is not a predicted value.

  5. 5.

    Here attribute linked to a condominium has data of all the units available in that condominium. Sometimes a proper entry for these units might not be available which includes NAs, incomplete words, typographical errors, and so on. These improper entries are removed, and the ratio of available data points to the total units available in that condominium is found. All the attributes associated with a condominium are available as a downloadable .csv file with condominiums units as the rows and the attributes as the columns.

  6. 6.

    // represents a comment.

  7. 7.

    We have restricted our work for these three distributions of discrete class; rest will be considered in our future work. It is intuitive that the data do not belong to geometric distribution.

  8. 8.

    In Fig. 4, node B is split into children nodes {FF}. In our case, in this figure it is identical, but there are nonidentical children as well, if we consider any decision tree in general. Hence, in Procedure 1, we consider a general scenario without getting into the issues of identical or nonidentical. In addition, the occurrences of 1’s and 0’s in Table 3 need not be always containing all the possible cases of user. It will vary according to the user’s response, say no users are interested in number of full baths, then the entire column is filled with 0’s.

  9. 9.

    We call the probability of truths and falses of a child node, probability of landmark occurrences in the target nodes as the system probabilities, and the system being decision tree.

  10. 10.

    Time complexity is the measure of the time the tree takes to arrive at a leaf node from the root node.

  11. 11.

    Space complexity is the measure of the program size and the data size that occupies the memory of a system.

  12. 12.

    Pre-order: the root will be processed first and then the left and right children subtrees. Post-order: the left subtree is processed first then the right subtree and finally the root. In-order: the left subtree is processed, then the root and finally the right subtree. Level-order: the processing starts from the root, then the nodes in the next level and the process continues until the traversal finishes the leaf nodes.

  13. 13.

    That deals with the complexity involved in communicating nodes in a tree.

  14. 14.

    In our case, there are nine landmarks; hence, \(c=9\) and \(p_{1}, p_{2}\ldots \) are the probabilities of their occurrences.

  15. 15.

    Let us consider the truth table in Table 3. For the attribute \({ Number}~of~{ Beds}\), \(p_\mathrm{t}= \frac{4}{8}=0.5,p_\mathrm{f}=\frac{4}{8}=0.5\). The landmarks associated with the falses are: James Ave, West Ave, Lincoln CT, Lincoln Rd; similarly, the landmarks associated with the truths are: Bay Rd, Alton Rd, Lincoln CT, Lincoln Rd. Hence, \(p_j= \{\frac{1}{4},\frac{1}{4},\frac{1}{4},\frac{1}{4}\}\) and same for \(p_k\).

  16. 16.


  17. 17.

    The parent node splits with the truth probability of \(p_\mathrm{t}\) and false probability of \(p_\mathrm{f}\).


  1. 1.

    Chowdhury, M., Apon, A., Dey, K.: Data Analytics for Intelligent Transport Systems, 1st edn. Elsevier, New York City (2017)

    Google Scholar 

  2. 2.

    Khan, N., Yaqoob, I., Hashem, I.A., Inayat, Z., Ali, W.K., Alam, M., Shiraz, M., Gani, A.: Big data: survey, technologies, opportunities, and challenges. Sci World J. 2014, 712826 (2014)

    Google Scholar 

  3. 3.

    Weihs, C., Ickstadt, K.: Data science: the impact of statistics. Int. J. Data Sci. Anal. 6(3), 189–194 (2018). https://doi.org/10.1007/s41060-018-0102-5

    Article  Google Scholar 

  4. 4.

    Clarke, B., Fokoue, E., Zhang, H.H.: Principles and theory for data mining and machine learning. Springer (2009)

  5. 5.

    Skourletopoulos, G., et al.: Big data and cloud computing: a survey of the state-of-the art and research challenges. In: Mavromoustakis, C., Mastorakis, G., Dobre, C. (eds.) Advances in Mobile Cloud Computing and Big Data in the 5G Era. Studies in Big Data, vol. 22. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45145-9_2

    Google Scholar 

  6. 6.

    Carr, D.H., Lawson, J.A., Lawson, J., Schultz, J.: Mastering Real Estate Appraisal. Dearborn Real Estate Education, Wisconsin (2003)

    Google Scholar 

  7. 7.

    Tang, D., Li, L.: Real estate investment decision-making based on analytic network process. IEEE International Conference on Business Intelligence and Financial Engineering. Beijing, pp. 544–547 (2009). https://doi.org/10.1109/BIFE.2009.128

  8. 8.

    Klimczak, K.: Determinants of real estate investment. Econ. Sociol. 3(2), 58–66 (2010)

    Article  Google Scholar 

  9. 9.

    Zhang, Y., Liu, S., He, S., Fang, Z.: Forecasting research on real estate prices in Shanghai. In: 2009 IEEE International Conference on Grey Systems and Intelligent Services (GSIS 2009), Nanjing, pp. 625–629 (2009)

  10. 10.

    Wei, W., Guang-ji, T., Hong-rui, Z.: Empirical analysis on the housing price in Harbin City based on hedonic model. In: 2010 International Conference on Management Science and Engineering 17th Annual Conference Proceedings, Melbourne, VIC, pp. 1659–1664 (2010)

  11. 11.

    Park, B., Bae, J.K.: Using machine learning algorithms for housing price prediction: the case of Fairfax County, Virginia housing data. Expert Syst. Appl. 42(6), 2928–2934 (2015)

    Article  Google Scholar 

  12. 12.

    Zhang, P., Ma, W., Zhang, T.: Application of artificial neural network to predict real estate investment in Qingdao. In: Future Communication, Computing, Control and Management, LNEE 141, pp. 213–219. Springer, Berlin (2012)

    Google Scholar 

  13. 13.

    Shi, H.: Determination of real estate price based on principal component analysis and artificial neural networks. In: 2009 Second International Conference on Intelligent Computation Technology and Automation, Changsha, Hunan, pp. 314–317 (2009)

  14. 14.

    Ahmed, E., Moustafa, M.: House price estimation from visual and textual features. In: Computer Vision and Pattern Recognition. Cornell University Library. arXiv:1609.08399 (2016)

  15. 15.

    French, N., French, S.: Decision theory and real estate investment. J. Prop. Valuat. Invest. 15(3), 226–232 (1997). https://doi.org/10.1108/14635789710184943

    MathSciNet  Article  Google Scholar 

  16. 16.

    French, N.: Decion ecision theory and real estate investment. Manag. Decis. Econ. 22, 399–410 (2001)

    Article  Google Scholar 

  17. 17.

    Li, L., Chu, K.H.: Prediction of real estate price variation based on economic parameters. In: 2017 International Conference on Applied System Innovation (ICASI), Sapporo, pp. 87–90 (2017). https://doi.org/10.1109/ICASI.2017.7988353

  18. 18.

    Sampathkumar, V., Helen Santhi, M., Vanjinathan, J.: Forecasting the land price using statistical and neural network software. Procedia Comput. Sci. 57, 112–121 (2015)

    Article  Google Scholar 

  19. 19.

    Chiarazzoa, V., Caggiania, L., Marinellia, M., Ottomanelli, M.: A Neural Network based model for real estate price estimation considering environmental quality of property location. In: 17th Meeting of the EURO Working Group on Transportation, EWGT2014, 2–4 July 2014, Sevilla, Spain. Transportation Research Procedia, vol. 3, pp. 810–117 (2014)

    Article  Google Scholar 

  20. 20.

    Salnikovo, V.A., Mikheeva, M.: Models for predicting prices in the Moscow residential real estate market. Stud. Russ. Econ. Dev. 29(1), 94–101 (2018)

    Article  Google Scholar 

  21. 21.

    Pappalardo, L., Vanhoof, M., Gabrielli, L., Smoreda, Z., Pedreschi, D., Giannotti, F.: An analytical framework to nowcast well-being using mobile phone data. Int. J. Data Sci. Anal. 2(1–2), 75–92 (2016). https://doi.org/10.1007/s41060-016-0013-2

    Article  Google Scholar 

  22. 22.

    Tosi, D.: Cell phone big data to compute mobility scenarios for future smart cities. Int. J. Data Sci. Anal. 4(4), 265–284 (2017). https://doi.org/10.1007/s41060-017-0061-2

    Article  Google Scholar 

  23. 23.

    “Maptitude”—real estate software. http://www.caliper.com/Maptitude/RealEstate/default.htm

  24. 24.

    Pitney bowes—real estate software. http://www.pitneybowes.com/

  25. 25.

    “Terrafly”—Geospatial Big Data Platform and Solutions. http://www.terrafly.com/

  26. 26.

    The condominium numbers’ range was obtained from the website. http://www.miamidade.gov/pa/property_search.asp

  27. 27.

    Sheugh, L., Alizadeh, S.H.: A note on Pearson correlation coefficient as a metric of similarity in recommender system. In: 2015 AI & Robotics (IRANOPEN), Qazvin, pp. 1–6 (2015). https://doi.org/10.1109/RIOS.2015.7270736

  28. 28.

    Benesty, J., Chen, J., Huang, Y.: On the importance of the Pearson correlation coefficient in noise reduction. IEEE Trans. Audio Speech Lang. Process. 16(4), 757–765 (2008)

    Article  Google Scholar 

  29. 29.

    Soong, T.T.: Fundamentals of Probability and Statistics for Engineers. Wiley, Hoboken (2004)

    Google Scholar 

  30. 30.

    Schalkopff, R.J.: Intelligent Systems Principles, Paradigms, and Pragmatics. Jones and Bartlett Publishers, Burlington (2011)

    Google Scholar 

  31. 31.

    Jolliffe, I.T.: Principal Component Analysis. Springer, Berlin (2002)

    Google Scholar 

  32. 32.

    Wu, J.: Advances in K-means Clustering: A Data Mining Thinking. Springer, Berlin (2012)

    Google Scholar 

  33. 33.

    da Silva, I.N., Hernane Spatti, D., Andrade Flauzino, R., Liboni, L.H.B., dos Reis Alves, S.F.: Artifical neural networks: a practical course. Springer (2017)

  34. 34.

    Kathmann, R.M.: Neural networks for the mass appraisal of real estate. Comput. J. Environ. Urban Syst. 17(4), 373–384 (1993)

    Article  Google Scholar 

  35. 35.

    Lim, W.T., Wang, L., Wang, Y., Chang, Q.: Housing price prediction using neural networks. In: IEEE 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), Changsha, pp. 518–522 (2016)

  36. 36.

    Wang, L., Chan, F.F., Wang, Y., Chang, Q.: Predicting public housing prices using delayed neural networks. In: 2016 IEEE Region 10 Conference (TENCON), Singapore, pp. 3589–3592 (2016)

  37. 37.

    Peterson, S., Flanagan, A.B.: Neural network hedonic pricing models in mass real estate appraisal. J. Real Estate Res. 31(2), 147–164 (2009)

    Google Scholar 

  38. 38.

    Olden, J.D., Jackson, D.A.: Illuminating the ‘black-box’: a randomization approach for understanding variable contributions in artificial neural networks. Ecol. Model. 154, 135–150 (2002)

    Article  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to E. Sandeep Kumar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


Appendix A

Procedure 1

Let \(\mathscr {F}=\{f_{1},f_{2},f_{3}\ldots f_{n}\}\) be the set of features (attributes) \(\forall \mathscr {F}\in \mathbb {R}^{n}\), A feature \(f_{*}\) is called a root of D, if the information gain \(\mathrm{IG}|_{f_{*}}=\mathrm{sup}(\mathrm{IG}|_{f_{*}},f_{j}\in \mathscr {F}.\)

Steps. Let \(\mathscr {F}=\{f_{1},f_{2},f_{3}\ldots f_{n}\}\) be the set of features \(\forall \mathscr {F}\in \mathbb {R}^{n}.\) Let the randomness in any variable be defined by entropy:

$$\begin{aligned} H=-p\mathrm{log}_{2}p \end{aligned}$$

where p is the probability of occurrences of instances in the column of a truth table.

Let the target be \(\tau =\{p_{1},p_{2}\ldots p_{c}\}\) ,where c is the number of class.Footnote 14

Let D be the decision tree \(\ni D:\mathscr {F}\rightarrow \tau \); we find the root of D.

We find the information before split of a parent node (in our case the output column) by \(I_{BS}=-p_{1}\mathrm{log}_{2}p_{1}-p_{2}\mathrm{log}_{2}p_{2} -\cdots -p_{c}\mathrm{log}_{2}p_{c}=\)

$$\begin{aligned} \sum \limits _{d=1}^c-p_{d} \mathrm{log}_{2} p_{d} \end{aligned}$$

Consider the feature \(f_{i}\in \mathscr {F}\) having two classes (1 or 0). The net information of the children nodes is given by

$$\begin{aligned} I_{AS}= p_\mathrm{t}\bigg (\sum \limits _{j=1}^c -p_{j} \mathrm{log}_{2}p_{j}\bigg )+ p_\mathrm{f}\bigg (\sum \limits _{k=1}^c -p_{k}\mathrm{log}_{2}p_{k}\bigg ) \end{aligned}$$

Let the truth occurrences in the children (the split probability of truths) be \(p_\mathrm{t}\) and that for the falses be \(p_\mathrm{f}\). Let \(p_{j}\) and \(p_{k}\) be the probabilities of the target accompanied with the truths and the falses, respectively.Footnote 15 Every instances in Eq. (12) are written according to the entropy of (10). The total information gain is obtained by subtraction of (12) from (11). Therefore, \(I_\mathrm{g}=I_\mathrm{BS}-I_\mathrm{AS}\)

$$\begin{aligned} I_\mathrm{g}= & {} -\sum \limits _{d=1}^cp_{d} \mathrm{log}_{2} p_{d}+p_\mathrm{t}\bigg (\sum \limits _{j=1}^c p_{j} \mathrm{log}_{2}p_{j}\bigg )\nonumber \\&+p_\mathrm{f}\bigg (\sum \limits _{k=1}^c p_{k}\mathrm{log}_{2}p_{k}\bigg ) \end{aligned}$$

The following conditions are applied throughout the root identification process.

  • \(0\le p_\mathrm{t}\le 1,0\le p_\mathrm{f}\le 1\ni p_\mathrm{t}+p_\mathrm{f}=1\)

  • \(\Bigg \{ 0\le \sum \nolimits _{j=1}^cp_{j}\le 1, 0\le \sum \nolimits _{j=1}^cp_{k}\le 1\Bigg \} \ni \Bigg \{\sum \nolimits _{j=1}^cp_{j}+\sum \nolimits _{j=1}^cp_{k}= \sum \nolimits _{d=1}^c p_d\Bigg \}\)

  • \(\sum \nolimits _{d=1}^cp_{d}=1\)

Let us identify the root node (with the highest information gain by induction in Eq. (12) for five cases and its variants (totally eleven in the following) discussed prior):

Case 1: When \(p_\mathrm{t}=1,p_{j}=0 \bigvee p_\mathrm{t}=0,p_{j}=p_{d}\)with\(p_{j}+p_{k}=p_{d},p_\mathrm{t}+p_\mathrm{f}=1\).

If \(p_{j}=0\), we have \(p_{k}=p_{d}\); substituting in (13) and changing the limits, we have:

$$\begin{aligned} I_\mathrm{g}= & {} -\sum \limits _{d=1}^cp_{d}\mathrm{log}_{2}p_{d}+(1-p_\mathrm{t}) \sum \limits _{d=1}^cp_{d}\mathrm{log}_{2}p_{d}\nonumber \\= & {} -\sum \limits _{d=1}^c p_{d}\mathrm{log}_{2}p_{d} \end{aligned}$$

Case 2: When \(p_\mathrm{t}=1,p_{j}=P_{d} \bigvee p_\mathrm{t}=0,p_{j}=0\)with\(p_{j}+p_{k}=p_{d},p_\mathrm{t}+p_\mathrm{f}=1\).

If \(p_{j}=p_{d}\), we have \(p_{k}=0\); substituting in (13) and changing the limits we have:

$$\begin{aligned} I_\mathrm{g}=-\sum \limits _{d=1}^cp_{d}\mathrm{log}_{2}p_{d} +\sum \limits _{d=1}^cp_{d}\mathrm{log}_{2}p_{d}=0 \end{aligned}$$

Case 3: When \(p_\mathrm{t}=1,p_{j}=p_{k}\bigvee p_\mathrm{t}=0,p_{j}=p_{k}\) with \(p_{j}+p_{k}=p_{d},p_\mathrm{t}+p_\mathrm{f}=1\).

If \(p_{j}=p_{k}\), we have \(p_{k}=\frac{p_{d}}{2}\); substituting in (6) and on further simplification, we have:

$$\begin{aligned} I_\mathrm{g}=-\sum \limits _{d=1}^cp_{d}\mathrm{log}_{2}p_{d} +\sum \limits _{d=1}^c\bigg (p_{d}\mathrm{log}_{2}p_{d}-p_{d}\bigg ) {=}-\sum \limits _{d=1}^c p_{d}\nonumber \\ \end{aligned}$$

Case 4: When \(0<p_\mathrm{t}<\frac{1}{2},p_{j}=p_{d}\) with \(p_{j}+p_{k}=p_{d},p_\mathrm{t}+p_\mathrm{f}=1\)

If \(p_{j}=p_{d}\), then \(p_{k}=0\); substituting in (13) and on further simplification, we have:

$$\begin{aligned} I_\mathrm{g}= & {} -\sum \limits _{d=1}^cp_{d}\mathrm{log}_{2}p_{d}+p_\mathrm{t} \sum \limits _{d=1}^c\bigg (p_{d}\mathrm{log}_{2}p_{d}\bigg )\nonumber \\= & {} (p_\mathrm{t}-1)\sum \limits _{d=1}^c p_{d}\mathrm{log}_{2}p_{d}\succ -\frac{1}{2} \sum \limits _{d=1}^cp_{d}\mathrm{log}_{2}p_{d} \end{aligned}$$

Case 5: When \(0<p_\mathrm{t}<\frac{1}{2},p_{j}=0\) with \(p_{j}+p_{k}=p_{d},p_\mathrm{t}+p_\mathrm{f}=1\)

If \(p_{j}=0\), then \(p_{k}=p_{d}\); substituting in (13) and on further simplification, we have:

$$\begin{aligned} I_\mathrm{g}= & {} -\sum \limits _{d=1}^cp_{d}\mathrm{log}_{2}p_{d}+(1-p_\mathrm{t}) \sum \limits _{d=1}^c\bigg (p_{d}\mathrm{log}_{2}p_{d}\bigg )\nonumber \\= & {} (-p_\mathrm{t})\sum \limits _{d=1}^c p_{d}\mathrm{log}_{2}p_{d} \end{aligned}$$

Case 6: When \(0<p_\mathrm{t}<\frac{1}{2},p_{j}=p_{k}\) with \(p_{j}+p_{k}=p_{d},p_\mathrm{t}+p_\mathrm{f}=1\)

If \(p_{j}=p_{k}\), then \(p_{k}=\frac{p_{d}}{2}\); substituting in (13) and on further simplification, we have:

$$\begin{aligned} I_\mathrm{g}= & {} -\sum \limits _{d=1}^cp_{d}\mathrm{log}_{2}p_{d} +\frac{1}{2}\sum \limits _{d=1}^c p_{d}\mathrm{log}_{2}p_{d} -\sum \limits _{d=1}^c p_{d}\nonumber \\= & {} -\frac{1}{2}\sum \limits _{d=1}^c p_{d}\mathrm{log}_{2}p_{d} -\sum \limits _{d=1}^c p_{d}\prec 0 \end{aligned}$$

Case 7: When \(p_\mathrm{t}>\frac{1}{2},p_{j}=0\) with \(p_{j}+p_{k}=p_{d},p_\mathrm{t}+p_\mathrm{f}=1\)

If \(p_{j}=0\), we have \(p_{k}=p_{d}\); substituting in (13) and changing the limits, we have:

$$\begin{aligned} I_\mathrm{g}= & {} -\sum \limits _{d=1}^cp_{d}\mathrm{log}_{2}p_{d}+(1-p_\mathrm{t}) \sum \limits _{d=1}^c p_{d}\mathrm{log}_{2}p_{d}\nonumber \\= & {} -p_\mathrm{t}\sum \limits _{d=1}^c p_{d}\mathrm{log}_{2}p_{d}> -\frac{1}{2} \sum \limits _{d=1}^cp_{d}\mathrm{log}_{2}p_{d} \end{aligned}$$

Case 8: When \(p_\mathrm{t}>\frac{1}{2},p_{j}=p_{d}\) with \(p_{j}+p_{k}=p_{d},p_\mathrm{t}+p_\mathrm{f}=1\)

If \(p_{j}=0\), we have \(p_{k}=0\); substituting in (13) and changing the limits, we have:

$$\begin{aligned} I_\mathrm{g}= & {} -\sum \limits _{d=1}^cp_{d}\mathrm{log}_{2}p_{d}+p_\mathrm{t} \sum \limits _{d=1}^c p_{d}\mathrm{log}_{2}p_{d}\nonumber \\= & {} (1-p_\mathrm{t})\sum \limits _{d=1}^c p_{d}\mathrm{log}_{2}p_{d}< -\frac{1}{2} \sum \limits _{d=1}^cp_{d}\mathrm{log}_{2}p_{d} \end{aligned}$$

Case 9: When \(p_\mathrm{t}>\frac{1}{2},p_{j}=p_{k}\) with \(p_{j}+p_{k}=p_{d},p_\mathrm{t}+p_\mathrm{f}=1\)

If \(p_{j}=p_{k}\), then \(p_{k}=\frac{p_{d}}{2}\); substituting in (13) and on further simplification, we have:

$$\begin{aligned} I_\mathrm{g}=-\frac{1}{2}\sum \limits _{d=1}^cp_{d}\mathrm{log}_{2}p_{d} -\sum \limits _{d=1}^cp_{d}<0 \end{aligned}$$

Case 10: When \(p_\mathrm{t}=\frac{1}{2},p_{j}=p_{d} \bigvee p_\mathrm{t}=\frac{1}{2},p_{j}=p_{d}\) with \(p_{j}+p_{k}=p_{d},p_\mathrm{t}+p_\mathrm{f}=1\)

$$\begin{aligned} I_\mathrm{g}= & {} -\sum \limits _{d=1}^cp_{d}\mathrm{log}_{2}p_{d}+(1-\frac{1}{2}) \sum \limits _{d=1}^c\bigg (p_{d}\mathrm{log}_{2}p_{d}\bigg )\nonumber \\= & {} -\frac{1}{2}\sum \limits _{d=1}^c p_{d}\mathrm{log}_{2}p_{d} \end{aligned}$$

Case 11: When \(p_\mathrm{t}=\frac{1}{2},p_{j}=p_{k}\) with \(p_{j}+p_{k}=p_{d},p_\mathrm{t}+p_\mathrm{f}=1\)

$$\begin{aligned} I_\mathrm{g}=-\frac{1}{2}\sum \limits _{d=1}^cp_{d}\mathrm{log}_{2}p_{d} -\frac{1}{2}\sum \limits _{d=1}^cp_{d}<0 \end{aligned}$$

For remaining conditions, we can apply (13) to obtain information gain, which gives the maximum information gain of a tree. Let us analyze the above cases; the conditions used to obtain (14) are in contradiction to one another, i.e., \(p_\mathrm{t}=1\) and \(p_j=0\) cannot happen at the same time. Hence, this case can never happen in a decision tree. \(I_\mathrm{g}\) in Eqs. (17) and (20) are the optimal for the information gain and best suited for the decision tree operation. In the rest of the cases, the probability conditions do not occur due to contradiction or they do not lead to maximum information gain.

Relation between information gain\(I_\mathrm{g}\)and entropy\(H_{s}\): (a general result)

Let us denote the overall entropy (combined entropy of parent and children) as \(H_{s}\). We find a relation between \(H_s\) and \(I_\mathrm{g}\).

$$\begin{aligned} H_{s}= -\sum \limits _{d=1}^c p_{d}\mathrm{log}_{2}p_{d}-p_\mathrm{t} \sum \limits _{j=1}^c p_{j}\mathrm{log}_{2}p_{j}-p_{j} \sum \limits _{k=1}^c p_{k}\mathrm{log}_{2}p_{k} \end{aligned}$$

When we add Eqs. (6) and (17), we get:

$$\begin{aligned} I_\mathrm{g}+H_{s}=-2\sum \limits _{d=1}^c p_{d}\mathrm{log}_{2}p_{d} \end{aligned}$$

In (26), the R.H.S is a constant because the class probabilities in the target column will not change. Hence, we can conclude that

$$\begin{aligned} I_\mathrm{g}+H_{s}= \mathrm{constant} \end{aligned}$$

This follows the notion of a straight line with a negative slope.Footnote 16

Simulation results of Procedure-1: We simulated the equations in MATLAB 2014. The simulation parameters were as follows: number of classes=3 (nevertheless in our work, it is a 9 class problem, because cluster has 9 landmarks, for the analysis of the theorem and simulations, let us choose number of classes as 3), the probability of classes: \(p_{1}=0.1,p_{2}=0.1\) and \(p_{3}=0.8\). Let the truth occurrences in the children (the split probability of truths) be \(p_\mathrm{t}\) and that for the false be \(p_\mathrm{f}\).Footnote 17 Let \(p_{j}\) and \(p_{k}\) be the probabilities of the target accompanied with the truths and the falses, respectively. The graphs are plotted for the different conditions of \(p_\mathrm{t}\)\(p_\mathrm{t}=0,p_{t}=1,p_\mathrm{t}=\frac{1}{2},p_\mathrm{t}=0.3,p_\mathrm{t}=0.7\). The value \(p_\mathrm{t}=0.3\) is a representative of the condition \(0\le p_\mathrm{t}<\frac{1}{2}\), and \(p_\mathrm{t}=0.7\) is a representative of the condition \(p_\mathrm{t}>\frac{1}{2}\). Since the information gain is always positive, the iteration on \(p_\mathrm{f}\) will give the same outputs/results, since \(p_\mathrm{t}+p_\mathrm{f}=1\). Let the terms associated with \(p_\mathrm{t}\) be \(p_{k1},p_{k2},p_{k3}\) and with \(p_\mathrm{f}\) be \(p_{f1},p_{f2},p_{f3}\). The information gain in (13) can be written as: \(I_\mathrm{g}=-p_{1}\mathrm{log}_{2}p_{1}-p_{2}\mathrm{log}_{2}p_{2}-p_{3}\mathrm{log}_{2}p_{3} +p_\mathrm{t}\{p_{k1}\mathrm{log}_{2}p_{k1}+p_{k2}\mathrm{log}_{2}p_{k2}+p_{k3}\mathrm{log}_{2}p_{k3}\} +p_\mathrm{f}\{p_{f1}\mathrm{log}_{2}p_{f1}+p_{f2}\mathrm{log}_{2}p_{f2}+p_{f3}\mathrm{log}_{2}p_{f3}\}\).

This equation can be rewritten as: \(I_\mathrm{g}=-p_{1}\mathrm{log}_{2}p_{1}-p_{2} \mathrm{log}_{2}p_{2}-p_{3}\mathrm{log}_{2}p_{3}+p_\mathrm{t}\{p_{k1}\mathrm{log}_{2}p_{k1} +p_{k2}\mathrm{log}_{2}p_{k2}+p_{k3}\mathrm{log}_{2}p_{k3}\}+(1-p_\mathrm{t}) \{(1-p_{k1})\mathrm{log}_{2}(p-p_{k1})+(1-p_{k1})\mathrm{log}_{2}(p-p_{k2}) +(1-p_{k1})\mathrm{log}_{2}(p-p_{k2})\}\). since \(p_{j}+p_{k}=p_{d}\) and \(p_\mathrm{t}+p_\mathrm{f}=1\).

We vary the \(p_{k1},p_{k2},p_{k3}\) probabilities such that \(0\le p_{k1}\le p_{1}, 0\le p_{k2}\le p_{2}\) and \(0\le p_{k3}\le p_{3}\). The obtained graphs are shown in Fig. 7.

Fig. 10

a Plot of information gain versus possible combinations of \(p_{t},p_{k},p_{j}\) where \(p_\mathrm{t}=0\). b Plot of system entropy versus possible combinations of \(p_{t},p_{k},p_{j}\),where \(p_\mathrm{t}=0\). c Plot of information gain versus possible combinations of \(p_{t},p_{k},p_{j}\), where \(p_\mathrm{t}=1\). d Plot of system entropy versus possible combinations of \(p_{t},p_{k},p_{j}\), where \(p_\mathrm{t}=1\). e Plot of information gain versus possible combinations of \(p_{t},p_{k},p_{j}\), where \(p_\mathrm{t}=0.5\). f Plot of system entropy versus possible combinations of \(p_{t},p_{k},p_{j}\),where \(p_\mathrm{t}=0.5\). g Plot of information gain versus possible combinations of \(p_{t},p_{k},p_{j}\), where \(p_\mathrm{t}=0.3\). h Plot of system entropy versus possible combinations of \(p_{t},p_{k},p_{j}\), where \(p_\mathrm{t}=0.3\). i Plot of information gain versus possible combinations of \(p_{t},p_{k},p_{j}\), where \(p_\mathrm{t}=0.7\). j Plot of system entropy versus possible combinations of \(p_{t},p_{k},p_{j}\), where \(p_\mathrm{t}=0.7\)

In Fig. 10a, we have fixed the truth occurrences \(p_\mathrm{t}=0\) (meaning the feature has only false occurrences and there are no truths), and probability of class-1 occurrences is 0.1, probability of class-2 occurrences is 0.1 and that of class-3 is 0.8 in the target column. In Fig. 10a, the information gain reaches maximum when \(p_{k}=p_{d}\) (meaning that all the classes of the target are associated with the truths). This is a contradiction, since there are no truth occurrences in the feature; the classes cannot associate with the truths of the children nodes. Hence, we can omit this condition and the system configuration (set of probabilities used), though the \(I_\mathrm{g}\) obtained is 0.9219 which is the maximum of all the probability configurations, and if we move along x-axis, we can see 11 lobes in the information gain plot. Each main lobe has 11 sublobes, and each sublobe has 11 points which runs vertically. This is because of the possible combinations of \(p_{1},p_{2},p_{3}\) each having 11 instances (i.e., 0 to 0.1 in steps of 0.01). Also, there is a decreasing slope between \(I_\mathrm{g} and H\) which goes according to Eq. (26).

In Fig. 10c, we have repeated the simulations with \(p_\mathrm{t}=1\) (meaning that all are truths in the considered feature column). The maximum information gain happens to be when \(p_\mathrm{t}=0\) with the gain value of 0.9219. This implies that the feature column has only truths, and no classes are associated with the truths. This is a contradiction, and this will not happen at the same time. Hence, the system with the probability conditions aforementioned is neglected.

In Fig. 10e, we can notice that the information gain is symmetric when \(p_\mathrm{t}=\frac{p_{d}}{2}\), where the information gain reaches exactly the half of the maximum of its value. The information gain reaches to its maximum value 0.4610 that happens when \(p_\mathrm{t}=p_{d}\). It can be seen that the maximum value is exactly the half of the information gain obtained according to (13). This is not a point of operation for a decision tree because the information gain goes slightly negative at its minimum point \(p_\mathrm{t}=\frac{p_{d}}{2}\) or we can assume it as 0. This is because the uncertainty in the system is beyond zero, which is a contradiction in the present scenario. But we can call the point \(p_\mathrm{t}=p_{d}\) as the equilibrium point of operation. There is no gain; neither there is loss. The information of parent gets split among the children nodes equally.

Figure 10g is the case when \(p_\mathrm{t}=0.3\), an instance where \(0<p_\mathrm{t}<\frac{1}{2}\); we get the maximum information gain of 0.6453 when \(p_\mathrm{t}=p_{d}\). It was also found that the information gain is always > 0.4610, which is according to Eq. (16). It is clear that the information gain has a hard threshold where it stays always above. The feature with \(0<p_\mathrm{t}<\frac{1}{2}\) has maximum gain when all the classes are associated with truths itself. Less truth probability with all classes associated with it gives the optimal information gain.

Figure 10i is the case when \(p_\mathrm{t}=0.7\), an instance where \(p_\mathrm{t}>\frac{1}{2}\); we get the maximum information gain of 0.6453 when \(p_\mathrm{t}=0\). It was also found that the information gain is always > 0.4610, which is according to Eq. (20). It is clear that the information gain has a hard threshold, and it always stays above that. The feature with \(p_\mathrm{t}>\frac{1}{2}\) has maximum gain when all the classes are associated with false, meaning that none are associated with the truths. Even though the classes are associated with the false, the parent can get the maximum information gain in this case as well. We conclude that the probability conditions mentioned in Case 4 and Case 7 are the best conditions to choose an attribute as the root node. In other words, whichever attributes satisfies the conditions of Case 4 and Case 7 are placed as the root node of a tree.

Figure 10b, d, f, h, j are the plot of system entropy versus system probabilities, and is according to Eq. (26).

Appendix B

See Tables 6, 7 and 8.

Table 6 Validation of decision tree (layer-1 classification)
Table 7 Validation of PCA\(+\)K-means (layer-2 classification)
Table 8 Validation of ANNs\(+\)K-means (layer-2 classification, cluster centers match error)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sandeep Kumar, E., Talasila, V., Rishe, N. et al. Location identification for real estate investment using data analytics. Int J Data Sci Anal 8, 299–323 (2019). https://doi.org/10.1007/s41060-018-00170-0

Download citation


  • Real estate investment
  • Machine learning
  • Artificial intelligence
  • Decision trees
  • Principal component analysis
  • K-means clustering
  • Artificial neural networks
  • Complex systems