Abstract
A panel data set contains observations on multiple phenomena observed over multiple time periods for the same subjects (e.g., firms or individuals). Panel data sets frequently appeared in the study of Marketing, Economics, and many other social sciences. An important panel data analysis task is to analyze and predict a variable of interest. As in social sciences, the number of collected data records for each subject is usually not large enough to support accurate and reliable data analysis, a common solution is to pool all subjects together and then run a linear regression method in attempt to discover the underlying relationship between the variable of interest and other observed variables. However, this method suffers from two limitations. First, subjects might not be poolable due to their heterogeneous nature. Second, not all variables might have significant relationships to the variable of interest. A regression on many irrelevant regressors will lead to wrong predictions. To address these two issues, we propose a novel approach, called Selecting and Clustering, which derives underlying linear models by first selecting variables highly correlated to the variable of interest and then clustering subjects into homogenous groups of the same linear models with respect to those variables. Furthermore, we build an optimization model to formulate this problem, the solution of which enables one to select variables and clustering subjects simultaneously. Due to the combinatorial nature of the problem, an effective and efficient algorithm is proposed. Studies on real data sets validate the effectiveness of our approach as our approach performs significantly better than other existing approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Baltagi, B.H.: Econometric Analysis of Panel Data, 3rd ed. Wiley, Chichester (2005)
Baltagi, B.H., Griffin, J.M.: Pooled estimators vs. their heterogeneous counterparts in the context of dynamic demand for gasoline. J. Econometrics 77(2), 303–327 (1997)
Durlauf, S.N., Johnson, P.A.: Multiple regimes and cross-country growth behaviour. J. Appl. Econometrics 10(4), 365–384 (1995)
Kapetanios, G.: Cluster analysis of panel data sets using non-standard optimisation of information criteria. J. Econ. Dyn. Control 30(8), 1389–1408 (2006)
Pesaran, M.H., Smith, R.: Estimating long-run relationships from dynamic heterogeneous panels. J. Econometrics 68(1), 79–113 (1995)
Maddala, G.S., Wu, S.: Cross-country growth regressions: problems of heterogeneity, stability and interpretation. Appl. Econ. 32(5), 635–642 (2000)
Vahid, F.: Clustering Regression Functions in a Panel. Monash University. Clayton (2000)
DeSarbo, W.S., Cron, W.L.: A maximum likelihood methodology for clusterwise linear regression. J. Classif. 5(2), 249–282 (1988)
Baltagi, B.H., Griffin, J.M.: Gasolne Demand in the OECD: An application of pooling and testing procedures, gasolne demand in the OECD: an application of pooling and testing procedures. Testing for country heterogeneity in growth models using a finite mixture approach. J. Appl. Econometrics 23(4), 487–514 (2008)
Castellacci, F.: Evolutionary and new growth theories. Are they converging? J. Econ. Surv. 21(3), 585–627 (2007)
Castellacci, F., Archibugi, D.: The technology clubs: the distribution of knowledge across nations. Res. Policy 37(10), 1659–1673 (2008)
Su, J.J.: Convergence clubs among 15 oecd countries. Appl. Econ. Lett. 10(2), 113 (2003)
Zhang, B.: Regression Clustering, p. 451. IEEE Computer Society, Washington (2003)
Späth, H.: Algorithm 39: clusterwise linear regression. Computing 22, 367–373 (1979)
Gaffney, S., Smyth, P.: Trajectory Clustering with Mixtures of Regression Models, pp. 63–72. ACM, New York, (1999)
Torgo, L., Da Costa, J.P.: Clustered partial linear regression. Mach. Learn. 50(3), 303–319 (2003)
Ross, S.M.: Simulation, 3rd edn. (Statistical Modeling and Decision Science) (Hardcover). Academic Press, San Diego (2002)
Besag, J., Green, P., Higdon, D., Mengersen, K.: Bayesian computation and stochastic systems. Statist. Sci. 10(1), 43–46 (1995)
Baltagi, B.H., Griffin, J.M.: Gasolne demand in the oecd: an application of pooling and testing procedures. Eur. Econ. Rev. 22, 117–137 (1983)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Lu, H., Huang, S., Li, Y., Yang, Y. (2014). Panel Data Analysis Via Variable Selection and Subject Clustering. In: Yada, K. (eds) Data Mining for Service. Studies in Big Data, vol 3. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45252-9_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-45252-9_5
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45251-2
Online ISBN: 978-3-642-45252-9
eBook Packages: EngineeringEngineering (R0)