Abstract
The complexity and variety of bibliographic data is growing, and efforts to define new methodologies and techniques for bibliometric analysis are intensifying. In this complex scenario, one of the most crucial issues is the quality of data and the capability of bibliometric analysis to cope with multiple data dimensions. Although the problem of enforcing a multidimensional approach to the analysis and management of bibliographic data is not new, a reference design pattern and a specific conceptual model for multidimensional analysis of bibliographic data are still missing. In this paper, we discuss ten of the most relevant challenges for bibliometric analysis when dealing with multidimensional data, and we propose a reference data model that, according to different goals, can help analysis designers and bibliographic experts in working with large collections of bibliographic data.
Similar content being viewed by others
Notes
A detailed description of each fact schema according to DFM is given in the following sections.
The model has also been tested on a collection of about 8,000 publications in the research area of databases and data modeling.
A very important contribution about the statistical issues in comparing institutional performance is [22].
References
Agrawal, R., Gupta, A., Sarawagi, S. (1997). Modeling multidimensional databases. In: Proceedings of the Thirteenth International Conference on Data Engineering, ICDE ’97, (pp. 232–243). Washington, DC, USA: IEEE Computer Society. http://portal.acm.org/citation.cfm?id=645482.653299.
Bakkalbasi, N., Bauer, K., Glover, J., Wang, L. (2006). Three options for citation tracking: Google scholar, scopus and web of science. Biomedical digital libraries, 3(1), 7.
Benito, M., Romera, R. (2011). Improving quality assessment of composite indicators in university rankings: A case study of french and german universities of excellence. Scientometrics, 89, 153–176.
Blei, D., Lafferty, J. (2006). Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning (pp. 113–120). New York: ACM.
Blei, D., Lafferty, J. (2007). A correlated topic model of science. The Annals of Applied Statistics, 1(1), 17–35.
Blei, D., Lafferty, J. (2009). Topic models. Text mining: classification, clustering, and applications, 10, 71.
Blei, D., Ng, A., Jordan, M. (2003) Latent Dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.
Borg, I., Groenen, P. (2005). Modern multidimensional scaling: Theory and applications. Berlin: Springer.
Brockwell, P., Davis, R. (2002). Introduction to time series and forecasting. Berlin: Springer.
Bryk, A., Raudenbush, S. (1992) Hierarchical linear models: Applications and data analysis methods. New York: Sage Publications, Inc.
Castano, S., Ferrara, A., Lorusso, D., Montanelli, S. (2008). On the Ontology Instance Matching Problem. In: Proceedings of the 7th DEXA Workshop on Web Semantics (WebS 08) (pp. 180–184). Turin, Italy
Coates, H. (2007). Universities on the catwalk: Models for performance ranking in australia. Higher Education Management and Policy, 19(2), 69.
Codd, E., Codd, S., Salley, C. (1993). Providing olap to user-analysts: An it mandate. Tech. rep.
DeBattisti, F., Salini, S. (2010). Bibliometric indicators for statisticians: critical assessment in the Italian context. Università di Firenze, Firenze. http://air.unimi.it/handle/2434/152106.
Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), 391–407.
Falagas, M., Pitsouni, E., Malietzis, G., Pappas, G. (2008). Comparison of pubmed, scopus, web of science, and Google scholar: strengths and weaknesses. The FASEB Journal, 22(2), 338.
Franceschet, M. (2009). A cluster analysis of scholar and journal bibliometric indicators. Journal of the American Society for Information Science and Technology, 60(10), 1950–1964.
Friedman, J., Tibshirani, R., Hastie, T. (2009). The elements of statistical learning: Data mining, inference, and prediction. New York: Springer.
Geraci, M., Degli Esposti, M. (2011). Where do italian universities stand? An in-depth statistical analysis of national and international rankings. Scientometrics, 87(3), 667–681.
Glänzel, W., Schubert, A. (2003). A new classification scheme of science fields and subfields designed for scientometric evaluation purposes. Scientometrics, 56(3), 357–367.
Goldstein, H. (2010). Multilevel statistical models, 4th edn. New York: Wiley.
Goldstein, H., Spiegelhalter, D. (1996) League tables and their limitations: Statistical issues in comparisons of institutional performance. Journal of the Royal Statistical Society. Series A (Statistics in Society), 385–443.
Golfarelli, M., Rizzi, S. (2009). Data Warehouse design: Modern principles and methodologies. Maidenheach: McGraw-Hill.
Greenacre, M., Blasius, J. (2006). Multiple correspondence analysis and related methods. Boca Raton: Chapman & Hall/CRC.
Hirsch, J. (2005) An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences of the United states of America, 102(46), 16,569.
Hofmann, T. (1999). Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 50–57). New york: ACM.
Hubert, J. (1977). Bibliometric models for journal productivity. Social Indicators Research, 4(1), 441–473.
Hudomalj, E., Vidmar, G. (2003). Olap and bibliographic databases. Scientometrics, 58(3), 609–622.
Irvine, J., Martin, B. (1984). Foresight in science: picking the winners. London.
Jensen, F. (1996). An introduction to Bayesian networks, vol. 210. London: UCL press.
Kenett, R., Salini, S. (2011). Modern analysis of customer satisfaction surveys: comparison of models and integrated analysis. Applied Stochastic Models in Business and Industry, 27(5), 465–475.
Kolaczyk, E. (2009). Statistical analysis of network data: methods and models. Berlin: Springer.
Mallig, N. (2010). A relational database for bibliometric analysis. Journal of Informetrics, 4(4), 564–580.
Mann, G., Mimno, D., McCallum, A. (2006). Bibliometric impact measures leveraging topic analysis. In: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries (pp. 65–74). New york: ACM.
Meho, L., Yang, K. (2007). Impact of data sources on citation counts and rankings of lis faculty: Web of science versus scopus and google scholar. Journal of the American Society for Information Science and Technology, 58(13), 2105–2125.
Molinari, J., Molinari, A. (2008). A new methodology for ranking scientific institutions. Scientometrics, 75(1), 163–174.
Nigam, K., McCallum, A., Thrun, S., Mitchell, T. (2000). Text classification from labeled and unlabeled documents using em. Machine learning 39(2), 103–134.
Steyvers, M., Griffiths, T. (2007) Probabilistic topic models. Handbook of latent semantic analysis, 427(7), 424–440.
Tapper, T., Filippakou, O. (2009). The world-class league tables and the sustaining of international reputations in higher education. Journal of Higher Education Policy and Management, 31(1), 55–66.
Teh, Y., Jordan, M., Beal, M., Blei, D. (2006). Hierarchical dirichlet processes. Journal of the American Statistical Association, 101(476), 1566–1581.
Vassiliadis, P. (1998). Modeling multidimensional databases, cubes and cube operations. In: Scientific and Statistical Database Management, International Conference on, (p. 53). IEEE Computer Society, Los Alamitos, CA, USA. http://doi.ieeecomputersociety.org/10.1109/SSDM.1998.688111.
Vassiliadis, P., Sellis, T. (1999). A survey of logical models for olap databases. SIGMOD Rec. 28, 64–69. http://doi.acm.org/10.1145/344816.344869. http://doi.acm.org/10.1145/344816.344869.
Vinkler, P. (2010). The evaluation of research by scientometric indicators. London: Chandos Publishing.
Wolfram, D. (2006). Applications of SQL for informetric frequency distribution processing. Scientometrics, 67(2), 301–313.
Yu, H., Davis, M., Wilson, C., Cole, F. (2008). Object-relational data modelling for informetric databases. Journal of Informetrics, 2(3), 240–251.
Acknowledgments
We would like to thank the UNIMIVAL group of the University of Milan (http://www.unimi.it/cataloghi/nucelo_valutazione/ricercatori_in_breve.pdf). In the last year, they worked with us on the subject of bibliometrics; many of our ideas come from our common work and from our many fruitful discussions.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ferrara, A., Salini, S. Ten challenges in modeling bibliographic data for bibliometric analysis. Scientometrics 93, 765–785 (2012). https://doi.org/10.1007/s11192-012-0810-x
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-012-0810-x