Skip to main content
Log in

Ten challenges in modeling bibliographic data for bibliometric analysis

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

The complexity and variety of bibliographic data is growing, and efforts to define new methodologies and techniques for bibliometric analysis are intensifying. In this complex scenario, one of the most crucial issues is the quality of data and the capability of bibliometric analysis to cope with multiple data dimensions. Although the problem of enforcing a multidimensional approach to the analysis and management of bibliographic data is not new, a reference design pattern and a specific conceptual model for multidimensional analysis of bibliographic data are still missing. In this paper, we discuss ten of the most relevant challenges for bibliometric analysis when dealing with multidimensional data, and we propose a reference data model that, according to different goals, can help analysis designers and bibliographic experts in working with large collections of bibliographic data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. A detailed description of each fact schema according to DFM is given in the following sections.

  2. The model has also been tested on a collection of about 8,000 publications in the research area of databases and data modeling.

  3. A very important contribution about the statistical issues in comparing institutional performance is [22].

  4. For a detailed overview of the international rankings, see [19]; for the quality assessment of composite indicators, see [3].

  5. http://bulletin.imstat.org/2011/09/presidential-address-peter-hall/.

References

  1. Agrawal, R., Gupta, A., Sarawagi, S. (1997). Modeling multidimensional databases. In: Proceedings of the Thirteenth International Conference on Data Engineering, ICDE ’97, (pp. 232–243). Washington, DC, USA: IEEE Computer Society. http://portal.acm.org/citation.cfm?id=645482.653299.

  2. Bakkalbasi, N., Bauer, K., Glover, J., Wang, L. (2006). Three options for citation tracking: Google scholar, scopus and web of science. Biomedical digital libraries, 3(1), 7.

    Article  Google Scholar 

  3. Benito, M., Romera, R. (2011). Improving quality assessment of composite indicators in university rankings: A case study of french and german universities of excellence. Scientometrics, 89, 153–176.

    Article  Google Scholar 

  4. Blei, D., Lafferty, J. (2006). Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning (pp. 113–120). New York: ACM.

  5. Blei, D., Lafferty, J. (2007). A correlated topic model of science. The Annals of Applied Statistics, 1(1), 17–35.

    Article  MATH  MathSciNet  Google Scholar 

  6. Blei, D., Lafferty, J. (2009). Topic models. Text mining: classification, clustering, and applications, 10, 71.

    Article  Google Scholar 

  7. Blei, D., Ng, A., Jordan, M. (2003) Latent Dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.

    MATH  Google Scholar 

  8. Borg, I., Groenen, P. (2005). Modern multidimensional scaling: Theory and applications. Berlin: Springer.

    MATH  Google Scholar 

  9. Brockwell, P., Davis, R. (2002). Introduction to time series and forecasting. Berlin: Springer.

    Book  MATH  Google Scholar 

  10. Bryk, A., Raudenbush, S. (1992) Hierarchical linear models: Applications and data analysis methods. New York: Sage Publications, Inc.

    Google Scholar 

  11. Castano, S., Ferrara, A., Lorusso, D., Montanelli, S. (2008). On the Ontology Instance Matching Problem. In: Proceedings of the 7th DEXA Workshop on Web Semantics (WebS 08) (pp. 180–184). Turin, Italy

  12. Coates, H. (2007). Universities on the catwalk: Models for performance ranking in australia. Higher Education Management and Policy, 19(2), 69.

    Article  Google Scholar 

  13. Codd, E., Codd, S., Salley, C. (1993). Providing olap to user-analysts: An it mandate. Tech. rep.

  14. DeBattisti, F., Salini, S. (2010). Bibliometric indicators for statisticians: critical assessment in the Italian context. Università di Firenze, Firenze. http://air.unimi.it/handle/2434/152106.

  15. Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), 391–407.

    Article  Google Scholar 

  16. Falagas, M., Pitsouni, E., Malietzis, G., Pappas, G. (2008). Comparison of pubmed, scopus, web of science, and Google scholar: strengths and weaknesses. The FASEB Journal, 22(2), 338.

    Article  Google Scholar 

  17. Franceschet, M. (2009). A cluster analysis of scholar and journal bibliometric indicators. Journal of the American Society for Information Science and Technology, 60(10), 1950–1964.

    Article  Google Scholar 

  18. Friedman, J., Tibshirani, R., Hastie, T. (2009). The elements of statistical learning: Data mining, inference, and prediction. New York: Springer.

    MATH  Google Scholar 

  19. Geraci, M., Degli Esposti, M. (2011). Where do italian universities stand? An in-depth statistical analysis of national and international rankings. Scientometrics, 87(3), 667–681.

    Article  Google Scholar 

  20. Glänzel, W., Schubert, A. (2003). A new classification scheme of science fields and subfields designed for scientometric evaluation purposes. Scientometrics, 56(3), 357–367.

    Article  Google Scholar 

  21. Goldstein, H. (2010). Multilevel statistical models, 4th edn. New York: Wiley.

    Book  Google Scholar 

  22. Goldstein, H., Spiegelhalter, D. (1996) League tables and their limitations: Statistical issues in comparisons of institutional performance. Journal of the Royal Statistical Society. Series A (Statistics in Society), 385–443.

  23. Golfarelli, M., Rizzi, S. (2009). Data Warehouse design: Modern principles and methodologies. Maidenheach: McGraw-Hill.

    Google Scholar 

  24. Greenacre, M., Blasius, J. (2006). Multiple correspondence analysis and related methods. Boca Raton: Chapman & Hall/CRC.

    Book  MATH  Google Scholar 

  25. Hirsch, J. (2005) An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences of the United states of America, 102(46), 16,569.

    Article  Google Scholar 

  26. Hofmann, T. (1999). Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 50–57). New york: ACM.

  27. Hubert, J. (1977). Bibliometric models for journal productivity. Social Indicators Research, 4(1), 441–473.

    Article  MathSciNet  Google Scholar 

  28. Hudomalj, E., Vidmar, G. (2003). Olap and bibliographic databases. Scientometrics, 58(3), 609–622.

    Article  Google Scholar 

  29. Irvine, J., Martin, B. (1984). Foresight in science: picking the winners. London.

  30. Jensen, F. (1996). An introduction to Bayesian networks, vol. 210. London: UCL press.

    Google Scholar 

  31. Kenett, R., Salini, S. (2011). Modern analysis of customer satisfaction surveys: comparison of models and integrated analysis. Applied Stochastic Models in Business and Industry, 27(5), 465–475.

    Article  MathSciNet  Google Scholar 

  32. Kolaczyk, E. (2009). Statistical analysis of network data: methods and models. Berlin: Springer.

    Book  MATH  Google Scholar 

  33. Mallig, N. (2010). A relational database for bibliometric analysis. Journal of Informetrics, 4(4), 564–580.

    Article  Google Scholar 

  34. Mann, G., Mimno, D., McCallum, A. (2006). Bibliometric impact measures leveraging topic analysis. In: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries (pp. 65–74). New york: ACM.

  35. Meho, L., Yang, K. (2007). Impact of data sources on citation counts and rankings of lis faculty: Web of science versus scopus and google scholar. Journal of the American Society for Information Science and Technology, 58(13), 2105–2125.

    Article  Google Scholar 

  36. Molinari, J., Molinari, A. (2008). A new methodology for ranking scientific institutions. Scientometrics, 75(1), 163–174.

    Article  Google Scholar 

  37. Nigam, K., McCallum, A., Thrun, S., Mitchell, T. (2000). Text classification from labeled and unlabeled documents using em. Machine learning 39(2), 103–134.

    Article  MATH  Google Scholar 

  38. Steyvers, M., Griffiths, T. (2007) Probabilistic topic models. Handbook of latent semantic analysis, 427(7), 424–440.

    Google Scholar 

  39. Tapper, T., Filippakou, O. (2009). The world-class league tables and the sustaining of international reputations in higher education. Journal of Higher Education Policy and Management, 31(1), 55–66.

    Article  Google Scholar 

  40. Teh, Y., Jordan, M., Beal, M., Blei, D. (2006). Hierarchical dirichlet processes. Journal of the American Statistical Association, 101(476), 1566–1581.

    Article  MATH  MathSciNet  Google Scholar 

  41. Vassiliadis, P. (1998). Modeling multidimensional databases, cubes and cube operations. In: Scientific and Statistical Database Management, International Conference on, (p. 53). IEEE Computer Society, Los Alamitos, CA, USA. http://doi.ieeecomputersociety.org/10.1109/SSDM.1998.688111.

  42. Vassiliadis, P., Sellis, T. (1999). A survey of logical models for olap databases. SIGMOD Rec. 28, 64–69. http://doi.acm.org/10.1145/344816.344869. http://doi.acm.org/10.1145/344816.344869.

  43. Vinkler, P. (2010). The evaluation of research by scientometric indicators. London: Chandos Publishing.

    Book  Google Scholar 

  44. Wolfram, D. (2006). Applications of SQL for informetric frequency distribution processing. Scientometrics, 67(2), 301–313.

    Article  Google Scholar 

  45. Yu, H., Davis, M., Wilson, C., Cole, F. (2008). Object-relational data modelling for informetric databases. Journal of Informetrics, 2(3), 240–251.

    Article  Google Scholar 

Download references

Acknowledgments

We would like to thank the UNIMIVAL group of the University of Milan (http://www.unimi.it/cataloghi/nucelo_valutazione/ricercatori_in_breve.pdf). In the last year, they worked with us on the subject of bibliometrics; many of our ideas come from our common work and from our many fruitful discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Silvia Salini.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ferrara, A., Salini, S. Ten challenges in modeling bibliographic data for bibliometric analysis. Scientometrics 93, 765–785 (2012). https://doi.org/10.1007/s11192-012-0810-x

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-012-0810-x

Keywords

Navigation