Skip to main content

Introduction: Tracing the History of a Discipline Through Quantitative and Qualitative Analyses of Scientific Literature

  • Chapter
  • First Online:
Tracing the Life Cycle of Ideas in the Humanities and Social Sciences

Abstract

The chaptersĀ of this book are concerned with learning of the evolution of ideas (theories, concepts, methods, and application domains) and of the history of a discipline, by means of the temporal evolution of word occurrences in papers published by scientific journals. The work carried out for each of the areas involved in the project (philosophy, sociology, psychology, linguistics, statistics) pursued different objectives: to obtain a first overview of the relationship between time and contents in order to observe latent temporal patterns; to identify relevant keywords; to cluster keywords portraying similar temporal patterns; to identify latent dynamics of cluster keywords; and to identify relevant topics as groups of related words. The contributions identified and analysed the main subject matters that, at the time of publication, were considered relevant by mainstream journals and offer new viewpoints to read and understand the evolution of a discipline. The interdisciplinary debate triggered by this research work is innovative because quantitative methods for text analysis have been used in areas of human and social sciences, which are traditionally studied through qualitative approaches, and also represents a positive experience since new paths have been explored by pooling together the qualitative and quantitative research methods, traditions, and expertise of different disciplines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Aggarwal, C. C., & Zhai, C. (2012). Mining text data. New York: Springer.

    BookĀ  Google ScholarĀ 

  • Angelini, A., Canditiis, D. D., & Pensky, M. (2012). Clustering time-course microarray data using functional bayesian infinite mixture model. Journal of Applied Statistics, 39(1), 129ā€“149.

    ArticleĀ  MathSciNetĀ  Google ScholarĀ 

  • Baayen, R. H. (2001). Word frequency distributions. Dordrecht: Kluwer Academic Publishers.

    BookĀ  Google ScholarĀ 

  • Beaudouin, V. (2016). Statistical analysis of textual data: BenzĆ©cri and the French School of Data Analysis. Glottometrics, 33, 56ā€“72.

    Google ScholarĀ 

  • Berry, M. W. (Ed.). (2004). Survey of text mining. Clustering, classification, and retrieval. New York: Springer-Verlag.

    MATHĀ  Google ScholarĀ 

  • Berry, M. W., & Kogan, J. (2010). Text mining: Applications and theory. Chichester: Wiley Online Library.

    BookĀ  Google ScholarĀ 

  • Bhattacharya, S., & Basu, P. K. (1998). Mapping a research area at the micro level using co-word analysis. Scientometrics, 43(3), 359ā€“372.

    ArticleĀ  Google ScholarĀ 

  • Blei, D. M., Ng, A. Y., & Jordan, M. (2003). Latent Dirichlet allocation. The Journal of Machine Learning Research, 3, 993ā€“1022.

    MATHĀ  Google ScholarĀ 

  • Bolasco, S. (2005). Statistica testuale e text mining: alcuni paradigmi applicativi. Quaderni di Statistica, 7, 17ā€“53.

    Google ScholarĀ 

  • Bolasco, S. (2013). L'analisi automatica dei testi. Fare ricerca con il text mining. Roma: Carocci.

    Google ScholarĀ 

  • CahlĆ­k, T., & Jiřina, M. (2006). Law of cumulative advantages in the evolution of scientific fields. Scientometrics, 66(3), 441ā€“449.

    ArticleĀ  Google ScholarĀ 

  • Chavalarias, D., & Cointet, J. P. (2008). Bottom-up scientific field detection for dynamical and hierarchical science mapping, methodology and case study. Scientometrics, 75(1), 37ā€“50.

    ArticleĀ  Google ScholarĀ 

  • Chavalarias, D., & Cointet, J. P. (2013). Phylomemetic patterns in science evolutionĀ ā€“ The rise and fall of scientific fields. PLoS One, 8(2), e54847.

    ArticleĀ  Google ScholarĀ 

  • Cobo, M., LĆ³pez-Herrera, A., Herrera-Viedma, E., & Herrera, F. (2011). An approach for detecting, quantifying, and visualizing the evolution of a research field: A practical application to the fuzzy sets theory field. Journal of Informetrics, 5(1), 146ā€“166.

    ArticleĀ  Google ScholarĀ 

  • Cobo, M., LĆ³pez-Herrera, A., Herrera-Viedma, E., & Herrera, F. (2012). SciMAT: A new science mapping analysis software tool. Journal of the American Society for Information Science and Technology, 63(8), 1609ā€“1630.

    ArticleĀ  Google ScholarĀ 

  • Coffey, N., Hinde, J., & Holian, E. (2014). Clustering longitudinal profiles using P-splines and mixed effects models applied to time-course gene expression data. Computational Statistics & Data Analysis, 71, 14ā€“29.

    ArticleĀ  MathSciNetĀ  Google ScholarĀ 

  • Cortelazzo, M. A., & Tuzzi, A. (Eds.). (2007). Messaggi dal Colle. I discorsi di fine anno dei presidenti della Repubblica. Venezia: Marsilio Editori.

    Google ScholarĀ 

  • Cretchley, J., Rooney, D., & Gallois, C. (2010). Mapping a 40-year history with leximancer: Themes and concepts in the journal of cross-cultural psychology. Journal of Cross-Cultural Psychology, 41(3), 318ā€“328.

    ArticleĀ  Google ScholarĀ 

  • Dister, A., LongrĆ©e, D., & Purnelle, G. (Eds.). (2012). JADT 2012 Actes des 11es JournĆ©es internationales dā€™analyse statistique des donnĆ©es textuelles. LiĆØge/Bruxelles: LASLAĀ ā€“ SESLA.

    Google ScholarĀ 

  • Diwersy, S., & Luxardo, G. (2016). Mettre en Ć©vidence le temps lexical dans un corpus de grandes dimensions: l'exemple des dĆ©bats du Parlement europĆ©en. In D. Mayaffre, C. Poudat, L. Vanni, V. Magri, & P. Follette (Eds.), JADT 2016 - proceedings of 13th international conference on statistical analysis of textual data. Nice: Pressess de Fac Imprimeur France.

    Google ScholarĀ 

  • Giacofci, M., Lambert-Lacroix, S., Marot, G., & Picard, F. (2013). Wavelet-based clustering for mixed-effects functional models in high dimension. Biometrics, 69(1), 31ā€“40.

    ArticleĀ  MathSciNetĀ  Google ScholarĀ 

  • Greenacre, M. J. (1984). Theory and application of correspondence analysis. London: Academic Press.

    MATHĀ  Google ScholarĀ 

  • Greenacre, M. J. (2007). Correspondence analysis in practice. London: Chapman & Hall.

    BookĀ  Google ScholarĀ 

  • Gries, S. T., & Hilpert, M. (2008). The identification of stages in diachronic data: Variability-based neighbour clustering. Corpora, 3(1), 59ā€“81.

    ArticleĀ  Google ScholarĀ 

  • Gries, S. T., & Hilpert, M. (2012). Variability-based neighbor clustering: A bottom-up approach to periodization in historical linguistics. In T. Nevalainen & E. Traugott (Eds.), The Oxford handbook of the history of English (pp. 134ā€“144). Oxford: Oxford University Press.

    Google ScholarĀ 

  • Griffiths, T., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America (PNAS), 101(Supplement 1), 5228ā€“5235.

    ArticleĀ  Google ScholarĀ 

  • GuĆ©rin-Pace, F., Saint-Julien, T., & Lau-Bignon, A. W. (2012). The words of Lā€™Espace gĆ©ographique: A lexical analysis of the titles and keywords from 1972 to 2010. Espace gĆ©ographique, 41(1), 4ā€“31.

    ArticleĀ  Google ScholarĀ 

  • Hall, D., Jurafsky, D., & Manning, C. D. (2008). Studying the history of ideas using topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 363ā€“371.

    Google ScholarĀ 

  • Hastie, T., Tibshirani, R., & Friedman, J. (2008). The elements of statistical learning: Data mining, inference and prediction (2nd ed.). New York: Springer-Verlag.

    MATHĀ  Google ScholarĀ 

  • Hilpert, M., & Gries, S. T. (2009). Assessing frequency changes in multi-stage diachronic corpora: Applications for historical corpus linguistics and the study of language acquisition. Literary and Linguistic Computing, 24(4), 385ā€“401.

    ArticleĀ  Google ScholarĀ 

  • Jacques, J., & Preda, C. (2014a). Model-based clustering for multivariate functional data. Computational Statistics & Data Analysis, 71, 92ā€“106.

    ArticleĀ  MathSciNetĀ  Google ScholarĀ 

  • Jacques, J., & Preda, C. (2014b). Functional data clustering: A survey. Advances in Data Analysis and Classification, 8(3), 231ā€“255.

    ArticleĀ  MathSciNetĀ  Google ScholarĀ 

  • James, G. M., & Sugar, C. A. (2003). Clustering for sparsely sampled functional data. Journal of the American Statistical Association, 98, 397ā€“408.

    ArticleĀ  MathSciNetĀ  Google ScholarĀ 

  • Johnstone, I. M., & Titterington, D. M. (2009). Statistical challenges of high-dimensional data. Philosophical Transactions of the Royal Society A, 367(1906), 4237ā€“4253.

    ArticleĀ  MathSciNetĀ  Google ScholarĀ 

  • Kao, A., & Poteet, S. R. (Eds.). (2007). Natural language processing and text mining. London: Springer-Verlag.

    MATHĀ  Google ScholarĀ 

  • Kelih, E., Knight, R., Mačutek, J., & Wilson, A. (Eds.). (2016). Issues in quantitative linguistics 4. Studies in quantitative linguistics (Vol. 23). LĆ¼denscheid: RAM-Verlag.

    Google ScholarĀ 

  • Kƶhler, R. (2011). Laws of languages. In P. C. Hogan (Ed.), The Cambridge encyclopedia of the language science (pp. 424ā€“426). Cambridge: Cambridge University Press.

    Google ScholarĀ 

  • Kƶhler, R. (2012). Quantitative syntax analysis. Berlin: De Gruyter.

    BookĀ  Google ScholarĀ 

  • Kƶhler, R., & Galle, M. (1993). Dynamic aspects of text characteristics. In L. HrebĆ­cek & G. Altmann (Eds.), Quantitative text analysis (pp. 46ā€“53). Trier: Wissenschaftlicher.

    Google ScholarĀ 

  • Koplenig, A. (2017). A data-driven method to identify (correlated) changes in chronological corpora. Journal of Quantitative Linguistics, 24(4), 289ā€“318.

    ArticleĀ  Google ScholarĀ 

  • Lebart, L., Morineau, A., & Warwick, K. M. (1984). Multivariate descriptive statistical analysis: Correspondence analysis and related techniques for large matrices. Applied probability and statistics. Chichester: Wiley.

    MATHĀ  Google ScholarĀ 

  • Lebart, L., Salem, A., & Berry, L. (1998). Exploring textual data. Boston: Kluwer Academic Publication.

    BookĀ  Google ScholarĀ 

  • Lee, S. X., & McLachlan, G. J. (2013). Model-based clustering and classification with non-normal mixture distributions. Statistical Methods & Applications, 22(4), 427ā€“454.

    ArticleĀ  MathSciNetĀ  Google ScholarĀ 

  • LĆ©on, J., & Loiseau, S. (Eds.). (2016). History of quantitative linguistics in France. LĆ¼denscheid: RAM-Verlag.

    Google ScholarĀ 

  • Maggioni, M. A., Gambarotto, F., & Uberti, T. E. (2009). Mapping the evolution of ā€˜Clustersā€™: A meta-analysis. FEEM working paper no. 74.2009.

    Google ScholarĀ 

  • Mayaffre, D., Poudat, C., Vanni, L., Magri, V., & Follette, P. (Eds.). (2016). JADT 2016 - Proceedings of 13th International Conference on Statistical Analysis of Textual Data, Nice 7-10 giugno 2016. Nice: Pressess de Fac Imprimeur France.

    Google ScholarĀ 

  • Michel, J. B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., The Google Books Team, et al. (2011). Quantitative analysis of culture using millions of digitized books. Science, 331(6014), 176ā€“182.

    ArticleĀ  Google ScholarĀ 

  • Mikros, G. K., & Mačutek, J. (Eds.). (2015). Sequences in language and text. Berlin/Boston: Walter De Gruyter.

    Google ScholarĀ 

  • Moretti, F. (2013). Distant reading. London: Verso/New Left Books.

    Google ScholarĀ 

  • Murtagh, F. (2005). Correspondence analysis and data coding with java and R. London: Chapman & Hall/CRC.

    BookĀ  Google ScholarĀ 

  • Murtagh, F. (2010). The correspondence analysis platform for uncovering deep structure in data and information, sixth Boole lecture. Computer Journal, 53(3), 304ā€“315.

    ArticleĀ  Google ScholarĀ 

  • Murtagh, F. (2017). Big data scaling through metric mapping: Exploiting the remarkable simplicity of very high dimensional spaces using correspondence analysis. In F. Palumbo, A. Montanari, & M. Vichi (Eds.), Data science - innovative developments in data analysis and clustering (pp. 295ā€“306). Cham: Springer.

    Google ScholarĀ 

  • Naumann, S., Grzybek, P., Vulanović, R., & Altmann, G. (Eds.). (2012). Synergetic linguistics. Text and language as dynamic systems. Vienna: Praesens Verlag.

    Google ScholarĀ 

  • NĆ©e, Ɖ., Daube, J.-M., Valette, M., & Fleury, S. (Eds.). (2014). Actes des 12e JournĆ©es internationales d'analyse statistique des donnĆ©es textuelles (JADT 2014), 3ā€“6 juin 2014, Paris (Actes Ć©lectroniques).

  • Obradović, I., Kelih, E., & Kƶhler, R. (Eds.). (2013). Methods and applications of quantitative linguistics: Selected papers of the VIIIth International Conference on Quantitative Linguistics (QUALICO), Belgrade, Serbia, April 16ā€“19, 2012, Akademska Misao, Belgrado, Serbia.

    Google ScholarĀ 

  • Pawłowski, A. (2006). Chronological analysis of textual data from the Wrocław Corpus of Polish. Poznań Studies in Contemporary Linguistics, 41, 9ā€“29.

    Google ScholarĀ 

  • Pawłowski, A. (2016). Chronological corpora: Challenges and opportunities of sequential analysis. The example of ChronoPress corpus of Polish. Digital Humanities (pp. 311ā€“313).

    Google ScholarĀ 

  • Pawłowski, A., Krajewski, M., & Eder, M. (2010). Time series modelling in the analysis of homeric verse. Eos, 97(2), 79ā€“100.

    Google ScholarĀ 

  • Popescu, I.-I., Macutek, J., & Altmann, G. (2009). Aspects of word frequencies. Studies in quantitative linguistics. Ludenscheid: RAM.

    Google ScholarĀ 

  • Popescu, I.-I. (2009). Word frequency studies. Berlin: Mouton De Gruyter.

    Google ScholarĀ 

  • Popescu, O., & Strapparava, C. (2014). Time corpora: Epochs, opinions and changes. Knowledge-Based Systems, 69, 3ā€“13.

    ArticleĀ  Google ScholarĀ 

  • Porter, A. L., & Rafols, I. (2009). Is science becoming more interdisciplinary? Measuring and mapping six research fields over time. Scientometrics, 81(3), 719ā€“745.

    ArticleĀ  Google ScholarĀ 

  • Ramsay, J., & Silverman, B. W. (2005). Functional data analysis (Springer series in statistics). New York: Springer.

    Google ScholarĀ 

  • Ratinaud, P., & Marchand, P. (2012). Application de la mĆ©thode ALCESTE Ć  de ā€œgrosā€ corpus et stabilitĆ© des ā€œmondes lexicauxā€: analyse du ā€œCableGateā€ avec IRaMuTeQ. In Actes des 11eme JournĆ©es internationales dā€™Analyse statistique des DonnĆ©es Textuelles (pp. 835ā€“844). LiĆØge, Belgique.

    Google ScholarĀ 

  • Ray, S., & Mallick, B. (2006). Functional clustering by bayesian wavelet methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(2), 305ā€“332.

    ArticleĀ  MathSciNetĀ  Google ScholarĀ 

  • Reinert, M. (1983). Une methode de classification descendante hierarchique: application a lā€™analyse lexicale par context. Les Cahiers de lā€™Analyse des DonnĆ©es, 8(2), 187ā€“198.

    Google ScholarĀ 

  • Reinert, M. (1990). ALCESTE: Une mĆ©thodologie d'analyse des donnĆ©es textuelles et une application: AurĆ©lia de GĆ©rard de Nerval. Bulletin de MĆ©thodologie Sociologique, 26, 24ā€“54.

    ArticleĀ  Google ScholarĀ 

  • Reinert, M. (1993). Les ā€œmondes lexicauxā€ et leur ā€œlogiqueā€ Ć  travers lā€™analyse statistique dā€™un corpus de rĆ©cits de cauchemars. Language et SociĆ©tĆ©, 66, 5ā€“39.

    ArticleĀ  Google ScholarĀ 

  • Rodriguez, A., Dunson, D. B., & Gelfand, A. E. (2009). Bayesian nonparametric functional data analysis through density estimation. Biometrika, 96(1), 149ā€“162.

    ArticleĀ  MathSciNetĀ  Google ScholarĀ 

  • Sahami, A., & Srivastava, M. (Eds.). (2009). Text mining: Theory and applications. London: Taylor and Francis.

    MATHĀ  Google ScholarĀ 

  • Salem, A. (1988). Approches du temps lexical. Statistique textuelle et sĆ©ries chronologiques. Mots. Les langages du politique, 17, 105ā€“114.

    Google ScholarĀ 

  • Salem, A. (1991). Les sĆ©ries textuelles chronologiques. Histoire & Mesure, VI-1(2), 149ā€“175.

    ArticleĀ  Google ScholarĀ 

  • Sanger, J., & Feldman, R. (2007). The text mining handbook: Advanced approaches in analyzing unstructured data. Cambridge: Cambridge University Press.

    Google ScholarĀ 

  • Small, H. (2006). Tracking and predicting growth areas in science. Scientometrics, 68(3), 595ā€“610.

    ArticleĀ  Google ScholarĀ 

  • Sullivan, D. (2001). Document warehousing and text mining: Techniques for improving business operations. Wiley: Marketing and Sales.

    Google ScholarĀ 

  • Tibshirani, R., Wainwright, M., & Hastie, T. (2015). Statistical learning with sparsity: The lasso and generalizations. New York: Chapman and Hall/CRC.

    MATHĀ  Google ScholarĀ 

  • Trevisani, M., & Tuzzi, A. (2015). A portrait of JASA: The history of statistics through analysis of keyword counts in an early scientific journal. Quality and Quantity, 49, 1287ā€“1304.

    ArticleĀ  Google ScholarĀ 

  • Trevisani, M., & Tuzzi, A. (2018). Learning the evolution of disciplines from scientific literature. A functional clustering approach to normalized keyword count trajectories. Knowledge-Based Systems, 146, 129ā€“141.

    ArticleĀ  Google ScholarĀ 

  • Tuzzi, A. (2012). Reinhard Kƶhlerā€™s scientific production: Words, numbers and pictures. In S. Naumann, P. Grzybek, R. Vulanović, & G. Altmann (Eds.), Synergetic linguistics. Text and language as dynamic systems (pp. 223ā€“242). Vienna: Praesens Verlag.

    Google ScholarĀ 

  • Tuzzi, A., BenesovĆ”, M., & Macutek, J. (Eds.). (2015). Recent contributions to quantitative linguistics. Berlin: De Gruyter.

    Google ScholarĀ 

  • Tuzzi, A., & Kƶhler, R. (2015). Tracing the history of words. In A. Tuzzi, M. BenesovĆ”, & J. Macutek (Eds.), Recent contributions to quantitative linguistics (pp. 203ā€“214). Berlin: DeGruyter.

    Google ScholarĀ 

  • Van Den Besselaar, P., & Heimeriks, G. (2006). Mapping research topics using word-reference co-occurrences: A method and an exploratory case study. Scientometrics, 68(3), 377ā€“393.

    ArticleĀ  Google ScholarĀ 

  • Wang, J. L., Chiou, J. M., & Mueller, H. G. (2016). Functional data analysis. Annual Review of Statistics and Its Application, 3(1), 257ā€“295.

    ArticleĀ  Google ScholarĀ 

  • Wang, L., Kƶhler, R., & Tuzzi, A. (Eds.). (2018). Structure, Function and Process in Texts. LĆ¼denscheid: RAM-Verlag.

    Google ScholarĀ 

  • Weiss, S. M., Indurkhya, N., Zhang, T., & Damerau, F. (2005). Text mining: Predictive methods for analyzing unstructured information. New York: Springer.

    BookĀ  Google ScholarĀ 

  • Yin, Y., & Wang, D. (2017). The time dimension of science: Connecting the past to the future. Journal of Informetrics, 11, 608ā€“621.

    ArticleĀ  Google ScholarĀ 

  • Zhang, Y., Chen, H., Lu, J., & Zhang, G. (2017). Detecting and predicting the topic change of knowledge-based systems: A topic-based bibliometric analysis from 1991 to 2016. Knowledge Based System, 133(Supplement C), 255ā€“268.

    ArticleĀ  Google ScholarĀ 

  • Zhang, Y., Zhang, G., Chen, H., Porter, A. L., Zhu, D., & Lu, J. (2016). Topic analysis and forecasting for science, technology and innovation: Methodology with a case study focusing on big data research. Technological Forecasting and Social Change, 105, 179ā€“191.

    ArticleĀ  Google ScholarĀ 

Download references

Acknowledgements

To the members of the research team and co-authors of this book, which I had the honour to lead and coordinate, go all my respect and gratitude for having chosen to follow me in this challenging adventure and to join the small group of brave researchers who for some time shared my interest in this matter. I would like to recognize the open minds of our most senior colleagues, and their vision and desire to get involved on truly exceptional, unfamiliar terrain, and I am very satisfied with the work of my younger colleagues for the desire to learn which they have shown, and for the great enthusiasm that they dedicated to the project and for having become the real ā€œresearch engineā€ of the group.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arjuna Tuzzi .

Editor information

Editors and Affiliations

Appendix

Appendix

1.1.1 A Brief Overview on Correspondence Analysis

Correspondence Analysis (CA) is an Explorative Data Analysis (EDA) that has proven useful in studying the conjoint distribution of two (or more) categorical variables. CA portrays the existing structure of association between two (or more) variables by means of simple plots that position the categories of the variables on a plane.

The quantitative perspective adopted by the contributionsĀ of this volume are based on words and word counts, i.e. they are based on the observation of occurrences of relevant keywords over time. In this perspective, CA can be exploited to achieve a content mapping as it is useful to represent the system of relationships among years (e.g. volumes of the journals), among words (e.g. relevant keywords), and between years and words. Although CA is not able to describe all relevant linguistic features of a set of texts, it contributes to highlight latent patterns. For example, in our case, it makes it possible to verify whether the volumes of a journal expressed a clear temporal pattern in their main contents.

In the simplest version, CA works on a two-way contingency table in which the rows represent keywords (e.g. m word-types w1, ā€¦, wm) and columns represent the volumes of the journal (e.g. p time-points t1, ā€¦, tp). Each cell of this (lexical) contingency table represents the number nij of occurrences of the i-th keyword (the i-th row) in the volume published at the j-th time-point (the j-th column) (Table 1.1).

Table 1.1 Example of (lexical) contingency table words Ɨ time-points

CA provides the best simultaneous representation of row profiles and column profiles on each axis (and on each plane generated by a pair of axes). The purpose of the CA is to translate the similarities between categories (words and volumes) in a graph in which the most similar categories are placed in adjacent positions in the space defined by the Cartesian axes. If you look at the words, it is fairly intuitive to think that the similarity between two words depends on how much the occurrences in the two rows of the table ā€œresemble each otherā€, that is, how similar they are in terms of presence, absence, or occurrence in the journal volumes: if two words tend to be used in the same volumes and with similar frequency, they have a similar profile over time. Two words with an identical profile will have no distance between them, that is, they will be represented on a graph as two overlapping points.

The intuitive notion of similarity between the profiles of two words wi and wk is translated into a distance (chi-square distance) that can be calculated for each pair of words:

\( {d}_{ik}^2=\sum \limits_{j=1}^p\frac{n}{n_{.j}}{\left(\frac{n_{ij}}{n_{i.}}-\frac{n_{kj}}{n_{k.}}\right)}^2 \)

All the reasoning can be repeated by taking into consideration the similarity between pairs of volumes and considering the profiles of the two columns. Two volumes of the journal (time-points tj and tk) resemble each other if they have a similar lexical profile, i.e. if they include the same words with a similar relative frequency (Fig. 1.1).

Fig. 1.1
figure 1

Profiles in terms of relative frequencies and positions on the plane of three time-points

The distance between two time-points tj and tk is given as:

\( {d}_{jk}^2=\sum \limits_{i=1}^m\frac{n}{n_{i.}}{\left(\frac{n_{ij}}{n_{.j}}-\frac{n_{ik}}{n_{.k}}\right)}^2 \)

From another viewpoint, the rows and the columns of this matrix are considered as vectors, i.e. as points in a multidimensional space, and the distance between two vectors is measured through a weighted Euclidian distance that compares the corresponding lexical profiles taking into account the size of the subcorpora (volumes) at each time-point and the occurrences of each word in the corpus as a whole.

Following the calculation of the pairwise distance for words and for volumes, the next step is to transform the space generated by the original variables in a Euclidean space generated by new orthogonal variables (components or axes). The multidimensional space generated by the matrix is reduced to orthogonal dimensions (axes) that are displayed as Cartesian axes. The number of dimensions of this new space (i.e. the number of orthogonal axes) is equal to the number of linearly independent variables (rank of the matrix) that, in our context, is the number of time-points minus one (p āˆ’ 1, more generally min(m, p) āˆ’ 1).

The starting point of this transformation are the square matrix mĀ Ć—Ā m which contains the pairwise distances between words and the square matrix pĀ Ć—Ā p with the pairwise distances between volumes. The calculation of the coordinates of each axis is based on the singular value decomposition (SVD). The orthogonal factorial axes are sorted according to the amount of inertia collected (according to degree of association), i.e. they are in order of relevance: the first is the most important axis and the one which collects the highest portion of the information contained in the contingency table, the second axis is the one which collects the highest portion of information not explained by the first axis and so on. The Cartesian plane constructed with the first two factorial axes is the two-dimensional space which best represents the structure of association shown in the contingency table on a low-dimensional Euclidean space, and so on.

Unlike other analyses that move from the analysis of a matrix cases Ɨ variables, in CA the contingency table can be read in two ways: as m row vectors in the p-1 dimensions space generated by the columns, i.e. m words in the space of p time-pointsĀ (volumes), and as p column vectorsĀ in the m-1 dimensions space generated by the rows, i.e. p time-points in the space of m words. From this observation, there is the immediate possibility to obtain two graphs separately: one with the words and one with the volumes. For the geometric properties of the two spaces (duality), the dimensions are the same and the two graphs overlap. This makes it possible to observe the system of relations between all the categories in play; although we must be very careful in the interpretation of the joint graphical representation of the two variables. In order to briefly summarize the elements for reading the graphs obtained from CA, we should remember that the position where a word or a volume is found assumes a role only in the globally created context of the graph, i.e. it doesnā€™t have any meaning by itself, but it does have meaning in comparison with the positions taken by all the other points found in the solution with respect to the barycentre at the origin of the axes. If two words are close on the graph, it means that they have similar profiles and, analogously, if two volumes are close they have similar lexical profiles. The mutual position assumed by a word and a volume cannot be evaluated in a direct manner and must be evaluated with reference to the positions assumed by all the other elements. In this sense, it is useful to use the quadrants of the Cartesian plane and, thanks to the axes, the proximity can be evaluated by taking into account the angle formed by the axes (the more similar the angle formed with the axes is, the more they can be considered associated). The words or the volumes that contributed the most to the solution and which, therefore, can be considered the most important in the reconstructed context of the graph, are those which are distant from the origin of the axes. The densification of modalities in an area of the graph that stands out from the rest as a cluster might be interpreted as a semantic area and for this purpose one often choses to partition into clusters. The clusters of words or volumes should be homogeneous as much as possible within the group and, as much as possible, heterogeneous within groups. In the analysis of the lexical contingency table, a cluster analysis based on the CA groups together the volumes based on the lexical similarity (which is usually also visible in terms of proximity of the points on the graph).

1.1.2 An Example

To understand the functioning of the CA, an application example of a very simplified fictional corpus might be useful. Suppose you have 11 texts that include the topics of a journal of the statistical field and constitute a small text corpus:

  • text01 regression analysis; linear regression

  • text02 regression model; linear and non-linear model

  • text03 generalized linear model; parameter estimation

  • text04 sampling methods; random sampling; survey design and sampling methods

  • text05 survey design; finite populations

  • text06 methods for sampling elusive populations

  • text07 Normal distribution

  • text08 z-scores and Normal distribution

  • text09 Gamma distribution

  • text10 p-value: Normal distribution and Gammaā€”exponential family

  • text11 regression analysis; Normal distribution

There are 53 word-tokens and 25 word-types in the corpus. Taking into account only the words that are repeated at least twice, namely distribution (5 occurrences) and, linear, Normal, regression, and sampling (4), methods and model (3), analysis, design, Gamma, populations, and survey (2), we can construct a contingency table words Ɨ texts (Table 1.2), in which we see, for example, that the word survey was used once each by texts 04 and 05.

Table 1.2 Contingency table words Ɨ texts

The CA of the contingency table results in 10 factorial axes. The first two axes collect 55% of the information (explained inertia) and the first factorial plane is shown in Fig. 1.2.

Fig. 1.2
figure 2

First plane of correspondence analysis. Visualization of texts (a) and of both texts and words with frequency ā‰„2 (b)

Figure 1.2 shows very well the three latent patterns present in the texts that refer to linear model (regression, analysis), sampling methods (survey design, populations), and distribution (Normal, Gamma). Texts 01, 02, and 03 can be found together in the area of linear model (second quadrant, upper left) while texts 07, 08, 09, and 10 in the area of distribution (third quadrant, bottom left). Text 11 is somewhere between linear models and distributions areas because it includes both topics. In the area of sampling methods (first quadrant, on the left), there are the texts 04, 05, and 06. It is interesting to note the conjunction and which is found near the origin of the axes because it has been used in different contexts (though slightly more often used by those who talked about distributions).

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2018 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Tuzzi, A. (2018). Introduction: Tracing the History of a Discipline Through Quantitative and Qualitative Analyses of Scientific Literature. In: Tuzzi, A. (eds) Tracing the Life Cycle of Ideas in the Humanities and Social Sciences. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-97064-6_1

Download citation

Publish with us

Policies and ethics