Introduction: Tracing the History of a Discipline Through Quantitative and Qualitative Analyses of Scientific Literature

Tuzzi, Arjuna

doi:10.1007/978-3-319-97064-6_1

Arjuna Tuzzi⁷

Part of the book series: Quantitative Methods in the Humanities and Social Sciences ((QMHSS))

764 Accesses
1 Citations

Abstract

The chapters of this book are concerned with learning of the evolution of ideas (theories, concepts, methods, and application domains) and of the history of a discipline, by means of the temporal evolution of word occurrences in papers published by scientific journals. The work carried out for each of the areas involved in the project (philosophy, sociology, psychology, linguistics, statistics) pursued different objectives: to obtain a first overview of the relationship between time and contents in order to observe latent temporal patterns; to identify relevant keywords; to cluster keywords portraying similar temporal patterns; to identify latent dynamics of cluster keywords; and to identify relevant topics as groups of related words. The contributions identified and analysed the main subject matters that, at the time of publication, were considered relevant by mainstream journals and offer new viewpoints to read and understand the evolution of a discipline. The interdisciplinary debate triggered by this research work is innovative because quantitative methods for text analysis have been used in areas of human and social sciences, which are traditionally studied through qualitative approaches, and also represents a positive experience since new paths have been explored by pooling together the qualitative and quantitative research methods, traditions, and expertise of different disciplines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aggarwal, C. C., & Zhai, C. (2012). Mining text data. New York: Springer.
Book Google Scholar
Angelini, A., Canditiis, D. D., & Pensky, M. (2012). Clustering time-course microarray data using functional bayesian infinite mixture model. Journal of Applied Statistics, 39(1), 129–149.
Article MathSciNet Google Scholar
Baayen, R. H. (2001). Word frequency distributions. Dordrecht: Kluwer Academic Publishers.
Book Google Scholar
Beaudouin, V. (2016). Statistical analysis of textual data: Benzécri and the French School of Data Analysis. Glottometrics, 33, 56–72.
Google Scholar
Berry, M. W. (Ed.). (2004). Survey of text mining. Clustering, classification, and retrieval. New York: Springer-Verlag.
MATH Google Scholar
Berry, M. W., & Kogan, J. (2010). Text mining: Applications and theory. Chichester: Wiley Online Library.
Book Google Scholar
Bhattacharya, S., & Basu, P. K. (1998). Mapping a research area at the micro level using co-word analysis. Scientometrics, 43(3), 359–372.
Article Google Scholar
Blei, D. M., Ng, A. Y., & Jordan, M. (2003). Latent Dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.
MATH Google Scholar
Bolasco, S. (2005). Statistica testuale e text mining: alcuni paradigmi applicativi. Quaderni di Statistica, 7, 17–53.
Google Scholar
Bolasco, S. (2013). L'analisi automatica dei testi. Fare ricerca con il text mining. Roma: Carocci.
Google Scholar
Cahlík, T., & Jiřina, M. (2006). Law of cumulative advantages in the evolution of scientific fields. Scientometrics, 66(3), 441–449.
Article Google Scholar
Chavalarias, D., & Cointet, J. P. (2008). Bottom-up scientific field detection for dynamical and hierarchical science mapping, methodology and case study. Scientometrics, 75(1), 37–50.
Article Google Scholar
Chavalarias, D., & Cointet, J. P. (2013). Phylomemetic patterns in science evolution – The rise and fall of scientific fields. PLoS One, 8(2), e54847.
Article Google Scholar
Cobo, M., López-Herrera, A., Herrera-Viedma, E., & Herrera, F. (2011). An approach for detecting, quantifying, and visualizing the evolution of a research field: A practical application to the fuzzy sets theory field. Journal of Informetrics, 5(1), 146–166.
Article Google Scholar
Cobo, M., López-Herrera, A., Herrera-Viedma, E., & Herrera, F. (2012). SciMAT: A new science mapping analysis software tool. Journal of the American Society for Information Science and Technology, 63(8), 1609–1630.
Article Google Scholar
Coffey, N., Hinde, J., & Holian, E. (2014). Clustering longitudinal profiles using P-splines and mixed effects models applied to time-course gene expression data. Computational Statistics & Data Analysis, 71, 14–29.
Article MathSciNet Google Scholar
Cortelazzo, M. A., & Tuzzi, A. (Eds.). (2007). Messaggi dal Colle. I discorsi di fine anno dei presidenti della Repubblica. Venezia: Marsilio Editori.
Google Scholar
Cretchley, J., Rooney, D., & Gallois, C. (2010). Mapping a 40-year history with leximancer: Themes and concepts in the journal of cross-cultural psychology. Journal of Cross-Cultural Psychology, 41(3), 318–328.
Article Google Scholar
Dister, A., Longrée, D., & Purnelle, G. (Eds.). (2012). JADT 2012 Actes des 11es Journées internationales d’analyse statistique des données textuelles. Liège/Bruxelles: LASLA – SESLA.
Google Scholar
Diwersy, S., & Luxardo, G. (2016). Mettre en évidence le temps lexical dans un corpus de grandes dimensions: l'exemple des débats du Parlement européen. In D. Mayaffre, C. Poudat, L. Vanni, V. Magri, & P. Follette (Eds.), JADT 2016 - proceedings of 13th international conference on statistical analysis of textual data. Nice: Pressess de Fac Imprimeur France.
Google Scholar
Giacofci, M., Lambert-Lacroix, S., Marot, G., & Picard, F. (2013). Wavelet-based clustering for mixed-effects functional models in high dimension. Biometrics, 69(1), 31–40.
Article MathSciNet Google Scholar
Greenacre, M. J. (1984). Theory and application of correspondence analysis. London: Academic Press.
MATH Google Scholar
Greenacre, M. J. (2007). Correspondence analysis in practice. London: Chapman & Hall.
Book Google Scholar
Gries, S. T., & Hilpert, M. (2008). The identification of stages in diachronic data: Variability-based neighbour clustering. Corpora, 3(1), 59–81.
Article Google Scholar
Gries, S. T., & Hilpert, M. (2012). Variability-based neighbor clustering: A bottom-up approach to periodization in historical linguistics. In T. Nevalainen & E. Traugott (Eds.), The Oxford handbook of the history of English (pp. 134–144). Oxford: Oxford University Press.
Google Scholar
Griffiths, T., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America (PNAS), 101(Supplement 1), 5228–5235.
Article Google Scholar
Guérin-Pace, F., Saint-Julien, T., & Lau-Bignon, A. W. (2012). The words of L’Espace géographique: A lexical analysis of the titles and keywords from 1972 to 2010. Espace géographique, 41(1), 4–31.
Article Google Scholar
Hall, D., Jurafsky, D., & Manning, C. D. (2008). Studying the history of ideas using topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 363–371.
Google Scholar
Hastie, T., Tibshirani, R., & Friedman, J. (2008). The elements of statistical learning: Data mining, inference and prediction (2nd ed.). New York: Springer-Verlag.
MATH Google Scholar
Hilpert, M., & Gries, S. T. (2009). Assessing frequency changes in multi-stage diachronic corpora: Applications for historical corpus linguistics and the study of language acquisition. Literary and Linguistic Computing, 24(4), 385–401.
Article Google Scholar
Jacques, J., & Preda, C. (2014a). Model-based clustering for multivariate functional data. Computational Statistics & Data Analysis, 71, 92–106.
Article MathSciNet Google Scholar
Jacques, J., & Preda, C. (2014b). Functional data clustering: A survey. Advances in Data Analysis and Classification, 8(3), 231–255.
Article MathSciNet Google Scholar
James, G. M., & Sugar, C. A. (2003). Clustering for sparsely sampled functional data. Journal of the American Statistical Association, 98, 397–408.
Article MathSciNet Google Scholar
Johnstone, I. M., & Titterington, D. M. (2009). Statistical challenges of high-dimensional data. Philosophical Transactions of the Royal Society A, 367(1906), 4237–4253.
Article MathSciNet Google Scholar
Kao, A., & Poteet, S. R. (Eds.). (2007). Natural language processing and text mining. London: Springer-Verlag.
MATH Google Scholar
Kelih, E., Knight, R., Mačutek, J., & Wilson, A. (Eds.). (2016). Issues in quantitative linguistics 4. Studies in quantitative linguistics (Vol. 23). Lüdenscheid: RAM-Verlag.
Google Scholar
Köhler, R. (2011). Laws of languages. In P. C. Hogan (Ed.), The Cambridge encyclopedia of the language science (pp. 424–426). Cambridge: Cambridge University Press.
Google Scholar
Köhler, R. (2012). Quantitative syntax analysis. Berlin: De Gruyter.
Book Google Scholar
Köhler, R., & Galle, M. (1993). Dynamic aspects of text characteristics. In L. Hrebícek & G. Altmann (Eds.), Quantitative text analysis (pp. 46–53). Trier: Wissenschaftlicher.
Google Scholar
Koplenig, A. (2017). A data-driven method to identify (correlated) changes in chronological corpora. Journal of Quantitative Linguistics, 24(4), 289–318.
Article Google Scholar
Lebart, L., Morineau, A., & Warwick, K. M. (1984). Multivariate descriptive statistical analysis: Correspondence analysis and related techniques for large matrices. Applied probability and statistics. Chichester: Wiley.
MATH Google Scholar
Lebart, L., Salem, A., & Berry, L. (1998). Exploring textual data. Boston: Kluwer Academic Publication.
Book Google Scholar
Lee, S. X., & McLachlan, G. J. (2013). Model-based clustering and classification with non-normal mixture distributions. Statistical Methods & Applications, 22(4), 427–454.
Article MathSciNet Google Scholar
Léon, J., & Loiseau, S. (Eds.). (2016). History of quantitative linguistics in France. Lüdenscheid: RAM-Verlag.
Google Scholar
Maggioni, M. A., Gambarotto, F., & Uberti, T. E. (2009). Mapping the evolution of ‘Clusters’: A meta-analysis. FEEM working paper no. 74.2009.
Google Scholar
Mayaffre, D., Poudat, C., Vanni, L., Magri, V., & Follette, P. (Eds.). (2016). JADT 2016 - Proceedings of 13th International Conference on Statistical Analysis of Textual Data, Nice 7-10 giugno 2016. Nice: Pressess de Fac Imprimeur France.
Google Scholar
Michel, J. B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., The Google Books Team, et al. (2011). Quantitative analysis of culture using millions of digitized books. Science, 331(6014), 176–182.
Article Google Scholar
Mikros, G. K., & Mačutek, J. (Eds.). (2015). Sequences in language and text. Berlin/Boston: Walter De Gruyter.
Google Scholar
Moretti, F. (2013). Distant reading. London: Verso/New Left Books.
Google Scholar
Murtagh, F. (2005). Correspondence analysis and data coding with java and R. London: Chapman & Hall/CRC.
Book Google Scholar
Murtagh, F. (2010). The correspondence analysis platform for uncovering deep structure in data and information, sixth Boole lecture. Computer Journal, 53(3), 304–315.
Article Google Scholar
Murtagh, F. (2017). Big data scaling through metric mapping: Exploiting the remarkable simplicity of very high dimensional spaces using correspondence analysis. In F. Palumbo, A. Montanari, & M. Vichi (Eds.), Data science - innovative developments in data analysis and clustering (pp. 295–306). Cham: Springer.
Google Scholar
Naumann, S., Grzybek, P., Vulanović, R., & Altmann, G. (Eds.). (2012). Synergetic linguistics. Text and language as dynamic systems. Vienna: Praesens Verlag.
Google Scholar
Née, É., Daube, J.-M., Valette, M., & Fleury, S. (Eds.). (2014). Actes des 12e Journées internationales d'analyse statistique des données textuelles (JADT 2014), 3–6 juin 2014, Paris (Actes électroniques).
Obradović, I., Kelih, E., & Köhler, R. (Eds.). (2013). Methods and applications of quantitative linguistics: Selected papers of the VIIIth International Conference on Quantitative Linguistics (QUALICO), Belgrade, Serbia, April 16–19, 2012, Akademska Misao, Belgrado, Serbia.
Google Scholar
Pawłowski, A. (2006). Chronological analysis of textual data from the Wrocław Corpus of Polish. Poznań Studies in Contemporary Linguistics, 41, 9–29.
Google Scholar
Pawłowski, A. (2016). Chronological corpora: Challenges and opportunities of sequential analysis. The example of ChronoPress corpus of Polish. Digital Humanities (pp. 311–313).
Google Scholar
Pawłowski, A., Krajewski, M., & Eder, M. (2010). Time series modelling in the analysis of homeric verse. Eos, 97(2), 79–100.
Google Scholar
Popescu, I.-I., Macutek, J., & Altmann, G. (2009). Aspects of word frequencies. Studies in quantitative linguistics. Ludenscheid: RAM.
Google Scholar
Popescu, I.-I. (2009). Word frequency studies. Berlin: Mouton De Gruyter.
Google Scholar
Popescu, O., & Strapparava, C. (2014). Time corpora: Epochs, opinions and changes. Knowledge-Based Systems, 69, 3–13.
Article Google Scholar
Porter, A. L., & Rafols, I. (2009). Is science becoming more interdisciplinary? Measuring and mapping six research fields over time. Scientometrics, 81(3), 719–745.
Article Google Scholar
Ramsay, J., & Silverman, B. W. (2005). Functional data analysis (Springer series in statistics). New York: Springer.
Google Scholar
Ratinaud, P., & Marchand, P. (2012). Application de la méthode ALCESTE à de “gros” corpus et stabilité des “mondes lexicaux”: analyse du “CableGate” avec IRaMuTeQ. In Actes des 11eme Journées internationales d’Analyse statistique des Données Textuelles (pp. 835–844). Liège, Belgique.
Google Scholar
Ray, S., & Mallick, B. (2006). Functional clustering by bayesian wavelet methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(2), 305–332.
Article MathSciNet Google Scholar
Reinert, M. (1983). Une methode de classification descendante hierarchique: application a l’analyse lexicale par context. Les Cahiers de l’Analyse des Données, 8(2), 187–198.
Google Scholar
Reinert, M. (1990). ALCESTE: Une méthodologie d'analyse des données textuelles et une application: Aurélia de Gérard de Nerval. Bulletin de Méthodologie Sociologique, 26, 24–54.
Article Google Scholar
Reinert, M. (1993). Les “mondes lexicaux” et leur “logique” à travers l’analyse statistique d’un corpus de récits de cauchemars. Language et Société, 66, 5–39.
Article Google Scholar
Rodriguez, A., Dunson, D. B., & Gelfand, A. E. (2009). Bayesian nonparametric functional data analysis through density estimation. Biometrika, 96(1), 149–162.
Article MathSciNet Google Scholar
Sahami, A., & Srivastava, M. (Eds.). (2009). Text mining: Theory and applications. London: Taylor and Francis.
MATH Google Scholar
Salem, A. (1988). Approches du temps lexical. Statistique textuelle et séries chronologiques. Mots. Les langages du politique, 17, 105–114.
Google Scholar
Salem, A. (1991). Les séries textuelles chronologiques. Histoire & Mesure, VI-1(2), 149–175.
Article Google Scholar
Sanger, J., & Feldman, R. (2007). The text mining handbook: Advanced approaches in analyzing unstructured data. Cambridge: Cambridge University Press.
Google Scholar
Small, H. (2006). Tracking and predicting growth areas in science. Scientometrics, 68(3), 595–610.
Article Google Scholar
Sullivan, D. (2001). Document warehousing and text mining: Techniques for improving business operations. Wiley: Marketing and Sales.
Google Scholar
Tibshirani, R., Wainwright, M., & Hastie, T. (2015). Statistical learning with sparsity: The lasso and generalizations. New York: Chapman and Hall/CRC.
MATH Google Scholar
Trevisani, M., & Tuzzi, A. (2015). A portrait of JASA: The history of statistics through analysis of keyword counts in an early scientific journal. Quality and Quantity, 49, 1287–1304.
Article Google Scholar
Trevisani, M., & Tuzzi, A. (2018). Learning the evolution of disciplines from scientific literature. A functional clustering approach to normalized keyword count trajectories. Knowledge-Based Systems, 146, 129–141.
Article Google Scholar
Tuzzi, A. (2012). Reinhard Köhler’s scientific production: Words, numbers and pictures. In S. Naumann, P. Grzybek, R. Vulanović, & G. Altmann (Eds.), Synergetic linguistics. Text and language as dynamic systems (pp. 223–242). Vienna: Praesens Verlag.
Google Scholar
Tuzzi, A., Benesová, M., & Macutek, J. (Eds.). (2015). Recent contributions to quantitative linguistics. Berlin: De Gruyter.
Google Scholar
Tuzzi, A., & Köhler, R. (2015). Tracing the history of words. In A. Tuzzi, M. Benesová, & J. Macutek (Eds.), Recent contributions to quantitative linguistics (pp. 203–214). Berlin: DeGruyter.
Google Scholar
Van Den Besselaar, P., & Heimeriks, G. (2006). Mapping research topics using word-reference co-occurrences: A method and an exploratory case study. Scientometrics, 68(3), 377–393.
Article Google Scholar
Wang, J. L., Chiou, J. M., & Mueller, H. G. (2016). Functional data analysis. Annual Review of Statistics and Its Application, 3(1), 257–295.
Article Google Scholar
Wang, L., Köhler, R., & Tuzzi, A. (Eds.). (2018). Structure, Function and Process in Texts. Lüdenscheid: RAM-Verlag.
Google Scholar
Weiss, S. M., Indurkhya, N., Zhang, T., & Damerau, F. (2005). Text mining: Predictive methods for analyzing unstructured information. New York: Springer.
Book Google Scholar
Yin, Y., & Wang, D. (2017). The time dimension of science: Connecting the past to the future. Journal of Informetrics, 11, 608–621.
Article Google Scholar
Zhang, Y., Chen, H., Lu, J., & Zhang, G. (2017). Detecting and predicting the topic change of knowledge-based systems: A topic-based bibliometric analysis from 1991 to 2016. Knowledge Based System, 133(Supplement C), 255–268.
Article Google Scholar
Zhang, Y., Zhang, G., Chen, H., Porter, A. L., Zhu, D., & Lu, J. (2016). Topic analysis and forecasting for science, technology and innovation: Methodology with a case study focusing on big data research. Technological Forecasting and Social Change, 105, 179–191.
Article Google Scholar

Download references

Acknowledgements

To the members of the research team and co-authors of this book, which I had the honour to lead and coordinate, go all my respect and gratitude for having chosen to follow me in this challenging adventure and to join the small group of brave researchers who for some time shared my interest in this matter. I would like to recognize the open minds of our most senior colleagues, and their vision and desire to get involved on truly exceptional, unfamiliar terrain, and I am very satisfied with the work of my younger colleagues for the desire to learn which they have shown, and for the great enthusiasm that they dedicated to the project and for having become the real “research engine” of the group.

Author information

Authors and Affiliations

Department of Philosophy, Sociology, Education and Applied Psychology, University of Padova, Padova, Italy
Arjuna Tuzzi

Authors

Arjuna Tuzzi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arjuna Tuzzi .

Editor information

Editors and Affiliations

Department of Philosophy, Sociology, Education and Applied Psychology, University of Padova, Padova, Italy
Arjuna Tuzzi

Appendix

1.1.1 A Brief Overview on Correspondence Analysis

Correspondence Analysis (CA) is an Explorative Data Analysis (EDA) that has proven useful in studying the conjoint distribution of two (or more) categorical variables. CA portrays the existing structure of association between two (or more) variables by means of simple plots that position the categories of the variables on a plane.

The quantitative perspective adopted by the contributions of this volume are based on words and word counts, i.e. they are based on the observation of occurrences of relevant keywords over time. In this perspective, CA can be exploited to achieve a content mapping as it is useful to represent the system of relationships among years (e.g. volumes of the journals), among words (e.g. relevant keywords), and between years and words. Although CA is not able to describe all relevant linguistic features of a set of texts, it contributes to highlight latent patterns. For example, in our case, it makes it possible to verify whether the volumes of a journal expressed a clear temporal pattern in their main contents.

In the simplest version, CA works on a two-way contingency table in which the rows represent keywords (e.g. m word-types w₁, …, w_m) and columns represent the volumes of the journal (e.g. p time-points t₁, …, t_p). Each cell of this (lexical) contingency table represents the number n_ij of occurrences of the i-th keyword (the i-th row) in the volume published at the j-th time-point (the j-th column) (Table 1.1).

Table 1.1 Example of (lexical) contingency table words × time-points

Full size table

CA provides the best simultaneous representation of row profiles and column profiles on each axis (and on each plane generated by a pair of axes). The purpose of the CA is to translate the similarities between categories (words and volumes) in a graph in which the most similar categories are placed in adjacent positions in the space defined by the Cartesian axes. If you look at the words, it is fairly intuitive to think that the similarity between two words depends on how much the occurrences in the two rows of the table “resemble each other”, that is, how similar they are in terms of presence, absence, or occurrence in the journal volumes: if two words tend to be used in the same volumes and with similar frequency, they have a similar profile over time. Two words with an identical profile will have no distance between them, that is, they will be represented on a graph as two overlapping points.

The intuitive notion of similarity between the profiles of two words w_i and w_k is translated into a distance (chi-square distance) that can be calculated for each pair of words:

\( {d}_{ik}^2=\sum \limits_{j=1}^p\frac{n}{n_{.j}}{\left(\frac{n_{ij}}{n_{i.}}-\frac{n_{kj}}{n_{k.}}\right)}^2 \)

All the reasoning can be repeated by taking into consideration the similarity between pairs of volumes and considering the profiles of the two columns. Two volumes of the journal (time-points t_j and t_k) resemble each other if they have a similar lexical profile, i.e. if they include the same words with a similar relative frequency (Fig. 1.1).

The distance between two time-points t_j and t_k is given as:

\( {d}_{jk}^2=\sum \limits_{i=1}^m\frac{n}{n_{i.}}{\left(\frac{n_{ij}}{n_{.j}}-\frac{n_{ik}}{n_{.k}}\right)}^2 \)

From another viewpoint, the rows and the columns of this matrix are considered as vectors, i.e. as points in a multidimensional space, and the distance between two vectors is measured through a weighted Euclidian distance that compares the corresponding lexical profiles taking into account the size of the subcorpora (volumes) at each time-point and the occurrences of each word in the corpus as a whole.

Following the calculation of the pairwise distance for words and for volumes, the next step is to transform the space generated by the original variables in a Euclidean space generated by new orthogonal variables (components or axes). The multidimensional space generated by the matrix is reduced to orthogonal dimensions (axes) that are displayed as Cartesian axes. The number of dimensions of this new space (i.e. the number of orthogonal axes) is equal to the number of linearly independent variables (rank of the matrix) that, in our context, is the number of time-points minus one (p − 1, more generally min(m, p) − 1).

The starting point of this transformation are the square matrix m × m which contains the pairwise distances between words and the square matrix p × p with the pairwise distances between volumes. The calculation of the coordinates of each axis is based on the singular value decomposition (SVD). The orthogonal factorial axes are sorted according to the amount of inertia collected (according to degree of association), i.e. they are in order of relevance: the first is the most important axis and the one which collects the highest portion of the information contained in the contingency table, the second axis is the one which collects the highest portion of information not explained by the first axis and so on. The Cartesian plane constructed with the first two factorial axes is the two-dimensional space which best represents the structure of association shown in the contingency table on a low-dimensional Euclidean space, and so on.

Unlike other analyses that move from the analysis of a matrix cases × variables, in CA the contingency table can be read in two ways: as m row vectors in the p-1 dimensions space generated by the columns, i.e. m words in the space of p time-points (volumes), and as p column vectors in the m-1 dimensions space generated by the rows, i.e. p time-points in the space of m words. From this observation, there is the immediate possibility to obtain two graphs separately: one with the words and one with the volumes. For the geometric properties of the two spaces (duality), the dimensions are the same and the two graphs overlap. This makes it possible to observe the system of relations between all the categories in play; although we must be very careful in the interpretation of the joint graphical representation of the two variables. In order to briefly summarize the elements for reading the graphs obtained from CA, we should remember that the position where a word or a volume is found assumes a role only in the globally created context of the graph, i.e. it doesn’t have any meaning by itself, but it does have meaning in comparison with the positions taken by all the other points found in the solution with respect to the barycentre at the origin of the axes. If two words are close on the graph, it means that they have similar profiles and, analogously, if two volumes are close they have similar lexical profiles. The mutual position assumed by a word and a volume cannot be evaluated in a direct manner and must be evaluated with reference to the positions assumed by all the other elements. In this sense, it is useful to use the quadrants of the Cartesian plane and, thanks to the axes, the proximity can be evaluated by taking into account the angle formed by the axes (the more similar the angle formed with the axes is, the more they can be considered associated). The words or the volumes that contributed the most to the solution and which, therefore, can be considered the most important in the reconstructed context of the graph, are those which are distant from the origin of the axes. The densification of modalities in an area of the graph that stands out from the rest as a cluster might be interpreted as a semantic area and for this purpose one often choses to partition into clusters. The clusters of words or volumes should be homogeneous as much as possible within the group and, as much as possible, heterogeneous within groups. In the analysis of the lexical contingency table, a cluster analysis based on the CA groups together the volumes based on the lexical similarity (which is usually also visible in terms of proximity of the points on the graph).

1.1.2 An Example

To understand the functioning of the CA, an application example of a very simplified fictional corpus might be useful. Suppose you have 11 texts that include the topics of a journal of the statistical field and constitute a small text corpus:

text01 regression analysis; linear regression
text02 regression model; linear and non-linear model
text03 generalized linear model; parameter estimation
text04 sampling methods; random sampling; survey design and sampling methods
text05 survey design; finite populations
text06 methods for sampling elusive populations
text07 Normal distribution
text08 z-scores and Normal distribution
text09 Gamma distribution
text10 p-value: Normal distribution and Gamma—exponential family
text11 regression analysis; Normal distribution

There are 53 word-tokens and 25 word-types in the corpus. Taking into account only the words that are repeated at least twice, namely distribution (5 occurrences) and, linear, Normal, regression, and sampling (4), methods and model (3), analysis, design, Gamma, populations, and survey (2), we can construct a contingency table words × texts (Table 1.2), in which we see, for example, that the word survey was used once each by texts 04 and 05.

Table 1.2 Contingency table words × texts

Full size table

The CA of the contingency table results in 10 factorial axes. The first two axes collect 55% of the information (explained inertia) and the first factorial plane is shown in Fig. 1.2.

Figure 1.2 shows very well the three latent patterns present in the texts that refer to linear model (regression, analysis), sampling methods (survey design, populations), and distribution (Normal, Gamma). Texts 01, 02, and 03 can be found together in the area of linear model (second quadrant, upper left) while texts 07, 08, 09, and 10 in the area of distribution (third quadrant, bottom left). Text 11 is somewhere between linear models and distributions areas because it includes both topics. In the area of sampling methods (first quadrant, on the left), there are the texts 04, 05, and 06. It is interesting to note the conjunction and which is found near the origin of the axes because it has been used in different contexts (though slightly more often used by those who talked about distributions).

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Tuzzi, A. (2018). Introduction: Tracing the History of a Discipline Through Quantitative and Qualitative Analyses of Scientific Literature. In: Tuzzi, A. (eds) Tracing the Life Cycle of Ideas in the Humanities and Social Sciences. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-97064-6_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-97064-6_1
Published: 31 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-97063-9
Online ISBN: 978-3-319-97064-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Introduction: Tracing the History of a Discipline Through Quantitative and Qualitative Analyses of Scientific Literature

Abstract

Access this chapter

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

1.1.1 A Brief Overview on Correspondence Analysis

1.1.2 An Example

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation