Introduction

The World Values Survey (WVS) is “an international research program devoted to the scientific and academic study of social, political, economic, religious and cultural values of people in the world”.Footnote 1 In order to sustain such an ambitious goal, a representative comparative social survey (Haerpfer et al. 2022) is conducted globally every 5 years. The survey data is freely available and, by enabling an effective open data philosophy (Murray-Rust 2008), it progressively became one of the most authoritative and widely-used cross-national surveys in the broad field of social sciences.

The enormous popularity and relevance of WVS within the scientific community is evident looking at the enormous number of studies in literature based on such survey data. Just to provide few significant examples among the very many, WVS allows cross-national comparison (Alemán and Woods 2016), agile analysis (e.g. MacIntosh 1998; Silver and Dowley 2000; Bruni and Stanca 2006; Amoranto et al. 2010; Cowley and Smith 2014) and easy extension or deepening (e.g. Johnson and Mislin 2012). Because of the multidimensionality of the dataset, resulting studies may have a more holistic focus (e.g. Fleche et al. 2012), as well as they can be framed within a more specific domain, such as education (Koshy et al. 2023) and religion (Freese 2004).

In this study clustering techniques (Rui and Wunsch 2005) have been applied to discover patterns among a number of selected features from WVS. Clustering analysis has been extensively used in a scientific context and keep evolving as a response to the constantly evolving environment (Wierzchoń and Kłopotek 2018).

WVS deals with a large number of values but doesn’t provide a formal multi-perspective classification but rather a structure in thematic sections. Additionally, the cross-cultural focus, which is unquestionably one of its major strengths, should be properly considered addressing holistic studies. A basic classification to distinguish between perceptions and more deep-rooted values can be established, at least in generic terms, according to straightforward and relatively objective criteria, in line with definitions and studies in literature. However, such a simplified approach hides an intrinsic underlying complexity. For instance, some perceptions (e.g. “happiness” or “satisfaction”) are relatively easy to identify, while others (e.g. “perception of security”) may depend on environmental and contextual factors. It applies also to the different values, whose interpretation and consequent classification can vary considerably depending on the analysis context. In this specific work, as an assumption, features have been softly classified as values, opinions and perceptions, based on their theoretical likelihood to change along the time. Such an assumption enables a semantically enriched conceptual framework that does not affect numerical results but enhance qualitative and critical analysis. Additionally, the focus is on believes and opinions that are more likely to have a concrete social impact, such as potentially discriminatory or divisive.

The proposed approach, which combines clustering techniques with a semantically enriched framework, enables a further level of abstraction for critical analysis. At the same time, it naturally facilitates continuity with possible future work, as well as a natural bridge to predictive models. Indeed, the actual scientific value of the identified patters, depends on the capability of interpretation in context through an enhanced knowledge engineering process.

From a more philosophical perspective, this work can be framed within a hybrid context as there is no pre-formulated theory but rather an attempt to discover patterns and new knowledge from data by adopting hybrid practices (Tuunainen 2005). Given the relatively manageable dimensionality of the input dataset, the feature selection has been performed according to an application-oriented approach (rather than driven by statistical analysis) to establish a more comprehensive and consistent research framework.

The study primarily aims at a self-contained analysis based on the identification of critical patterns. On the other side, the application of unsupervised learning techniques (Alloghani et al. 2020), which by definition work on unlabeled data, provides potential for classification and for a natural evolution towards predictive models. The main findings are briefly discussed by facilitating a concise dialogue with literature. It allows an interpretation in context, as well as the identification of possible gaps.

From a methodological perspective, the aimed holistic analysis has been conducted according to a systematic approach, which is described in detail in the next section both with the main research design decisions. As explained later on in the paper, the most critical and sensitive parameter is the set of thresholds to identify patterns. Indeed, since patterns are defined on relative differences among clusters, thresholds become a determinant. It applies to the analysis itself given the methodology adopted, as well as to support additional Machine Learning steps through automated/semi-automated data labelling (e.g. Willemink et al. 2020) and related further application, such as predictive (e.g. Di Francescomarino et al. 2016, among the very many) and ontological modelling (for example Pileggi 2023a). Instead of considering such a parameter as a variable with a consequent impact on the clarity of the analysis, we have defined thresholds looking at the adopted numerical scale and defined regular numerical steps accordingly. We believe that such an approach foster a more transparent and understandable communication of the key findings.

Looking holistically at this research, the exploratory focus and its inherent simplicity naturally enable insight and added value for future theoretical development. The latter can be translated into enhanced capabilities to bridge existing knowledge gaps, as well as to identify additional gaps and outline research directions accordingly.

Research question(s)

In more formal terms, the paper addresses the following research questions:

  • How to systematically enable hybrid science looking at large datasets?

  • How can such methods be applied to complex case studies to generate insight and added value?

Structure of the paper

The core part of the paper is structured according to a classic schema that assumes design and methodological aspects presented in section “Research Design, methodology and approach”, computation results briefly summarised in section “Computation results” and discussed in section “Discussion”.

Research design, methodology and approach

As previously introduced, the study adopts clustering techniques and has been conducted by following the process depicted in Fig. 1. This section explains such a process in detail with a focus on the research design and key related decisions.

Fig. 1
figure 1

Overview of the process

Dataset

This study is based on the seventh wave of the WVS survey (2017–2021) (JD Systems Institute & WVSA 2022), which is composed of 290 core questions (including also demographics) in addition to a number of extra modules. Core questions are structured in 13 different thematic sections, in addition to demographics. The original dataset includes 94,278 answers to the survey.

As a rich data source, WVS makes possible in practice studies with a broad scope and purpose with a different focus, including, among others, cross-cultural or cross-country and long-term trend analysis. At the same time, WVS supports more specific studies as many contributions in literature and previously mentioned works demonstrate.

This study keeps a generic and holistic focus in principle by dealing with multiple values, related dependencies and social implications. However, the analysis is conducted in the context of an abstracted framework which distinguishes between deep rooted values, believes/opinion and perceptions.

Feature selection

In general terms, given the dimensional richness of the original dataset, statistical analysis can potentially provide insight and contribute to identify key patterns. However, rather than on a holistic analysis of the dataset, this work focuses on the analysis of specific features of interest that defines a subset \(f\) of the original dataset \(F\) (Eq. 1).

$$ {f = \left[{f_0},{f_1},{f_2},\ldots,{f_i}\right]\quad\quad\quad\quad\quad{f_k} \in F\;\;\forall \;k \in \left[0,\ldots,i\right]\;\;\implies {f \subseteq F}} $$
(1)

Feature selection plays a critical, if not determinant, role in many context and studies. In a computational world that wants normally to take advantage of data with very high dimension, systematic approaches are normally requested and, indeed, a number of consolidated algorithms for feature selection have been developed by the research community (Chandrashekar and Sahin 2014). Additionally, certain disciplines and applications may even suggest a more domain-specific approach (e.g. bio-informatics (Saeys et al. 2007)).

In this specific context, the goal is to maximizing the application value rather than the number of identified statistical correlations. Therefore, taking advantage of the manageable number of features, features have been selected as part of a modelling process according to the following criteria:

  • Selected features should independently model stand-alone attributes. As the original survey is quiet structured and questions are often grouped, this criterion becomes critical and enables relatively clear boundaries.

  • Features are selected trying to minimize the potential overlapping. For the same reasons previously mentioned, certain questions are addressing similar concepts.

  • The priority is on generalizable attributes—i.e. those attributes that are more likely to reflect generic concepts at a social level. Indeed, the abstracted conceptualization introduced by the analysis framework needs to be consistent with the underpinning data and semantically consolidated. In line with the first criterion, selected questions should be on aspects of general relevance at a social level, for instance because potentially discriminatory or divisive.

The application of such criteria has led to the selection of 16 features. The original IDs and related survey questions are reported in Table 1. Because of the holistic research focus, no demographic feature has been considered. Demographics could be considered to address more specific studies.

Table 1 Features and semantic characterization

Because of the generality of the selection principles, the feature selection process presents a certain degree of subjectivity on the potential relevance of the different features in the context of the proposed study. This is a relatively common situation, for instance when composite indicators are generated from more specific ones (e.g. in Pileggi 2019, 2022, 2023b). However, in this specific case, we believe that the bias is limited by the scope of the analysis conducted.

Semantic characterization

Each selected feature defines a concept as reported in Table 1. Additionally, as shown in the same table, concepts are classified as follows:

  • Demographic as in a common meaning.

  • Value, understood as a principle or standard of behaviour. In this context, the key assumption is that, in very generic terms, values are unlikely to change very much at an individual level as they normally result from the cultural background and other radicated factors.

  • Opinion/Belief, as in a common meaning. The assumption is that an opinion is still a firm individual believe but it is somehow more likely to change than a value.

  • Perception, a belief or opinion that significantly differs from the previous category as it is based on a very personal, often temporary, understanding or interpretation of a given reality or situation. It is assumed to be much more volatile than the previous category as a perception can change relatively often in response to happenings or changes of circumstances.

That is understood to be a very soft classification (summarised in Fig. 2) as there is no clear boundary among value, opinion/belief and perceptions. Additionally, such concepts are adopted across different studies with slightly different semantics.

Fig. 2
figure 2

Soft classification of concepts

As previously discussed, a completely objective classification is unrealistic. However, a number of driving criteria have been applied in this specific study.

First of all, even in a context of continuous cultural change and evolution (Inglehart and Baker 2000), traditional values are more luckily to preserve their deep-rooted character. It is definitely the case of religion (Roccas 2005; Luckmann et al. 2022), family (Hakim 2018) (often extended to friends (Pahl and Pevalin 2005)) and related opinions (Knafo and Galansky 2008), work culture (Casey 2013) and its balance with personal life (Brough et al. 2020), as well as engagement/interest in politics (Bowler et al. 2007). Such considerations have driven the identification of a corpus of values/principles (Q1-Q6 and Q27 in Table 1).

On the other side, a set of perceptions has been identified assuming the commonly accepted definition and their potential social relevance (e.g. Oppenheimer 2006). While perceptions may present some dependency on cultural and deep-rooted factors, their subjectivity and potential likelihood to change in response to personal and/or environmental factors define a trend in contrast, if not opposite, with the previous category, mostly characterized by “stability”. The features associated with this class in the study (Q46, Q49, Q50, Q112 and Q131 in Table 1) are considered as perceptions in a large number of studies in literature. Happiness (e.g. Robert Cloninger and Zohar 2011) and life satisfaction (e.g. Miller et al. 2019) are constantly object of study within the scientific community, as well as financial well-being (e.g. Ponchio et al. 2019), corruption (Melgar et al. 2010) and security in the very many possible meaning (e.g. Greco and Polli 2021). To note that, in general terms, there is a significant difference between a perception and a more or less official measure or estimation. A typical example is corruption (Olken 2009).

The additional category (opinion/belief) presents hybrid characteristics as it includes features that are reasonably related to the cultural background and education but also sensitive of the social environment. The underlying complexity is extensively discussed in literature. It’s the case of the different aspects of homosexuality acceptance (e.g. La Roi and Mandemakers 2018), the very many kind of gender discrimination (for instance at work (Cleveland et al. 2013)), social trusting (Holmberg and Rothstein 2017) and confidence in authorities (Tyler 2001). These features have been selected in this study (Q29, Q36, Q60 and Q69 in Table 1).

Filtering and scaling

Incomplete data entries are not suitable for the target analysis. Filtering by accepting only positive values allows the inclusion of complete line only as the original dataset codification assumes negative values for missing information. Filtering is formalised in Eq. (2), where \({X^f}\) represents the set of multi-dimensional data point \(x\), dimensionally restricted to the set of selected features \(f\). A given data point \({x_j}\) is considered in the study whether its value is positive for each selected feature. After filtering, valid data entries are 72,810.

$$\eqalign{{x_{j}^{f}} &= \left[{x_{j}^{f_0}},{x_{j}^{f_1}},{x_{j}^{f_2}},\ldots,{x_{j}^{f_i}}\right] \in {X^f} \quad\quad\quad\quad{f_0},\ldots,{f_i} \in f \\ &\quad\quad\quad\quad\quad\quad\quad\quad{x_{j}^{f_i}}> 0\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad{\forall i,j}}$$
(2)

Because of the different scales (Table 1), the formula in Eq. (3) is applied to obtain a uniform representation between a minimum value \({f^{Min}} = 1\) and a maximum value \({f^{Max}} = 4\). That is probably the most natural choice because most selected features (12 out of 16) are expressed according to that scale in the original dataset as reported in Table 1.

$$ {x_j^{{f_i}} = {f^{Min}} + \frac{{(x_j^{{f_i}} - min({x^{{f_i}}}))*({f^{Max}} - {f^{Min}})}}{{max({x^{{f_i}}}) - min({x^{{f_i}}})}}\quad\quad\quad\quad[{f^{Min}},{f^{Max}}] = [1,4]} $$
(3)

Clustering

Unsupervised Learning (Alloghani et al. 2020) is a branch of Machine Learning that learns from unlabeled data and, therefore, without human supervision. It is often adopted to explore data, discover patterns and new knowledge, as well as to prepare further learning steps. Clustering algorithms (Rui and Wunsch 2005) deal with the data structure partition in unknown area according to different metrics and processes (Xu and Tian 2015).

The different clustering techniques are typically classified depending on their underlying approach. An effective categorization assumes five major classes of solutions (hierarchical, partitional, grid, density-based and model-based) (Saxena et al. 2017). An exhaustive discussion of the different solutions is out of the scope of this paper.

The analysis conducted is based on the classic k-means algorithm for clustering (Sinaga and Yang 2020), which belongs to the partitional clustering category as no hierarchical structure is assumed. K-mean is probably the most popular and bench-marked algorithm in the field (Saxena et al. 2017). Because of its simplicity that somehow fosters transparency, it is ideal in this specific context dealing with a need for generic clustering without specifically critical requirements. More concretely, the computations adopt the Scikit-learn python package (Pedregosa et al. 2011), which provide an implementation of k-means within an integrated framework. Such a software library is very popular within the scientific community as it is freely available and considered to be highly reliable. Similar criteria applies to determine the optimal number of clusters, that has been estimated heuristically by adopting the popular elbow method (Thorndike 1953) with the support of the Yellowbrick package (Bengfort and Bilbro 2019). By providing a kind of saturation value, i.e. a point where diminishing returns are not worth the corresponding increasing in cost or complexity, the method is simple and intuitive, as well as it is freely available as a computational resource.

Cluster analysis

The analysis is based on centroids, which are understood as the centres of the identified clusters. Centroids are vectors whose values are the mean of each feature. Because of its characteristics, a centroid can be understood as a kind of representative of a cluster.

In order to facilitate the analysis, each centroid is normalised by feature as per Eq. (4), where \({C^f}\) is the set of centroids and \({x^k}\) is a single centroid associated with the cluster \(k\). The maximum “interest”/”consideration” is associate with the value 0, while higher values indicates a proportional decreasing of the associated relevance. This simple transformation allows to reason in terms of relative difference.

$$\eqalign{&{x_{k}^{f}} \in {C^f}:\quad\quad{x_k} = centroid(cluste{r_k}) \\ {x_{k}^{f_i}} = {x_{k}^{f_i}} - min &(x_k^{{f_i}})\quad\quad\quad\quad\quad\quad\forall {x_k} \in {C^f}\quad,\,\,\,{\forall {f_i} \in f}}$$
(4)

The matrix resulting by merging the different centroids is provided according to two different views that are numerically equivalent but adopt a different visualization technique:

  • the holistic view aims to identify overall key patterns. Negligible values (<0.5) are represented in white; values between 0.5 and 1 in yellow; values between 1 and 1.5 in orange and, finally, values higher than 1.5 in red. Such a representation allows to intuitively identify different levels of consideration or relevance, including high (white), moderate (yellow), low (orange), very low (red). Such qualitative thresholds are summarised in Table 2.

  • the feature view enables a kind of local view to systematically analyse the key patterns related to single features. A gradient scale is applied by feature to put emphasis on the distribution of each feature across the different centroids. Darker colors are associated with higher numerical values and, therefore, with less consideration/importance.

Table 2 Qualitative thresholds for centroid analysis

Because of the nature of the proposed views, which express relative patterns, they are considered looking also at major statistics (typically mean and standard deviation) to provide a more consistent interpretation in context. Moreover, qualitative views are always considered in the quantitative context in which they are generated. In other words, the qualitative characterization resulting from a more or less systematic partition of the evaluation space is useful to provide abstraction, while the associated interpretations are still quantitative.

Computation results

Major statistics on the input data are reported in Table 3, which shows the mean, the standard deviation, skewness and kurtosis for each selected feature.

Table 3 Main statistics on the input data (mean, standard deviation, skewness and kurtosis)

Most features classified as values (Family, Friends, Leisure, Work and Parents Opinion) are highly considered (low mean) in addition to a generalised perception of high happiness (mean = 1.83). Family (mean = 1.12) is the most valued by far. Politics, Religion, Gender Discrimination, Homosexuality Acceptance, Trusting in Others, Confidence in Authorities, Security Perception and financial stability are in a medium range (mean between 2 and 3), with a relatively high standard deviation. On the other extreme, a very low perception of corruption (mean = 3.2) and of satisfaction (mean = 3.06).

In terms of skewness and kurtosis, the input dataset presents positive and negative values in a relatively limited range. The only outstanding exception is family, which has a remarkable right-skewed distribution and high kurtosis value.

The number of clusters to consider is an input for the K-Means algorithm. The optimal number of clusters (k = 9) has been heuristically estimated by applying the elbow algorithm. In this specific case, such an optimization point is considered to be reasonable in context, looking at the actual dimensionality—i.e. at the number of considered features. A visualization of the outcome based on the distortion score is reported in Fig. 3.

Fig. 3
figure 3

Heuristic estimation of the optimal number of clusters to consider

The computation outcome is visualized in Fig. 4: a bi-dimensional representation is proposed in Fig. 4a, while Fig. 4b presents the size of the different clusters in percentage. As shown in the figure, clusters are relatively balanced in size in a range [7–16%].

Fig. 4
figure 4

Visualization of clusters. (a) Bi-dimensional visualization of clusters. (b) Clusters size (%)

The Holistic View and the Feature View as previously defined are reported in Fig. 5a and 5b respectively. They are two dimensional structures, where columns are clusters and row features. Therefore, a column indicates the values of the different features for a given centroid. Key findings are identified on these views and are discussed in detail in the following section.

Fig. 5
figure 5

Holistic and feature view. (a) Holistic view. (b) Feature view

Discussion

A cluster analysis facilitated by the provided views allows the identification of several patters of potential interest. Such patterns largely depends on relative thresholds, their interpretations, as well as on the specificity of the focus of the study. In this section, major patterns, identified assuming a relatively low number of thresholds (Table 2), are described and discussed in context with an holistic focus. Such findings are summarised in Table 4.

Table 4 Summary of the major identified patterns by cluster analysis

There is an evident symbiotic relationship existing between the perception of satisfaction and of financial stability, as the patterns of the two features (Fig. 5) present a very high degree of similarity. A possible correlation is somehow suggested by the similar mean and standard deviation (Table 3), which averagely point out very low levels of perceived satisfaction and financial stability. The cluster analysis provides further insight and it is largely supported by literature, since the relationship between financial and life satisfaction has been often object of study. For instance(Medgyesi and Zólyomi 2016) addresses the explicit impact of job and financial satisfaction on the overall satisfaction with life, (Christoph 2010) suggests a more accurate analysis by adopting alternative measures, (Gray 2014) adopts a different analysis strategy focusing on financial concerns, (Boes and Winkelmann 2010) puts specific emphasis on the impact of the income, (Diener and Biswas-Diener 2002) is characterised by a more holistic focus, while (Frijters et al. 2004) provides empirical evidences.

A similar relationship exists also between the perception of happiness and of security (Fig. 5a), in this case with absolute values that may be considered in a medium range (Table 3). In general terms, the definition of security is contextual. In the original survey, security is approached holistically, both in the formulation of the question itself and in the context of the corresponding section that includes multiple questions. Also in this case, navigating the literature may result in a very articulated process involving many factors (e.g. Ouweneel 2002), as well as the self-assessed perception of happiness (Dolan et al. 2008) is far way to be uniquely understood and may depend on different determinants (Schimmel 2009).

Additionally, the conducted analysis allows some considerations about the relationship between happiness and satisfaction. Although in certain contexts the two concepts are indistinctly used or, anyway, considered to be similar, they have a different definition as “happiness is a momentary experience that arises spontaneously, while life satisfaction is a long-term feeling based on achieving life-long goals” (Badri et al. 2022). In this specific case, there is averagely a much higher level of perceived happiness (mean = 1.83) than of satisfaction (mean = 3.06). Cluster analysis has pointed out different patterns (Fig. 5). The existing literature demonstrates the complexity of the relationship, which normally requires a multi-domain analysis to be properly addressed (Michalos 1980). For example, the specific role of work has been investigated by the work proposed in (Dockery et al. 2003), as well as a more generic study (Peiró 2006) frames the analysis within the broad socio-economic conditions.

Religion presents by far the most polarised patters with six out of the nine clusters that consider it to be a relevant value and the remaining clusters associating a low relevance (Fig. 5a). In numerical terms, that is consistent with the main statistic reported in Table 3 (mean in a medium range and high standard deviation). This is in a way re-iterating the relevance and the role of religion in a constantly evolving society (Turner 2011; Luckmann et al. 2022).

Politics, perception of corruption and homosexuality acceptance present the opposite trend, as they are the most regularly distributed across the different clusters. That is evident in Fig. 5a since all the qualitative characterizations reported Table 2 are present for these features. The mean associated is high for the three attribute, meaning that averagely there is a low interest in politics, a low level of homosexuality acceptance and a high perception of corruption. Looking at the relationship among these features, the most evident pattern (cluster 4) associates a maximum interest in politics with a very high perception of corruption and a very low acceptance of homosexuality. While the perception corruption is extensively addressed in literature, as far as we know, there is a much more limited knowledge on the relationship between political interest and perception of corruption (e.g. Dong and Torgler 2009). Similarly, we could not identify any specific study on the relationship between interest in politics and homosexuality acceptance.

Focusing more specifically on homosexuality acceptance, a very low level of homosexuality acceptance (cluster 1,4 and 5) is related, among others, to a high consideration of religion, while the higher levels of acceptance (cluster 2 and 7) are associated with both a low (cluster 7) and a high (cluster 2) consideration of religion. Such a relationship is expensively addressed in literature (Adamczyk and Pitt 2009; Whitehead and Baker 2012; Jäckle and Wenzelburger 2015; Xie and Peng 2018).

Finally, family is value appreciated the most. Also other values (work, friends, leisure and parents opinion) are generally appreciated, although at a slightly minor level.

Conclusions and future work

The application of unsupervised learning techniques and enriched semantics on a large scale survey has enabled a dynamic analysis framework according to an hybrid science approach aimed at discovering patters among features of interest. The analysis of the resulting clusters has provided insight and, more in general, added value. A number of critical patterns have been identified accordingly and discussed by facilitating a dialogue with literature. Because of the high-dimensionality of the input dataset reflecting a variety of related social aspects, a clear identification of most critical patters may be challenging, especially in a cross-cultural context. The methodology adopted and the associated framework on one side provide a computational resource to enable a systematic analysis; on another side allows customisation through semantics to be re-use in a different context.

Furthermore, the hybrid approach characterising this study has contributed to define and progressively consolidate few research gaps and possible future research directions in the field of social sciences, as not all identified patterns seem to be extensively addressed in literature. Overall, the exploratory focus has resulted to be effective in the context of a complex case study by leveraging on the flexibility of clustering techniques, as well as on the underlying conceptualisation to support future theoretical development.

The major limitation of the analysis conducted is related to a fundamental sensitivity of the adopted thresholds, that have been intuitively defined looking at numerical scales. As previously discussed, other potentially biasing factors in the research design can be considered in a context of theoretical trade-off to establish a hybrid approach in fact. In the specific use case addressed in this study, human (feature selection) and computational-driven (number of clusters) design decisions converge towards an heuristic optimization of such a trade-off, which assumes, ideally, low dimensionality and a proportional number of clusters. The dimensionality reduction performed by conceptualization finally resulted in 16 selected features associated with 9 clusters. Therefore, the ratio features/clusters is close to 2. It can be intuitively considered to be a good configuration for the proposed analysis framework at a low scale.

The holistic focus of the analysis intrinsically suggests more specific studies, as well as a consolidation of the main findings, for instance considering a multi-feature approach for the conceptualization. Additional statistical test to confirm the strengths of the identified relationships should be conducted on more extended data spaces resulting from the integration of different datasets. The most natural direction for future work is probably the generation of predictive models based on the provided classification that allows data labelling and the consequent learning loop closing.