Machine learning, artificial neural networks and social research

Di Franco, Giovanni; Santurro, Michele

doi:10.1007/s11135-020-01037-y

Machine learning, artificial neural networks and social research

Open access
Published: 08 September 2020

Volume 55, pages 1007–1025, (2021)
Cite this article

Download PDF

You have full access to this open access article

Quality & Quantity Aims and scope Submit manuscript

Machine learning, artificial neural networks and social research

Download PDF

14k Accesses
64 Citations
3 Altmetric
Explore all metrics

Abstract

Machine learning (ML), and particularly algorithms based on artificial neural networks (ANNs), constitute a field of research lying at the intersection of different disciplines such as mathematics, statistics, computer science and neuroscience. This approach is characterized by the use of algorithms to extract knowledge from large and heterogeneous data sets. In addition to offering a brief introduction to ANN algorithms-based ML, in this paper we will focus our attention on its possible applications in the social sciences and, in particular, on its potential in the data analysis procedures. In this regard, we will provide three examples of applications on sociological data to assess the impact of ML in the study of relationships between variables. Finally, we will compare the potential of ML with traditional data analysis models.

What is Qualitative in Qualitative Research

Article Open access 27 February 2019

A new criterion for assessing discriminant validity in variance-based structural equation modeling

Article Open access 22 August 2014

Mixed methods research: what it is and what it could be

Article Open access 29 March 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

ML is an automatic learning process that takes place through the processing of usually very large data sets. The procedures of the past, defined with the “symbolic artificial intelligence” label, operated on algorithms constituted by a logical set of instructions by which a given output (usually called target) was encoded for all possible inputs. Contrarily, the new ML systems “learn” directly from data and estimate mathematical functions that discover representations of some input, or learn to link one or more inputs to one or more outputs to be able to formulate predictions on new data (Jordan and Mitchell 2015).

In recent years in various human sciences: economics (Varian 2014; Blumenstock et al. 2015; Athey and Imbens 2017; Mullainathan and Spiess 2017), political science (Baldassarri and Goldberg 2014; Bonikowski and DiMaggio 2016), sociology (Barocas and Selbst 2016; Evans and Aceves 2016; Baldassarri and Abascal 2017), communication science (Hopkins and King 2010; Grimmer and Stewart 2013; Bail 2014), etc., ML has started to be applied both in academic research and in areas related to the management of services provided by the public administration (Athey 2017; Berk et al. 2018) or by private companies.

Overall, many different approaches and tools are included under the ML label (Kleinberg et al. 2015). Here we will only consider ANNs that use supervised ML algorithms. In the supervised ML the algorithm observes an output for each input. This output gives the algorithm a target to predict and acts as a “teacher”. On the contrary, unsupervised ML algorithms only observe the input and their task is to independently compute a function without a predetermined target (Hastie et al. 2009; Molina and Garip 2019). The goal of this paper is to apply ANNs to sociological data by comparing the results obtained with the results of traditional statistical techniques, to evaluate their strengths and weaknesses.

2 Short illustration of artificial intelligence and machine learning based on artificial neural networks

Artificial intelligence (AI) is a branch of computer science that encompasses a huge variety of computational operations, ranging from classical algorithmic production to ML and deep learning (DL) techniques (Russell and Norvig 2010; Kitchin 2014b). The substantial difference between these approaches is that while traditional AI problem solving methods are based on if–then rules, ML and DL seek to iteratively evolve an understanding of data sets without the need to explicitly code any rules. This allow the computing system on which they are implemented to automatically learn and make predictions starting from a set of input data, adjusting their parameters by optimizing a performance criterion defined on the data and reducing the error rate at each stage of the learning process (Alpaydin 2016; Goodfellow et al. 2016).

In other words, in ML the aim is to construct a software program that adapt and learn independently, that is, without having a pre-programmed system that establishes how it should behave. Algorithms can learn from their mistakes thanks to training data used as examples. Accordingly, how much a model learns depends on the quality and amount of example data to which it has been exposed (Nilsson 2010; Dong 2017).

The considerable availability of information, due to the deluge of big data gathered from all kinds of specialized sensors and digital devices, and the rapid growth in parallel and distributed computing systems, made possible by the advance of faster CPUs, the advent of general purpose GPUs, the use of faster network connectivity and better software infrastructure for distributed computing, have given a boost to this sector (National Research Council 2013; Schmidhuber 2015; Goodfellow et al. 2016). AI applications are constantly evolving, reaching high levels of complexity and fascinating results in many different tasks: language translation, speech recognition, visual processing, spam filtering, and so on.

It is intuitive how companies capable of collecting and storing data correctly are candidates to be at the top of the AI sector. Many of the applications of DL are highly profitable (Goodfellow et al. 2016; Zuboff 2019).^{Footnote 1} Indeed, despite the emphasis around the state of the art, most big tech companies still use traditional ML models instead of more advanced DL, and depend on a traditional infrastructure of tools poorly suited to ML (Dong 2017).

Early findings of DL date back at least to the 1960s, when it was intended to be a computational model of biological learning, that is, a model of how learning happens or could happen in the brain. As a result, one of the names that DL has gone by is ANNs. (Schmidhuber 2015; Goodfellow et al. 2016). The two terms are often used as synonyms. To be precise, DL is a subfield of ANNs, that uses multi-layered neural networks to process information. The idea behind deep neural networks is that, starting from the raw input, each hidden layer—so named because its values are not given in the data—combines the values in its preceding layer and learns more complicated functions of the input. It is difficult for a computer to understand the meaning of raw input data. DL resolves this difficulty breaking the desired task into a series of nested concepts, each described by a different layer of the model (LeCun et al. 2015; Alpaydin 2016; Goodfellow et al. 2016).

There is no consensus about how much depth a model requires to qualify as deep. Discussions with DL experts have not yet yielded a conclusive response to this question. However, DL can be safely understood as the set of models that involve a greater amount of composition of either learned functions or learned concepts than traditional ML does (Schmidhuber 2015; Goodfellow et al. 2016).

DL is not a breakthrough in the scientific sense, rather it is a relevant breakthrough in efficient coding that makes a difference in several contexts. In practical applications, DL is able to achieve higher accuracy on more complex tasks as compared with traditional ANNs, although it requires more computational resources. Furthermore, DL needs less manual interference to craft the right features or the suitable transformations of data. It performs exceptionally precise operations on data that come from different modalities, such as images, texts and videos (Schmidhuber 2015; Alpaydin 2016; Goodfellow et al. 2016).

In summary, ML offers numerous mathematical tools to deal with a wide variety of problems. The main tool, very popular nowadays, are the ANNs, which are trained to solve a particular task. Neurons are organized into groups called layers and connected to each other precisely to form a network. As mentioned, when the number of layers is high, the neural network is defined as deep. The DL’s approach attempts to mathematically model the way in which the human brain processes information in vision and hearing: the stimuli of eyes and ears, passing through the human brain, are initially broken down into simple concepts and gradually reconstructed in increasingly complex and abstract representations (Russell and Norvig 2010; Alpaydin 2016; Goodfellow et al. 2016).

Similarly, in a deep network a face is broken down in the form of an array of pixel values. The first layer can easily identify edges of different orientations. Subsequent layers combine these to form corners and extended contours. Layers that follow can detect entire parts of specific objects, by finding specific collections of contours and corners. Finally, these in turn are combined with some more layers of processing, allowing us to represent the faces we want to learn (Nilsson 2010; LeCun et al. 2015; Alpaydin 2016; Goodfellow et al. 2016).

So, the choice between ML or DL algorithms depends on the problem to be analyzed. If the problem is relatively simple, it is preferable to use ML based on ANNs with few layers of hidden units; if the problem is complex or requires the achievement of very specific and rigorous objectives, it is considered more useful to resort to DL.

3 Methodology

The starting point of our experiments is to evaluate whether, in the typical data analysis operations of social sciences, the techniques of ML based on ANNs can constitute an alternative, or at least a possible integration, with respect to the traditional data analysis tools which basically consist of linear and logistic regression models.

As is known, in general, multivariate data analysis models perform an empirical control of one or more hypotheses derived from a theory and the results consist in the comparison between the so-called expected, or theoretical, data and the empirical data. If the outcome of this comparison is attributable to random effects, it is said that the model fits, or is compatible, with the data; otherwise the model must be revised or, if this is impossible, rejected (Di Franco 2017). It is therefore the so-called confirmatory-explanatory approach.

Starting from the work of data analysis pioneers such as Fisher (1925, 1935), Galton (1869, 1886), Spearman (1904, 1927) and many others, for many decades data analysis in the social sciences has been characterized by this approach which fundamentally seeks to identify, from associations between a set of empirically detected variables, causal links between the same variables. In this context the model (i.e. the theory) is prevalent and the data are used to evaluate the goodness of fit of the model, expressed by the values of a coefficient of statistical significance (p-value).

Alternative approaches to data analysis, based on induction, exploration-description, simulation, etc. which have also been proposed over time (among others by Benzécri 1969, 1992; Benzécri et al. 1973a, b; Tukey 1977; Gifi 1981, 1990) have received less interest among social sciences researchers. The characteristic of these alternative approaches is the inversion of the relationship between data and theory: data are more important than the model. This means that starting from the data it is necessary to find the model that best represent them; while in the causal approach the starting point is a model and the data are used to test it.

Thanks to the recent developments in different disciplines such as applied mathematics, statistics, information technology, approaches based on data prevalence have become established, or are emerging, in many disciplinary areas of natural and biometric sciences. Over time these approaches have taken on different names such as data mining, statistical learning, machine learning (ML), deep learning (DL) and others.

In addition to the innovations to which we referred, starting from the development in information and communication technologies and web platforms, the current historical period is strongly characterized by the so-called big data and their management through mathematical algorithms that are able to independently process them to extract information useful for various purposes. As a result, many ML techniques exist today. A common feature of these techniques is that they are exploratory and rely on computer assisted analysis.

One large subdivision of these techniques uses a single outcome and tries to make an optimal prediction of this outcome from multiple predictor variables (supervised learning techniques). The second subdivision does not require any outcome and merely classifies inputs into subgroups based on similarities among a set of variables (unsupervised learning techniques).

For the purpose of our experiments we will use the ML which adopts the ANNs which have units arranged on three layers (input, hidden and output) and unidirectional connections between each unit of one layer and all the other units of the next layer.

Being essentially a distributed processor built in analogy with the human central nervous system, an ANN is generally composed of elementary computational units called neurons, conceivable as nodes of a network with certain processing capacities and interconnected.^{Footnote 2} Artificial neurons are able to receive a combination of signals from the outside or from other neurons, and then transform them through a particular function called activation function, thus storing data in the network parameters and in particular in the weights associated with every connection.

Then there is the return of an output: a result generally dependent on the purpose for which the ANN was built (classification, recognition, approximation, etc.).

The relationship between incoming and outgoing data is generally determined:

From the type of elementary units used: complexity of the internal structure, class of activation function used;
From the formal structure of the network: number, orientation and direction of the nodes, which can be represented according to the tool of graph theory;
From the values of the internal parameters associated with the neurons and the related interconnections: to be determined using appropriate learning algorithms.

The question we ask ourselves is whether ANN can be usefully applied in social research, besides as a complex of nonlinear data processing algorithms, also as a tool to simulate social phenomena (Capecchi 1996).

It is difficult to assimilate social phenomena to neurophysiological ones; for this reason, the analogies of the nodes of an ANN with neurons, of its connections with synapses, etc., that are possible in the study of the brain, are not possible in these other cases. However, it is a question of assessing whether the abstractness of the structures and processes postulated in ANNs, understood as models of complex nonlinear dynamic systems, does allow their application also to the study of social phenomena. In this case it is necessary to determine the interpretation to be given to concepts such as node, connection, excitation/inhibition, connection weight, learning rule, equilibrium and so on.

On the other hand, the use of ANNs allows the possibility of partially overcoming some limitations of the analyses conducted with traditional statistical techniques. For example, the use of ANNs does not require any hypothesis on the distributions of the system variables and their reciprocal associations. For this reason, the treatment of cardinal, ordinal and/or categorical variables is possible (Di Franco 2017). By such approach the actual analysis of the system is left to the network, which alone creates its own criteria to reproduce its behaviour and consequently enables itself to formulate predictions on the system itself. In Fabbri and Orsini’s (1993) judgement, this is both a strength and a weakness of ANNs: it is a strength because in this way the researcher is not conditioned by a priori hypotheses in the choice of the units of the network; the weakness consists in the fact that the network is not able to do anything else but reproduce in a phenomenological manner the behaviour of the analysed system, without contributing to the knowledge of the internal relationships between the single parts of the system. This problem, however, can be partially overcome as some devices, that allow us to interrogate the network about what it was able to reproduce, have been fine-tuned. (Di Franco 1998).

If the simulation approach of ANNs to social phenomena proved to be possible and useful (Capecchi et al. 2010), this would allow significant progress in the social disciplines because it would also contribute to the foundation of a consistent basis of simulation concepts, models and techniques. If social phenomena can be thought of as complex dynamic systems then it is necessary to accept the possibility of simulating them on a computer with more meaningful results than those obtainable with traditional data analysis tools.

We now describe the methodology used in the examples whose results we present in the next paragraph. The data used in the three examples are taken from a matrix containing some information on the electoral polls published in Italy by the mass media from 1 January 2017 to 29 February 2020. The information relating to these electoral polls was downloaded from the institutional website of the Presidency of the Council of Ministers: www.sondaggipoliticoelettorali.it.

In the period indicated above we collected 825 polls focused on voting intentions for the next political elections. As mentioned, the results of these polls have been disseminated by the mass media and are governed by rules that require the drafting of an information note that presents methodological information useful for assessing the correctness of the polls carried out by the various agencies (Di Franco 2018).

The Italian regulation on the publication and dissemination of electoral polls in the mass media lists the information that must compulsorily be inserted in the document that is published on the institutional website. These are the fifteen information items:

1.
Title of the poll;
2.
Subject who carried out the poll;
3.
Client;
4.
Buyer;
5.
Date or period in which the poll was carried out;
6.
Name of the mass media in which the poll is published or disseminated;
7.
Date of publication or diffusion;
8.
Topics covered by the poll;
9.
Reference population;
10.
Territorial extension of the poll;
11.
Sampling method;
12.
Representativeness of the sample including indication of sampling error;
13.
Method of collecting information;
14.
Sample size, number and percentage of non-respondents and replacement made;
15.
Full text of all questions and percentage of people who answered each.

From our analysis it emerged that in many documents there are important gaps with respect to what is required by current legislation, especially in relation to purely methodological information.

To assess the quality of the documents as a whole, we have developed a completeness index of the poll information, adding the presence of the following six elements on which we have identified the most critical issues:

1.
The proportions between the breakdown of interviews conducted with mixed interview methods;
2.
The confidence interval for the estimates;
3.
The number of subjects contacted;
4.
The number of refusals and replacements for the interviews carried out;
5.
The percentage of the undecideds;
6.
The percentage of voters or abstainers.

We have coded each of the six elements with the value one when it is present and with the value zero when it is absent. We then normalized the values of the index by dividing the sum by the number of elements in order to obtain values in the range between zero (which involves the absence of all six elements) and one (when all six elements are present).

The minimum value found on the completeness index (label ind-1) is .17 (which means only one element out of six); the maximum 1 is recorded only in 9 polls out of 825. The average score is .56, the standard deviation is .18. Most of the polls analyzed (58.5%) obtain a value on ind-1 equal or less than .5.

By analyzing the mean values of ind-1 for the interview techniques we can see in which cases the most critical issues are recorded.

Let’s first analyze the polls conducted using a single data collection technique. When the interviews are carried out using a panel of respondents the average on ind-1 is equal to .69; the polls carried out with the CATI (computer assisted telephone interviewing) technique present an average on ind-1 equal to .63; the polls carried out with the CAWI (computer assisted web interviewing) technique present an average on ind-1 equal to .33.

When the polls are carried out using mixed techniques of data collection the average results on ind-1 are: .65 with CATI-CAWI; .60 with CATI-CAMI (computer assisted mobile interviewing); .49 with CATI-CAMI-CAWI.

We have computed the eta coefficient to quantify the strength of the association between ind-1 and the technique of conducting interviews. The value obtained (.641) allows us to establish the existence of a significant association between the two variables examined.

By examining the information notes of each poll, we considered all the other information provided with respect to the current legislation, focusing our attention on the elements that concern very important aspects for the evaluation of the results of the poll such as the days in which it was carried out, the sample size, the sampling error, the confidence interval, the number and percentage of those not available, non-respondents and replacements made, the full text of all questions and the percentage of interviewees who they answer each of them.

Following the descriptive analysis, we found a second serious gap in the methodological notes concerning the information on the number of contacts and that of refusals : 128 out of 825 published surveys (equal to 15.5%) do not report this information. To take this important information into account, we have designed a second index (ind-2) which consists in the relationship between the number of people contacted and the number of interviews carried out. Thanks to this index, we can evaluate for each poll how many subjects it was necessary to contact to carry out a valid interview. The mean value of ind-2 is equal to 5.313 (the standard deviation is 3.768) which indicates that to carry out a valid interview it was necessary to contact just over five subjects. In other words: on average, for each interview carried out more than four refusals were registered.

By analyzing the mean values of ind-2 for the interview techniques we can see in which cases the most critical issues are recorded.

Let’s first analyze the polls conducted using a single data collection technique. When the interviews are carried out using a panel of respondents the average on ind-2 is equal to 1.178; the polls carried out with the CATI present an average on ind-2 equal to 5.23; almost all the polls carried out with the CAWI technique do not provide this information. It seems clear that the problem particularly concerns the use of the CAWI data collection technique and this suggests that in fact the institutes that use this technique do not carry out a probability sampling but draw on a form of convenience selection if not a self-selection of the subjects who frequent the web.

When the polls are carried out using mixed techniques of data collection the average results on ind-2 are: 9.494 with CATI-CAWI; 5.119 with CATI-CAMI; 5.690 with CATI-CAMI-CAWI.

We have computed the eta coefficient to quantify the strength of the association between ind-2 and the technique of conducting interviews. The value obtained (.647) allows us to establish the existence of a significant association between the two variables examined.

By examining the differences between the ind-2 values, it is possible to find a significant effect of the data collection technique on the ratio between the number of contacts and the number of interviews carried out. Undoubtedly, the polls that resort to web and mobile interviews have significantly higher ratios than those conducted only with CATI and those that resort to the combination of CATI and CAWI. Polls conducted on a panel are an exception because the sample is composed of subjects who agree to be interviewed repeatedly over time, therefore they have very low values on ind-2.

In Table 1 we show the descriptive statistics relating to the following variables: duration of the survey in number of days (label days), sample size (nsample), sampling error (error), number of subjects contacted (n. contacts), number of subjects who refused the interview (n. of refusals), value recorded on ind-1 (ind-1), value recorded on ind-2 (ind-2), percentage of respondents who declared their intention not to vote or who declared themselves undecided (no-vot).

Table 1 Descriptive statistics of the main variables available in the data matrix

Machine learning, artificial neural networks and social research

Abstract

Similar content being viewed by others

What is Qualitative in Qualitative Research

A new criterion for assessing discriminant validity in variance-based structural equation modeling

Mixed methods research: what it is and what it could be

1 Introduction

2 Short illustration of artificial intelligence and machine learning based on artificial neural networks

3 Methodology

4 Results and discussion

5 Conclusions

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation