1 Introduction

Instagram, WhatsApp, Facebook and Twitter have become part of everyday life [25]. They are tools of participation, born thanks to changes in development and spread of the World Wide Web. They modified the communication transforming the flow from “one to many” to “many to many” in which the stakeholders (users, companies, institutions, etc...) could interact each other exchanging a huge amount of contents [15]. These contents grow in a heterogeneous environment, made of digital tools. They could assume different formats like sentences, words, images, pictures, videos and numbers. From these contents, it is possible to derive information leading to new challenges in terms of analysis, management and comprehension. In this view, the information is the final result of a complex process. Starting from the single informative unit (e.g. a Facebook Like, a Twitter hashtag), a process of research, data acquisition and extraction of value takes place.

Among the alternatives used to express a reaction to social media contents, there is the concept of Like, which relates to different social networks such as YouTube, Google, Facebook, Instagram, Twitter and Tumblr. These social media platforms offer to the companies business opportunities by reading the preferences of the users, allowing them to listen to listeners and shape products and services. They make it possible to make profiling, to generate new markets leading the existing ones [16, 20].

The Like allows to harmonize comments and generate the use of tools of sentiment analysis simplifying through a synthesis [23]. Facebook uses an evaluation mechanism to give the users a simple measure of popularity indicating the number of people answering positively to shared information or posts [11]. The Like system measures the interaction between brand and service, user and consumer.

The Like presents some features that complicates the treatment: the lack of a temporal reference when it is used; the ad infinitum permanence; the thickening for some categories as music and sport; the awareness that it was not always used after visiting similar pages. These features differ the use respect to the researches. In this case, in order to understand if a product or a service is appreciated, it is asked to the stakeholders a point of view selecting a representative sample and distributing a questionnaire for data collection [19]. The questions in the survey consider a positive answer (Like), negative (Dislike) or the lack of knowledge of the product. This situation does not allow to give a dimension to target universe of the respondents, leading caution in the management of the missing values. The main hypothesis is that all users have had access to the information and potentially exposed to the event.

Each user can access a page at different times giving a Like the first or the last time. There is no information about the time in which the Like was placed and the page could be changed after the Like. It is not possible to know if the evaluation was given “coeteris paribus”, after visiting different pages about the same topic. From the data point of view, the obtained data matrix is sparse [2].

If the presence of a Like represents an appreciation, a clear behaviour, it is necessary to make some considerations about the absence of a Like. Regarding the statistical analysis, the treatment of missing data represents a relevant problem. The goodness of prediction or classification models depends on the presence or absence of missing values and how they are treated. In the pre-processing phase, it is necessary to understand the meaning of the absence of data. This could be related to an effective lack of knowledge or it could be an information about a particular category of the subjects. The pattern has to be detected defining the structure of missing data, the observations and the mechanism defining the existing relationship [17].

But what happens in presence of such a huge quantity of missing data? The first option is to attribute a zero value to missing data. In this way, it is defined as a modality that could be interpreted as a Dislike.

However, there is no indication about the fact a single user also visited the other pages and consequently the generalisation of the missing data into a Dislike is not plausible. The zero value, potentially considered as a Dislike could result into a Nothing, which is a total lack of awareness of the user that does not know the specific social media page.In other words, if the \(n-th\) user does not give a Like to the reported social media page, it is possible to discern the missing value as two distinct cases: “I in” know the page but I do not give a Like because I am not interested or “I do not know the social media page”. This leads to a substantial difference in the interpretation of the missing response associated to each page.

Several works inspected the impact of Facebook in social media analysis: some focused on users’ features [4], others on the role of the platform on social interactions [6], or about the growing interest about Facebook as a segmentation tool in order to detect the users’ behaviour [2]. A Like represents a quantitative alternative to a way to express a reaction to a content [3]. The contribution of this study is manifold: to give a practical strategy to extract more knowledge from social media data; to reduce, through the disambiguation, the presence of missing data overcoming the problem of a sparse matrix; to propose a technique of pre-processing to limit the noise and increase the signal from social media data [10, 12, 23].

The rest of the paper is organized as follows. Section 2 introduces the methodology based on the generation data mechanism of missing data. In Sect. 3, the proposed approach is described to discern missing values from negative opinions. Section 4 presents some preliminary results. Finally, Sect. 5 is reserved for discussion and final remarks.

2 The generation mechanism of missing data

Missing values are usually treated as a modality, but another important issue is about the causality of the missing values. If it possible to detect an association between these values, then presence of causality in missing values could be hypothesised. The statistical literature about missing values analysis is recent. Many methods have been proposed for their imputation, but only a few investigated their generation mechanism. Little and Rubin gave a great contribution to this topic [17].

Let \(Y = y_{ij}\) the complete dataset where i is referred to the observation and j to the variables and with \(M = m_ {ij}\), the indicator matrix assumes 1 values if data \(y_{ij}\) are missing and 0 otherwise. Y depends on a parameter vector \(\theta \), while M on a parameter vector \(\psi \) describing the relationship between M and Y. Let \(Y_{obs}\) data effectively observed and \(Y_{mis}\) the missing component. The nature of the missing values is characterised by the conditional distribution of M given Y, that is \(f(M|Y, \psi )\).

Little and Rubin defined three mechanisms of missing data generation:

  • Missing Completely At Random (MCAR);

  • Missing At Random (MAR);

  • Missing Not At Random (MNAR).

In the first case:

$$\begin{aligned} f\left( M|Y_{obs}, Y_{mis}, \psi \right) = f (M|\psi ) for\,\, each Y, \psi , \end{aligned}$$

the lack of data of Y does not depend on observed and not observed values, but only on \(\psi \) (distinct from \(\theta \)). This means that the values do not help to understand why the data is missing. For example, if some questionnaires get casually lost after a survey, the probability that a single questionnaire gets lost is equal for all the questionnaires. In presence of a single variable, when data are MCAR, then:

$$\begin{aligned} Pr \left( Mi=1 |y_i, \psi \right) = c for\,\, each y_i, \psi \end{aligned}$$

In order to know if the observed values could explain the cause of missing values, generally it is necessary to resort to explanatory variables X with completely observed values \(X_1,\ldots , X_k\) . When the probability that target variable Y is missing does not depend on values of X or Y, then data are MCAR. There is no explanatory variable explaining because some values are missing in Y, neither Y variable is capable of giving an explanation. There is no cause to research because data are missing completely at random.

In the second case:

$$\begin{aligned} f \left( M |Y_{obs}, Y_{mis}, \psi \right) = f \left( M |Y_{obs}, \psi \right) for\,\, each Y_{mis}, \psi , \end{aligned}$$

missing observations depend on \(\psi \) and the observed values. It is an intermediate case when causality remain in the sub-groups. For example, if in a survey only data about income are missing and a correlation exists between the presence of missing data and the job of the respondents, then the profession influences the presence of missing values. It is possible to resort to a X variabile and prove that the probability of Y depends on X. Differently from the previous case, the probability that Y is missing is not equal for all the respondents, but it is in the sub-groups [8, 14].

In the third case, the distribution of M given Y also depends on \(Y_{mis}\). The parameters \(\theta \) and \(\psi \) are not distinct because M is conditional respect to both parameters. When data are MNAR, the probability that the target variable Y is missing does not depend on X, but on Y [1, 13, 21].

Little and Rubin also proved that the generation mechanism can be ignored only if data are MCAR and MAR, therefore the model specification is not required to obtain valid inferences for \(\theta \); while, if data are MNAR, the generation mechanism could not be ignored. To understand this mechanism of missing values, they introduced the Little test applied only to quantitative variables [17].

After the identification of the missing data generation mechanism, it is necessary to choose the method of treating these missing observations. The statistical literature for the incomplete dataset analysis offers different solutions to the issue of missing values [7, 9, 22, 24]. These methods could be divided into three macro-classes:

  • deletion methods;

  • single imputation methods;

  • multiple imputation methods.

The deletion methods represent the most simple solution, the missing values are deleted (case deletion), but they are valid only if the data generation mechanism is MCAR. In particular, there are two deletion methods: listwise deletion and pairwise deletion. The imputation methods allow to substitute each missing value with a plausible one to obtain the complete data matrix. It is an advantaging solution, but dangerous [5].

These methods are solutions for MCAR and MAR data. The biggest disadvantage of this approach is that the parameter estimates are biased if data are MCAR. This happens because this imputation does not take into account the uncertainty component underestimating the variability of the estimates.

The multiple imputation is a technique that allows us to impute a set of plausible values for each missing observation; therefore it takes into consideration the uncertainty leading to a valid inferential results. The idea is to impute \(m\ge 2\) values for each missing data in order to obtain m complete datasets.

3 A proposal for missing imputation in social data

These presented methods are applicable when data are MCAR e MAR, when the hypothesis of a casual lack is allowed, but not for MNAR data. If the cause of missingness is known, there are no problems in the imputation phase. When the cause is unknown, a possible solution is to treat the missing values as a modality. It is accepted that data are missing for an unknown cause, without identifying it and recognizing its existence.

In this section, a new approach to discern the missing value from a behaviour is proposed. It is possible to hypothesise different situations: on the one hand, a missing Like could mean that the content of the page was not appreciated and, if possible, the user would express a Dislike; on the other hand, the Like is missing because the user did not know the existence of the page. Dislike expresses a negative position because the user chooses not to give a Like. Nothing conveys to a neutral opinion or a not-knowledge.

The imputation method is based on the following hypothesis: when the number of Likes grows for a category, the probability that a not-observed value corresponds to a missing value decreases. For each category, it is necessary to compare \(q_i\) and b.

\(q_i\) represents the percentage of Liked pages for the user i in a given category of s alternatives:

$$\begin{aligned} q_i = \frac{\#like for user_i}{s} for\,\, each\,\, i=1,2,\ldots ,n \end{aligned}$$

It could assume values between 0 and 1, where 0 means that the user is not interested in that category because there is no Like to any page of that category; 1 indicates that the category is well known by the user because he/she likes all the pages in that category.

b is a weighted average:

$$\begin{aligned} b = \frac{\sum _{s=1}^{S} p_s \times frequency_s }{\sum _{s=1}^{S} frequency_s} \end{aligned}$$

where: \(p_s\) is the proportion of Like in a group of s alternatives and \(frequency_s\) is the number of times that \(p_s\) is repeated. The b value represents a threshold to discern the missing values and three scenarios are possible:

  • if \(q_i > b\), the empirical evidence suggests the user i is interested in that category, then he/she probably knows the other pages of the category, this leads to consider the missing as a behaviour and as a Dislike;

  • if \(q_i \le b\), the user i is classified as not interested in that category. Since it is unlikely that the user knows the other pages of the category, the missing value is imputed as a Nothing;

  • if \(q_i \approx b\), more information is necessary to solve the uncertainty.

The value associated with the threshold b is sensible to the width of the category. The bigger is the category, the smaller is the threshold. The problem of the sensibility to the dimensionality is not new in this field. In MCA, the absolute contribution tends to be greater for variables with more modalities. The same situation is present in Conjoint Analysis for the computation of the importance index that is sensible to the number of the attributes [18].

To explain this step, it is necessary to build a matrix of distances of the modalities considering the presence and the absence of a Like for each page. For each category, the two closest modalities are joint and the first threshold is calculated. Each time a modality is added, a new threshold is computed.

Once the disambiguation of the missing value has been implemented, a new index of Informative Earning has been introduced for each category according to the following formula:

$$\begin{aligned} IE = \frac{\sum _{i=1}^{n} Dislike_i}{\sum _{i=1}^{n} \left( Dislike_i + Nothing_i\right) } * 100, \end{aligned}$$

where n is the width of the category. This index shows the percentage gain in terms of non-appreciation compared to the missing expression on such a category.

4 An application of missing Like on celebrities’ social media pages

This study refers to Italian social media users that gave at least a Like to a set of social media pages related to pharmaceutical companies and institutions of Public Health. It is a small subset underlining the capability of this kind of data to catch aspects that would be difficult to detect with other data collection techniques. The statistical analysis is descriptive and not referred to a probabilistic sample, because users have been auto-selected.

Cubeyou hunts the social web and all interactions between people and brands, products and services (shares, likes, tweets, pins, posts) on Facebook, Twitter, Google+, Pinterest and Instagram and classifies them. Data have been shared only with research purposes. Users and characters are anonymous and not identifiable, and no sensible data have been treated. Results have been published in aggregate form and the focus is about the technique and the potential applications. The dataset is the outcome of a not-supervised extraction by the authors.

The analysed dataset contains 5651 rows and 19 columns: each row corresponds to a social media user and each column a social media celebrity pages. Values in each column are 0 for the absence and 1 for the presence of a Like. The result is a sparse matrix made of 1 for \(14.66\%\) and 0 for \(85.34\%\); the 0 values can be considered missing values to disambiguate. The concept of sparsity is directly connected to density.

Table 1 displays the row totals: 1288 users did not give a Like, 1102 placed only a Like and 2 users give Like to 17 pages over 19: no user put Likes to all social media pages. Even if in different terms, there is a strong disproportion between the presence and the absence of Like. The high quantity of missing values is a very common situation in presence of social network data.

Table 1 Sum of Likes for user, Pharma, Italy, 2015

The 19 social media celebrity pages have been divided into 5 categories according to some belonging common factors: 3 celebrities are related to a beauty and style (B&S), 2 presenters are linked to journalistic inquiry (JI), 2 characters are in the spiritual field (SPI), 6 celebrities can be classified as politicians (POL), and 6 TV entertainers (ENT).

Since one of the objectives of this study, it is to verify how the proposed threshold b depends on the size of the category s, the results of the method to discern between Dislike and Nothing are here presented only for beauty and style and Tv entertainers group. These 2 groups have been selected because they have different values for s. Table 2 contains the initial information for the application of this approach. In B&S group, there are 2.685 users with 0 Like, 1.191 that placed a Like for only one character, 396 for two characters and 91 for all characters.

Table 2 Number of Like in each group for user, Pharma, Italy, 2015

Table 3 describes the framework of Likes for the B&S group. The analysis is limited to 1678 users with at least a Like in the category. For users with 0 Like, data are missing but not imputable. From now on, B&S1 stands for character 1 in B&S group and so on. B&S1 had 1157 Likes, B&S2 619 and B&S3 480. Total number of Likes is 2256 and the missing values to disambiguate are 2778.

Table 3 Likes and missing values in the B&S group, Pharma, Italy, 2015

Table 4 explains the structure of Likes for each character in the B&S group. Users that gave Like only to B&S1 are 721 (254 for B&S2 and 216 for B&S3), users that gave Like to B&S1 and another character (B&S2 or B&S3) are 345, while only 91 users placed Like to all 3 celebrities of the group. B&S1 has the highest percentage of users with a single Like, while B&S2 and B&S3 tends to have the percentage compared to B&S1 among users with 2 or 3 Likes.

Table 4 Number of Likes in the B&S group, Pharma, Italy, 2015

Using the approach proposed in Sect. 3, the b threshold could be computed:

$$\begin{aligned} b = \frac{1/3 \times 1191 + 2/3 \times 396+ 3/3 \times 91}{1678} = 0.448 \end{aligned}$$
  • if \(q_i> 0.448\) the missing values are substituted with a Dislike

  • if \(q_i \le 0.448\) the missing values are substituted with a Nothing

If a user gave only a Like to B&S3, \(q_i = 1/3 = 0.333 < 0.448\), then the missing values will be imputed with a Nothing. The Dislike imputation occurs only when there is an empirical evidence to support the hypothesis that the Like has not been placed despite the user could know the content of other pages in the category. On the other hand, when the user knows the category placing a Like in the group, it is not possible to be sure that the absence of Like corresponds to a negative opinion.

Table 5 shows the imputation results: the lowest percentage of Dislike is for B&S1 (\(3.0\%\)), the highest for B&S3 (\(13.2\%\)). Despite this character was the celebrity with the lowest number of Likes, it is plausible that he/she was less known having the highest number of Nothing, while B&S1 was the most appreciated catching the least number of Dislike.

Table 5 Results of imputation technique for B&S group, Pharma, Italy, 2015

The parallel analysis could be replicated, for the other groups of celebrities: it would be interesting to verify the accuracy of the proposed approach for a category with more modalities. The ENT group is composed by 6 celebrities. The users that placed at least a Like is 3192, the total number of Likes is 6725 (see Table 6).

Table 6 Likes and missing values in the ENT group, Pharma, Italy, 2015

The most popular celebrities are capable to receive an higher percentage of Likes from users that placed 1 or 2 Likes. Differently, the least known characters tend to catch Likes together with the most popular. However, for ENT6, this is not verified because, despite he/she is the celebrity with the least amount of Likes, he/she has a strong percentage of Likes from users that placed only 1 or 2 Likes (\(20.5\%\) and \(25.5\%\))(see Table 7).

Table 7 Number of Likes in the ENT group, Pharma, Italy, 2015

The threshold b is computed for the ENT group:

$$\begin{aligned} b = \frac{1/6 \times 1366 + 2/6 \times 850+ 3/6 \times 485 + 4/6 \times 302 + 5/6 \times 138+ 6/6 \times 51}{3192} = 0.351 \end{aligned}$$

Imputation results are presented in Table 8. The percentage of Nothing imputation tends to increase when the number of Likes decreases. Compared to the B&S group, the growth of the Dislike imputation is less evident. The lowest value belongs to ENT3 (\(8.3\%\)), followed by ENT2 (\(9.4\%\)) and ENT1 (\(10.0\%\)).

Table 8 Results of imputation technique for ENT group, Pharma, Italy, 2015

As proposed in Sect. 4, to prove that for this application when the width of the category is big, this method allows more Dislike imputations, it is necessary to build a matrix of distances of the celebrities considering the presence and the absence of a Like for each page (see Table 9).

Table 9 Distance matrix for ENT celebrities, Pharma, Italia, 2015

For each group, the two closest celebrities are joint and the first threshold is calculated. Each time a modality is added, a new threshold is computed. According to the value in Table 9, ENT2 and ENT3 are joint and hypothesising a new group with only ENT2 and ENT3: the threshold is equal to 0.627. At the second step ENT1 is joint to ENT2 and ENT3 and iteratively the second threshold is 0.465. Each time a new celebrity is added, the threshold is decreasing until the entire group and the final threshold 0.351 (see Table 10).

Table 10 Size of categories and thresholds for ENT Celebrities, Pharma, Italia, 2015

The inverse correlation between the width of the category s and the proposed threshold b is also evidenced in Fig. 1, in which a model regression is applied.

Fig. 1
figure 1

Inverse relationship between size and threshold for ENT group in social media data

The EI index considers how much the category was not appreciated compared to the total of users that did not express a preference. The EI index is higher in the ENT group with \(17.6\%\) respect to \(14.2\%\) of the B&S group. This confirms the hypothesis that when the category is bigger, there is more informative gain. The biggest categories are more informative because they contain more celebrities with shared characteristics.

5 Conclusions

In a context made of digital tools and continuous exchange of contents and interactions among stakeholders, social networks have become crucial in leading the markets. Data collected through these platforms could express the users’ preferences by means of the Like as a way to appreciate opinions. The Like allows to show a positive viewpoint to posts, pictures and pages and many social media platforms use this model to evaluate the popularity of the information. From a methodological point of view, the Like system presents some complications and issues that are very often object of analysis by the researchers. The most discussed issue is represented by the lack of a correspondent tool to express a negative opinion. This means that every time a user does not express a preference using a Like, this could be interpreted as a missing value creating a sparse data matrix.

The first issue regarding missing values is to understand their generation mechanism and the causality, that is to understand whether the lack of data of Y does not depend on observed and not observed values. Based on this mechanism, missing observations could be classified as MCAR, MAR and MNAR data. Once this mechanism is known, it is possible to proceed with some imputation methods. When the cause is unknown, a possible solution is to treat the missing values as a modality. It is accepted that data are missing for an unknown cause, without identifying it and recognising its existence.

The present study aimed to introduce an exploratory approach capable of detecting the presence of a behaviour when considering missing data in social media analysis. In particular, the concept of missing observations is here presented in relation to the possibility for a user to declare an appreciation using the Like instrument. Since social media does not give a chance to express a dis-appreciation, a missing value could be interpreted as a lack of appreciation for a social media page. But a user may not put the Like even because he/she does not know the page; therefore, the approach looks for a delimitation rule to discriminate between a Dislike and a Nothing. Dislike expresses a negative position because the user chooses not to give a Like. Nothing conveys a neutral opinion or a not-knowledge.

The case study contained a dataset with the presence or absence of Like for celebrities divided into several categories for some social media users. Results were here presented for two categories with different size: Beauty and Style with three celebrities and Entertainment with six characters. The delimitation rule obtained using a b threshold based on the total number of Likes in the category tends to discern the missing value into a Dislike more frequently for celebrities with a lower number of appreciations. Moreover, the rule appears to be sensible to the size of the group. The larger the category, the lower is the threshold. This tends to be confirmed even when in the same category the b value is iteratively computed. Finally, an index of informative gain was introduced to measure how many missing values are transformed into a Dislike on percentage. It was shown how this index is higher for group with more celebrities.

In conclusion, this study proposed a method of imputation capable of discerning missing observation from a behaviour. The informative gain index yields satisfactory results increasing the quantity of valuable observations. This method also represents a technique of pre-processing data replicable in similar situations when social media platforms are involved. It is necessary to remind that this approach is data-driven. The b threshold depends on the size s of the category and on the number of Likes in the group of social media pages; therefore, the validity of this technique is strictly related to the collected data.

Future works should be focused on the application of this technique on other social media platforms or different kinds of data. From a methodological point of view, some corrections could be applied to the delimitation rule to improve the validity of the approach. Another spark could regard the study of the structure of the graph underlying the interactions among users.