Survey of existing literature
To guide the overall literature review, we refer to the survey of network-based marketing methodologies conducted by Hill et al.2 The authors provide an overview of key strategies for identifying likely adopters through consumer networks. Their research provides a detailed overview of existing work in this domain, and highlights the challenges and issues encountered when dealing with econometric models, network classification models, surveys, designed experiments, diffusion models and collaborative filtering systems. Briefly, econometric models provide empirical estimations of economic relationships through statistical methods. Network classification models aim to quantify interest between entities in a network (eg Google's PageRank aglorithm). Traditional surveys enable consumer habits about word of mouth behaviour to be collected. Designed experiments provide controlled settings through which one can observe interactions between subjects. Diffusion models provide tools for assessing the likely rate of diffusion of a technology or product in networks. Finally, collaborative filtering systems make personalized recommendations to individuals based on demographic content and link data.
Moreover they propose empirical evidence that statistical models built from a combination of large amounts of geographic, demographic and prior purchase data are significantly improved through inclusion of network link information. Such models can be used to more effectively target potential consumers when relying on network-based marketing methodologies.
This survey allows us to broadly identify the approaches most likely to be suited to Facebook, and the metrics most relevant in the context of each approach. However, the focus of the survey does not take into account the specific nature of social media platforms in general and Facebook in particular, which may differ from ‘real-world’ social environments and may present other characteristics and data sets relevant to this work, such as a wealth of text-based conversations. As such, it is necessary for us to broaden the literature review to encapsulate additional research in this domain. We categorize existing work according to the following research areas: network diffusion models (section ‘Information diffusion’), collaborative filtering and recommender systems (section ‘Recommender systems’) and semiotic and linguistic analysis (section ‘Semiotic and semantic analysis’) and provide for each category a brief overview, an analysis of work specifically related to our problem domain and relevant to the context of this document and detail the potential approaches suited for Facebook and their potential caveats.
Diffusion theory provides tools to assess the likely rate of diffusion of a technology or product. Understanding the individual-level data and connections between users of a social network has become of great value to marketers due to the impact of the network structure on the diffusion of a marketing message.
We begin in this section by providing an overview of network statistics, and information diffusion and its relation to social media environments and marketing. This overview aims to provide clear definitions for notions such as diffusion and adoption and provide the necessary background for this work. We then examine related work focused specifically on social media platforms such as Facebook and Twitter, and finally identify limitations with this work and potentially applicable approaches.
Statistics of networks structure Any social network can be described as a graph G=(V, E), where V is a non-empty set of nodes (or actors) and E is a set of edges (or links) that connect pairs of nodes. The actors correspond to individual persons in the social network and the links correspond to some type of social relationship between the actors (eg friendship). The number of actors n=|N| is the order of the graph while the number of links m=|E| is the size of the graph. The adjacency matrix A is a n × n matrix that indicates which links exist between the actors. If the actors v,w are connected by a directed link (from v to w), then the element av,w is equal to 1. Otherwise it is equal to 0. If the edge is undirected then av, w=aw, v=1. In a social network information may flow in a bi-directional manner (Facebook friendship links must be reciprocated), so we consider here primarily undirected graphs. However, other social networks such as Twitter can have uni-directional links, where following a user does not have to be reciprocated. In addition, there are no links that have as endpoints the same actor (self-loops), so the adjacency matrix of a social graph is symmetric and its diagonal elements ai, i are equal to 0.
A number of statistical measurements have been proposed to quantitatively characterize complex networks and study their topological properties. These measurements can be used to analyse and understand the topological characteristics of social networks and validate the fidelity of synthetic models that try to reproduce their function. We describe below the most important topological metrics and analyse their physical significance. An extensive survey of such measurements is provided in Costa et al.8
The degree k
of a node i is the most basic metric for the importance of a node and indicates the number of links (eg acquaintances) adjacent to it. For undirected graphs:
The average degree 〈k〉 of a network is the average of all k
in the network:
High average degree means that the network is densely connected. However, it is a very coarse metric and cannot be used for detailed analysis as graphs with the same average degree may have radically different topology.
The degree distribution P(k) of a graph is the probability that a randomly selected node has degree k. If n(k) is the number of node with degree k then the degree distribution can be calculated as
The degree distribution is a more informative characteristic, from which we can also calculate 〈k〉 as:
For graphs where any two nodes are connected with equal probability the degree distribution is binomial or Poisson for sufficiently large graph size. A well-known example of such graph is the Erdös-Rényi random graph model.9 On the contrary real graphs have long-tails and many of them follow the power-law distribution.
Graphs with power-law degree distribution are also called scale-free.10 The actual meaning of power-law degree distribution is that most of the nodes have relatively small degree but a very small number of nodes have disproportionally larger degree. Naturally these ‘rich’ actors function as information hubs for the poorly connected nodes.
If there is a positive correlation between the degrees of the connected actors, the network is assortative while if there is a negative correlation the network is disassortative. Thus, in assortative networks the actors prefer to connect to other actors of similar degree while in disassortative networks the actors seek connections with nodes of dissimilar degree.
The clustering coefficient c provides information on the density of connections in the neighbour of an actor. Clustering indicates the path diversity around an actor and therefore when some information reaches an actor with high clustering, the information will spread with high probability throughout the cluster. If two neighbours of a node are connected then a triangle (or 3-cycle) is formed between these nodes. The maximum number of triangles for a given node with degree k
−1)/2. Therefore, the clustering coefficient can be expressed as fraction of the average number of triangles over the total number of possible triangles:
is the number of triangles that the nodes shares.
Shortest path distribution: The shortest path length distribution P(h) is defined as the probability distribution that two random actors are at minimum distance of h hops from each other. A summary statistic is the average shortest path
is called the diameter of the network.
For a graph with fixed average degree, if the average shortest path grows logarithmically or even slower with the growth of the graph order, then the graph exhibits the small-world property.11 Graphs with power-law degree distribution (see equation (7)) with exponent γ=3 have mean diameter asymptotically log n/log log n12 while for exponent 2<γ<3, 〈h〉∼log log n and for γ>3, 〈h〉∼log n. The small-world property affects many basic properties of the graph, such as the spread of information or epidemics.
Betweenness centrality: The importance of an actor or a link in a graph is usually defined by the number of shortest paths in which this actor (link) participates. When an actor (link) participates in many shortest paths, then this node is said to be closer to the centre of the graph. Betweenness B
is the measurement that quantifies the centrality of an actor:
is the total number of shortest paths between actors i, j and σ
(v) is the number of shortest paths between i, j that pass through actor (link) v. To normalize the actor betweenness in order to compare different graphs, we divide B
by (n−1)(n−2)/2, which is the maximum possible value of actor betweenness in a graph.13
Centrality is important in analysing information flows in a network, discovering which links can serve as a bridge between different clusters.
Coreness: The k-core of a network can be defined as its maximal subgraph in which each vertex has at least degree k. The k-core of a graph can be formed by recursively deleting all nodes with degree smaller than k. The cores of a graph form layers in which the (k+1)-core is always a subgraph of the k-core. k-core decomposition can be used to analyse the cohesiveness of a network.
Theoretical information diffusion models Network epidemics: The majority of research efforts on modelling the flow of information and influence throughout networks has been done in the context of epidemiology, where the idea of epidemic is not limited to the spread of a virus but includes the spread of ideas, news, products or behaviour. The later are also described as social contagion.
The classical disease propagation models are based on the stages of a disease in a host: susceptible, infected and recovered. After recovery an actor can become immune (SIR) or once again susceptible (SIS). The initially infected nodes in our case would correspond to people who adopt a product without first receiving recommendations. In the SIR model, a susceptible person has a uniform probability β per unit time of becoming infected from any infected contact, while the infective individuals recover at some rate γ. The fractions s, i and r of individuals in the states S, I and R are determined by the differential equations:
In the case of SIS model the ‘cured’ individual goes back to the susceptible pool, thus the above differential equations are revised as follows:
The epidemic threshold of the SIS and SIR models determines whether a disease can dominate or die out. In networks with power-law degree distribution there is no non-zero epidemic threshold so long as the exponent of the power-law is less than 3. Since most power-law networks satisfy this condition we expect diseases always to propagate in these networks, regardless of the transmission probability between individuals.14 However, there may be a non-zero threshold if the network is low-dimensional (rather than infinite) or if the network has high clustering coefficient.
Although these epidemiological models are useful in understanding the basic dynamics, their basic problem is that they assume that disease spreading depends only on a single parameter that specifies the infectiousness of the disease. This would mean that the entire population is equally susceptible to an idea or product purchase. Obviously, this assumption is unrealistic since diseases spread between individuals that have actual contact.
Adoption of ideas or products: One of first product diffusion models was proposed by Bass.15 The Bass model predicts the number of people who will adopt an innovation over time. It is agnostic of the underlying network structure and it assumes that the adoption rate depends on the current proportion of population who have already adopted the innovation. The diffusion equation models the cumulative proportion of adopters in the population as a function of three parameters, the potential market, a co-efficient of innovation that expresses the intrinsic adoption rate, and a co-efficient of imitation p that determines the influence of social contagion. The cumulative adoption follows an S-curve, which means that initially the adoption is slow, increases exponentially and flattens at the end as shown in Figure 2.
The Bass model can provide a useful forecast at the aggregate level but it does not provide insights in the diffusion process.
On the other hand, the Linear Threshold Model16 tries to predict the spread of products at the individual level. Each actor u in the network is assigned a threshold t
∈[0, 1] drawn from some probability distribution. Each actor u is influenced by each neighbour w according to a weight bu, w, such that ∑
bu, w⩽1. A node adopts a product if the sum of the connection weights of its neighbours who already adopted the product is greater than the threshold of the actor, namely t⩽∑adopters(i)w
. After specifying the weights and the threshold for each actor, the diffusion process evolves deterministically in discrete steps.
A similar approach is followed by the Independent Cascade model.17 We start again with a set of initial adopters and the process unfolds in discrete steps. Whenever a neighbour v of an actor u adopts a product in a step t, it has single chance to also activate the actor u with probability pu,v. If v succeeds, then w will become active in step t+1; but whether or not v succeeds, it cannot make any further attempts to activate u in subsequent steps.
The last two diffusion models are not independent from the network structure and their function depends on the selection of the initial adopters. These models give rise to the question of which individuals are more influential and better for the spread of a product.
As described by the diffusion models, the decision that an individual takes of whether to adopt a product can be influenced, explicitly or implicitly, by their social contacts, as will be seen in the following section. In order to effectively employ ‘word-of-mouth’ recommendations, it is thus essential for companies to identify the so-called opinion leaders to target, assuming that influencing them will lead to a large cascade of further recommendations. More formally, the influence maximization problem is the following: given a probabilistic model for influence, can we determine a set individuals who will yield the largest expected cascade.
Early studies of scale-free networks identified the most connected people (hubs) as the key players in the large scale of spreading processes.14 Furthermore, in the context of social networks it is believed that the actors with the higher betweenness centrality exercise larger interpersonal influence.18 However, a recent study19 revealed that in many cases the most influential spreaders do not necessarily correspond to the best connected or more central individuals. Instead, by applying the SIR and SIS models in a series of online communities (livejournal.com, UCL CS email contact network, imdb.com), they find that a better metric for discovering the most influential spreaders is how close is the location of an individual to the core of the network. The core is defined by the k-shell decomposition (coreness). They also discover that when there are more than one initial spreaders, the distance between them is crucial to extending the cascade. Figure 3 shows how a hub produces a smaller cascade than an individual with fewer links but strategically placed close to the core of the network.
An important contribution in understanding the spread of information is the theory of ‘the strength of weak ties’.20 This theory claims that individuals are often influenced by others with whom they have sparse or even random interactions. These interactions are labelled as ‘weak ties’ in contrast to the strong ties that an individual has with his close friends or family members. The significance of weak ties is based on the fact that they have the potential to bring together communities that otherwise are isolated from each other. Strong ties affect the spread of the information inside a closed circle of acquaintances, but weak ties facilitate the wider diffusion. This theory is also confirmed by Goldenberg et al.17 where the authors — based on the Independent Cascade Model — show that weak ties are at least as influential as the strong ties. A direct indication is that in marketing systems based on referrals, the participants should be encouraged to refer people with whom they have a weak relationship, otherwise there is the risk that referrals will be limited to a small network subgroup.
Relevant research The models presented in the previous section address the problem of how influence spreads in a network from a theoretical point of view and are based on assumed influence effects and not actual data. A number of experimental studies have been conducted to track the diffusion of information in online social networks. The difficulties involved in mining the Facebook data, however, mainly due to privacy issues, resulted in a relatively small set of studies directly tied to the platform, but with particularly important insights.
A critical question is whether Facebook follows the topological characteristics of other social networks that exhibit power-law degree distributions. Since it is not possible to obtain complete connectivity information, sampling techniques are used to obtain unbiased topological measurements.21 Sampling based on the Metropolis-Hasting random walk and the Re-weighted Random Walk reveal that node degree distribution exhibits a heavy tail (with high degree nodes occurring with higher probability) and not a power-law distribution as do most of the offline social networks. These results are also confirmed by Sala et al.22 which proposes the use of Pareto-Lognormal degree distribution for more accurate modelling of the Facebook network.
One of the first large-scale measurement studies on the Facebook OSN is presented in Nazir et al.23 To obtain a rich dataset, the authors developed three Facebook applications whose popularity reached the top 1 per cent of Facebook applications with a combined user base of 8 million users. According to their results, the Facebook applications exhibit the Preferential Attachment property10 according to which the probability that a user will install an application is proportional to the current popularity of the application. This explains the fact that the application installations follow a power-law distribution. The preferential attachment derives directly from the news-feed of the Facebook platform that informs a user for the activities of his friends. A second important observation is that the popularity of applications that cross a certain threshold of installations is not affected by sharp daily drops. In contrast, even small changes in the daily usage of less popular applications lead to significant changes in the application's rank. Also, the authors observe that for applications that cannot gain popularity during their initial deployment stage, it is very difficult to reach high ranking, indicating that this first stage is decisive for the future user base of an application. Finally, the authors observe that the user interactions also follow a power-law distribution. This result implies that a small number of ‘power users’ dominate the user interactions on applications, and are responsible for sustaining an application at a high rank.
Another important study on the diffusion of applications in the Facebook platform is presented in Onnela and Reed-Tsochas.24 The authors study how social influence and collective behaviour affects the decision of individual users on whether to install an application. Their analysis is based on hourly data between June and August 2007 for 2,720 Facebook applications with a total of 104 million installations. Their results are surprising and show the existence of two distinct regimes of behaviour. Once applications cross a particular threshold of popularity, social influence becomes a highly determinant factor for the behaviour among the users, and leads some of the applications to extraordinary levels of popularity. Below this threshold, the collective effect of social influence appears to disappear almost entirely. These results demonstrate that social influence can spontaneously assume an on/off nature in an online social network, in a manner not observed in the offline world. Additional simulation results with synthetic time series show that this transition is not equivalent to a standard epidemic threshold.
Sun et al.25 explore a dataset of 2,62,985 Facebook pages to provide an empirical investigation of diffusion and the effect of news feed broadcasting. This unique insight into diffusion on Facebook, built on data not normally accessible directly, offers similar conclusions to that previously stated. While long chains of up to 82 levels have been observed, Facebook diffusion is not or rarely the result of a large cascade. Instead, it exhibits diffusion patterns characterized by large-scale collisions of shorter chains: diffusion events are often related to publicly visible pieces of content that are introduced into a network from multiple sources, often merging into one large group of friends. Start nodes, at the source of the chains, constitute an average of 14.8 per cent of users for pages with over 1,000 fans. Using zero-inflated negative binomial regressions, the authors also find that maximum diffusion chains cannot be predicted based on demographics or number of friends.
Other approaches, not necessarily tied to Facebook, have aimed to uncover the structure of social network through the use of penetration data,26 relying on dissemination patterns to estimate properties of the network (eg type of degree distribution such as gaussian/poisson, uniform, scale-free or lognormal). Using this information as a basis, they define a growth model that more precisely estimates the magnitude of the contagion process. Through their experimental set-up, which relies on CD sales data, online movie data based on search query volume, and Friendster data, they conclude that underlying network structures can be correctly identified through the use of their model and observed ‘infection’ rates, such as search volumes over time.
Identified methodologies and limitations There are a few key conclusions to take away from this initial set of related work. First content disseminated on Facebook appears to follow a similar pattern of diffusion to that of product diffusion: viral processes or word of mouth do not take place until a certain threshold of deployment has been reached. Only then can we observe the impact of social influence. This impact may be limited in its depth: for fan pages, it appears that a significant number of start nodes will have found the page independently, before some form of cascading takes place. Whether these patterns of use remain past the initial introduction of the content, or when the content appears to be no longer relevant, is not clear from these studies. Indeed, further work is needed to understand the complete content life cycle, from introduction to decline in popularity, in order to correctly determine the level of investment required throughout: this ‘viral life cycle’ is discussed in more detail in the section ‘Challenges and research opportunities’.
While work has focused on independent subsets of the Facebook feature space, such as ‘fanning of pages’ or application installations, little has worked on comparing the impact of these features on the overall diffusion process. Actively allowing users to invite others to use an application has, for example, a different impact to passive broadcasting via the news feed, and this may significantly impact the marketing process in a way that is just as significant as the network structure. As such it is important to examine closely other approaches, which both look at the operation of specific Facebook features, such as the news feed population through ranking-based recommendations (section ‘Recommender systems’), or at the nature of the content itself (section ‘Semiotic and semantic analysis’).
Overview Recommender systems aim at identifying interesting items (eg books, movies, websites, conversations) for a given user based on their previously expressed interests and other data such as demographics and social links. Beyond the obvious use of recommendation systems to recommend items to purchase, their use can form an integral part of the social media experience: Facebook's news feed algorithm EdgeRank, detailed in the section ‘Background: Facebook’, is an obvious example of a ranking-based approach to recommendation that aims to highlight conversations or other posts that a user is most likely to be interested in. It is for this reason that we examine in this section recommender systems, their operation and use in social media environments.
Collaborative filtering is one of the most popular techniques for this purpose,27 relying on the basic idea that users who tend to have similar preferences in the past are likely to have similar preferences in the future. From there various approaches can be adopted: model-based approaches provide item recommendation by first developing a model of user ratings. The training samples are used to generate an abstraction, which is then used to predict ratings for items that are unknown to the user. Examples of probabilistic models that have been proposed in this regard include the work of Breese et al.,28 who model item correlation using a Bayesian Network model, in which the conditional probabilities between items are maintained or29 who compute the probability that users belong to particular personality types, as well as the probability that certain personality types like new items.
Memory-based approaches on the other hand are the most widely adopted. In these approaches, all user ratings are indexed and stored into memory. In the rating prediction phase, similar users or items are sorted based on the memorized ratings. Relying on the ratings of these similar users or items, a prediction of an item rating for a test user can be generated. Examples of memory-based collaborative filtering include user-based methods,27 item-based methods30 and combined methods.31
It is also worth mentioning content-based approaches, which take into account domain-specific knowledge during the process of generating prediction. Content-based32 recommender systems deal with domain specific knowledge in order to build on semi-structured information (eg the genre of the movie, the location of the user) and infer relationships between these structures, rather than limiting the analysis to the relationship between user and item.
On the other hand, researchers also argued that understanding collaborative filtering as a rating prediction problem has some significant drawbacks. In many practical scenarios, such as the Amazon's book recommender system,33 a better view of the task is of generating a top-N list of items that the user is most likely to like. The simple reason behind it is that users only check the correctness of the prediction at the top positions of the recommended list, they are not interested in items that would be predicted uninteresting to them, therefore low on the ranking list. Such ranking-based approaches have been explored in a wide range of works.34, 35
Recommender systems are tightly related to network-based marketing for the fundamental reason that both benefit from exploiting underlying affinities between users. As pointed out by Hill et al.,2 however, few of these works take into account explicit links between users: users are instead linked based on shared purchases or similar ratings. Nevertheless, understanding and analysing affinities between users is key to building marketing messages that resonate with specific users and improving the targeting process, which allows us to identify likely adopters.
We examine in the following subsection research that explores the use of recommendation systems in social media environments.
Relevant research The inclusion of social media links to improve collaborative filtering has been explored by a number of works. Zheng et al.36 for example introduce social network links to complement traditional collaborative filtering approaches and predict rated items obtained from online communities, concluding that this provides a net advantage over simple by-item or by-user approaches. The method simply relies on weighted averages of ratings of the same items from self-selected friends. In addition they conclude that there is little evidence of peer influence.
Guo et al.37 examine the inclusion and impact of social network features into the e-commerce world, alongside the standard recommendations, product reviews and search features that are now standard. They aim to determine whether social networks can shape purchasing decisions, the factors that influence the success of a recommendation, and the impact of social influence and reputation on the commercial activity. The data used as a basis for the research are obtained from a Chinese online marketplace called Taobao, which alongside the standard elements of functionality required to facilitate user to user sales, also includes an instant messaging platform. The analysis of trades, message exchanges and contact lists highlights a number of key findings. In particular, there exists a relationship between social proximity (measured by the number of mutual contacts), the frequency of message exchanges and the likelihood of trades. Using an approach based on directed triadic closure, they further refine their analysis to investigate the process of information passing, looking specifically at message exchanges between buyers prior to purchase from a seller. The overall analysis is used as a basis for a consumer choice prediction algorithm that uses a ranking-based machine learning approach to determine the overall likelihood that a buyer relies on one particular seller over another for a same given product with a 42 per cent likelihood of correctly picking the correct seller of ten potential options. The overall conclusion is that social network features are of greater value than other variables such as product price and seller rating in predicting purchase behaviour, and that buyer to buyer communication is one of the primary drivers of purchasing.
Important conclusions from this work relevant here are that social proximity, frequency of exchanges are all key indicators and have a potential impact on any marketing strategy. There are, however, key questions regarding the applicability to this on networks not directly related to e-commerce (eg Facebook) — and whether similar conclusions can be drawn on the virality of message diffusion.
Chen et al.,38 define a combinational method of collaborative filtering, which aims to perform personalized community recommendations based on semantic and user information and a hybrid training strategy that combines Gibbs sampling and Expectation-Maximization. While we will discuss semantic approaches in more detail in the following section, the combined approach effectively relies on combining collections of users and words describing a community to identify content similarities. Using the social network Orkut as a source of community and user data, the authors collect information for over 3,00,000 users and 1,00,000 communities — though get limited insight into friend graphs due to privacy issues. The precision of the recommendations increase from 15 per cent to 27 per cent for users who have joined over 100 hundred communities using the hybrid approach and with further improvements as the number of communities joined increases.
Though the experimental approach relies on Orkut rather than Facebook, the result of these experiments highlights that where limitations may exist with regard to accessing social graphs, recommendation-based approaches and linguistic analysis approaches may serve to compensate for the lack of knowledge of the social graph. Communities and community descriptions serve as a mechanism for identifying potential topics of interest to users and further communities that could/should be targetted whether for the purpose of advertising or otherwise.
Chen et al.39 examine the use of ranking-based recommendation systems for recommending conversations in online social streams focusing particularly on Twitter, with obvious comparisons to EdgeRank. Their approach relies on thread length, topic relevance and tie strength as the main factors for exchanges. Thread length is included primarily on the basis that the longer the conversation length, and hence of richer content, the more likely a user is to be interested. Topic relevance relies on a bag of words approach, and a tf–idf (term frequency–inverse document frequency) weighting scheme (described in the following subsection) to identify topic interests in the form of frequently mentioned entities or terms. Finally the tie-strength is a characterization of the relationship between users, estimated by examining the frequency of past bi-directional communications between two users or their common friends. A survey of users showed that users expressed much more interest in algorithms including tie strength, as a measure of how efficient their approach was in identifying conversations and content relevant to them.
Finally Zaman et al.40 use collaborative filtering to predict the spread of information in social networks — looking specifically at Twitter data. Using data of who and what was shared in the form of a retweet (ie forwarding of a message posted by a user), the authors use a probabilistic collaborative filter model to predict future retweets at the micro-level. The features used as a basis for analysis were the name, number of followers, time frame and tweet content (main tweet words extracted). They obtained for this purpose 102 million retweets and a network of 50 million edges and 7.3 million distinct users and concluded that their model allows us to predict retweets up to a day in advance.
Identified methodologies and limitations While the ambition is not actually to build a recommender system, we can conclude that the approaches exploited for recommender systems are very likely to provide insights into the users — such as their interest networks. With a sufficient data set, models built on such systems are likely to help us identify not only items that are likely to appear in a user's news feed but also those that are most likely to generate some sort of response from that user. Alternatively, these data may help us identify potential ‘likes’ and user interests to complement missing demographics, which would serve to refine the marketing message. This is of particular worth as Facebook social plug-ins are increasingly deployed throughout the web.
Following Zaman's work,40 it remains an open question whether such models can be used to determine the likelihood of sharing at a micro and macro-level in Facebook based on past interactions.
Semiotic and semantic analysis
Overview As we have briefly touched upon in the previous section, while much work on contagion and recommendation systems examines the how of message propagation in networks, these do not examine the nature of the particular message, which facilitated its spread. As highlighted by Berger et al.:41
Focusing on network structure […] and on the influence of social people provides little insight into why certain cultural items become viral while others do not.
To address this issue, various works have aimed to apply semiotic and semantic analysis to social networks in order to gain a clearer understanding of what may become viral, exploiting a wide range of techniques in the process, such as information retrieval and sentiment analysis.
Information retrieval involves the automatic indexing and retrieval of information in some shape or form. Automated classification and analysis techniques are a must considering the volume of human generated data that can be obtained from social networks. Most of the popular search engines use techniques like term frequency weighting approaches to extract important terms out of documents and vector space approaches to rank the results according to their relevance.42 tf–idf weighting is a method to calculate the relevance of a term according to a document. The measure assumes that the importance of a term in a document corresponds to the number of appearances in this document as well in all documents of the whole corpus. A term is more important than another if the term has more occurrences in the document than another. The term frequency is combined with the inverse of the document frequency by multiplication. This allows common non-contextual words to be eliminated. Such information may be complemented with part-of-speech information using algorithms that associate terms with descriptive tags representing their relationship to other adjacent terms (eg verbs, nouns, adjectives, etc). In this way, a document, such as a conversation, wall post or other can be represented by or classified according to the identified main themes and topics that emerge.
Opinion mining and sentiment analysis build on similar concepts,43 though are less concerned with the topic or factual content in it, but rather with the opinion expressed in a document. While issues such as subjectivity and viewpoints all come into play, opinion classification is usually framed as a two-way classification of positive and negative sentiment, which can then be applied at different levels: phrases, sentences, documents and collections of documents. Most sentiment analysis algorithms fall into two types, sentiment-lexicon-based algorithms and machine learning-based algorithms. Sentiment-lexicon-based algorithms construct functions based on features provided by sentiment lexicon such as term positive scores to calculate the polarity of the tested review. Machine learning-based algorithms typically include support vector machines:44 these rely on training data labelled with sentiment values, which can take into account sentiment lexicon features to direct the support vector machines.45 It should be stated that such approaches have been used to track a wider range of moods beyond the usual binary approach to positive and negative sentiment (eg calm, alert, sure, vital, kind and happy).46
Beyond automated language analysis, several pieces of work have looked at pulling data in various forms (eg pictures) from social media networks to obtain insights into what draws response from a user community. White47 examines for example travel images on Facebook surrounding tourism encounters to draw conclusions on which particular shots proved more popular in order to influence travel photography.
Though this may touch upon a wide range of fields (socio linguistics, psychology, etc), we focus here on that most relevant to social networks.
Relevant research Greetham et al.48 examine the spread of positive and negative effect in social networks, in an experiment involving 100 university students logging all interactions with peers and their general positive and negative affect. An analysis of network dynamics based on a stochastic actor-based model concludes that a relationship exists between better health, greater negative affect and a greater number of interactions.
Similarly, Berger et al.41 examine generally the type of content that is shared in social networks and the psychological processes that act as a selection mechanism and shape the process of virality. For this purpose, they examine the sharing of 7,000 New York Times articles by email, relying on automated sentiment analysis and human coders to identify more complex feelings associated with particular articles (utility, anger, anxiety, etc). Beyond content characteristics, they also control for external factors such as the appearance of the article on homepage, release timing, writing complexity and article length. The summary of results indicates that surprising, practically useful and evoking emotional response were more likely to be emailed. They also, however, conclude that articles that spent more time on the homepage were more likely to be shared, as well as those written by more famous authors. In short, we may conclude that both affect and the mechanisms behind social transmission (the specific sharing mechanism and the features of the product being shared) have an impact on the virality of an item.
Doshi et al.49 use a combination of social network analysis and sentiment analysis to attempt to predict trends, focusing particularly on movie prices. By comparing buzz, represented by blog betweenness and other social network analysis metrics, sentiment metrics obtained from discussions in forums, and box office performance data, they use multilinear and non-linear regression to attempt to predict final box office return. Though the social networks are constructed from semantic networks, using links and references between bloggers, over the standard mutual links established in most social media platforms, the combined use of social network analysis and semantic analysis, provides an interesting basis for predicting content that will prove popular. Sentiment analysis relies on a bag of word approach, though dynamically adapted to different movie genres. Such an approach results in a correct prediction of movie success or flop 80 per cent of the time. The research provides an interesting insight into the popularity of online content. The overall strategy allows to infer a potential link between content and diffusion in social networks.
Alternatively, Bollen et al.46 rely on Twitter data to determine whether mood expressed correlates with stock market activity. Using mood tracking tools that measure moods across dimensions such as calm, alert, sure, vital, kind and happy, based on expressions in the twitter message content (eg I am feeling) and a large data set (9,853,498 tweets posted by approximately 2.7 million users), they establish a relationship between public mood states and the Dow Jones Industrial Average. Using Granger causality analysis and a Self-organizing Fuzzy Neural Network, they find an accuracy of 87.6 per cent in predicting the daily up and down changes in the closing values of the Dow Jones Industrial Average.
A similar approach has been used to forecast audience increase on Youtube.50 On the basis of the notion that quantifying the standing and perception of a user can be achieved by measuring his/her in-degree on a given social web platform, Rowe posits that a relationship exists between the behaviour of the community and subscribers count. Using a multiple linear regression approach, and a data set involving 2,000 uploaded videos monitored on a regular basis, the author examines whether current user in-degree (subscriptions to the user), out-degree (channels followed by the user), view count, post count, favourite count have a bearing on the increase in subscribers. Their results indicate that the post view counts is found to have a significant correlation with in-degree, as well as the number of times the content is favourited. An observation is made, however, that there appears to be a negative correlation between increases in participation in the community by the user (more content uploaded, more video views), implying that excessive participation may have an effect on reputation. The authors believe that additional conclusions may be drawn from this work through the linguistic analysis of comments/titles.
Identified methodologies and limitations As organizations aim to use social media to increase customer engagement, the effectiveness of these approaches hinges on creating content that people want to share. While some have argued that there may be no formula for what becomes viral,51 the work discussed here demonstrates that both the nature of the content and the process by which the content is introduced (eg position of links on a website) may have a high impact on the propagation process. It is hence crucial to understand the type of content that is likely to be shared by consumers, and whether we can infer from the initial response obtained whether this content is likely to further propagate or reach new customers. Fan pages and wall posts in Facebook provide a wealth of textual content that can be analysed, both for sentiment, or as described previously for the purpose of recommendations and user interest analysis. Exploiting such findings may serve to construct user engagement processes, which are much more effective than current approaches.
In this section, we have explored a wide range of methodologies and approaches that allow us to measure the impact of structure of the Facebook social graphs on the diffusion of messages and identify key influencers. We have also explored techniques that aim to provide key insights into the demographic and interest profiles of users, exploring the use of recommendation systems and linguistic analysis in social media environments. Though some level of work has been exploring the reasons why particular items and the specific characteristics exhibited by particular types of messages are more likely to propagate, we find that there is still little insight on this topic — particularly with respect to the specific nature of the Facebook features that may lead to messages being shared via the news feed or otherwise.
Monitoring Facebook metrics, described in the section ‘Background: Facebook’, allows us to examine the effects of passive broadcasting of actions via the news feed as well or the effect of the active involvement of users and empirically evaluate the impact of continuous customer engagement beyond initial customer acquisition. The observations of repeat interactions with an application or fan page may result in triggering interest, or dislike, in a way that might have further effect on the propagation of content in the networked environment. In this respect, we may begin to model the complete life cycle of viral content on Facebook, as will be discussed in the following section.