1 Introduction

A network is defined as a system of elements that interact or regulate each other. Networks can be mathematically represented as graphs. Typically, the term ‘graph’ refers to visual representations of the variation of one variable compared to other variables, or the mathematical concept of a set of vertices connected by edges, or data structures based on that mathematical concept; whereas the term ‘network’ typically refers to interconnected systems of things (inanimate objects or people), or specialised types of the mathematical concept of graphs. Throughout this work, we will use the terms ‘graph’ and ‘network’ interchangeably.

Network representations are used widely in many areas of science to model various physical or abstract systems that form web-like interwoven structures, such as the World Wide Web (Web) (Albert et al. 1999), the Internet (Faloutsos et al. 1999), social networks (Barabási et al. 2002; Girvan and Newman 2002;, Krapivsky et al. 2000; Moore and Newman 2000; Newman 2001; Wasserman and Faust 1994), metabolic networks (Feinberg 1980; Jeong et al. 2000; Lemke et al. 2004), food webs (Barrat et al. 2004; McCann et al. 1998; Polis 1998), neural networks (Latora and Marchiori 2003; Sporns 2002; Sporns et al. 2002), transportation networks (Berlow 1999; Guimera et al. 2005; Li and Cai 2004), disease and rumour spreading (Moore and Newman 2000; Pastor-Satorras and Vespignani 2001), or urban infrastructure (Scellato et al. 2005; see Albert 2005; Albert and Barabási 2002; Boccaletti et al. 2006; Christensen and Albert 2007) for overviews). Key to understanding such systems are the mechanisms that determine the topology of their resulting networks. Once the topology of a network is determined, the network can be quantitatively described with measures that capture its most salient properties, by making an analogy between mathematical properties of network topology, such as diameter or density, and real properties of the system being modelled as a network, e.g. Web connectivity or clustering. This allows to draw parallels between the structural mechanics of a physical (or abstract) system and the connectivity of a mathematical object. Based on such parallels, estimations can be made about the system. For instance, in bibliometrics, citation networks are used to estimate the scientific productivity of a field (based on its rate of growth, or its citation preferences Belew 2005) or the importance of an author (based on how many other authors cite him/her, and how important these authors are themselves Newman 2001). Similarly, in Web IR, graph representations of hyperlinked Webpages are used to compute how important a Webpage is, on the basis of how many other Webpages point to it, and how important those Webpages are Page et al. (1998). In such graph theoretic representations of networks, the notion of an ‘important’ vertex being linked to another vertex is called recommendation.

This work models text as a network of associations that connect words (text graph). We apply this text graph representation to IR, by deriving graph-based term weights, and by drawing an analogy between topological properties of the graph and discourse properties of the text being modelled as a graph. Text graphs are a well explored area in linguistics (overview in Sect. 2.2). The underlying motivation behind text graphs is that they can represent the non-linear, non-hierarchical structure formation of language in a mathematically tractable formalism. This representation is powerful because it can integrate several aspects of language analysis (topological, statistical, grammatical, or other) seamlessly into the model. We focus on text graphs that model term co-occurrence and grammatical modification, and we analyse their topology to gain insights into discourse aspects of the text being modelled. We posit that term co-occurrence and grammatical modification reflect language organisation in a subtle manner that can be described in terms of a graph of word interactions. The underlying hypothesis is that in a cohesive text fragment, related words tend to form a network of connections that approximates the model humans build about a given context in the process of discourse understanding (Halliday and Hasan 1976). This position is accepted in linguistics, for instance by Deese’s (1965) and Cramer’s (1968) earlier work on the relation between word association and the structured patterns of relations that exist among concepts, Hobbs’ work on word relationships that ‘knit the discourse together’, and Firth’s (1968b) well-known work about seeking and finding the meaning of words ‘in the company they keep’.

Our study of text graphs can be split into two parts:

  1. (a)

    how the text graph is built

  2. (b)

    what computations are done on the text graph.

Regarding (a), we build a graph where vertices denote terms, and where edges denote co-occurrence and grammatical relations between terms. Regarding (b), we realise two different types of computations on this graph, aiming either to rank the vertices, or to measure properties of graph topology. The former (vertex ranking) computations allow us to rank each vertex based on properties of the whole graph. The latter (measuring properties of graph topology) allows us to enhance the previously computed term weights with information about topological properties of the graph (and hence of the text), which represent discourse properties of the text.

We apply these term weights and discourse properties to IR in a series of experiments involving building two different types of text graphs, computing four different graph-based term weights, and using these weights to rank documents against queries. We reason that our graph-based term weights do not necessarily need to be normalised by document length (unlike existing term weights) because they are already scaled by their graph-ranking computation. This is a departure from existing IR ranking functions, and we experimentally show that it can perform comparably to established, highly tuned ranking, such as BM25 (Robertson et al. 1995). In addition, we integrate into ranking graph-dependent properties, such as the average path length, or clustering coefficient of the graph. These properties represent different aspects of the topology of the graph, and by extension of the document represented as a graph. Integrating such properties into ranking practically allows us to consider issues such as discourse coherence, flow and density when retrieving documents with respect to queries. These combinations are evaluated on three different Text REtrieval Conference (TREC; Voorhees and Harman 2005) collections (with 350 queries) against a BM25 baseline. We measure retrieval performance using three different standard TREC evaluation measures, and we tune the parameters of all our ranking functions (both the baseline and the graph-based ones) separately for each evaluation measure. In addition, we carry out an extra ‘split-train’ parameter tuning study, which confirms the stability of our approach across different parameter settings.

There exist numerous uses of graph theory in IR (overviewed in Sect. 2.1). This work contributes an alternative approach, which allows to model term co-occurrence and grammatical relations into retrieval as an integral part of the term weight. In addition, this work contributes a ranking function that contains no document length normalisation and that can perform comparably to a baseline that contains tuned document length normalisation. To our knowledge, this is novel. The final contribution of this work is the analogy drawn between graph topology properties and aspects of text discourse, which enables us to numerically approximate discourse aspects and successfully integrate them into retrieval. Even though this analogy between graph properties and discourse aspects is not novel in linguistics (see discussion in Sect. 2.2), its implementation into IR is novel, and we experimentally show that it is effective.

The remainder of this paper is organised as follows. Section 2 overviews related work on graph theory in IR (Sect. 2.1), and on graph representations of text (Sect. 2.2). Section 3 introduces some graph theory preliminaries and presents the properties of graph topology used in this work. Section 4 presents the two text graphs we build in this work, and the graph-based term weights that we compute from them. Section 5 presents IR ranking functions that use these graph-based term weights, firstly without normalisation (Sect. 5.1), and secondly enhanced with properties of graph topology (Sect. 5.2). Section 6 describes and discusses the experimental evaluation of these graph-based term weights in IR. Section 7 discusses issues pertaining to the implementation and efficiency of our approach. Section 8 summarises this article and suggests future research directions.

2 Related work

2.1 Graphs in information retrieval

Graph theoretic approaches to IR can be traced back to the early work of Minsky on semantic IR (Minsky 1969), which was followed by several variants of conceptual IR and knowledge-based IR. Numerous variants of graph formalisms have since been used in connectionist Footnote 1 approaches to IR (e.g., frame networks, neural networks, spreading activation models, associative networks, conceptual graphs, ontologies; Doszkocs et al. 1990), and numerous approaches have been proposed to shift weights between vertices in the graphs (such as the Inference Network IR model Turtle and Croft 1991 based on the formalism of Bayesian networks Pearl 1988, or Logical Imaging formalisms Crestani and van Rijsbergen 1998). Such connectionist approaches provide a convenient knowledge representation for IR applications in which vertices typically represent IR objects such as keywords, documents, authors, and/or citations, and in which bidirectional links represent their weighted associations or relevance (approximated in terms of semantic or statistical similarity). The propagated learning and search properties of such networks provide the means for identifying relevant information items. One of the main attractions of these models is that, in contrast to more conventional information processing models, connectionist models are ‘self-processing’ in the sense that no external program operates on the network: the network literally ‘processes itself’, with ‘intelligent behaviour’ emerging from the local interactions that occur concurrently between the vertices through their connecting edges.

For instance, one of the earliest neural networks used to model information is the Hopfield net (Hopfield 1982; Hopfield and Tank 1986), in which information was stored in single-layered interconnected neurons (vertices) and weighted synapses (edges). Information was then retrieved based on the network’s parallel relaxation method: vertices were activated in parallel and were traversed until the network reached a stable state (convergence). Another early connectionist model explicitly adopted to IR was Belew’s AIR (1989), a three-layer neural network of authors, index terms, and documents, which used relevance feedback from users to change the representation of authors, index terms, and documents over time through an iterative learning process. The result was a representation of the consensual meaning of keywords and documents shared by some group of users.

Such connectionist networks have been found to fit well with conventional vector space and probabilistic retrieval models. For instance, Kwok’s early three-layer network of queries, index terms, and documents, used a modified Hebbian learning rule to reformulate probabilistic IR (Kwok 1989). Similarly, Wilkinson and Hingston’s neural network representations of vector space retrieval used spreading activation through related terms to improve retrieval performance (Wilkinson and Hingston 1991). The above models represent IR applications in terms of their main components of documents, queries, index terms, authors, etc. Network models have also been used in other IR representations, for instance to model the semantic relations between documents as a self-organising Kohonen network (Lin et al. 1991), or to cluster documents (Macleod and Robertson 1991). In addition, similar connectionist approaches have also been used for various classification and optimisation tasks, starting from the early work of Huang and Lippmann (1987).

More recently, the appearance and fast widespread of the Web has caused a resurgence of graph theoretic representations in applications of Web search. Starting with the seminal work of Page et al. (1998) and Kleinberg (1999), the main idea is to draw direct analogies between hypertext connectivity on the Web and vertex connectivity in graphs. Page and Brin proposed the PageRank vertex ranking computation. PageRank uses random walks, which are a way of ranking the salience of a vertex, by taking into account global information recursively computed from the entire graph, rather than relying only on local vertex-specific information. In the context of the Web, where the graph is built out of Webpages (nodes) and their hyper-references (links), PageRank applies a ‘random Web surfer’ model, where the user jumps from one Webpage to another randomly. The aim is to estimate the probability of the user ending at a given Webpage. There are several alternatives and extensions of PageRank, for instance HITS (Kleinberg 1999), which applies the same idea, but distinguishes the nodes between ‘hubs’ and ‘authorities’, where a hub is a Webpage with many outgoing links to authorities, and an authority is a Webpage with many incoming links from hubs. More elaborate ranking algorithms have also been proposed that incorporate information about the node’s content into the ranking, for instance anchor text (Chakrabarti et al. 1998), or that involve computationally lighter processes (Lempel and Moran 2001). Such ranking algorithms are used for various tasks, such as Web page clustering (Bekkerman et al. 2007), or document classification (Zhou et al. 2004).

More recently, graph theoretic applications have been used for other applications within IR, for instance IR evaluation measurements (Mizzaro and Robertson 2007), and re-ranking (Zhang et al. 2005). Furthermore, an increasingly popular recent application of graph theoretic approaches to IR is in the context of social or collaborative networks and recommender systems (Craswell and Szummer 2007; Kleinberg 2006; Konstas et al. 2009; Noh et al. 2009; Schenkel et al. 2008).

In the above approaches, the graph is usually built out of the main components of an IR process (e.g. documents and/or queries and/or users). Our work differs because we build the graph out of the individual terms contained in a document. Hence, the object of our representation is not the IR process as such.

2.2 Text as graph

Text can be represented as a graph in various ways, for instance in graphs where vertices denote syllables (Soares et al. 2005), terms (in their raw lexical form, or part-of-speech (POS) tagged, or as senses), or sentences, and where edges denote some meaningful relation between the vertices. This relation can be statistical (e.g. simple co-occurrence Ferrer i and Solé 2001, or collocation Footnote 2 Bordag et al. 2003; Dorogovtsev and Mendes 2001; Ferrer i and Solé 2001), syntactic (i Cancho et al. 2007; Ferrer i Cancho et al. 2004; Widdows and Dorow 2002), semantic (Kozareva et al. 2008; Leicht et al. 2006; Motter et al. 2011; Sigman and Cecchi 2002; Steyvers and Tenenbaum 2005), phonological (Vitevitch and Rodrguez 2005), orthographic (Choudhury et al. 2007), discourse (Somasundaran et al. 2009), or cognitive (e.g. free-association relations observed in experiments involving humans Sigman and Cecchi 2002; Steyvers and Tenenbaum 2005). There exist numerous variants of such text graphs. For instance, graphs representing semantic relations between terms can be further subcategorised into thesaurus graphs (Leicht et al. 2006; Motter et al. 2011; Sigman and Cecchi 2002; Steyvers and Tenenbaum 2005) and concept graphs (Sigman and Cecchi 2002; Steyvers and Tenenbaum 2005). In thesaurus graphs, vertices denote terms, and edges denote sense relations, e.g. synonymy or antonymy. In concept graphs, vertices denote concepts, and edges denote conceptual relations, e.g. hypernymy or hyponymy. Furthermore, in text graphs, the exact definition of the relations that build the graph can vary. For instance, (Mihalcea and Tarau 2004) remove stopwords, and (Hoey 1991) link sentences that share at least two lexically cohesive words. Moreover, edge relations can further combine two or more different statistical, linguistic or other criteria, for instance in syntactic-semantic association graphs (Nastase et al. 2006; Widdows and Dorow 2002). Edge relations can also be further refined, for instance in co-occurrence graphs which define co-occurrence either within a fixed window (Ferrer i and Solé 2001; Masucci and Rodgers 2006; Milo et al. 2004), or within the same sentence (Antiqueira et al. 2009; Caldeira et al. 2005). Optionally, meaningless edge-relations can be filtered out, under any statistical or linguistic interpretation of this (Pado and Lapata 2007). The vertices themselves can be optionally weighted, in line with some statistical or linguistic criterion (e.g. term frequency, or rank in some semantic hierarchy). Such text graphs have been built for several languages, i.e. German, Czech and Romanian (Ferrer i Cancho et al. 2004), Spanish (Vitevitch and Rodrguez 2005), Hindi, Bengali (Choudhury et al. 2007), Japanese (Joyce and Miyake 2008), and even undeciphered Indus script (Sinha et al. 2009).

The applications of such text graphs are numerous. For instance, in the graph representation of the undeciphered Indus script, graph structure is used to detect patterns indicative of syntactic structure manifested as dense clusters of highly frequent vertices (Sinha et al. 2009). Simply speaking, this means being able to detect patterns of syntax in an undeciphered language. Another example can be found in collocation graphs. In such graphs, edges model an idea central to collocation analysis (Sinclair 1991), namely that collocations are manifestations of lexical semantic affinities beyond grammatical restrictions (Halliday and Hasan 1976). Analysing such graphs allows to discover semantically related words based on functions of their co-occurrences (e.g. similarity). A further example is the case of phonological and free-association graphs, which are used to explain the human perceptual and cognitive processes in terms of the organisation of the human lexicon (Vitevitch and Rodrguez 2005). The existence of many local clusters in those graphs is seen as a necessary condition of effective associations, while the existence of short paths is linked to fast information search in the brain. Such findings are used to improve navigation methods of intelligent systems (Sigman and Cecchi 2002; Steyvers and Tenenbaum 2005), and also in medicine, where it has been suggested that the effects of disconnecting the most connected vertices of cognitive association graphs can be identified in some language disorders (Motter et al. 2011), and that specific topological properties of these graphs can be quantitatively associated to anomalous structure and organisation of the brain (namely decreased synaptic density) (Ferrer-i-Cancho 2005). Another example of text graph applications can be found in more formal approaches to linguistics, and specifically in transitive reasoning, which uses graph representations of language for logical reasoning. For instance, the Piagetian idea contends that transitive inference is logical reasoning in which relationships between adjacent terms (represented as vertices) figure as premises (Reynal and Brainerd 2005). Yet another example can be found in orthographic association graphs (Choudhury et al. 2007), which study graph topology with the aim to identify aspects of spelling difficulty. In this case, topological properties of the graph are interpreted as difficulties related to spell-checking. Specifically, the average degree of the graph is seen as proportional to the probability of making a spelling mistake, and the average clustering coefficient of the graph is related to the hardness of correcting a spelling error.

More popular applications of text graphs include building and studying dictionary graphs (of semantically-related senses), which are then used for automatic synonym extraction (Blondel et al. 2004; Ho and Fairon 2004; Muller et al. 2006), or word sense disambiguation (Agirre and Soroa 2009; Gaume 2008; Kozima 1993; Mihalcea and Tarau 2004; Véronis and Ide 1990). In such graphs, topological properties are interpreted as indicators of dictionary quality or consistency (Sigman and Cecchi 2002). The popularity of this type of graphs has grown significantly, following the appearance of resources such as WordNet or the Wikipedia, which are themselves network-based and hence explicitly susceptible to graph modelling (see Minkov and Cohen 2008; Pedersen et al. 2004; Widdows and Dorow 2002 for text graph applications using these resources). Another popular and more recent application is opinion graphs, representing opinions or sentiments linked by lexical similarities (Takamura et al. 2007), morphosyntactic similarities (Popescu and Etzioni 2005), term weights like TF-IDF (Goldberg and Zhu 2006), or discourse relations (Somasundaran et al. 2009). Finally, in more mainstream applications of linguistics, text graphs are commonly used to observe the evolution rate and patterns of language (Dorogovtsev and Mendes 2001), while in applications of automatic language processing, text graphs are commonly used to estimate text quality (Antiqueira et al. 2007).

The type of application of text graphs used in this work consists in ranking the graph vertices using random walk computations, like PageRank. Two well-known implementations of random walks on text graphs are TextRank (Mihalcea and Tarau 2004) and LexRank (Erkan and Radev 2004), variants and extensions of which have been applied to keyword detection, word sense disambiguation, text classification (Hassan and Banea 2006), summarisation (by extraction or query-biased Esuli and Sebastiani 2007, or ontology-based Plaza et al. 2008), novelty detection (Gamon 2006), lexical relatedness (Hughes and Ramage 2007), and semantic similarity estimation (Ramage et al. 2009). To our knowledge, the only application of random walks term weights to IR has been our poster study of (Blanco and Lioma 2007), which we extend in this work. Specifically, this work extends (Blanco and Lioma 2007) in three ways (also discussed in Sect. 4). Firstly, in (Blanco and Lioma 2007), we derived term weights from graphs that modelled solely term co-occurrence. In this work, we compute term weights from graphs that model not only term co-occurrence, but also grammatical modification. We do so using a theoretically principled way in accordance to Jespersen’s Rank Theory (Jespersen 1929), a well-known grammatical theory. Secondly, in (Blanco and Lioma 2007), we used graph-based term weights for retrieval by plugging them to the ranking function without considering document length normalisation (we used pivoted document length normalisation at the time). In this work, we look at the issue of document length normalisation very closely, by studying the use of our graph-based term weights in ranking functions without normalisation. Our motivation is that our graph-based term weights do not necessarily need to be normalised by document length because they are already scaled by their graph-ranking computation. Our here-used ranking functions simply combine graph-based term weights with inverse document frequency (idf) (Sparck 1972), a well-known statistic of term specificity, and are shown to be comparable to BM25 (Robertson et al. 1995), an established and robust retrieval model that includes an explicit parameterised document length normalisation component. Thirdly, in Blanco and Lioma (2007) we did not consider graph topology at all. In this work, we study topological properties of the text graphs, which we consider analogous to discourse aspects of the text, and we integrate these properties into the ranking process to enhance retrieval.

3 Graph theory preliminaries

3.1 Networks as graphs

Graphs are mathematical representations of networks. The node and link of a network are referred to as vertex (V) and edge (E) respectively in graph theory. Formally, an undirected graph G is a pair \(G=(\mathcal{V},\mathcal{E})\) where \(\mathcal{V}\) is the set of vertices and \(\mathcal{E}\) is the set of edges, such that \(\mathcal{E} \subseteq \mathcal{V}^2\) (see Fig. 1). Edges can be either directed or undirected, as is necessitated by the type of interaction they represent. For instance, the edges of gene-regulatory networks and the Internet are directed, since they depict relationships for which the source and target of the interaction are known; conversely, protein-protein interaction networks and social networks are typically undirected, since these relationships tend to be more mutual (Christensen and Albert 2007). A directed graph is a pair (\(\mathcal{V},\mathcal{E}\)), where edges point toward and away from vertices. For a vertex \(v_i \in \mathcal{V}, \,In(v_i)\) is the set of vertices that point to it and Out(v i ) is the set of vertices that v i points to, such that:

$$ In(v_i) = \{ v_j \in {\mathcal{V}} \slash (v_j, v_i) \in {\mathcal{E}} \} $$
(1)
$$ Out(v_i) = \{ v_j \in {\mathcal{V}} \slash (v_i, v_j) \in {\mathcal{E}} \} $$
(2)
$$ {\mathcal{V}}(v_i) = In(v_i) \cup Out(v_i) $$
(3)

The order of a graph G is the number of its vertices.

Fig. 1
figure 1

Left an undirected graph (G 1) with eight vertices and seven undirected edges. Right a directed graph (G 2) with eight vertices and eleven directed edges

3.2 Graph topology properties

Early work on random graphs (Erdos and Renyi 1959; Erdos and Renyi 1960; Erdos and Renyi 1961), i.e. on networks for which the connections among nodes have been randomly chosen, pioneered the basic measures of graph topology that would later be extended to the analysis of non-random networks. It was soon established that networks with similar functions have similar graph-theoretical properties (see Albert 2005; Albert and Barabási 2002; Boccaletti et al. 2006; Dorogovtsev and Mendes 2002; Newman 2003 for reviews). Three of the main properties of a graph’s topology are its degree distribution, average path length and clustering coefficient.

3.2.1 Degree distribution

The degree δ(v i ) of a vertex v i is the number of edges adjacent to v i . If the directionality of interaction is important (in directed graphs), the degree of a vertex can be broken into an indegree and an outdegree, quantifying the number of incoming and outgoing edges adjacent to the vertex.

The degree of a specific vertex is a local topological measure, and we usually synthesise this local information into a global description of the graph by reporting the degrees of all vertices in the graph in terms of a degree distribution, P(k), which gives the probability that a randomly-selected vertex will have degree k. The degree distribution is obtained by first counting the number of vertices with k = 1, 2, 3, … edges, and then dividing this number by the total number of vertices in the graph (Christensen and Albert 2007). Often the degree distribution is approximated as the average degree of a graph, computed as (Mehler 2007):

$$ \delta(G) = 2\frac{\left | {\mathcal{E}}(G) \right |}{\left | {\mathcal{V}}(G) \right |} $$
(4)

where δ(G) denotes the average degree of graph \(G, \,|\mathcal{E}(G)|\) denotes the cardinality of edges in G, and \(|\mathcal{V}(G)|\) denotes the cardinality of vertices in G.

Extensive work on graph theory in the last decade (reviewed in Albert 2005; Albert and Barabási 2002; Boccaletti et al. 2006; Dorogovtsev and Mendes 2002; Newman 2003) has demonstrated that graphs of similar types of systems tend to have similar degree distributions, and that the vast majority of them have degree distributions that are scale-free (reviewed in Albert and Barabási 2002). The scale-free form of the degree distribution indicates that there is a high diversity of vertex degrees and no typical vertex in the graph that could be used to characterise the rest of the vertices (Christensen and Albert 2007). Practically this means that the degree distribution gives valuable insight into the heterogeneity of vertex interactivity levels within a graph. For instance, in directed graphs, information regarding local organisational schemes can be gleaned by identifying vertices that have only incoming or outgoing edges. These sinks and sources, respectively, are very likely to have specialised functions. For example, if the graph describes a system of information flow, such as a signal transduction network within a cell, the sources and sinks of the graph will represent the initial and terminal points of the flow. In this case, the injection points for chemical signals will be sources, and the effectors at which the chemical signals terminate will be sinks (Ma’ayan et al. 2004). Another example is the case of vehicular traffic networks, where sources and sinks take the form of on and offramps in highway systems (Knospe et al. 2002). In these cases, looking at the degree of the respective graph offers insight into how heterogeneous the objects modelled as vertices are, and to what extent some of them (if any) can be considered to be discriminative with respect to the rest.

3.2.2 Average path length

Given an edge adjacent to a vertex, if one traces a path along consecutive distinct edges, only a fraction of the vertices in the graph will be accessible to the starting vertex (Bollobás 1979, 1985). This is often the case in directed graphs, since whether two edges are consecutive depends on their directions. If a path does exist between every pair of vertices in a graph, the graph is said to be connected (or strongly connected if the graph is directed). The average number of edges in the shortest path between any two vertices in a graph is called average path length. Based on this, the average path length of the graph l(G) can be estimated as the ratio of the number of its vertices over its degree (Albert and Barabási 2001):

$$ l(G) \approx \frac{\ln(|{\mathcal{V}}(G)|)} { \ln(\delta(G))} $$
(5)

where \(|\mathcal{V}(G)|\) is the cardinality of vertices in G and δ(G) is the average degree of G.

For most real networks, the average path length is seen to scale with the natural logarithm of the number of vertices in the graph. In addition, for most real networks, average path length remains small, even if the networks become very large (small world property; Watts and Strogatz 1998). Average path length, approximated as the average of inverse distances, can be indicative of the graph’s global efficiency, in the sense that it can indicate how long it takes to traverse a graph (Latora and Marchiori 1987, 2003).

3.2.3 Clustering coefficient

The clustering coefficient of a vertex measures the proportion of its neighbours that are themselves neighbours. By averaging the clustering coefficients of all vertices in a graph we can obtain an average clustering coefficient of the graph, which is indicative of the strength of connectivity within the graph. Most real networks, including, for example, protein-protein interaction networks, metabolic networks (Wagner and Fell 2001), or collaboration networks in academia and the entertainment industry (Christensen and Albert 2007), exhibit large average clustering coefficients, indicating a high level of redundancy and cohesiveness.

Mathematically, the local clustering coefficient of v i is given by

$$ c(v_i) = \frac{2{\mathcal{E}}(v_i)}{\delta(v_i) (\delta(v_i) - 1)} $$
(6)

where \(\mathcal{E}(v_i)\) is the number of edges connecting the immediate neighbours of node v i , and δ(v i ) is the degree of node v i . Alternatively, the average clustering coefficient of a graph can be approximated as the average proportion of the neighbours of vertices that are themselves neighbours in a graph (Albert and Barabási 2001):

$$ c(G) \approx \frac{\delta(G)}{|{\mathcal{V}}(G)|} $$
(7)

where δ(G) denotes the average degree of G, and \(|\mathcal{V}(G)|\) denotes the cardinality of vertices in G.

The average clustering coefficient can be used to identify connected graph partitions. By identifying such connected partitions within a network, one may be able to gain a sense of how the network is functionally organised at an intermediate level of complexity. For example, often we come across directed graphs having one or several strongly-connected components. These strongly-connected components are subgraphs whose vertex pairs are connected in both directions. Each strongly-connected component is associated with an in-component (vertices that can reach the strongly-connected component, but that cannot be reached from it) and an out-component (the converse). It has been suggested that the vertices of each of these components share a component-specific task within a given network (Christensen and Albert 2007). For example, in biological networks of cell signal transduction, the vertices of the in-component tend to be involved in ligand-receptor binding, while the vertices of the out-component are responsible for the transcription of target genes and for phenotypic changes (Ma’ayan et al. 2005).

Finally, note that properties of graph topology are connected to each other. For instance, a high average clustering coefficient often indicates a high abundance of triangles (three-vertex cliques) in the graph, which causes short paths to emerge. This has been observed in text graphs across languages, i.e. for German, Czech and Romanian (Ferrer i Cancho et al. 2004).

4 Graph based term weights

This section presents the two different text graphs we build, and the term weights we compute from them. Given a collection of documents, we build text graphs separately for each document. Hence, for the remaining of this article, text graph denotes document-based graphs, not collection-based graphs.

4.1 Co-occurrence text graph (undirected)

We represent text as a graph, where vertices correspond to terms, and edges correspond to co-occurrence between the terms. Specifically, edges are drawn between vertices if the vertices co-occur within a ‘window’ of maximum N terms. The underlying assumption is that all words in the text have some relationship to all other words in the text, modulo window size, outside of which the relationship is not taken into consideration. This approach is statistical, because it links all co-occurring terms, without considering their meaning or function in text. This graph is undirected, because the edges denote that terms simply co-occur, without any further distinction regarding their role. We represent each word in the text as a vertex in the graph. We do not filter out any words, such as stopwords, nor do we focus solely on content-bearing words, such as nouns, when building the graph (unlike Mihalcea and Tarau 2004). An example of such a co-occurrence graph is shown in Fig. 2, which uses a very short text borrowed from (Mihalcea and Tarau 2004), with a window of N = 4 terms. The main topological properties of average path length, average degree, and clustering coefficient are also shown.

Fig. 2
figure 2

Co-occurrence graph (undirected) of a short sample text borrowed from (Mihalcea and Tarau 2004). Vertices denote terms, and edges denote co-occurrence within a window of 4 terms. Graph topology properties: average degree (δ), average path length (l) and cluster coefficient (c)

We derive vertex weights (term weights) from this graph in two different ways:

  • Using a standard graph ranking approach that considers global information from the whole graph when computing term weights (Sect. 4.1.1).

  • Using a link-based approach that considers solely local vertex-specific information when computing term weights (Sect. 4.1.2).

4.1.1 Graph ranking weight: textrank

Given a text graph, we set the initial score of each vertex to 1, and run the following ranking algorithm on the graph for several iterations (Mihalcea and Tarau 2004; Page et al. 1998):

$$ S(v_i) = (1-\phi) + \phi \sum_{j \in {\mathcal{V}}(v_i)} \frac{S(v_j)}{|{\mathcal{V}}(v_j)|} \;\;\; (0 \le \phi \le 1) $$
(8)

S(v i ) and S(v j ) denote the score of vertex v i and v j respectively, \(\mathcal{V}(v_i)\) and \(\mathcal{V}(v_j)\) denote the set of vertices connecting with v i and v j respectively, and ϕ is a damping factor that integrates into the computation the probability of jumping from a given vertex to another random vertex in the graph. We run (8) iteratively for a maximum number of iterations (100 in this work). Alternatively, the iteration can run until convergence below a threshold is achieved (Mihalcea and Tarau 2004). When (8) converges, a score is associated with each vertex, which represents the importance of the vertex within the graph. This score includes the concept of recommendation: the score of a vertex that is recommended by another highly scoring vertex will be boosted even more. The reasoning behind using such vertex scores as term weights is that: the higher the number of different words that a given word co-occurs with, and the higher the weight of these words (the more salient they are), the higher the weight of this word.

Equation (8) implements the ‘text surfing model’ of Mihalcea and Tarau (2004), which itself is a variation of the original PageRank (Page et al. 1998). The sole difference between (8) and the formula proposed in Mihalcea and Tarau (2004) (and applied to IR in Blanco and Lioma 2007) is that the latter distinguishes between inlinking and outlinking vertices, whereas (8) considers the set of all connected vertices without any sense of direction. We refer to this weight as TextRank, which is the original name used in Mihalcea and Tarau (2004).

4.1.2 Link-based weight: textlink

Given a graph of words, we derive a weight for each vertex directly from the number of its edges:

$$ S(v_i) = \delta(v_i) $$
(9)

where δ(v i ) is the average degree of a vertex, as defined in Sect. 3.2.1, Equation 4. The reasoning behind using such vertex scores as term weights is that: the higher the number of different words that a given word co-occurs with, the higher the weight of this word. We refer to this weight as TextLink.

This weight is a simple approximation, which for this graph is closely linked to term frequency (because in this graph edges are drawn if terms co-occur). This is not necessarily the case for other graphs however, where edges are defined not only on co-occurrence grounds, but also on the basis of grammatical modification, like the graph discussed in the next section.

4.2 Co-occurrence text graph with grammatical constraints (directed)

In our second graph, terms are related according to their co-occurrence as before, but also according to their grammatical modification. Grammatical modification distinguishes a scheme of subordination between words, or otherwise stated, different hierarchical levels. In linguistics these hierarchical levels are called ranks, and the special terms primary, secondary and tertiary refer to the first three ranks, which are typically considered to be semantically more important than the others (Jespersen 1929). Schemes of subordination or different hierarchical levels mean that one word is defined (or modified) by another word, which in its turn may be defined (or modified) by a third word, etc., and that only the word occupying the highest level does not depend on, or does not require the presence of another word. An example follows.

Example 1

Some (4) furiously (3) speeding (2) cars (1).

In the above example, the numerals 4, 3, 2, 1 denote the quaternary, tertiary, secondary and primary rank of the words respectively. The primary is typical of nouns, the secondary is typical of adjectives and verbs, the tertiary is typical of adverbs, and the quaternary is typical of the remaining POS, although there can be exceptions. These ranks can be generalised to the case of sentences, for example the primary rank, noun, can be taken to be the subject of a sentence. Under this light, syntactic relations between words imply hierarchical jumps between words. The point to remember is that a word can modify other words of the same or lower rank only. This is one of the principles of Jespersen’s Rank Theory (Jespersen 1929), which practically means that a noun can only modify another noun, whereas a verb can modify a noun, verb, or adjective, but not an adverb. It is exactly this modification that we use to build a text graph. This modification has been used in IR applications, for instance in Lioma and Van (2008 and Lioma and Blanco (2009).

Specifically, we build the graph as described in Sect. 4.1, with the difference that we now define vertices pointing to or being pointed to by other vertices (outlinking and inlinking respectively). A prerequisite for building this graph is to have the text previously grammatically annotated, which we do automatically using a POS tagger.

The resulting graph is directed, where the direction of the edges represents the grammatical modification between terms. For example, given two grammatically dependent terms t i and t j , where t i modifies t j , we consider the edge direction to be from t i to t j , hence t i points to t j .

Figure 3 graphically displays an example of such a directed graph for the same short text used in Fig. 2, using the same window of co-occurrence (N = 4 terms). Note how the main topological properties of average degree, average path length, and clustering coefficient of this graph differ from the ones computed for the undirected co-occurrence graph in Fig. 2: even though both graphs represent the exact same text, the directed graph has a lower average degree, a higher average path length and a lower clustering coefficient. This is due to the restrictions imposed when drawing edges between vertices, with the practical result of rendering the graph in Fig. 3 less densely linked, longer to traverse, and less clustered.

Fig. 3
figure 3

Co-occurrence graph with grammatical constraints (directed): vertices denote POS-tagged terms, the POS of which is ranked from 1 (most salient term) up to 4 (least salient term). Edges denote co-occurrence and grammatical modification between terms. The main topological properties of average degree (δ), average path length (l) and clustering coefficient (c) are lower in value than their equivalent properties computed for Fig. 2, indicating that this graph is less dense, longer to traverse, and less clustered than the undirected graph in Fig. 2

Having constructed such a graph, we compute the score of a vertex (term weight) in two different ways, similarly to the case of the directed graph, namely by using a graph ranking weight (Sect. 4.2.1), and a link-based weight (Sect. 4.2.2).

4.2.1 Graph ranking weight: PosRank

Since our graph is directed, we compute the original PageRank (Page et al. 1998):

$$ S(v_i) = (1-\phi) + \phi \sum_{j \in In(v_i)} \frac{S(v_j)}{|Out(v_j)|} \;\;\; (0 \le \phi \le 1) $$
(10)

where In(v i ) denotes the set of vertices that modify v i (in 8 this was simply \(\mathcal{V}(v_i), \) i.e. all vertices linking to v i were considered), and Out(v j ) denotes the set of vertices that v j modifies (in 8 this was simply \(\mathcal{V}(v_j), \) i.e. all vertices linking to v j were considered). All other notation is as defined for Equation 8. The reasoning behind using such vertex scores as term weights is that: the higher the number of different words that a given word co-occurs and is grammatically dependent with, and the higher the weight of these words (the more salient they are), the higher the weight of this word. We refer to this weight as PosRank, to stress the fact that it is computed from a graph where words are linked according to their grammatical (or POS) dependence.

4.2.2 Link-based weight: PosLink

Similarly to the case of the undirected graph, we approximate a vertex weight directly from the vertex average degree. However, in order to take into account the aspect of grammatical modification, we consider solely the vertex indegree (In(v i )):

$$ S(v_i) = |In(v_i)| $$
(11)

In our case, the indegree corresponds to how many words modify a word. The more salient a word (e.g., noun), the higher its indegree. The reasoning behind using such vertex scores as term weights is that: the higher the number of different words that a given word co-occurs and is grammatically-dependent with, the higher the weight of this word. We refer to this weight as PosLink.

To recapitulate, in Sect. 4 we have presented two ways of representing text as a graph of words, one undirected and one directed. In each case, we have suggested two different ways of assigning scores to vertices (term weights), one based on graph ranking, and one based on links alone. This results in four graph-based term weights, which are summarised in Fig. 4.

Fig. 4
figure 4

Overview of our four graph-based term weights, with their respective Equation numbers and underlying intuitions

5 Graph-based term weights for ranking

We use our four graph-based term weights for retrieval by integrating them to the ranking function that ranks documents with respect to queries. This is a standard use of term weights in IR, and there exist several different ways of doing so, e.g. (Ponte and Croft 1998; Robertson et al. 1995). Most, if not all, of these approaches include an idf-like component, which represents the inverse document frequency of a term, defined as the ratio of the total number of documents in a collection over the number of documents that contain a term (Sparck 1972). In addition, most, if not all, of these approaches also include some normalisation component, which re-adjusts the final ranking, typically according to document length, in order to avoid biasing the ranking in favour of longer documents. This re-adjustment, better known as document length normalisation, is an important aspect of ranking, especially for realistic collections that contain documents of varying lengths: without document length normalisation, longer documents tend to score higher since they contain more words and word repetitions (Singhal 2001; Singhal et al. 1996).

More specifically, a typical ranking function estimates the relevance R(dq) between a document d and a query q as:

$$ R(d,q) \approx \sum_{t \in q} w(t,q) \cdot w(t,d) $$
(12)

where w(tq) is the weight of term t in q, and it is usually computed directly from the frequency of the query terms. Since most queries in standard ad-hoc IR are short and keyword-based, the term frequency of each term in the query is 1, hence w(t,q) = 1. The second component of (12), w(td), is the weight of term t in document d, and, typically, this is the weight that primarily influences the final ranking. There exist various different ways of computing w(td), all of which use the frequency of a term in a document (tf) one way or another. For example, the BM25 probabilistic ranking function computes w(td) as:

$$ w(t,d) = w^{(1)} \cdot \frac{(k_3+1) \cdot qtf}{k_3 + qtf} \cdot tfn $$
(13)

where w (1) is an idf-like component, k 3 is a parameter, qtf is the query term frequency, and tfn is the normalised term frequency in a document (see Robertson et al. 1995 for their definitions). The normalised term frequency in the document is adjusted according to document length as follows:

$$ tfn = \frac{(k_1 + 1) \cdot tf}{tf + k_1 \cdot \left(1-b + b \cdot \frac{l}{avl}\right)} $$
(14)

where k 1 and b are parameters, and l and avl are the actual and average document length respectively.

Following this practice, we propose two different uses of our graph-based term weights for ranking:

  • With idf, but without document length normalisation (Sect. 5.1)

  • With idf, without document length normalisation, but enhanced with properties of graph topology, which represent discourse properties of the text modelled as a graph (Sect. 5.2)

5.1 Raw model (no document length normalisation)

We start off with a general ranking function for estimating the relevance R(dq) between a document d and a query q, like the one shown in (12). Our goal is to modify the weight of a document with respect to a term (w(td)) by considering our graph-based term weights. We estimate w(td) as:

$$ w(t,d) = \log{idf} \cdot \log{tw} $$
(15)

where tw denotes any one of our four proposed graph-based term weights (namely TextRank (Equation 8), TextLink (Equation 9), PosRank (Equation 10), PosLink (Equation 11)).

Equation (15) is very similar to the classical TF-IDF formula (Robertson and Sparck 1976), with the sole difference that we replace term frequency (tf) with a graph-based term weight. This replacement is not without grounds: we have found our graph-based term weights to be overall correlated to tf, for instance, for the top most relevant document retrieved in experiments with all queries in one of our TREC collections (Disk4&5), Pearson’s correlation coefficient is 0.953 between TextRank—tf, and 0.714 between PosRank—tf.

Equation (15) is reminiscent of the ranking formula we used in our earlier poster work (Blanco and Lioma 2007), however there is an important difference between the two. Whereas in Blanco and Lioma (2007) we applied pivoted document length normalisation to this formula, here we do not apply any sort of normalisation or re-adjustment. We use the ‘raw’ idf and tw scores exactly as measured. Doing so with conventional TF-IDF (i.e. applying it without document length normalisation) is detrimental to retrieval. However, in Sect. 6.2 we show that our graph-based term weights can be used without document length normalisation and still perform comparably to BM25 (with tuned document length normalisation) (and outperform normalised TF-IDF).

We refer to Equation15 as our ‘raw ranking model’, because it contains no normalisation component. The only parameters involved in this ranking function are the window size N of term co-occurrence (for building the text graph) and the damping factor of the iteration for TextRank and PosRank. In Sect. 6.2 we experimentally show the range of N values within which performance is relatively stable. The value of the damping factor ϕ is also typically fixed in literature without major shortcomings (Mihalcea and Tarau 2004; Page et al. 1998).

5.2 Model enhanced with graph topological properties (no document length normalisation)

We present a ranking function that contains three components: (1) an idf component, (2) a graph-based term weight component, and (3) a discourse aspect of the document, which is represented as a topological property of the text graph. Specifically, we experiment with three different topological properties, each of which contributes a different discourse aspect into ranking (Sects. 5.2.15.2.3). Section 5.2.4 illustrates these graph topological properties for two real sample texts. Section 5.2.5 describes how we integrate these graph topological properties to the ranking formula.

5.2.1 Average degree as a property for ranking documents

The first topological property we use is the average degree of the graph. In Sect. 3.2.1 we discussed how the degree distribution can give valuable insight into the heterogeneity of node interactivity levels within a network. Simply speaking, the higher the average degree of a graph, the more heterogeneous the interactivity levels of its vertices. In our text graph analogy, heterogeneous vertex interaction can be seen as heterogeneous term interaction in a document. For instance, recall Figs. 2 and 3, which represent two different text graphs of the same sample text. The average degree of the first graph is higher than the average degree of the second graph, because the first graph models solely term co-occurrence, whereas the second graph models term co-occurrence with grammatical modification. Hence, the interaction between the terms is more heterogeneous in the first graph (all co-occurring words are connected) than in the second graph (only co-occurring words that are grammatically dependent are connected).

More generally, in language, words interact in sentences in non-random ways, and allow humans to construct an astronomical variety of sentences from a limited number of discrete units (words). One aspect of this construction process is the co-occurrence and grammatical modification of words, which in our text graph analogy we model as vertex interactions. The more homogeneous these word interactions are, the more cohesive the text is. Cohesiveness is directly linked to discourse understanding, since humans process, understand (and remember) by association (Ruge 1995). More simply, a document that keeps on introducing new concepts without linking them to previous context will be less cohesive and more difficult for humans to understand, than a document that introduces new concepts while also linking them to previous context.

We integrate the average degree of a text graph into ranking with the aim to model the cohesiveness of the document being ranked. Our reasoning is that a more cohesive document is likely to be more focused in its content than a less cohesive document, and hence might make a better candidate for retrieval (this point is illustrated in Sect. 5.2.4). In line with this reasoning, we propose an integration of the average degree of the text graph into ranking, which boosts the retrieval score of lower-degree documents and conversely penalises the retrieval score of higher-degree documents. The exact formula of the integration of the average degree (and of the remaining topological properties we study) is presented at the end of this section (Sect. 5.2.5), so that we do not interrupt the flow of the discussion regarding our reasoning for using properties of graph topology into ranking and their analogy to discourse aspects.

5.2.2 Average path length as a property for ranking documents

The second topological property we use is the average path length of the text graph. As discussed in Sect. 2.2, in graphs where edges represent sense relations, shorter path length has been associated to faster information search in the brain. Similarly, in graphs where edges represent term co-occurrence, longer paths result from less term co-occurrence. For example, think of a text graph that has two completely disconnected graph partitions. These partitions correspond to regions in the text that do not share any words at all. If we introduce the same word in both regions of the text, then the graph partitions will become connected and the average path length of the graph will be reduced.

An underlying assumption behind looking at average path length in text graphs that model term co-occurrence and grammatical modification is that the closer two vertices (words) are to each other, the stronger their connection tends to be. This is a generally accepted assumption, supported for instance by studies examining dependence structures derived from the Penn Tree Bank and mapping the probability of dependence to the distance between words, as noted by Gamon (2006) based on Eisner and Smith 2005). Moreover, this combination of distance and co-occurrence measure is also generally accepted (for instance, this can be seen as what Pointwise Mutual Information applies on a high-level), and specifically in IR it is reminiscent of the decaying language models proposed by Gao et al. (2005), in the sense that they combine term distance and co-occurrence information into ranking.

We integrate the average path length of a text graph into ranking with the aim to model the discourse dependence of the document being ranked. Our reasoning is that the lower the average path length of a document, the higher its discourse dependence, or more simply the more tightly knit its discourse is (this point is illustrated in Sect. 5.2.4). In line with this reasoning, we propose an integration of the average path length of the text graph into ranking, which boosts the retrieval score of documents that have lower average path length values, and conversely penalises the retrieval score of documents that have higher average path length values (described in Sect. 5.2.5).

5.2.3 Clustering coefficient as a property for ranking documents

In Sect. 2.2 we saw that graph clustering is typically seen as an indicator of the hierarchical organisation of the network being modelled. For instance, in syntactic graphs (i Cancho et al. 2007), clusters are seen as core vocabularies surrounded by more special vocabularies. Similarly, in free-association graphs, clusters are seen as supporter vertices (response words) that gather around one leader vertex (stimulus word), forming a kind of small conceptual community (Jung et al. 2008). In the co-occurrence and grammatical modification graphs used in this work, clustering can be seen as a indication of contextually-bounded ‘discourse hubs’, in the sense that the clustered terms may very likely share some meaning, which may be more specific, specialised, or contextually-bounded than the general topic(s) of the text. A text graph exhibiting low clustering may indicate a document charasterised by discourse drifting, i.e. a document mentioning several topics in passing, but without having a clear discriminative topic.

We integrate the clustering coefficient of a text graph into ranking with the aim to model the discourse drifting of the document being ranked. Our reasoning is that the higher the clustering coefficient of a document, the more clustered its discourse is with salient and discriminative topics (this point is illustrated in Sect. 5.2.4). In line with this reasoning, we propose an integration of the clustering coefficient of the text graph into ranking, which boosts the retrieval score of documents that have higher clustering coefficient values, and conversely penalises the retrieval score of documents that have lower clustering coefficient values (described in Sect. 5.2.5).

5.2.4 Illustration

For the illustration of graph topological properties in text graphs, we consider two different sample texts: a BBC news online article on astronomy and a wikipedia entry on Bill Bryson’s book A Short History of Nearly Everything. These sample texts, which are freely available online, are displayed in “Appendices 1 and 2” respectively. The BBC article is selected because it discusses a specific topic of a scientific nature in a layman manner. The wikipedia entry is selected because it is discusses a topic-rich book, which covers a multitude of scientific topics also in layman’s terms. These two sample texts are of approximately similar size: the BBC text is 537 terms long, and the wikipedia article is 479 terms long.

For each sample text separately, we build two different text graphs: an undirected co-occurrence graph as described in Sect. 4.1, and a directed co-occurrence graph with grammatical constraints as described in Sect. 4.2 (both graphs use a co-occurrence window size of N = 4). Figure 5 displays the undirected co-occurrence graph of the BBC sample text, and Fig. 6 displays the undirected co-occurrence graph of the wikipedia sample text. Footnote 3 Table 1 displays the topological properties of the two graphs built for each of the two sample texts. Comparing the graph topological properties of the BBC and the wikipedia graphs we observe that the BBC graphs have slightly higher average degree, slightly lower average path length, and slightly higher clustering coefficient, from their respective graphs of the wikipedia text.

Fig. 5
figure 5

Undirected co-occurrence text graph for the sample BBC text displayed in “Appendix 1”. This graph is built as described in Sect. 4.1, with a co-occurrence window size of N = 4. Graph properties are displayed in Table 1

Fig. 6
figure 6

Undirected co-occurrence text graph for the sample wikipedia entry displayed in “Appendix 2”. This graph is built as described in Sect. 4.1, with a co-occurrence window size of N = 4. Graph properties are displayed in Table 1

Table 1 Topological properties of the undirected co-occurrence graph (TextGraph) and the directed co-occurrence graph with grammatical constraints (PosGraph) built for the two sample texts (BBC NEWS, WIKIPEDIA)

Regarding the average degree, we reasoned in Sect. 5.2.1 that it can represent document cohesiveness, and that a document that keeps on introducing new concepts without linking them to previous context will be less cohesive, than a document that introduces new concepts while also linking them to previous context. The higher the average degree of a text graph, the lower the cohesion of the respective text. Indeed, the discourse of the BBC sample text has less cohesion than the discourse of the wikipedia sample text. Specifically, the BBC text discourse shifts across several the third person entities (for instance, Hubble Space Telescope, Professor Richard Bouwens, Dr Olivia Johnson, he, Dr Robert Massey, a NASA team, the research team, astronomers ), impersonal structures (for instance, it is thought, it’s very exciting to, there are many ), and repeatedly swaps direct to indirect narration (for instance, we’re seeing, you start out, he compares, we can use ). This is typical of journalistic writing, which generally favours including statements from different agents about some fact, and referring to those agents in different ways. The wikipedia sample text is more cohesive, because the discourse shifts across fewer entities, mainly Bryson and the book (for instance, Bryson, Bryson relies, Bryson describes, Bryson also speaks, he states, he the explores, he discusses, he also focuses, this is a book about, the book does ). The discourse is mainly built around Bryson and the book, as opposed to the three different named scientists, and the several other entities of the BBC sample text.

Regarding the average path length, we reasoned in Sect. 5.2.2 that it can represent how tightly knit the discourse is, and that the lower it is, the more tightly knit the discourse is. We observe in Table 1 that the graphs of the BBC sample text have slightly lower average path length than the graphs of the wikipedia sample text. Indeed, the BBC text discusses only the topic of galaxies, and specifically the discovery of the possibly oldest galaxy ever observed. The text does not discuss any other topics, but focuses on different aspects of the discovery and the significance of this discovery. Hence, the discourse of the BBC text is tighly knit on this topic. The wikipedia sample text however discusses a book that covers a wide range of topics pertaining to the history of science across a plethora of fields and aspects (as the title of the book suggests). In this respect, the discourse of the wikipedia sample text is not as tightly knit around a very specific topic.

Finally, regarding the average clustering coefficient, we reasoned in Sect. 5.2.3 that it can represent topical clustering, and that the higher it is, the more clustered the discourse is with salient and discriminative topics (not just passing mentions of topics, i.e. topical drifting, but clusters or hubs of topics that are sufficiently discussed). We observe in Table 1 that the graphs of the BBC sample text have a slightly higher clustering coefficient than the graphs of the wikipedia sample text. Indeed, the BBC sample text discusses in more depth the topics it covers (how the oldest galaxy was discovered, what is the significance of this finding), compared to the wikipedia text that simply mentions in passing some of the topics covered in Bryson’s book (for instance, Newton, Einstein, Darwin, Krakatoa, Yellowstone National Park ), without however elaborating enough to create semantic hubs or clusters around these topics.

5.2.5 Integration into ranking

We integrate the graph topological properties discussed above into ranking in the same way that query-independent indicators of document retrieval quality are typically integrated into ranking. This use is reminiscent of the approaches presented in Sect. 2, where such graph properties are treated as indicators of thesaurus quality in semantic graphs, for instance. In addition to the above three graph measures, we also use the sum of the graph-based term weights in a graph as a type of graph-equivalent property to document length: in the same way that the sum of the term frequency in a document is seen as an indicator of document length, we sum the graph-based term weights as an indicator of the amount of those weights associated with the nodes of the graph.

We integrate the above graph properties into ranking using the satu integration approach, initially suggested in Craswell et al. (2005) for integrating PageRank scores into BM25:

$$ w(t,d) = \log{idf} \cdot \log{tw} + \psi \frac{P_d}{\kappa + P_d} $$
(16)

where tw is any of our four graph-based term weights, P d is the graph property of document d to be integrated, and ψ, κ are parameters. These parameters can be tuned via extensive 2d exploration (Craswell et al. 2005), but in our case we fix one and tune the other only (discussed in Sect. 6.1). We use the exact (16) to integrate the estimated clustering coefficient into the ranking (P d  = c(G)), but for the remaining three properties we inverse P d in (16) (i.e. we replace P d by 1/P d ) because we assume that the higher these properties are, the lower the relevance ranking, as discussed in Sects. 5.2.1, 5.2.2, 5.2.4. We use the satu integration because we wish to integrate into the ranking function graph properties which we assume to be query-independent indicators of the retrieval quality of a document seen as a graph. Other ways of integrating these properties to retrieval are also possible, for instance any of the alternatives presented in Craswell et al. (2005).

6 Experiments

We use our two combinations of ranking with graph-based term weights presented in Sects. 5.1 and 5.2 respectively, in order to match documents to queries. We compare performance against a BM25 baseline.

6.1 Experimental settings

We use the Terrier IR system (Ounis et al. 2007), and extend it to accommodate graph-based computations. For the directed graphs, which require POS tagging, we use the freely available TreeTagger (Schmid 1994) on default settings. We do not filter out stopwords, nor do we stem words, when we build the text graphs and during retrieval. This choice is motivated by findings showing that stemming in general is not consistently beneficial to IR (Harman 1991; Krovetz 2000), and that the overall performance of IR systems is not expected to benefit significantly from stopword removal or stemming (Baeza-Yates and Ribeiro-Neto 1999).

6.1.1 Datasets

For our retrieval experiments we use standard TREC (Voorhees and Harman 2005) settings. Specifically, we use three TREC collections, details of which are displayed in Table 2: Disk4&5 (minus the Congressional Record, as used in TREC), WT2G, and BLOG06. Disk4&5 contains news releases from mostly homogeneous printed media. WT2G consists of crawled pages from the Web. BLOG06 is a crawl of blog feeds and associated documents. These collections belong to different domains (journalistic, everyday Web, blog), differ in size and statistics (Disk4&5 has almost twice as many documents as WT2G, but notably less unique terms than WT2G). For each collection, we use the associated set of TREC queries shown in Table 2. We experiment with short queries (title only), because they are more representative of real Web queries (Ozmutlu et al. 2004). We evaluate retrieval performance in terms of Mean Average Precision (MAP), Precision at 10 (P@10), and binary Preference (BPREF), and report the results of statistical significance testing with respect to the baseline, using the Wilcoxon matched-pairs signed-ranks test.

Table 2 Dataset features. DISK4&5 exclude the Congressional Record subset, and contain the following: Federal Register (1994), Financial Times (1992–1994), Los Angeles Times (1989–1990)

6.1.2 Parameter tuning

There is one parameter involved in the computation of our graph-based term weights, namely the window size N of term co-occurrence when building the text graph. We vary N within N = [2,  3,  4,  5,  10,  20,  25,  30] and report retrieval performance for each of these values. Note that there is another parameter involved in the computation of TextRank and PosRank only (i.e. the term weights that use recommendation only), namely the damping factor ϕ specified in (8) and (10). We do not tune ϕ, but we set it to ϕ = 0.85 following (Mihalcea and Tarau 2004; Page et al. 1998).

In addition, there are two parameters involved in the satu integration of the graph topological properties into our second ranking function only. The parameters of the satu integration are: ψ, κ. We fix κ = 1 and tune only ψ within ψ = [1–300] in steps of 3. We do not tune both parameters simultaneously, because our aim is to study whether our graph-based term weights are beneficial to retrieval performance, and not to fine-tune their performance in a competitive setting. In spite of this, our graph-based term weights perform very well as shown next, which indicates that the performance reported here may be further improved with full tuning.

Our retrieval baseline, BM25 (Equation 13), includes three tunable parameters: k 1 and k 3, which have little effect on retrieval performance, Footnote 4 and b, a document length normalisation parameter. We tune b to optimise retrieval performance by ranging its values within b = [0–1] in steps of 0.05.

6.2 Experimental results

To recapitulate, we have proposed four graph-based term weights: TextRank, TextLink, PosRank, and PosLink. TextRank and TextLink compute term weights from a graph of word co-occurrence; PosRank and PosLink compute term weights from a graph of word co-occurrence and grammatical modification. We use these for ranking without document length normalisation, either ‘raw’ (i.e. combined only with IDF), or enhanced with the following graph properties: average degree, average path length, clustering coefficient, and the sum of the vertex weights in the graph.

6.2.1 Retrieval precision

Tables 3, 4 and 5 present the retrieval performance of our graph-based term weights and ranking functions against BM25, separately per collection and evaluation measure. The column entitled ‘raw’ refers to our first ‘raw’ ranking function; the remaining columns refer to our second ranking function that is enhanced with the graph property mentioned in the header. Specifically, ‘+Degree, +Path, +Cl. coef, +Sum’ refer respectively to the average degree, average path length, clustering coefficient, and sum of graph-based term weights. Note that +Sum is not the sum of the graph properties (i.e. it is not a summation over +Degree and +Path for instance), but the sum of the graph-based term weights of a document, which we use here a graph-based equivalent to document length (this point is explained in Sect. 5.2.5). The baseline is the BM25 scores (TF-IDF (Robertson and Sparck 1976) scores are also included for reference, Footnote 5 but we use BM25, which performs much better than TF-IDF, as a baseline). Tables 3, 4 and 5 display the best scores for each model, after tuning.

Table 3 Mean average precision (MAP) of retrieval results of our ranking with our four graph-based term weights (TextRank, TextLink, PosRank, PosLink) compared to baseline ranking with BM25 (TFIDF is displayed for reference). Raw denotes ranking without graph topological properties. +Degree, +Path, +Cl. coef., +Sum denotes ranking with the respective graph topological properties. Bold font marks MAP ≥ baseline. Large font marks best overall MAP, and * marks statistical significance at p < 0.05 with respect to the baseline. All scores are tuned as specified in Sect. 6.1.2
Table 4 Precision at retrieved results (P@10) of our ranking with our four graph-based term weights (TextRank, TextLink, PosRank, PosLink) compared to baseline ranking with BM25 (TFIDF is displayed for reference). Raw denotes ranking without graph topological properties. +Degree, +Path, +Cl. coef., +Sum denotes ranking with the respective graph topological properties. Bold font marks P@10 ≥ baseline. Large font marks best overall P@10, and * marks statistical significance at p < 0.05 with respect to the baseline. All scores are tuned as specified in Sect. 6.1.2
Table 5 Binary Preference (BPREF) of the retrieved results of our ranking with our four graph-based term weights (TextRank, TextLink, PosRank, PosLink) compared to baseline ranking with BM25 (TFIDF is displayed for reference). Raw denotes ranking without graph topological properties. +Degree, +Path, +Cl. coef., +Sum denotes ranking with the respective graph topological properties. Bold font marks BPREF ≥ baseline. Large font marks best overall BPREF, and * marks statistical significance at p < 0.05 with respect to the baseline. All scores are tuned as specified in Sect. 6.1.2

Regarding our first ‘raw’ ranking function, we see that all four of our weights are comparable to the baseline, for all collections and evaluation measures. By comparable, we mean that their performance is between −0.081 and +0.048 from the baseline, hence we are not looking at any significant loss or gain in retrieval performance. This is note-worthy, considering the fact the baseline is tuned with respect to document length, whereas our ‘raw’ ranking is not tuned with that respect. The most gains in retrieval performance associated to our ‘raw’ ranking are noted with BLOG06, for which our graph-based weights outperform the baseline for MAP (0.3947), P@10 (0.7160), and BPREF (0.4551).

The term weights of the directed graphs (PosRank and PosLink) do not seem to make a significant contribution to retrieval performance in comparison to the term weights of the undirected graphs (TextRank and TextLink). Similar findings have also been reported in other tasks when using graph-based term weights, for instance in keyword extraction (Mihalcea and Tarau 2004), where non-directed graphs fetch higher F-measures than directed graphs. Overall, our graph-based weights in the ‘raw’ ranking function perform consistently across collections, apart from TextLink and PosLink, which underperform in Disk4&5. A possible reason may be the fact that the Disk4&5 collection contains less unique terms in proportion to document number than the other collections, which implies high term repetition. However, when we build graphs from text, we link co-occurring terms only once, no matter how many times they actually co-occur. This affects TextLink and PosLink, because these weights rely solely on the average degree of the graph (i.e. the links between terms).

Our second type of ranking that is enhanced with graph topological properties (columns ‘+Degree, +Path, +Cl. coef., +Sum’) performs better than the ‘raw’ ranking function, and also comparably to BM25. In fact, at all times, the best overall score per collection and evaluation measure is one of our enhanced graph-based term weights. This indicates that the discourse aspects that we integrated into ranking can benefit retrieval performance, not only by bringing in more relevant documents in lower precision ranks, but also by re-ranking the top ranks of the retrieved documents (as shown in the improvements to the P@10 measure). All graph properties seem to work overall equally well, without any significant differences in their measured retrieval performance.

Note that most of our graph-based weights in Tables 3, 4 and 5 are statistically significant with respect to TF-IDF, but only few of them with respect to BM25 (marked * in the Tables).

6.2.2 Parameter sensitivity

The performance of our graph-based term weights depends to an extent on the value of the window size N of term co-occurrence. Specifically, the value of the co-occurrence window N affects how the text graph is built (which edges are drawn), and hence it is critical to the computation of the graph-based term weights. Figure 7 plots N against retrieval performance, in order to check the stability of the latter across the range of the former. For brevity we show the plots of our first ‘raw’ ranking only, but we can report that the same behaviour holds for our second enhanced ranking that includes graph properties. In Fig. 7 we observe that performance is overall consistent across collection, evaluation measure, and graph-based term weight. N values between 5 and 30 seem to perform better, and we can confirm that this also holds for our enhanced ranking in these collections. Among the N values between 5 and 30 that perform well, N = 10 performs well in terms of MAP and BPREF. A window value of N = 10 can be practically interpreted as the subsentence context that our approach considers when weighting term salience. Given that the average sentence length for English is 20 words (Sigurd et al. 2004), this choice of context N = 10 seems reasonable; in fact, N = 10 has been used in other text processing tasks that consider context within sentences, e.g. the context-based word sense disambiguation of Schütze and Pedersen (Schütze and Pedersen 1995).

Fig. 7
figure 7

Retrieval performance (measured in MAP, P@10, BPREF) across the full value range of parameter N (the ‘window size’ of term co-occurrence)

Regarding our second enhanced ranking, the performance reported in Tables 3, 4 and 5 depends to an extent on the value of parameter ψ, used when integrating the topological graph properties into ranking. Specifically, the value of ψ controls the influence of the graph property upon the final weight. To this end, we conduct additional experiments in order to test the parameter stability of our graph-based term weights, in a split-train scenario. We focus on our second enhanced ranking function, which contains the integration parameter ψ. We split the Disk4&5 queries into two sets (as split by TREC) and we use queries 301–450 for tuning the parameter ψ, and 601–700 for testing. The aim is to check if the scores reported so far are the product of fine tuning, or if we can expect a relatively similar performance with parameter values that are not necessarily optimal.

Table 6 shows retrieval performance for Disk4&5, where we see that the scores obtained by parameters trained on a different query set (column ‘train’) are not far off the actual best scores (column ‘best’). The value of ψ used in these runs is displayed separately in Table 7. We see that at most times the trained value is very close to the best value. These results indicate that the value of ψ can be estimated with relatively little tuning, because it is overall stable across different query sets and evaluation measures.

Table 6 Retrieval performance measured in MAP, P@10, BPREF using the TextRank graph-based term weight only on Disk4&5. TextRank is enhanced with four different graph properties (in separate rows). The parameter of this integration is tuned for one subset of the queries, and the tuned parameter values are used with a different subset of the queries (‘train’ column). Column ‘best’ reports the actual best performance. There is no notable or significant difference between ‘tuned’-‘best’, indicating that parameter tuning can be ported to different query sets, hence it is potentially robust. The actual parameter values are shown in Table 7
Table 7 Parameter ψ values of the integration of graph properties into ranking, as explained in the caption of Table 6

7 Implementation and efficiency

This section discusses issues pertaining to the implementation and efficiency of our graph-based term weights in an IR system. Graph-based term weights can be computed at indexing time, not at querying time, hence they imply no delay to the IR system’s response time to the user. Typically, when a document is fed into the system for indexing, it has to be read from disk for cleaning (extracting the content from Web pages or removing page templates in news articles, for instance), tokenising, parsing etc. It is at that time that the additional overhead of the graph-based term weight computation is applied. Specifically, this overhead is introduced by the algorithm that iterates over (8)–(11) for each of our four graph-based term weights respectively. Figure 8 displays the time overhead in miliseconds associated with the computation of Equation 8, taken as a function of the number of iterations (varied from 0 to 1,000), for computing the TextRank terms weights of the Disk4&5 collection. Figure 8 plots the running time for the different number of iterations averaged over the whole document set, and their variance, computed using an Intel Core 2 Due 3GHz Linux box with 4GB of RAM. We observe that running time increases approximately linearly with the number of iterations, although the overhead is negligible when the number of iterations is less than 20 (<1ms).

Fig. 8
figure 8

Overhead in miliseconds introduced by the algorithm. The x-axis represents the number iterations over (8)

The random-walk based approaches as introduced in Sect. 4 approximate an infinite Markov chain on a finite number of steps. Hence, a second question is what would be the required number of iterations of the algorithm to converge. To answer this question, we compare the weights obtained by iterating over (8) a certain number of times vs. iterating a large number of times (ideally, infinite). This provides insights on the number of iterations we need for convergence, or at which point the weights produced by the algorithm are indistinguishable. Specifically, we compute the weights that result after iterating the algorithm in (8) 10K times as ground truth. We then report the mean squared error of the weights that the algorithm produces after iterating for a lower number of times. Figure 9 plots the mean average squared error (MSE) and variance for a given number of iterations. We observe that the MSE decreases exponentially with the number of iterations and that it is close to zero after 100 iterations. In general the error is very low (<10−6) after a few iterations (≈20). Hence, we can conclude that the algorithm requires a low number of iterations to produce weights that are indistinguishable from those produced using a much higher number of iterations. This operation can be computed offline and independently for every document, thus term weights can be computed efficiently using parallel programming paradigms such as Hadoop. Footnote 6

Fig. 9
figure 9

Average difference between the weights computed at a given number of iterations and the weights computed using 10K iterations, over (8)

Overall, the low number of iterations and limited running time implies that our graph-based term weighting algorithm can process documents with a minimum overhead. Additionally, the fact that the whole process can be performed in an offline fashion and completely distributed, allows us to conclude that the processing of these kind of weights could scale up to Web-sized collections.

Using graph-based weights for document ranking has no overhead at query time, compared to other efficient approaches like Ahn and Moffat’s impact sorted posting lists (Anh and Moffat 2005). These posting lists store term-information using impacts, which are values that represent a score. In Anh and Moffat (2005) an impact is a combination of term frequency and a document length normalisation component. Graph-based term weights would be therefore stored and indexed in the same way, allowing for efficient high-throughput query processing, as fast as other impact-based posting indexing approaches.

8 Conclusions

Starting from a graph representation of text, where nodes denote words, and links denote co-occurrence and grammatical relations, we used graph ranking computations that take into account topological properties of the whole graph to derive term weights, in an extension of the initial proposal of Mihalcea and Tarau (2004). We built two different text graphs, an undirected one (for term co-occurrence) and a directed one (for term co-occurrence and grammatical dependence), and we extracted four graph-based term weights from them. These weights encoded co-occurrence and grammatical information as an integral part of their computation. We used these weights to rank documents against queries without normalising them by document length. In addition, we integrated into ranking graph topological properties (average degree, path length, clustering coefficient). An analogy was made between properties of the graph topology and discourse aspects of the text modelled as a graph (such as cohesiveness or topical drift). Hence, integrating these graph topological properties into ranking practically meant considering discourse aspects of the documents being ranked for retrieval.

Experiments with three TREC datasets showed that our graph-based term weights performed comparably to an established, robust retrieval baseline (BM25), and consistently across different datasets and evaluation measures. Further investigation into the parameters involved in our approach showed that performance was relatively stable for different collections and measures, meaning that parameter training on one collection could be ported to other datasets without significant or notable loss in retrieval performance.

A possible future extension of this work is to use the graph-based term weights presented in this article to rank sentences. This application is reminiscent of the LexRank (Erkan and Radev 2004) approach, which computes sentence importance based on the concept of eigenvector centrality in a graph representation of sentences. Specifically, in LexRank, a connectivity matrix based on intra-sentence cosine similarity is used as the adjacency matrix of the graph representation of sentences. Sentence similarity is measured based on sentence salience, which in its turn is typically defined in terms of the presence of particular important words or in terms of similarity to a centroid pseudo-sentence. A possible extension of our work could be to use the term salience computed from our graph-based term weights in order to deduce measures of sentence salience. Such an extension would in fact combine our graph-based term weights to the LexRank approach, and could potentially offer valuable insights into the application of graph-based term weights for sentence extraction.

Further future work includes using other graph ranking computations, e.g. HITS (Kleinberg 1999) to derive term weights, which implies defining vertices as hubs and authorities from a linguistic viewpoint. We also intend to experiment with weighted text graphs, in order to better represent meaningful linguistic relations between words modelled as vertices. This would involve determining how to learn the edge weights using training data to perform a weighed or personalized PageRank (Chakrabarti 2007). Our use of such graph-based term weights for IR may be further improved by applying further ranking functions (the two functions presented here had a straightforward TF-IDF format), or re-ranking functions, such as the non-symmetric similarity functions used by Kurland and Lee (2010). Lastly, the encouraging results of the graph topological properties reported here require further investigation, with respect to their integration into ranking and potential combination of more than one property.