Keywords

1 Introduction

Social media is nowadays the most popular platform that allows the creation and exchange of user generated content [15]. According to the research results by the Pew Research Center [24], over 70 % of internet users use social media sites as of January 2014. Another report by eMarketer [12] reveals that, by the end of 2013, 163.5 million people in U.S.-more than two-thirds of internet users-will be social media users. Moreover, Facebook, the global leading social networking service provider, has 1.35 billion monthly active users as of the third quarter 2014 [13]. 4.5 billion “Likes” were generated and 4.75 billion pieces of content was shared daily as of May 2013. These statistics indicate the social appeal associated with social media and user-generated content and the value of acquiring information from social media to facilitate the development of novel and the improvement of existing products and services.

Various social media websites, such as wikis (e.g., Wikipedia), blogs and microblogs (e.g., Twitter), media sharing (e.g., YouTube, Flickr), social news (e.g., Digg, Reddit), social bookmarking (e.g., Delicious, CiteULike), and social networking (e.g., Facebook, Google+), have been established. The knowledge (aka “wisdom of crowds”) gained from social media sites can not only meet the objectives of businesses offering them but also help the development of novel and effective services that are better tailored to users’ needs. In this study, we focus on analyzing a specific mechanism, i.e., social tagging system (aka folksonomy), commonly supported by numerous social media sites, e.g., YouTube, Flickr, Delicious, etc., for enhancing the effective of personalized information management. A folksonomy is a system of classification [29] which allows users to attach self-defined keywords (or tags) to describe resources [21], [27]. Folksonomy generally consists of a set of users, a set of self-defined tags, a set of resources, and a set of tag assignments (i.e., a set of user-tag-resource triple relationships) [8]. Semantically, tags in a folksonomy reflect users’ collaborative cognition on information. They can reveal both the users’ behavior and resources’ properties [34].

The knowledge gained from folksonomy is valuable for supporting various applications, such as Web page classification [1], recommendation [22, 37], and information retrieval [3, 6]. In this study, we attempt to apply the wisdom of crowds of folksonomy to a novel document management task, namely personalized document clustering. Specifically, we adopt the CAC technique proposed by Yang and Wei [36] as our underlying personalized document clustering algorithm. The CAC technique takes into consideration a user’s categorization preference (expressed as a list of anchoring terms) and subsequently generates a set of document clusters from this specific preferential perspective. Furthermore, the CAC technique exploits the world wide web as an information source to construct a statistical-based thesaurus, which then serves to expand the set of anchoring terms which is then applied to represent the source documents and then performs clustering to generate document clusters in accordance with the input preferential context (i.e., initial set of anchoring terms provided by the target user). Alternatively, we want to understand the effectiveness of folksonomy, in comparison with a general-purpose search engine (i.e., Google in Yang and Wei’s study), on constructing a statistical-based thesaurus for supporting personalized document clustering. We select delicious (https://delicious.com/), a leading social bookmarking site, as the folksonomy for our social-tagging-based CAC technique (ST-CAC). We also conduct some experiments to evaluate the effectiveness of the ST-CAC technique and its benchmark approaches.

The remainder of this paper is organized as follows. Section 2 reviews existing document clustering techniques relevant to this study. In Sect. 3, we describe the detailed design of the proposed ST-CAC technique. Subsequently, we depict our experimental design and discuss important evaluation results in Sect. 4. Finally, we conclude with a summary and some future research directions in Sect. 5.

2 Literature Review

Document clustering entails the automatic organization of a large document collection into distinct groups of similar documents that reflect general themes hidden within the corpus [23, 32]. The documents in the resultant clusters exhibit maximal similarity to those in the same cluster and, at the same time, share minimal similarity with documents in other clusters. However, according to the context theory of classification, document clustering behaviors of individuals not only involve the attributes (including contents) of documents but also depend on who is performing the task and in what context [2, 7, 17]. As a result, document clustering is an intentional act that should reflect individuals’ preferences with regard to the semantic coherency or relevant categorization of documents [26] and should conform to the context of a target task under investigation.

Most of existing document clustering techniques are anchored in document content analysis. The overall process of a content-based document clustering technique generally comprises three main phases: feature extraction and selection, document representation, and clustering [14, 32, 33]. The purpose of feature extraction and selection is to extract and select from the target document corpus a set of representative features to represent the documents in the document representation phase. Subsequently, the clustering phase applies a clustering technique to group the target documents into distinct clusters.

Feature extraction begins with the parsing of each source document to produce a set of nouns and noun phrases and exclude a list of prespecified “stop words” that are non-semantic-bearing words. Subsequently, representative features are selected from the set of extracted features. Feature selection is important for clustering efficiency and effectiveness, because it not only condenses the size of the extracted feature set, but also reduces the potential biases embedded in the original (i.e., nontrimmed) feature set [25, 35]. Commonly used feature selection metrics include: TF, TF × IDF, and their hybrids [4, 19].

On the basis of a particular feature selection metric, the k features with the highest selection metric scores then are selected to represent each source document in the document representation phase. Based on the chosen representation scheme, each document is described in the k-dimensional space and represented as a feature vector. Commonly employed document representation schemes include binary (presence or absence of a feature in a document), within-document TF, and TF × IDF [4, 19, 23, 25, 32].

In the final phase of document clustering, source documents are grouped into distinct clusters on the basis of the selected features and their respective values in each document. Common clustering approaches include partitioning-based [4, 9, 19], hierarchical [11, 25, 30, 32], and Kohonen neural network [18, 20, 25].

As mentioned, content-based document clustering techniques rely on an objective feature-selection metric (e.g., TF or TF × IDF) that merely considers document content. As a result, existing content-based techniques generate for all users an identical set of document clusters from a given document collection and, thus, is unable to support personalized document-clustering. In response to the limitation of existing content-based document clustering techniques, prior research has proposed several extended approaches that might support personalized document clustering. For example, Deogun and Raghavan [10] propose a user-oriented document clustering technique that considers only document relevance to user queries. Kim and Lee [16] propose a semi-supervised document clustering technique to improve clustering effectiveness. Their approach essentially is a hybrid one that considers not only content similarity but also a user’s perception of the document similarity using a relevance-feedback mechanism. Wei et al. [32] instead propose a personalized document clustering (PEC) approach to support personalization in document categorization. In addition to the contents of the documents to be clustered, the PEC approach includes a target user’s partial clustering as input, because it reflects his or her categorization preference. Last, Yang and Wei [36] propose a context-aware document-clustering (CAC) technique that takes into consideration a user’s categorization preference (expressed as a list of anchoring terms) and subsequently generates a set of document clusters from this specific preferential perspective.

The abovementioned extended document clustering techniques in some degree can support the desired personalized document clustering task. Accordingly to Yang and Wei’s study [36], the CAC technique outperforms other extended approaches in terms of supporting personalized document clustering. Thus, we adopted the CAC technique as the underlying algorithm for personalized document clustering. The CAC technique adopt a general-purpose search engine (i.e., Google) to construct a statistical-based thesaurus which serves as the basis for generating a set of document clusters which fits the categorization preference of a specific user. In this study, we adopt social media (more specifically, social tagging system) as an alternative information source for statistical-based thesaurus construction. The rational is that the information in folksonomy has been processed by crowds and reflects users’ collaborative cognition. Such collaborative wisdoms should be better in supporting personalized document clustering.

3 Proposed Method

The context-aware document-clustering (CAC) technique, proposed by Yang and Wei [36], takes into consideration a user’s categorization preference (expressed as a list of anchoring terms) and then generates a set of document clusters from this specific preferential perspective. For example, given a set of research articles related to “data mining,” a person interested in developing new data mining techniques may prefer document categories anchored on the techniques under discussion and thus provides some anchoring terms as classification analysis, clustering analysis, association rules, sequential patterns, and so on. On the other hand, another person, who is working on data mining techniques to real world business applications, may prefer a different set of categories based on the application domains involved (e.g., banking, retailing, health care, telecommunications, etc.). Given the set of user-provided anchoring terms which represent the specific user’s categorization preference, the CAC technique first constructs a statistical-based thesaurus and subsequently expands the given set of anchoring terms by adding their relevant terms. The expanded set of anchoring terms is adopted as the representative features for performing personalized document clustering.

The major difference between the CAC technique and our extended social-tagging-based CAC technique (ST-CAC) is the way of constructing statistical-based thesaurus. As shown in Fig. 1, the ST-CAC technique consists of five main phases: (1) feature extraction and selection; (2) statistical-based thesaurus construction; (3) anchoring term expansion; (4) document representation; and (5) document clustering. The detailed design of each phase is described in this section.

Fig. 1.
figure 1

Overall process of the ST-CAC technique

3.1 Feature Extraction and Selection

This phase aims at extracting and selecting a set of meaningful features (specifically, nouns and noun phrases) from the target document corpus. We adopt the part-of-speech (POS) tagger developed by Brill [5] to syntactically tag each word in the target documents and then employ Voutilainen’s approach [31] to implement a noun-phrase parser for extracting noun phrases from each tagged document. Furthermore, we remove features that infrequently appear in the target document corpus. Particularly, we only retain those features whose document frequency (df) is no less than a prespecified threshold δ DF .

3.2 Statistical-Based Thesaurus Construction

The purpose of this phase is to automatically construct a statistical-based thesaurus that will be used for expanding the user-provided anchoring terms. We adopt the folksonomy of Delicious website as the corpus for constructing a statistical-based thesaurus. Folksonomy generally is consisted of a set of users (U), a set of self-defined tags (T), a set of resources (R), and a set of tag assignments A ⊆ U × T × R (i.e., a set of user-tag-resource triple relationships). A bookmark in Delicious website is a triple (u, T ur , r) with u ∈ U, r ∈ R, and a set of tags T ur  = {t ∈ T | (u, t, r) ∈ A}.

For each anchoring term q i pertaining to the categorization preference of a target user and every feature f j representative to the target document corpus, the proposed ST-CAC technique calculates the relevance weight between q i and f j by the pointwise mutual information (PMI) measure [28] as follows:

$$ rw_{{q_{i} ,f_{j} }} = { \log }_{2} \left( {\frac{{p(q_{i} \wedge f_{j} )}}{{p(q_{i} )p(f_{j} )}}} \right) = { \log }_{2} \left( {\frac{{N \times {\text{hits}}(q_{i} \wedge f_{j} )}}{{{\text{hits}}(qi){\text{hits}}(f_{j} )}}} \right), $$
(1)

where rw qi,fj denotes the relevance weight between q i to f j , p(query) is the probability that query (i.e., q i or f j ) been used as a tag to annotated some resources (i.e., p(query) = |R query |/|R|, where R query is the set of resources which are annotated with tag query), N is total number of resources in the folksonomy (i.e., |R|), and hits(query) is the number of resources which are annotated with the tag query (i.e., |R query |).

We extend the standard PMI measure by incorporating the number of users U t  = {u ∈ U | (u, t, R) ∈ A} who use the tag t to annotated at least one resource. A tag commonly used to annotate same resources should have higher weight than those infrequently adopted. Accordingly, the weight PMI is defined as:

$$ weighted\_rw_{{q_{i} ,f_{j} }} = log_{2} \left( {\frac{{W\_N \times sum\_u(q_{i} \wedge f_{j} )}}{{sum\_u(q_{i} )sum\_u(f_{j} )}}} \right), $$
(2)

where sum_u(query) is the summation of number of users who use tag query to annotate some resources (i.e., \( sum\_u(query) = \sum\limits_{query \in Rquery} {\left| {U_{query} } \right|} \)) and W_N is the total number of tag assignments (i.e., |A|). We employ the weighted PMI measure for our proposed ST-CAC technique.

3.3 Anchoring Term Expansion

On the basis of the constructed statistical-based thesaurus, this phase expands a given set of anchoring terms AT by including additional relevant terms. An anchoring term q i in AT is expanded with a set of terms E qi whose relevance weights, measure by weighted PMI values, to q i need to be greater than a prespecified threshold α. The expanded set of anchoring terms \( RF = \left( {\mathop \cup \limits_{{q_{i} \in AT}} E_{{q_{i} }} } \right) \cup AT \) is formed for the succeeding document clustering task.

Because RF consists of the anchoring terms originally provided by the target user and relevant terms expanded from the anchoring terms, the importance of the terms in RF should not be identical when they are used to represent each document to be clustered. Accordingly, a TF × IDF-like scheme is adopted to estimate the weight of each expanded term f j (i.e., in RF but not in AT) as:

$$ w_{j} = \sum\limits_{{q_{i} \in ET_{j} }} {rw_{{q_{i} ,f_{j} }} \times { \log }\left( {\frac{|AT|}{{|ET_{j} |}} +\upvarepsilon} \right),} $$
(3)

where ET j  ⊆ AT is the set of anchoring terms that expand f j and ε is a small positive value to avoid the log component being 0. On the other hand, if f i  ∈ AT, w j is the largest weight across all expanded terms derived previously.

3.4 Document Representation

Subsequently, each document to be clustered is represented using the expanded set of anchoring terms RF. ST-CAC employs the weighted TF × IDF scheme for document representation. Specifically, each document d l is described by a feature vector \( \overrightarrow {{d_{l} }} \) as:

$$ \overrightarrow {{d_{l} }} = \left\langle {v_{l1} \times w_{ 1} ,v_{l2} \times w_{ 2} , \, \ldots ,v_{lm} \times w_{m} } \right\rangle ,$$
(4)

where m is the total number of terms in RF, v lj is the standard TF × IDF value of f j in d l , and w j is the weight of the term f j in RF.

3.5 Document Clustering

Finally, the target documents are grouped into distinct clusters on the basis of the expanded set of anchoring terms (i.e., RF) and their respective representation values in each document. ST-CAC adopts the hierarchical clustering approach (specifically, the HAC algorithm with the cosine measure for the similarity estimation between two documents and the group-average link method for similarity measurement between two clusters) as the underlying clustering algorithm.

4 Empirical Evaluation

This section reports our empirical evaluation of the proposed ST-CAC technique using a traditional content-based document clustering technique and the CAC technique as performance benchmarks. In the following, we discuss the evaluation design (including data collection and evaluation criteria), parameter tuning experiments, and important evaluation results.

4.1 Data Collection

The collection of document corpus for our evaluation purpose consists of 434 research articles related to information systems and technologies that were collected through keyword searches (e.g., XML, data mining, robotics) from a scientific literature digital library website (i.e., CiteSeer, http://citeseerx.ist.psu.edu/). For each article in our literature corpus, only the abstract and keywords were used in this evaluation study.

To evaluate the effectiveness of a personalized document clustering technique, we need to categorize our literature corpus from different users’ preferential perspectives. We developed a system to collect individuals’ preferred clustering for the literature corpus. Each experimental subject was asked to subjectively categorize the entire literature corpus manually on the basis of his/her own preference. After clustering, the subject was asked to assign a label for each category. These category labels are then considered as the set of anchoring terms of the subject which will be used as the input to the ST-CAC technique. A total of 33 subjects accomplished the manual clustering of the literature corpus. According to the self-reported estimates of the subjects, each subject spent a minimum of eight hours performing manual document clustering. A summary of the document categories generated by the subjects is provided in Table 1.

Table 1. Summary of subjects’ categories for the literature corpus

4.2 Evaluation Criteria

We employ cluster recall and cluster precision [25], defined according to the concept of associations, to measure the effectiveness of the ST-CAC technique and its benchmark techniques. An association refers to a pair of documents that belong to the same cluster. Accordingly, the cluster recall (CR) and cluster precision (CP) from the viewpoint of a subject u a is defined as:

$$ CR = \frac{{\left| {CA_{a}} \right|}}{{\left| {T_{a}} \right|}}\quad {\text{and}}\quad CP = \frac{{\left| {CA_{a}} \right|}}{{\left| {G_{a}} \right|}}, $$
(5)

where T a is the set of associations in the categories manually produced by the subject u a , CA a is the set of correct associations that exists in both the clusters generated by a document-clustering technique and the categories produced by u a , and G a is the set of associations in the clusters generated by the document-clustering technique.

To address the inevitable trade-offs between cluster recall and cluster precision, precision/recall trade-off (PRT) curves are employed. A PRT curve represents the effectiveness of a document clustering technique with different intercluster similarity thresholds.

4.3 Parameter Tuning

We randomly select 10 users from the 33 subjects to determine the appropriate value of each parameter involved in the three document clustering techniques (i.e., a traditional content-based approach, the CAC approach, and our proposed ST-CAC approach) examined. The overall clustering effectiveness of each technique in the tuning experiments is calculated by averaging the cluster recall and cluster precision obtained from the ten subjects.

The traditional content-based document clustering (TCC) approach involves the parameter of number of features (k) for document representation. We range k form 200 to 2000 in increments of 200 and obtain the best performance when k is equal to 2,000. On the other hand, both CAC and ST-CAC techniques include the parameters δ DF (the threshold to remove infrequent features in the feature extraction and selection phase) and α (the threshold to determine whether a term should be expanded in the anchoring term expansion phase). We first investigate α from 1 to 10 in increments of 0.5. The best values of α for CAC and ST-CAC are 2.5 and 2 respectively. Subsequently, we examine δ DF from 3–10 in increments of 1 and get the best δ DF values of 10 and 9 for CAC and ST-CAC respectively.

4.4 Comparative Evaluation Results

Using the parameter values determined previously, we evaluate the effectiveness of the ST-CAC technique and its benchmark techniques. In this experiment, all of the 33 subjects are used for evaluation purpose. The comparative evaluation result is shown in Fig. 2. The proposed ST-CAC technique achieves better clustering effectiveness than do the TCC and CAC techniques. Moreover, the CAC technique also outperforms the TCC technique. These results suggest that both ST-CAC and CAC techniques indeed have the ability to generate personalized document clusters according to the target user’s personalized preference expressed as a set of anchoring terms. Furthermore, using social media for statistical-based thesaurus construction has better performance than that constructed from a general-purpose search engine.

Fig. 2.
figure 2

Comparative evaluation results

5 Conclusion and Future Research Directions

Social media is nowadays an excellent source for gathering user intelligence to support various business intelligence applications. Motivated by the observation, this paper attempts to investigate the effectiveness of social tagging system (aka. folksonomy) in enhancing an important document management task, i.e., personalized document clustering. Specifically, we adopt the CAC technique proposed by Yang and Wei (2007) as our underlying algorithm and incorporate a leading social bookmarking site (i.e., Delicious) to design the ST-CAC technique which uses the folksonomy in Delicious to construct a statistical-based thesaurus for personalized document clustering. According to our empirical evaluation results, the ST-CAC and CAC techniques definitely have the ability to generate personalized document clusters than a traditional content-based approach. Moreover, the statistical-based thesaurus constructed from social media also slightly outperforms that generated from a general-purpose search engine.

Some ongoing and future research directions are briefly discussed as follows. First, Delicious, which is a social bookmarking service for webpages, is adopted as the social media for statistical-based thesaurus construction. Since our document corpus for evaluation purpose is collected from a scientific literature database, it is essential to evaluate the performance of an alternative social bookmarking service (i.e., CiteULike), which allows users to share citations to academic papers, on our proposed ST-CAC technique. Second, only the PMI measure is applied for statistical-based thesaurus construction. It should be interesting to implement and test empirically other measures for statistical-based thesaurus construction.