1 Introduction

Essential techniques of data clustering problem widely used for organizing the data into meaningful partitions based on similarity features of data objects. Relevant tweet data clustering problem is applied among the tweets (or data objects), in which the labels of tweets are initially unknown. Well-known topic clustering techniques (Xu and Wunsch 2005); (Yi et al. 2020); (Jose and Babu 2019) describes the modelling of tweets and generation of tweets features in the form of bag-of-words concerning topics rather than terms (Pessiot et al. 2010). Similarity features among the tweets are measured using distance measures, like Euclidean and cosine (Shirkhorshidi et al. 2015). The VAT (Bezdek and Hathaway 2002) and cVAT (Prasad et al. 2019b, a) use the Euclidean and cosine measures while evaluating the clustering tendency of tweets data. Cosine accurately measures the similarity features for the text documents (Oghbaie and Zanjireh 2018) than Euclidean. Thus, for the assessment of clusters, cVAT performs better than VAT. In cVAT, cosine similarity is computed among the tweets with a single viewpoint. MVS-VAT (Prasad et al. 2019b, a) is a pre-cluster assessment technique, which computes the cosine similarity among the tweets using multi viewpoints.

It is a more potential technique in the assessment of clusters than VAT and cVAT (Therese and Lingam 2017). The MVS-VAT computes the similarity features among two tweet documents for the set of N tweet documents is the average similarity concerning (N-2) viewpoints. Thus, it is computationally expensive than other visual methods, for reducing the computational requirements, the sampling technique proposed for the selection of a few viewpoints rather than the selection of (N-2) viewpoints. The sample viewpoints are selected from inter clusters of data objects, i.e., cosine similarity between two tweets (t1,t2) is measured with respect to sample viewpoints, and the selection of viewpoints do not belong to clusters of either t1 or t2, and it should be other than these clusters. This selection depends on the inter-cluster viewpoints of data objects; therefore, it is known as sampling inter-cluster viewpoints (SICV). Its extensive sampling MVS-VAT is known as SICV-MVS-VAT. The processing steps of proposed SICV-MVS-VAT are illustrated in Fig. 1. After pre-processing the tweets documents, cosine similarity features for every pair of tweets document are computed with the selection of sample viewpoints.

Fig. 1
figure 1

Processing Steps of Proposed SICV-MVS-VAT

Highlights of the contributions are described in following:

  1. 1.

    Develop the sampling technique for the selection of inter-cluster viewpoints of data objects

  2. 2.

    Estimate the clustering tendency for the tweet’s dataset with the selection of sample viewpoints

  3. 3.

    Effective visualization for cluster images of tweets documents

  4. 4.

    Develop the SICV-MVS-VAT for the best visual assessment of cluster of the social data (or tweets)

  5. 5.

    Derive the crisp parts of the visual cluster for finding the cluster labels of tweets objectsin order to find the generation of social data clusters

Other sections of the paper are described as follows: Sect. 2 presents the related visual methods Sect. 3 describes the proposed sampling multi-viewpoint-based visual method, Sect. 4 discusses the experimental study and performance discussion, and Sect. 5 presents the conclusion and scope.

2 Related visual methods

For the tweet data clustering (Sechelea et al. 2016); (Rehioui and Idrissi 2019), topic models (Amelio and Pizzuti 2015; Xu et al. 2019; Hu et al. 2012; Ismail et al. 2018) are required for modelling the tweets with respect to topics instead of terms for avoiding the problem of data sparsity. Topic models, i.e. non-matrix factorization (NMF) (Lee and Seung 2001), Latent Dirichlet Allocation (LDA) (Blei et al. 2003), Latent Semantic Indexing (LSI) (Deerwester et al. 1990), and Probabilistic Latent Sematic Indexing (PLSI) (Hofmann 1999) are widely used for the modelling and extraction of tweets features in bag-of-words (Wallach 2006) format. Each tweet document features are considered as a single data object. The data objects i.e. tweet feature are used for the clustering tweet documents. Clustering of text documents (or tweets documents) is performed through TF-IDF analysis (Neogi et al. 2020). However, such analysis is suffered with data sparsity problem. Topic models are well suited for performing of text clustering based on topics derivations instead of deriving the TF-IDF. It less sensitive to data sparsity problem when compared to TF-IDF based clustering approaches. Topic models are recommended for big text data clustering problems (Sukhija et al. 2016). The LDA is the probabilistic topic model; which enables the text data clustering results in an unsupervised approaches for large set of text documents. The LDA is a generative topic model in which each document represents the mixture of word probabilities of certain topics (Ramage et al. 2009). NMF model is optimistic model when compared to other topic models. It has the capable of extracting the documents information in terms of valid topics without knowing any prior information (Suri and Roy 2017). Key objective of NMF is to decompose the input matrix (document-term matrix) into two matrices, namely, document-topic matrix and topic term matrix. So that it perfectly extracts the topics information for the set of documents.

Cluster tendency is one the crucial problem in data clustering (Tang et al. 2011; Suleman Basha et al. 2019). (Pushpalatha et al. 2020; Devisetty et al. 2019) of text documents. Many pre-cluster tendency methods are surveyed (Mahallati et al. 2019; Hu and Hathaway 2008; Park et al. 2016; Kumar et al. 2015; Kumar and Bezdek 2020; Varish et al. 2020) for finding the cluster tendency. It is recognized that VAT and iVAT (Havens and Bezdek 2011) are the best choices for finding the assessment of cluster tendency, whereas VAT and iVAT are suited for normal and path-shaped datasets respectively. They are initially measuring the dissimilarity features using distance metrics (Blei et al. 2003) and re-order the indices of data objects (or tweets) in order to find their visual images as shown in Fig. 2.

Fig. 2
figure 2

Visual Method—VAT- Visual Images

It illustrates the clusters as visual dark black coloured blocks in the images (Srinivas 2018) of corresponding VAT. Cluster tendency (or clusters prior information) determined with the count of dark black coloured blocks obtained in the diagonal of visual images. Based on spectral features, the affinity values are computed for dissimilarity features of data objects).

Both VAT and cVAT are procedurally same, however, they compute the dissimilarity features with Euclidean and cosine respectively. Cosine takes the distance among the data objects based on magnitude and directions, whereas in an Euclidean takes distances based on differences among two data object vectors. Thus, cVAT estimates the good assessment of clusters with cosine based dissimilarities when compared to VAT.

Spectral features-based VAT is known as SpecVAT (Wang et al. 2008). SpecVAT finds the affinity matrix for the set of data objects and then Laplacian matrix is derived for determining the spectral (or Eigen features) of data objects. Cluster tendency is accessed from the VAT image of spectral features. It also shows the clusters effectively in the visual image for the set of data objects. Cosine measure is more effective than Euclidean for measuring the distance data objects in text data clustering (Nguyen et al. 2010). It computes the cosine similarity among the objects with reference of a single viewpoint. cVAT use the cosine measure for the assessment of cluster tendency. It justifies the cluster tendency using a single viewpoint only. Recently, multi viewpoints-based cosine similarity is developed and it is MVS-VAT (Reddy and Prasad 2016). It justifies the cluster tendency using more than two viewpoints unlike cVAT, it also produces the efficient data clustering results for the tweet datasets. MVS-VAT effectively utilized for the different topics of tweet datasets in (Prasad et al. 2019b, a) for the experimental of health data clustering results. It suffers with the problem of high computational time for the reason of spent much time in the measuring of similarity among the tweets with respect to (N-2) viewpoints, here N is the total number of viewpoints. Based on the observation of state-of-the-art of visual methods, it needs be development of sampling technique for the selection of few viewpoints rather than selection of (N-2) viewpoints. Thus, sampling based visual method needs to be developed to overcome the problem of high computational time. The proposed work is discussed in the following section, which explains the sample procedure for the selection of viewpoints.

3 Proposed sampling based multi viewpoints visual method

Cluster tendency is the major issue for the tweets data clustering. This problem is effectively handled by visual methods, VAT, cVAT, and MVS-VAT. In this paper, sampling strategy is developed with the selection of best samples of inter clusters viewpoints (SICV) in the extended work of MVS-VAT, say as SICV-MVS-VAT method. Algorithm 1 shows the procedural steps of SICV-MVS-VAT.

figure a

Samples of data objects are selected from the inter-cluster data objects. Inter clusters are identified with the selection of most dissimilar data objects (or the centroids of inter clusters); Next, nearest data objects of inter-clusters are identified with the least dissimilar data objects. These inter clusters are generated for the selection of best samples in our proposed work.

Feature extraction of tweets is performed with the derivations of bag-of-features. Topic model aims to derive the tweets feature with respect to topics rather than terms. Thus, problem complexity of data sparsity is handled and simplified with the topics-based feature extraction for wide range of tweets- documents, which process denoted in Step 1, and the features of N tweets are derived as {T1, T2, …. TN}. Pre-estimations of inter clusters are derived using Step 2 to Step6. These steps describe the most deviated data object based distances, in which random object is selected in Step 2 and most deviated data objects (it is considered as centroid of one of assumed cluster) is selected by finding the data object who maintains the maximum distance and it shown in Step 3 and Step 4. Updates the corresponding distances of data objects with centroid in Step 5. The centroids are excluded for further processing steps and considered the new centroids from remaining data objects and it is performed by Step 6. Step 7 shows the multi viewpoints-based cosine similarity (MVS) computation procedure with respect to selection of few viewpoints as per the sample size of ‘np sample’. MVS is applied for every pair of tweet documents (or data objects), which is (Di,Dj), the viewpoints vp should be selected from inter clusters of Di, and Dj; whereas the information of inter clusters are obtained from the earlier steps. The MVS of data pairs shown the similarity value between respective pair and the value is normalized between 0 and 1; find the dissimilarity matrix (M) with subtraction of normalized similarity value from 1, which shown in Step 8. In Step 9, visual method VAT is applied on M for re-ordered dissimilarity matrix (RDM), which represents that re-alignment of data based on ordering of dissimilarity features. Step 10 displays the image of RDM for visualizing the cluster tendency (‘k’). Finally, the cluster labels of tweet documents are computed with the crisp partition matrix of visual clusters shown in Step 11. The crisp partitions are derived with finding of squareness properties of appeared black coloured blocks along the diagonal of VAT image. The squareness property of dark coloured blocks are derived from the difference of pixel values appeared between dark coloured blocks and non-dark coloured blocks. The significance of the proposed work is to assess the excellent value of cluster tendency and generate efficient clustering results with the technique of sampling inter-cluster viewpoints (SICV). The efficiency of the proposed methodology (SICV-MVS-VAT) for the tweets data clustering is demonstrated experimentally in the following section.

4 Experimental study and performance discussion

Existing visual methods, VAT, cVAT, MVS-VAT and proposed method SICV-MVS-VAT are experimented on tweet documents, which are extracted based on standard health keywords, TREC 2015 and TREC2017. Data sets descriptions and performance evaluation of visual methods are presented in this section for demonstrating the efficiency of proposed method in assessment of cluster tendency and its data partitioning results.

4.1 Tweet datasets description

Table 1 shows the tweets description in terms of total number of topics, keywords or health topics. In the experimental, various subsets of tweets are merged as per mentioned keywords. The standard health keywords are used in the experimental, which are trendy or most discussed tweets. Benchmarked TREC2015 and TREC2017 keywords are used for tweets extraction.

Table 1 Description of Tweet Datasets for the Topics 2-Topics to 15-Topics

4.2 Topic models

Popular topic models, NMF, LDA, LSI, and PLSI, are used to model extracted tweets. During the modelling, bag-of-words of the tweets are extracted to the purpose of feature extraction. The dissimilarity matrix and re-ordered dissimilarity matrix are computed in the existing and proposed visual methods after the extraction of feature extraction. Exploration of tweets data clustering results are demonstrated in the following sub-section for a comparison study of visual methods for the four topic models.

4.3 Performance discussion

The comparative visualizations for the cluster tendency shown in Figs. 3, 4, 5, and 6 for TREC2015 (two-topic), TREC2017 (three-topic), standard health tweets (10-topics), and standard health tweets (15-topics), respectively. Resulting visual images of three existing methods and proposed SICV-MVS-VAT. The following figures having three variations of visual images, which are derived from VAT, MVS-VAT, and SICV-MVS-VAT methods respectively. Key variation of these images is that visual appearance and clarity of square-shaped black coloured dark blocks along the diagonal for the three methods in Figs. 3 and 4. Based on the experimental of existing VAT and MVS-VAT and SICV-MVS-VAT, it is observed that proposed SICV-MVS -VAT effectively determines the similarity features among the data objects with inter-cluster viewpoints.

Fig. 3
figure 3

Cluster Tendency Assessment for Visual Methods (TREC2015-Two Topics)

Fig. 4
figure 4

Cluster Tendency Assessment for Visual Methods (TREC2017-Three Topics)

Fig. 5
figure 5

Cluster Tendency Assessment for Visual Methods (Standard Health Tweets—Ten Topics)

Fig. 6
figure 6

Cluster Tendency Assessment for Visual Methods (Standard Health Tweets—Fifteen Topics)

Experimental is conducted under different topic models with finding of similarity features using cosine and Euclidean distance metrics. Cosine is based text data clusters assessment proven as the best when compared to Euclidean for any topic model. Experimental results of VAT based topic models, MVS-VAT based topic models, and SICV-MVS-VAT based topic models under Euclidean and cosine metrics are illustrated in Figs. 5 and 6; which results states that cosine based clusters assessment are the best when compared to Euclidean under any topic model and proven that our SICV-MVS-VAT visualizes the best clusters assessment compared to others.

With the sampling of inter-cluster viewpoints, the similarity among the tweets documents is computed. Cluster tendency generates with the count of dark black coloured blocks of visual method images. More clarity of visual clusters indicates proper and accurate social data clusters; for the text tweets, cosine shown the quality of clusters than Euclidean.

This assessment of cluster tendency is performed with a single viewpoint and it is with origin—the proposed methodology used multiple viewpoints during the assessment of cluster tendency. More informative assessment is extracted with multi viewpoints rather than a single viewpoint.

Therefore, SICV-MVS-VAT has shown the best quality of clusters than other visual methods, and the same is observed in Figs. 3 to 6. With the derivations of crisp partitions, the complete clustering results are obtained. Table 2 shows the sample tweets data. Table 3 shown the performance evaluation of visual methods using the performance measures mentioned in (Datta et al. 2018); (Vinh et al. 2010); (Bhatnagar et al. 2018) and with these five measures, study of performance for the visual methods are evaluated and presented in Table 4.

Table 2 Sample Tweets (2-topics to 4-Topics)
Table 3 Performance Evaluation of Visual Methods Using CA and NMI
Table 4 Performance Evaluation of Visual Methods Using Precision, Recall and F-Measure
Table 5 Computation Time (in Seconds) for SICV-MVS-VAT and MVS-VAT
Table 6 Memory Allocation (in kbs) for SICV-MVS-VAT and MVS-VAT

For the evaluation of performance, five performance measures are used, cluster accuracy (CA) (Pattanodom et al. 2016), normalized mutual information (NMI) (Amelio and Pizzuti 2015), precision (P), Recall (R), and F-Measure (F) (Xu et al. 2019; Li et al. 2016).

More visual clarity of clusters are appeared with the black coloured blocks of visual images using cosine similarity than Euclidean. The similarity features are computed with reference of a single viewpoint (i.e. origin) in cosine based visual method.

Recent visual method, MVS-VAT uses (N-2) viewpoints for measure the cosine similarity between any two documents among the ‘N’ documents. Hence, it showed the more clarity of visual clusters than cosine metric. However, it taken high computation time for the similarity computation of N documents with respect to (N-2) documents. Our proposed method (SICV-MVS-VAT) also uses the multiple viewpoints for the good assessment of visual clusters. This method use the sample viewpoints of pre-generated inter clusters instead of (N-2) viewpoints. Thus, it takes the fewer amounts of computational time and memory allocations than MVS-VAT.

Figures 7 and 8 visualizes the performance comparison with the measures, CA and NMI respectively. It states that proposed visual method SICV-MVS-VAT-NMF shown an excellent performance for the different sets of topic models i.e. 2-topics to 15-topics.

Fig. 7
figure 7

Performance Comparison of Visual Methods for the Topics Datasets (Using Clustering Accuracy)

Fig. 8
figure 8

Performance Comparison of Visual Methods for the Topics Datasets (Using NMI)

Topic model NMF greatly supports to proposed sampling based visual method compared to other topic models. The same performance values are illustrated experimentally and presented in the performance graph comparison charts.

Based on the observation of time and space values of SICV-MVS-VAT and MVS-VAT, we stated that proposed SICV-MVS-VAT taken the less computational time and less memory allocation than MVS-VAT due to fact for the selected sample viewpoints instead considering (N-2) viewpoints, unlike MVS-VAT. Hence, proposed SICV-MVS-VAT has outperformed the other visual methods with respect to the quality of tweets data clustering results, time and space requirements.

5 Conclusion and scope

Visual methods are well suited for accessing the pre-clusters information (cluster tendency ‘k’) and generation of tweets data clustering results. It proved that cosine-based pre-clusters estimations are reasonable when compared to clusters assessment with Euclidean in text clustering. However, a single viewpoint is considered when considering similarity features among tweets in the tweet data clustering. Another visual method MVS-VAT access the cluster tendency with multi viewpoints rather than a single viewpoint. Thus, it produces more effective clustering results than cosine based visual methods. Due to overcome the time and space issues of MVS-VAT, proposed work uses the sampling of inter-cluster viewpoints. The in-depth experimental analysis proved that proposed SICV-MVS-VAT is an efficient technique for the tweet data clustering.