Dimensionality Reduction via Community Detection in Small Sample Datasets

Bhardwaj, Kartikeya; Marculescu, Radu

doi:10.1007/978-3-319-93040-4_9

Kartikeya Bhardwaj¹⁹ &
Radu Marculescu¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10939))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3496 Accesses
2 Citations

Abstract

Real world networks constructed from raw data are often characterized by complex community structures. Existing dimensionality reduction techniques, however, do not take such characteristics into account. This is especially important for problems with low number of samples where the curse of dimensionality is particularly significant. Therefore, in this paper, we propose FeatureNet, a novel community-based dimensionality reduction framework targeting small sample problems. To this end, we propose a new method to directly construct a network from high-dimensional raw data while explicitly revealing its hidden community structure; these communities are then used to learn low-dimensional features using a representation learning framework. We show the effectiveness of our approach on eight datasets covering application areas as diverse as handwritten digits, biology, physical sciences, NLP, and computational sustainability. Extensive experiments on the above datasets (with sizes mostly between 100 and 1500 samples) demonstrate that FeatureNet significantly outperforms (i.e., up to 40% improvement in classification accuracy) ten well-known dimensionality reduction methods like PCA, Kernel PCA, Isomap, SNE, t-SNE, etc.

Download conference paper PDF

Detecting Communities in Feature-Rich Networks with a K-Means Method

An Extension of K-Means for Least-Squares Community Detection in Feature-Rich Networks

Feature selection based on community detection in feature correlation networks

Article 21 January 2019

Miloš Savić, Vladimir Kurbalija, … Mirjana Ivanović

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Many Artificial Intelligence (AI) and Machine Learning (ML) problems use very high-dimensional datasets. This high dimensionality leads to the curse of dimensionality where the performance of ML models decreases with increasing data dimensions [9]. The curse of dimensionality is particularly profound if in addition to the large number of features, the problem also suffers from the availability of a low number of samples [5, 9, 10, 16]. Specifically, as theoretically established in [5, 9, 10, 16], in order to obtain high classification performance in high-dimensional spaces, the number of samples must also be very large (e.g., $\sim $10$^5$ samples). Therefore, for problems with a low number of samples (say, 100–1500 samples), extracting a set of useful features from the high-dimensional data is a very challenging task [23]. Consequently, in this paper, we address the well-known dimensionality reduction problem specifically in a high dimensions and low sample-size setting. This important class of problems has many engineering and scientific applications such as on-device mobile applications, remote sensing, fMRI processing, sustainability, finance, and biological systems.

Dimensionality reduction can also be seen as an automatic feature extraction problem which yields low-dimensional features from the initial high-dimensional raw data. Indeed, a major goal in AI and representation learning is to enable machines to learn such useful, low-dimensional features automatically from the raw data rather than using manually engineered features. Towards this automatic feature learning, deep learning has emerged for vision, speech, and natural language processing (NLP) applications [3]. However, deep learning often needs enormous training datasets ($\sim $10$^5$–10$^6$ samples) and results in significant overfitting for problems with small sample-size and high dimensions [4]. Therefore, for small sample-size problems, new dimensionality reduction techniques are needed to automatically extract useful features from the initial high-dimensional data.

Several outstanding dimensionality reduction techniques exist ranging from linear methods such as Principal Component Analysis (PCA), Probabilistic PCA (PPCA), to graph-based non-linear techniques like Maximum Variance Unfolding (MVU) [21], Isomap [19], etc. (see [12] for a review). Other techniques include Kernel PCA, deep learning autoencoders [8], stochastic proximity embedding (SPE) [1], stochastic neighbor embedding (SNE) [7] and t-distributed SNE (t-SNE) [11]. The graph-based techniques build a neighborhood graph to learn a lower-dimensional embedding [12]. For example, Isomap builds a neighborhood graph based on a fixed parameter which controls the neighborhood size or the number of nearest-neighbors of each node in the graph.

In real world, however, networks constructed from raw data often possess complex characteristics such as communities (i.e., groups of tightly connected nodes) and structural equivalence (i.e., nodes with similar roles in network, e.g., hubs) [17]. Therefore, such network characteristics must be accounted for while computing network neighborhoods for dimensionality reduction as they can lead to more accurate feature learning. Specifically, the neighborhood of a given node must depend on the community it belongs to. By contrast, prior techniques like Isomap assume a rigid (fixed) neighborhood for all nodes in the network. Similarly, stochastic graph-based methods such as SNE/t-SNE use a fixed parameter called perplexity which measures the effective number of neighbors for each node. Hence, the complex community structure hidden within the raw data has not been explicitly taken into account in prior dimensionality reduction methods.

Recently, representation learning has been proposed in the context of learning features on networks while accounting for community structure, e.g., node2vec [6], DeepWalk [15], community preserving embedding [20], LINE [18], etc. We refer to this problem space as “Representation Learning on Networks” throughout the paper. However, the networks considered in this prior art do not come from high-dimensional raw data, but rather from social networks (e.g., blogs, Youtube, Flickr), authorship networks or Wikipedia webpage networks. Hence, the prior representation learning on networks research does not directly address the problem of dimensionality reduction. In contrast, we argue that by capturing communities and structural equivalence, ideas from “representation learning on networks” problem space can have significant implications for dimensionality reduction. Therefore, in this paper, we address the following two key questions:

1.
Can representation learning on networks have more general implications in dimensionality reduction if we leverage the hidden communities in raw data?
2.
If so, how can we best construct a network from high-dimensional data to optimally capture its latent communities for dimensionality reduction?

To answer these questions, we propose FeatureNet, a new community-based dimensionality reduction framework. We further contribute a new method to construct a network directly from the raw data while explicitly revealing its hidden communities; this enables us to employ network representation learning ideas to learn low-dimensional community- and structural equivalence-based features from this network, thereby reducing the dimensions of the dataset.

We evaluate our proposed approach on five very diverse application areas ranging from handwritten digit recognition, biology, physical sciences, NLP, to computational sustainability. As mentioned earlier, our datasets are relatively small with sizes mostly between 100 and 1500 samples. This is because, automatic feature engineering for relatively small datasets still remains an important problem as deep learning models often lead to overfitting for such datasets.

To summarize, we make the following key contributions:

1.
We propose FeatureNet, a novel community-based dimensionality reduction framework. We also propose a new method to construct a network directly from high-dimensional raw data, thereby revealing its hidden communities explicitly. To the best of our knowledge, we are the first to employ community-based representation learning ideas for dimensionality reduction.
2.
We evaluate FeatureNet on eight datasets spanning five diverse real-world applications like handwritten digit recognition, biology, physical science, NLP, and computational sustainability. Our new sustainability datasets can be used by research community to further benchmark dimensionality reduction.
3.
We further compare FeatureNet against ten most notable dimensionality reduction techniques such as PCA, deep learning autoenoders, t-SNE, Isomap, etc. Overall, the proposed FeatureNet significantly outperforms (in terms of accuracy) all of these techniques on the above diverse datasets by 3%–40%.
4.
Finally, we introduce a new challenging computational sustainability problem as a case study: Given high-dimensional Carbon Emissions data, how can we learn optimal low-dimensional features to best classify the GDP growth of nations? Again, FeatureNet achieves the state-of-the-art performance.

Next, we review the related work reported in the literature.

2 Related Work

As mentioned earlier, prior techniques such as node2vec, DeepWalk, community-preserving embedding, LINE, etc. explicitly require a network as an input [6, 15, 18, 20]. In contrast, we start with high-dimensional (raw) data that does not exist in predefined network forms and learn the network structure directly from the raw data to explicitly reveal its latent communities. For example, consider Arcene, a high-dimensional cancer benchmark dataset, where each sample comes from a patient and features specify the abundance of certain proteins^{Footnote 1}. This dataset does not have a predefined network structure like in social networks. Hence, methods like node2vec, LINE, etc. are not a natural choice. This is where our key contribution lies – in representing any kind of dataset as a network which reveals hidden communities in raw data; we then use this network for dimensionality reduction using ideas from community-based feature learning.

To summarize, prior work in representation learning on networks focuses only on network-based classification tasks (e.g., classify interests of a blogger based on communities/homophily in a blog social network). Our work, however, truly generalizes this “network representation learning” space to any classification problem which has high-dimensional data and does not restrict it only to network classification tasks (see Supplementary Sect. 1 for detailed related work^{Footnote 2}).

3 Proposed Approach

Given a classification problem $\{X,y\}$, let $X \in \mathbb {R}^{n \times p}$ denote the original dataset with n samples and p features, while $y \in \mathbb {R}^{n \times 1}$ denotes the labels. Also, let $x^{(i)} \in \mathbb {R}^{p \times 1}$ be the i-th sample in X. Then, dimensionality reduction is a function $f: \mathbb {R}^{n\times p}\rightarrow \mathbb {R}^{n \times d}$, where d is the number of features in the reduced space ($d<< p$). Since prior neighborhood graph-based methods [11, 19] do not take communities into account, this results in loss of important network information.

Let $\mathcal {X}$ be the low-dimensional mapping of X, and be the i-th sample of $\mathcal {X}$ (i.e., the reduced representation of initial $x^{(i)}$). Then, the problem is to find $\mathcal {X}$ which accounts for the latent community structure and structural equivalences hidden within the raw data. Hence, unlike established techniques such as Isomap, the network neighborhood for each sample in our approach is not fixed, but rather takes communities and structural equivalences into account. To find such a mapping, therefore, we maximize the probability of observing a certain neighborhood $\mathcal {N}$ of sample $x^{(i)}$, conditional on its low-dimensional representation, as well as on its latent community structure and structural equivalences:

(1)

where, $\mathcal {C}(x^{(i)})$ and $\mathcal {S}(x^{(i)})$ are latent variables containing information about communities and structural equivalence of sample $x^{(i)}$ hidden within the raw data.

To solve problem (1), we propose FeatureNet, a two stage solution for dimensionality reduction which: (i) transforms the raw data into a network space to explicitly reveal data’s inherent communities, and (ii) performs representation learning on this network (see Fig. 1). Next, we present our proposed network construction technique to reveal hidden communities naturally.

Step 1: Proposed $\varvec{K}$-$\varvec{\tau }$ Method for Network Construction

We create a network directly from raw data as follows: (i) construct a correlation-based network ($\tau $-step), and (ii) improve the density of network communities (K-step). For the $\tau $-step, we transform the initial high-dimensional data into a correlation-based network of samples (i.e., each sample now becomes a node). This first step is a mapping $l: \mathbb {R}^{n\times p}\rightarrow \mathbb {R}^{n \times n}$, which yields $\mathcal {G}=l(X)$. Here, $\mathcal {G}\in \mathbb {R}^{n\times n}$ is the adjacency matrix of the network of samples with elements:

$$\begin{aligned} \mathcal {G}^{ij} = {\left\{ \begin{array}{ll} c(x^{(i)},x^{(j)}) &{}\text {if}~c(x^{(i)},x^{(j)})\ge \tau \\ 0 &{}\text {if}~c(x^{(i)},x^{(j)})<\tau ~\mathrm{or}~i=j \end{array}\right. } \end{aligned}$$

(2)

where, $c(\cdot ,\cdot )$ is the Pearson’s correlation function, and $\tau $ is a threshold used on $c(\cdot ,\cdot )$ to remove weakly correlated links from the network.

Threshold ($\varvec{\tau }$-step): Setting a higher $\tau $ removes the noise from the network by encouraging connections only among samples of the same class (intra-class links) and not among samples of different classes (inter-class links). Ideally, a network should mostly have intra-class links and not many inter-class links. To elaborate, we consider MNIST handwritten digit dataset where each sample has 784 features. Since our focus is on relatively small datasets, we randomly select 1000 samples (100 images for each digit 0–9) from the MNIST database.

Next, we create a Pearson’s correlation-based network of samples from this ${1000\times 784}$ dataset using Eq. 2 and a threshold $\tau =0.7$. Figure 2(a) illustrates the adjacency matrix of this network. Clearly, the ten diagonal clusters in Fig. 2(a) represent the intra-class links, thus revealing the hidden communities of each digit. Moreover, using a high threshold of 0.7, most of the noisy links in the network (i.e., the inter-class links) are removed. We note, however, that a too high threshold can also result in some samples getting completely disconnected from the network and very sparse communities (see digits 2 and 5 in Fig. 2(a) zoomed inset). To overcome this problem, we introduce a network density parameter, K.

Network Density ($\varvec{K}$-step): To connect the disconnected nodes and increase the density of network communities, we next connect each sample to its corresponding K highest correlated samples; i.e., after the thresholding step, if a sample $x^{(i)}$ has less than K links, we connect it to samples $x^{(j)}$’s until it has K links. Here, samples $x^{(j)}$’s are selected based on the K-highest correlations. This step is a variant of the K-nearest neighbor approach to handle correlations rather than euclidean distances (i.e., instead of K neighbors with minimum distances, we use K neighbors with maximum correlations). As shown in Fig. 2(b), introducing a network density parameter of $K=7$, (i) connects all disconnected nodes, and (ii) increases the density of diagonal clusters significantly without too much additional noise (see the zoomed-in inset of Fig. 2(b)). Hence, the threshold and density steps yield a K-$\tau $ method-based network of samples, $\mathcal {G}$.

To summarize, our proposed approach creates a network from raw data using two parameters: Threshold $\tau $ and density K, which provide a tradeoff between the noise and the density of communities. Best K and $\tau $ can be selected via cross-validation and a simple grid search which would generate an optimal neighborhood for each sample. Hence, in our approach, the neighborhood of each sample is not rigid, rather is determined automatically by its community structure. Explicitly revealing these hidden communities in raw data, therefore, enables the use of community-based representation learning in dimensionality reduction.

Step 2: Community-Based Representation Learning

The network of samples, $\mathcal {G}$, often possesses characteristics such as communities and structural equivalence. Once in the network space, problem (1) reduces to:

(3)

where, $\mathcal {V}(\mathcal {G})$ denotes the set of nodes in network $\mathcal {G}$, and sample $x^{(i)}$ is now represented by a node v in the network of samples. Finding the neighborhood of $x^{(i)}$, $\mathcal {N}(x^{(i)})$, now becomes the problem of finding the network neighborhood $\mathcal {N}_\mathcal {G}^R(v)$ of node v in $\mathcal {G}$. This network neighborhood can be found using a strategy R, which can account for the latent community structure $\mathcal {C}(x^{(i)})$ and structural equivalence $\mathcal {S}(x^{(i)})$. However, note that, this is precisely the skip-gram objective which the recent research on network representation learning aims to optimize [6, 13, 15]. Therefore, once the high-dimensional raw data is transformed into a network which explicitly reveals the community structure, the final low-dimensional representation can be learned using techniques such as node2vec [6]. Specifically, node2vec acts as a mapping $h: \mathbb {R}^{n\times n}\rightarrow \mathbb {R}^{n \times d}$, which yields $\mathcal {X}=h(\mathcal {G})$. The $n \times d$ matrix $\mathcal {X}$ contains the final low-dimensional features based on hidden communities in raw data. Algorithm 1 shows these two stages of FeatureNet. For more information on the classic word2vec skip-gram objective [13] and node2vec search strategy [6], please refer to Supplementary Sect. 2.

4 Experimental Setup and Results

4.1 Experimental Setup

We implement the K-$\tau $ method in MATLAB, while node2vec-neighborhood search, optimization, and the subsequent classification are all carried out in Python. We use one-vs-rest logistic regression with L2 regularization and a broad range of inverse regularization strength parameter, $C\in \{10^{-2},10^{-1},\ldots ,10^4\}$ for multi-class classification. Of note, node2vec parameters (return parameter p, and in-out parameter q) which control a trade-off between communities and structural equivalence, are optimized via a grid search on $p,q \in \{0.25,0.75,0.9,1.5,2,4\}$. Finally, the two parameters of FeatureNet (K, $\tau $) are also optimized using a grid search. $\tau $ is varied in steps of 0.05 from 0.6 to 0.95, while K varies from 1 to 9. The best parameter values are selected using 10-fold cross-validation (CV).

To show the effectiveness of FeatureNet on many applications, we conduct experiments on eight datasets coming from five very different application areas as summarized in Table 1. Our focus in this paper is on dimensionality reduction for relatively small datasets which explains why the sample sizes are mostly between 100 and 1500 in Table 1. Reuters subset data is used to analyze the scalability of our approach. Table 1 contains five benchmarks from UCI ML repository^{Footnote 3}.

Table 1. Characteristics of the datasets

Full size table

Table 2. 10-fold CV F$_1$-Macro and F$_1$-Micro (Accuracy) scores for UCI benchmarks ($d=16$): Best six prior methods shown.

Full size table

Table 1 also shows three datasets from the computational sustainability domain in which quantitatively inferring economic growth from anthropogenic carbon emissions remains an active area of research [14]. Here, we make a twofold contribution: First, we propose the following new computational sustainability problem: “Given multiple years of daily carbon emissions (CE) data across the world, can we correctly classify the Gross Domestic Product (GDP) growth of different regions?” Second, we contribute three new datasets to further benchmark dimensionality reduction. The datasets are compiled using a carbon dioxide database [2] and the World Bank [22] data (see Supplementary Sect. 3).

Finally, we compare our approach against ten^{Footnote 4} well-established dimensionality reduction techniques: (1) PCA, (2) PPCA, (3) Polynomial Kernal PCA (KPCA - Poly.), (4) KPCA – gaussian kernel, (5) Linear Discriminant Analysis (LDA), (6) SPE, (7) Deep Autoencoders, (8) SNE, (9) t-SNE, and (10) Isomap. We used a dimensionality reduction toolbox [12] for these techniques.

4.2 Results

UCI Machine Learning Repository Benchmarks. In our experiments, we reduce the dimensions of each dataset from initial p features to $d=16$ features. We then conduct logistic regression on the reduced features and report its 10-fold CV F$_1$-Macro and F$_1$-Micro scores. Note that, F$_1$-Micro scores have the same interpretation as classification accuracy for multiclass classification problems. Table 2 presents these results for FeatureNet and the best six traditional techniques for all UCI datasets. As shown, our proposed FeatureNet significantly outperforms all six (and, implicitly, all ten!) prior techniques.

For Arcene, FeatureNet achieves a F$_1$-Micro of 0.82 improving over the best performing PPCA method by 6.5%. Arcene is a challenging dataset because 3000 out of its 10000 features are ‘probes’ with no predictive power. This shows that our proposed FeatureNet is able to handle such noisy datasets. Next, for Musk1, we achieve an improvement of 10.27% in F$_1$-Micro scores over the best traditional methods – PCA and PPCA. Similarly, for MNIST, we observe an improvement of 3.28% in F$_1$-Micro over the best performing t-SNE technique. Recall that, we are only using 1000 samples for MNIST and not all 60,000 images for training. In fact, all datasets used in the present work are “relatively small” with number of samples mostly between 100 and 1500. This is why, deep learning-based autoencoders do not perform very well and, as expected, overfit the data.

Finally, for the CNAE-9 dataset (NLP), we improve the F$_1$-Micro by 5.83% over the best performing SNE method. CNAE-9 is a business description text data for certain companies classified according to economic sectors. Each document is processed using standard NLP techniques (e.g., stop-word removal, stemming, etc.) and is converted to a term frequency vector. This results in a very sparse dataset wherein 99.22% of the raw data is all zeros. In summary, our results demonstrate that FeatureNet can handle dimensionality reduction problems on very diverse applications and can also handle noisy and sparse datasets. Similar improvements are observed for F$_1$-Macro scores.

Empirical Evaluation of FeatureNet in the $\varvec{K}$-$\varvec{\tau }$ Parameter Space. Figure 3 shows the impact of varying density K (y-axis) and threshold $\tau $ (x-axis) for various UCI datasets (see Supplementary Fig. S2(a) for Musk1 dataset). As shown, FeatureNet outperforms the traditional methods for several combinations of K and $\tau $ (see orange/red portions in Fig. 3). For instance, for MNIST, CNAE-9 (Fig. 3(b, c)) and Musk1 (Fig. S2(a)), almost any combination of parameters gives a high classification accuracy. Indeed, for Arcene (Fig. 3(a)), we observe that only a few parameter combinations give high performance (e.g., for $\tau =0.95$ and all K values). A possible reason for FeatureNet’s behavior for Arcene could be due to the additional noise in this dataset. We leave the theoretical analysis of stability of FeatureNet as a future work (e.g., analyzing impact of noise, etc.).

Why We Achieve Performance Gains? As mentioned before, the parameters $\tau $ and K control the tradeoff between noise in the network and density of communities. Consider the case $\tau =0.85$ and varying K’s for MNIST (i.e., the rightmost column of Fig. 3(b)). For a high threshold of 0.85, the diagonal communities are even more sparse than those shown in Fig. 2 where the threshold used was only 0.7 (see also supplementary Fig. S1). Now, if we increase the density K, the F$_1$-Micro increases from 0.873 for $K=2$, to 0.904 for $K=5$, to 0.902 for $K=9$ (probably too much noise for $K=9$). This clearly demonstrates the tradeoff between the noise and density, and how it can affect the model performance. Therefore, our K-$\tau $ method for network construction successfully captures the best tradeoff and thus yields a high classification accuracy. Hence, our results show that choosing a good network construction approach for revealing hidden communities in data is very important for obtaining higher performance.

Table 3. 10-fold CV F$_1$-Micro (Accuracy) for CE-GDP problems ($d=16$): Six best prior methods are shown.

Full size table

Computational Sustainability – A Case Study and New AI Datasets. Table 3 shows F$_1$-Micro for the competitive methods across the three years for the CE-GDP datasets. As evident, FeatureNet significantly outperforms the best PPCA method by 40.13%, 22.51%, and 27.26% for 1980, 1990, and 2000, respectively. We also observed results like Fig. 3 for the CE-GDP datasets (see supplementary Fig. S2(b)). Moreover, Fig. S3 shows results for varying the number of target dimensions from $d=16$ to 32. Again, FeatureNet outperforms other techniques for all d. Therefore, the CE-GDP datasets can also be used by the ML community to benchmark dimensionality reduction.

Finally, the K-$\tau $ network shown in Fig. 4(a) for CE-GDP 2000 dataset demonstrates that FeatureNet models the hidden communities with significantly different sizes very accurately, thus explaining the excellent performance of FeatureNet (see Fig. S4 for the 1980 network). Hence, fixed neighborhood size or perplexity methods (e.g., Isomap, t-SNE) cannot capture such massive heterogeneity in raw data’s community structure. To show this, we vary the fixed neighborhood size for Isomap in Fig. 4(b). As shown, FeatureNet is far superior (with F$_1$-Micro nearly 0.9 for $d=32$) to Isomap for all neighborhood sizes.

Note on Scalability. In order to analyze the scalability of FeatureNet, we consider a subset of Reuters-21578 dataset in which documents with multiple category labels were removed. This yielded 8293 documents from 65 classes with 18933 distinct terms. Of the total 8293 documents, we focus on the given training dataset^{Footnote 5} of 5946 documents and report 10-fold CV classification F$_1$-Micro after reducing its dimensions from 18933 to 16. We compare FeatureNet with some of the top performers from the above experiments – SPE, PCA, and t-SNE as these were amongst the only few techniques that were able to finish execution in a reasonable time (e.g., about 2–4 h) using reasonable computational resources (an 8-core Intel i7 desktop).

For relatively small datasets like MNIST, the number of links is not very big (e.g., 6669 links for 1000 nodes). However, for larger datasets like Reuters, the number of links can increase rapidly (719, 080 links for ($\tau =0.7$, $K=30$) case and 2.1 million links for ($\tau =0.5$, $K=50$) case; see Table S1). Figure S5 shows the diagonal communities for the Reuters $\tau =0.7$ and $K=30$ case ($\approx $700,000 links), whereas Fig. S6 shows the same for $\tau =0.5$ and $K=50$ (>2.1M links; to create this network, MATLAB takes only 10 s and up to 7 GB of memory.). Clearly, the diagonal communities of the former are significantly more sparse than those of the latter. Consequently, our proposed FeatureNet successfully reduced the dimensions and finished executing for the former but not for the latter. In terms of the classification accuracy, F$_1$-Micro for SPE, PCA, and t-SNE were 0.725, 0.82 and 0.823 respectively, whereas FeatureNet again significantly outperformed these techniques with a F$_1$-Micro score of 0.867 (5.34% improvement). These results demonstrate that currently FeatureNet can indeed scale up to large datasets provided their networks contain several hundreds of thousands of links. However, optimizing FeatureNet to handle datasets which result in more than several million links, is left for future research.

5 Conclusion and Future Work

We have proposed FeatureNet, a new community-based dimensionality reduction framework for small sample problems. To this end, we have proposed a new technique to construct a network from any general raw data while revealing its hidden communities. Community-based low-dimensional features are then learned using a representation learning framework. We have demonstrated the effectiveness of FeatureNet across five very different application domains ranging from handwritten digit recognition, biology, physical science, NLP, to computational sustainability. We have further shown that FeatureNet significantly outperforms many well-known dimensionality reduction techniques such as PCA, PPCA, deep autoencoders, t-SNE and Isomap. This ultimately shows how representation learning ideas can have huge implications for dimensionality reduction.

As a future work, we plan to develop even stronger algorithms and parallelization techniques to scale FeatureNet to hundred-thousand samples/features. Finally, we plan to provide an in-depth theoretical analysis for FeatureNet.

Notes

1.
https://archive.ics.uci.edu/ml/datasets/Arcene.
2.
Supplementary material available at: https://goo.gl/LvkmjB.
3.
http://archive.ics.uci.edu/ml/index.php.
4.
For the ease of presentation, we will report only the top six performers.
5.
See details at http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html.

References

Agrafiotis, D.K.: Stochastic proximity embedding. J. Comput. Chem. 24(10), 1215–1221 (2003)
Article Google Scholar
Andres, R.: Monthly Fossil-Fuel CO2 Emissions: Mass of Emissions Gridded by One Degree Latitude by One Degree Longitude. CDIAC, U.S.A. (2013)
Google Scholar
Bengio, Y., et al.: Representation learning: a review and new perspectives. IEEE Trans. PAMI 35(8), 1798–1828 (2013)
Article Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer Series in Statistics, vol. 1. Springer, New York (2001). https://doi.org/10.1007/978-0-387-21606-5
Book MATH Google Scholar
Fukunaga, K., Hayes, R.R.: Effects of sample size in classifier design. IEEE Trans. Pattern Anal. Mach. Intell. 11(8), 873–885 (1989)
Article Google Scholar
Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: KDD, pp. 855–864. ACM (2016)
Google Scholar
Hinton, G., Roweis, S.: Stochastic neighbor embedding. In: NIPS, vol. 15 (2002)
Google Scholar
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Article MathSciNet Google Scholar
Hughes, G.: On the mean accuracy of statistical pattern recognizers. IEEE Trans. Inf. Theory 14(1), 55–63 (1968)
Article Google Scholar
Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 4–37 (2000)
Article Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(Nov), 2579–2605 (2008)
MATH Google Scholar
Maaten, V.D.: Dimensionality reduction: a comparative review. J. Mach. Learn. Res. 10, 66–71 (2009)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Moss, R.H., et al.: The next generation of scenarios for climate change research and assessment. Nature 463(7282), 747 (2010)
Article Google Scholar
Perozzi, B., Al-Rfou, R., Skiena, S.: DeepWalk: online learning of social representations. In: KDD, pp. 701–710. ACM (2014)
Google Scholar
Raudys, S.J., et al.: Small sample size effects in statistical pattern recognition. IEEE Trans. PAMI 13(3), 252–264 (1991)
Article Google Scholar
Steinhaeuser, K., et al.: Multivariate and multiscale dependence in the global climate system revealed through complex networks. Clim. Dyn. 39, 889–895 (2012)
Article Google Scholar
Tang, J., et al.: LINE: large-scale information network embedding. In: Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077 (2015)
Google Scholar
Tenenbaum, J.B., et al.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000)
Article Google Scholar
Wang, X., Cui, P., Wang, J., Pei, J., Zhu, W., Yang, S.: Community preserving network embedding. In: AAAI, pp. 203–209 (2017)
Google Scholar
Weinberger, K.Q., Saul, L.K.: An introduction to nonlinear dimensionality reduction by maximum variance unfolding. In: AAAI 2006, pp. 1683–1686 (2006)
Google Scholar
World Bank: GDP Growth Data (%) (2017). http://data.worldbank.org/
Yamada, M., et al.: High-dimensional feature selection by feature-wise kernelized Lasso. Neural Comput. 26(1), 185–207 (2014)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Carnegie Mellon University, Pittsburgh, PA, USA
Kartikeya Bhardwaj & Radu Marculescu

Authors

Kartikeya Bhardwaj
View author publications
You can also search for this author in PubMed Google Scholar
Radu Marculescu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kartikeya Bhardwaj .

Editor information

Editors and Affiliations

Deakin University, Geelong, Victoria, Australia
Dinh Phung
National Chiao Tung University, Hsinchu City, Taiwan
Vincent S. Tseng
Monash University, Clayton, Victoria, Australia
Geoffrey I. Webb
Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan
Bao Ho
University of Melbourne, Melbourne, Victoria, Australia
Mohadeseh Ganji
University of Melbourne, Melbourne, Victoria, Australia
Lida Rashidi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bhardwaj, K., Marculescu, R. (2018). Dimensionality Reduction via Community Detection in Small Sample Datasets. In: Phung, D., Tseng, V., Webb, G., Ho, B., Ganji, M., Rashidi, L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 10939. Springer, Cham. https://doi.org/10.1007/978-3-319-93040-4_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-93040-4_9
Published: 17 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93039-8
Online ISBN: 978-3-319-93040-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Dimensionality Reduction via Community Detection in Small Sample Datasets

Abstract

Similar content being viewed by others

Detecting Communities in Feature-Rich Networks with a K-Means Method

An Extension of K-Means for Least-Squares Community Detection in Feature-Rich Networks

Feature selection based on community detection in feature correlation networks

Keywords

1 Introduction

2 Related Work

3 Proposed Approach