Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Many Artificial Intelligence (AI) and Machine Learning (ML) problems use very high-dimensional datasets. This high dimensionality leads to the curse of dimensionality where the performance of ML models decreases with increasing data dimensions [9]. The curse of dimensionality is particularly profound if in addition to the large number of features, the problem also suffers from the availability of a low number of samples [5, 9, 10, 16]. Specifically, as theoretically established in [5, 9, 10, 16], in order to obtain high classification performance in high-dimensional spaces, the number of samples must also be very large (e.g., \(\sim \)10\(^5\) samples). Therefore, for problems with a low number of samples (say, 100–1500 samples), extracting a set of useful features from the high-dimensional data is a very challenging task [23]. Consequently, in this paper, we address the well-known dimensionality reduction problem specifically in a high dimensions and low sample-size setting. This important class of problems has many engineering and scientific applications such as on-device mobile applications, remote sensing, fMRI processing, sustainability, finance, and biological systems.

Dimensionality reduction can also be seen as an automatic feature extraction problem which yields low-dimensional features from the initial high-dimensional raw data. Indeed, a major goal in AI and representation learning is to enable machines to learn such useful, low-dimensional features automatically from the raw data rather than using manually engineered features. Towards this automatic feature learning, deep learning has emerged for vision, speech, and natural language processing (NLP) applications [3]. However, deep learning often needs enormous training datasets (\(\sim \)10\(^5\)–10\(^6\) samples) and results in significant overfitting for problems with small sample-size and high dimensions [4]. Therefore, for small sample-size problems, new dimensionality reduction techniques are needed to automatically extract useful features from the initial high-dimensional data.

Several outstanding dimensionality reduction techniques exist ranging from linear methods such as Principal Component Analysis (PCA), Probabilistic PCA (PPCA), to graph-based non-linear techniques like Maximum Variance Unfolding (MVU) [21], Isomap [19], etc. (see [12] for a review). Other techniques include Kernel PCA, deep learning autoencoders [8], stochastic proximity embedding (SPE) [1], stochastic neighbor embedding (SNE) [7] and t-distributed SNE (t-SNE) [11]. The graph-based techniques build a neighborhood graph to learn a lower-dimensional embedding [12]. For example, Isomap builds a neighborhood graph based on a fixed parameter which controls the neighborhood size or the number of nearest-neighbors of each node in the graph.

In real world, however, networks constructed from raw data often possess complex characteristics such as communities (i.e., groups of tightly connected nodes) and structural equivalence (i.e., nodes with similar roles in network, e.g., hubs) [17]. Therefore, such network characteristics must be accounted for while computing network neighborhoods for dimensionality reduction as they can lead to more accurate feature learning. Specifically, the neighborhood of a given node must depend on the community it belongs to. By contrast, prior techniques like Isomap assume a rigid (fixed) neighborhood for all nodes in the network. Similarly, stochastic graph-based methods such as SNE/t-SNE use a fixed parameter called perplexity which measures the effective number of neighbors for each node. Hence, the complex community structure hidden within the raw data has not been explicitly taken into account in prior dimensionality reduction methods.

Recently, representation learning has been proposed in the context of learning features on networks while accounting for community structure, e.g., node2vec [6], DeepWalk [15], community preserving embedding [20], LINE [18], etc. We refer to this problem space as “Representation Learning on Networks” throughout the paper. However, the networks considered in this prior art do not come from high-dimensional raw data, but rather from social networks (e.g., blogs, Youtube, Flickr), authorship networks or Wikipedia webpage networks. Hence, the prior representation learning on networks research does not directly address the problem of dimensionality reduction. In contrast, we argue that by capturing communities and structural equivalence, ideas from “representation learning on networks” problem space can have significant implications for dimensionality reduction. Therefore, in this paper, we address the following two key questions:

  1. 1.

    Can representation learning on networks have more general implications in dimensionality reduction if we leverage the hidden communities in raw data?

  2. 2.

    If so, how can we best construct a network from high-dimensional data to optimally capture its latent communities for dimensionality reduction?

To answer these questions, we propose FeatureNet, a new community-based dimensionality reduction framework. We further contribute a new method to construct a network directly from the raw data while explicitly revealing its hidden communities; this enables us to employ network representation learning ideas to learn low-dimensional community- and structural equivalence-based features from this network, thereby reducing the dimensions of the dataset.

We evaluate our proposed approach on five very diverse application areas ranging from handwritten digit recognition, biology, physical sciences, NLP, to computational sustainability. As mentioned earlier, our datasets are relatively small with sizes mostly between 100 and 1500 samples. This is because, automatic feature engineering for relatively small datasets still remains an important problem as deep learning models often lead to overfitting for such datasets.

To summarize, we make the following key contributions:

  1. 1.

    We propose FeatureNet, a novel community-based dimensionality reduction framework. We also propose a new method to construct a network directly from high-dimensional raw data, thereby revealing its hidden communities explicitly. To the best of our knowledge, we are the first to employ community-based representation learning ideas for dimensionality reduction.

  2. 2.

    We evaluate FeatureNet on eight datasets spanning five diverse real-world applications like handwritten digit recognition, biology, physical science, NLP, and computational sustainability. Our new sustainability datasets can be used by research community to further benchmark dimensionality reduction.

  3. 3.

    We further compare FeatureNet against ten most notable dimensionality reduction techniques such as PCA, deep learning autoenoders, t-SNE, Isomap, etc. Overall, the proposed FeatureNet significantly outperforms (in terms of accuracy) all of these techniques on the above diverse datasets by 3%–40%.

  4. 4.

    Finally, we introduce a new challenging computational sustainability problem as a case study: Given high-dimensional Carbon Emissions data, how can we learn optimal low-dimensional features to best classify the GDP growth of nations? Again, FeatureNet achieves the state-of-the-art performance.

Next, we review the related work reported in the literature.

2 Related Work

As mentioned earlier, prior techniques such as node2vec, DeepWalk, community-preserving embedding, LINE, etc. explicitly require a network as an input [6, 15, 18, 20]. In contrast, we start with high-dimensional (raw) data that does not exist in predefined network forms and learn the network structure directly from the raw data to explicitly reveal its latent communities. For example, consider Arcene, a high-dimensional cancer benchmark dataset, where each sample comes from a patient and features specify the abundance of certain proteinsFootnote 1. This dataset does not have a predefined network structure like in social networks. Hence, methods like node2vec, LINE, etc. are not a natural choice. This is where our key contribution lies – in representing any kind of dataset as a network which reveals hidden communities in raw data; we then use this network for dimensionality reduction using ideas from community-based feature learning.

To summarize, prior work in representation learning on networks focuses only on network-based classification tasks (e.g., classify interests of a blogger based on communities/homophily in a blog social network). Our work, however, truly generalizes this “network representation learning” space to any classification problem which has high-dimensional data and does not restrict it only to network classification tasks (see Supplementary Sect. 1 for detailed related workFootnote 2).

3 Proposed Approach

Given a classification problem \(\{X,y\}\), let \(X \in \mathbb {R}^{n \times p}\) denote the original dataset with n samples and p features, while \(y \in \mathbb {R}^{n \times 1}\) denotes the labels. Also, let \(x^{(i)} \in \mathbb {R}^{p \times 1}\) be the i-th sample in X. Then, dimensionality reduction is a function \(f: \mathbb {R}^{n\times p}\rightarrow \mathbb {R}^{n \times d}\), where d is the number of features in the reduced space (\(d<< p\)). Since prior neighborhood graph-based methods [11, 19] do not take communities into account, this results in loss of important network information.

Let \(\mathcal {X}\) be the low-dimensional mapping of X, and be the i-th sample of \(\mathcal {X}\) (i.e., the reduced representation of initial \(x^{(i)}\)). Then, the problem is to find \(\mathcal {X}\) which accounts for the latent community structure and structural equivalences hidden within the raw data. Hence, unlike established techniques such as Isomap, the network neighborhood for each sample in our approach is not fixed, but rather takes communities and structural equivalences into account. To find such a mapping, therefore, we maximize the probability of observing a certain neighborhood \(\mathcal {N}\) of sample \(x^{(i)}\), conditional on its low-dimensional representation, as well as on its latent community structure and structural equivalences:

(1)

where, \(\mathcal {C}(x^{(i)})\) and \(\mathcal {S}(x^{(i)})\) are latent variables containing information about communities and structural equivalence of sample \(x^{(i)}\) hidden within the raw data.

To solve problem (1), we propose FeatureNet, a two stage solution for dimensionality reduction which: (i) transforms the raw data into a network space to explicitly reveal data’s inherent communities, and (ii) performs representation learning on this network (see Fig. 1). Next, we present our proposed network construction technique to reveal hidden communities naturally.

Fig. 1.
figure 1

Complete flow of FeatureNet: (a) First, construct a network of samples using the proposed correlations-based method to explicitly reveal hidden communities in raw data (Sect. 3: Step 1). (b) Next, use representation learning on this network to find the community-based low-dimensional features (Sect. 3: Step 2).

Step 1: Proposed \(\varvec{K}\)-\(\varvec{\tau }\) Method for Network Construction

We create a network directly from raw data as follows: (i) construct a correlation-based network (\(\tau \)-step), and (ii) improve the density of network communities (K-step). For the \(\tau \)-step, we transform the initial high-dimensional data into a correlation-based network of samples (i.e., each sample now becomes a node). This first step is a mapping \(l: \mathbb {R}^{n\times p}\rightarrow \mathbb {R}^{n \times n}\), which yields \(\mathcal {G}=l(X)\). Here, \(\mathcal {G}\in \mathbb {R}^{n\times n}\) is the adjacency matrix of the network of samples with elements:

$$\begin{aligned} \mathcal {G}^{ij} = {\left\{ \begin{array}{ll} c(x^{(i)},x^{(j)}) &{}\text {if}~c(x^{(i)},x^{(j)})\ge \tau \\ 0 &{}\text {if}~c(x^{(i)},x^{(j)})<\tau ~\mathrm{or}~i=j \end{array}\right. } \end{aligned}$$
(2)

where, \(c(\cdot ,\cdot )\) is the Pearson’s correlation function, and \(\tau \) is a threshold used on \(c(\cdot ,\cdot )\) to remove weakly correlated links from the network.

Threshold (\(\varvec{\tau }\)-step): Setting a higher \(\tau \) removes the noise from the network by encouraging connections only among samples of the same class (intra-class links) and not among samples of different classes (inter-class links). Ideally, a network should mostly have intra-class links and not many inter-class links. To elaborate, we consider MNIST handwritten digit dataset where each sample has 784 features. Since our focus is on relatively small datasets, we randomly select 1000 samples (100 images for each digit 0–9) from the MNIST database.

Next, we create a Pearson’s correlation-based network of samples from this \({1000\times 784}\) dataset using Eq. 2 and a threshold \(\tau =0.7\). Figure 2(a) illustrates the adjacency matrix of this network. Clearly, the ten diagonal clusters in Fig. 2(a) represent the intra-class links, thus revealing the hidden communities of each digit. Moreover, using a high threshold of 0.7, most of the noisy links in the network (i.e., the inter-class links) are removed. We note, however, that a too high threshold can also result in some samples getting completely disconnected from the network and very sparse communities (see digits 2 and 5 in Fig. 2(a) zoomed inset). To overcome this problem, we introduce a network density parameter, K.

Fig. 2.
figure 2

Adjacency matrix for MNIST dataset network. (a) Threshold \(\tau =0.7\) removes noise from the network and reveals a clear community structure (i.e., the diagonal clusters). (b) Introducing a density parameter K fixes the problem of sparse communities without adding significant noise and yields reliable low-dimensional representations.

Network Density (\(\varvec{K}\)-step): To connect the disconnected nodes and increase the density of network communities, we next connect each sample to its corresponding K highest correlated samples; i.e., after the thresholding step, if a sample \(x^{(i)}\) has less than K links, we connect it to samples \(x^{(j)}\)’s until it has K links. Here, samples \(x^{(j)}\)’s are selected based on the K-highest correlations. This step is a variant of the K-nearest neighbor approach to handle correlations rather than euclidean distances (i.e., instead of K neighbors with minimum distances, we use K neighbors with maximum correlations). As shown in Fig. 2(b), introducing a network density parameter of \(K=7\), (i) connects all disconnected nodes, and (ii) increases the density of diagonal clusters significantly without too much additional noise (see the zoomed-in inset of Fig. 2(b)). Hence, the threshold and density steps yield a K-\(\tau \) method-based network of samples, \(\mathcal {G}\).

To summarize, our proposed approach creates a network from raw data using two parameters: Threshold \(\tau \) and density K, which provide a tradeoff between the noise and the density of communities. Best K and \(\tau \) can be selected via cross-validation and a simple grid search which would generate an optimal neighborhood for each sample. Hence, in our approach, the neighborhood of each sample is not rigid, rather is determined automatically by its community structure. Explicitly revealing these hidden communities in raw data, therefore, enables the use of community-based representation learning in dimensionality reduction.

Step 2: Community-Based Representation Learning

The network of samples, \(\mathcal {G}\), often possesses characteristics such as communities and structural equivalence. Once in the network space, problem (1) reduces to:

(3)

where, \(\mathcal {V}(\mathcal {G})\) denotes the set of nodes in network \(\mathcal {G}\), and sample \(x^{(i)}\) is now represented by a node v in the network of samples. Finding the neighborhood of \(x^{(i)}\), \(\mathcal {N}(x^{(i)})\), now becomes the problem of finding the network neighborhood \(\mathcal {N}_\mathcal {G}^R(v)\) of node v in \(\mathcal {G}\). This network neighborhood can be found using a strategy R, which can account for the latent community structure \(\mathcal {C}(x^{(i)})\) and structural equivalence \(\mathcal {S}(x^{(i)})\). However, note that, this is precisely the skip-gram objective which the recent research on network representation learning aims to optimize [6, 13, 15]. Therefore, once the high-dimensional raw data is transformed into a network which explicitly reveals the community structure, the final low-dimensional representation can be learned using techniques such as node2vec [6]. Specifically, node2vec acts as a mapping \(h: \mathbb {R}^{n\times n}\rightarrow \mathbb {R}^{n \times d}\), which yields \(\mathcal {X}=h(\mathcal {G})\). The \(n \times d\) matrix \(\mathcal {X}\) contains the final low-dimensional features based on hidden communities in raw data. Algorithm 1 shows these two stages of FeatureNet. For more information on the classic word2vec skip-gram objective [13] and node2vec search strategy [6], please refer to Supplementary Sect. 2.

figure a

4 Experimental Setup and Results

4.1 Experimental Setup

We implement the K-\(\tau \) method in MATLAB, while node2vec-neighborhood search, optimization, and the subsequent classification are all carried out in Python. We use one-vs-rest logistic regression with L2 regularization and a broad range of inverse regularization strength parameter, \(C\in \{10^{-2},10^{-1},\ldots ,10^4\}\) for multi-class classification. Of note, node2vec parameters (return parameter p, and in-out parameter q) which control a trade-off between communities and structural equivalence, are optimized via a grid search on \(p,q \in \{0.25,0.75,0.9,1.5,2,4\}\). Finally, the two parameters of FeatureNet (K, \(\tau \)) are also optimized using a grid search. \(\tau \) is varied in steps of 0.05 from 0.6 to 0.95, while K varies from 1 to 9. The best parameter values are selected using 10-fold cross-validation (CV).

To show the effectiveness of FeatureNet on many applications, we conduct experiments on eight datasets coming from five very different application areas as summarized in Table 1. Our focus in this paper is on dimensionality reduction for relatively small datasets which explains why the sample sizes are mostly between 100 and 1500 in Table 1. Reuters subset data is used to analyze the scalability of our approach. Table 1 contains five benchmarks from UCI ML repositoryFootnote 3.

Table 1. Characteristics of the datasets
Table 2. 10-fold CV F\(_1\)-Macro and F\(_1\)-Micro (Accuracy) scores for UCI benchmarks (\(d=16\)): Best six prior methods shown.

Table 1 also shows three datasets from the computational sustainability domain in which quantitatively inferring economic growth from anthropogenic carbon emissions remains an active area of research [14]. Here, we make a twofold contribution: First, we propose the following new computational sustainability problem: “Given multiple years of daily carbon emissions (CE) data across the world, can we correctly classify the Gross Domestic Product (GDP) growth of different regions?” Second, we contribute three new datasets to further benchmark dimensionality reduction. The datasets are compiled using a carbon dioxide database [2] and the World Bank [22] data (see Supplementary Sect. 3).

Finally, we compare our approach against tenFootnote 4 well-established dimensionality reduction techniques: (1) PCA, (2) PPCA, (3) Polynomial Kernal PCA (KPCA - Poly.), (4) KPCA – gaussian kernel, (5) Linear Discriminant Analysis (LDA), (6) SPE, (7) Deep Autoencoders, (8) SNE, (9) t-SNE, and (10) Isomap. We used a dimensionality reduction toolbox [12] for these techniques.

4.2 Results

UCI Machine Learning Repository Benchmarks. In our experiments, we reduce the dimensions of each dataset from initial p features to \(d=16\) features. We then conduct logistic regression on the reduced features and report its 10-fold CV F\(_1\)-Macro and F\(_1\)-Micro scores. Note that, F\(_1\)-Micro scores have the same interpretation as classification accuracy for multiclass classification problems. Table 2 presents these results for FeatureNet and the best six traditional techniques for all UCI datasets. As shown, our proposed FeatureNet significantly outperforms all six (and, implicitly, all ten!) prior techniques.

Fig. 3.
figure 3

F\(_1\)-Micro for varying FeatureNet parameters (K, \(\tau \)): (a) Arcene, (b) MNIST, and (c) CNAE-9. Red (blue) indicates higher (lower) accuracy. For all datasets, FeatureNet outperforms prior methods for many combinations of K and \(\tau \). (Color figure online)

For Arcene, FeatureNet achieves a F\(_1\)-Micro of 0.82 improving over the best performing PPCA method by 6.5%. Arcene is a challenging dataset because 3000 out of its 10000 features are ‘probes’ with no predictive power. This shows that our proposed FeatureNet is able to handle such noisy datasets. Next, for Musk1, we achieve an improvement of 10.27% in F\(_1\)-Micro scores over the best traditional methods – PCA and PPCA. Similarly, for MNIST, we observe an improvement of 3.28% in F\(_1\)-Micro over the best performing t-SNE technique. Recall that, we are only using 1000 samples for MNIST and not all 60,000 images for training. In fact, all datasets used in the present work are “relatively small” with number of samples mostly between 100 and 1500. This is why, deep learning-based autoencoders do not perform very well and, as expected, overfit the data.

Finally, for the CNAE-9 dataset (NLP), we improve the F\(_1\)-Micro by 5.83% over the best performing SNE method. CNAE-9 is a business description text data for certain companies classified according to economic sectors. Each document is processed using standard NLP techniques (e.g., stop-word removal, stemming, etc.) and is converted to a term frequency vector. This results in a very sparse dataset wherein 99.22% of the raw data is all zeros. In summary, our results demonstrate that FeatureNet can handle dimensionality reduction problems on very diverse applications and can also handle noisy and sparse datasets. Similar improvements are observed for F\(_1\)-Macro scores.

Empirical Evaluation of FeatureNet in the \(\varvec{K}\)-\(\varvec{\tau }\) Parameter Space. Figure 3 shows the impact of varying density K (y-axis) and threshold \(\tau \) (x-axis) for various UCI datasets (see Supplementary Fig. S2(a) for Musk1 dataset). As shown, FeatureNet outperforms the traditional methods for several combinations of K and \(\tau \) (see orange/red portions in Fig. 3). For instance, for MNIST, CNAE-9 (Fig. 3(b, c)) and Musk1 (Fig. S2(a)), almost any combination of parameters gives a high classification accuracy. Indeed, for Arcene (Fig. 3(a)), we observe that only a few parameter combinations give high performance (e.g., for \(\tau =0.95\) and all K values). A possible reason for FeatureNet’s behavior for Arcene could be due to the additional noise in this dataset. We leave the theoretical analysis of stability of FeatureNet as a future work (e.g., analyzing impact of noise, etc.).

Why We Achieve Performance Gains? As mentioned before, the parameters \(\tau \) and K control the tradeoff between noise in the network and density of communities. Consider the case \(\tau =0.85\) and varying K’s for MNIST (i.e., the rightmost column of Fig. 3(b)). For a high threshold of 0.85, the diagonal communities are even more sparse than those shown in Fig. 2 where the threshold used was only 0.7 (see also supplementary Fig. S1). Now, if we increase the density K, the F\(_1\)-Micro increases from 0.873 for \(K=2\), to 0.904 for \(K=5\), to 0.902 for \(K=9\) (probably too much noise for \(K=9\)). This clearly demonstrates the tradeoff between the noise and density, and how it can affect the model performance. Therefore, our K-\(\tau \) method for network construction successfully captures the best tradeoff and thus yields a high classification accuracy. Hence, our results show that choosing a good network construction approach for revealing hidden communities in data is very important for obtaining higher performance.

Table 3. 10-fold CV F\(_1\)-Micro (Accuracy) for CE-GDP problems (\(d=16\)): Six best prior methods are shown.

Computational Sustainability – A Case Study and New AI Datasets. Table 3 shows F\(_1\)-Micro for the competitive methods across the three years for the CE-GDP datasets. As evident, FeatureNet significantly outperforms the best PPCA method by 40.13%, 22.51%, and 27.26% for 1980, 1990, and 2000, respectively. We also observed results like Fig. 3 for the CE-GDP datasets (see supplementary Fig. S2(b)). Moreover, Fig. S3 shows results for varying the number of target dimensions from \(d=16\) to 32. Again, FeatureNet outperforms other techniques for all d. Therefore, the CE-GDP datasets can also be used by the ML community to benchmark dimensionality reduction.

Fig. 4.
figure 4

(a) K-\(\tau \) network for CE-GDP 2000 shows communities with very different sizes that are accurately modeled by FeatureNet. (b) Varying fixed neighborhood size in Isomap and other methods cannot capture such variable size communities (\(d=32\)).

Finally, the K-\(\tau \) network shown in Fig. 4(a) for CE-GDP 2000 dataset demonstrates that FeatureNet models the hidden communities with significantly different sizes very accurately, thus explaining the excellent performance of FeatureNet (see Fig. S4 for the 1980 network). Hence, fixed neighborhood size or perplexity methods (e.g., Isomap, t-SNE) cannot capture such massive heterogeneity in raw data’s community structure. To show this, we vary the fixed neighborhood size for Isomap in Fig. 4(b). As shown, FeatureNet is far superior (with F\(_1\)-Micro nearly 0.9 for \(d=32\)) to Isomap for all neighborhood sizes.

Note on Scalability. In order to analyze the scalability of FeatureNet, we consider a subset of Reuters-21578 dataset in which documents with multiple category labels were removed. This yielded 8293 documents from 65 classes with 18933 distinct terms. Of the total 8293 documents, we focus on the given training datasetFootnote 5 of 5946 documents and report 10-fold CV classification F\(_1\)-Micro after reducing its dimensions from 18933 to 16. We compare FeatureNet with some of the top performers from the above experiments – SPE, PCA, and t-SNE as these were amongst the only few techniques that were able to finish execution in a reasonable time (e.g., about 2–4 h) using reasonable computational resources (an 8-core Intel i7 desktop).

For relatively small datasets like MNIST, the number of links is not very big (e.g., 6669 links for 1000 nodes). However, for larger datasets like Reuters, the number of links can increase rapidly (719, 080 links for (\(\tau =0.7\), \(K=30\)) case and 2.1 million links for (\(\tau =0.5\), \(K=50\)) case; see Table S1). Figure S5 shows the diagonal communities for the Reuters \(\tau =0.7\) and \(K=30\) case (\(\approx \)700,000 links), whereas Fig. S6 shows the same for \(\tau =0.5\) and \(K=50\) (>2.1M links; to create this network, MATLAB takes only 10 s and up to 7 GB of memory.). Clearly, the diagonal communities of the former are significantly more sparse than those of the latter. Consequently, our proposed FeatureNet successfully reduced the dimensions and finished executing for the former but not for the latter. In terms of the classification accuracy, F\(_1\)-Micro for SPE, PCA, and t-SNE were 0.725, 0.82 and 0.823 respectively, whereas FeatureNet again significantly outperformed these techniques with a F\(_1\)-Micro score of 0.867 (5.34% improvement). These results demonstrate that currently FeatureNet can indeed scale up to large datasets provided their networks contain several hundreds of thousands of links. However, optimizing FeatureNet to handle datasets which result in more than several million links, is left for future research.

5 Conclusion and Future Work

We have proposed FeatureNet, a new community-based dimensionality reduction framework for small sample problems. To this end, we have proposed a new technique to construct a network from any general raw data while revealing its hidden communities. Community-based low-dimensional features are then learned using a representation learning framework. We have demonstrated the effectiveness of FeatureNet across five very different application domains ranging from handwritten digit recognition, biology, physical science, NLP, to computational sustainability. We have further shown that FeatureNet significantly outperforms many well-known dimensionality reduction techniques such as PCA, PPCA, deep autoencoders, t-SNE and Isomap. This ultimately shows how representation learning ideas can have huge implications for dimensionality reduction.

As a future work, we plan to develop even stronger algorithms and parallelization techniques to scale FeatureNet to hundred-thousand samples/features. Finally, we plan to provide an in-depth theoretical analysis for FeatureNet.