Boolean autoencoders and hypercube clustering complexity

We introduce and study the properties of Boolean autoencoder circuits. In particular, we show that the Boolean autoencoder circuit problem is equivalent to a clustering problem on the hypercube. We show that clustering m binary vectors on the n-dimensional hypercube into k clusters is NP-hard, as soon as the number of clusters scales like \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${m^\epsilon (\epsilon >0 )}$$\end{document} , and thus the general Boolean autoencoder problem is also NP-hard. We prove that the linear Boolean autoencoder circuit problem is also NP-hard, and so are several related problems such as: subspace identification over finite fields, linear regression over finite fields, even/odd set intersections, and parity circuits. The emerging picture is that autoencoder optimization is NP-hard in the general case, with a few notable exceptions including the linear cases over infinite fields or the Boolean case with fixed size hidden layer. However learning can be tackled by approximate algorithms, including alternate optimization, suggesting a new class of learning algorithms for deep networks, including deep networks of threshold gates or artificial neurons.

hidden-to-output transformation functions so as to minimize a distortion measure between inputs vectors and the corresponding output vectors produced by the circuit. Autoencoders were introduced in the 1980s by the Parallel Distributed Processing group [23] as a way to address the problem of unsupervised learning, learning from data in the absence of any target, by using the data itself as the output target. When p < n, the hidden layer acts as a bottleneck forcing the circuit to extract features and produce a compressed representations of the input data X . More recently, autoencoders have been used extensively in the "deep architecture" approach [16,17,5,10], where autoencoders in the form of Restricted Boltzman Machines (RBMS) are stacked and trained bottom up in unsupervised fashion to extract hidden features and efficient representations that can then be used to address supervised classification or regression tasks.
In spite of the interest they have generated, and with few exceptions [3,26], little theoretical understanding of autoencoders and deep architectures has been obtained to this date, especially in the non-linear case. Here we derive a better theoretical understanding of autoencoders in part by studying an extreme form of non-linear autoencoder, namely the Boolean autoencoder.

Autoencoder problem
We begin with a fairly general definition of the autoencoder problem. An n/ p/n autoencoder ( Fig. 1) is defined by a t-uple n, p, m, F, G, A, B, X , Δ where: 1. n, p and m are positive integers. Here we consider primarily the case where 0 < p < n. 2. F and G are sets. 3. A is a class of functions from G p to F n . 4. B is a class of functions from F n to G p . 5. X = {x 1 , . . . , x m } is a set of m training vectors in F n . When external targets are present, we let Y = {y 1 , . . . , y m } denote the corresponding set of target vectors in F n . 6. Δ is a distance or distortion function (e.g. L p norm, Hamming distance) defined over F n .
For any A ∈ A and B ∈ B, the autoencoder transforms an input vector x ∈ F n into an output vector A • B(x) ∈ F n (Fig. 1). The corresponding autoencoder problem is to find A ∈ A and B ∈ B that minimize the overall distortion or error function: In the non auto-associative case, when external targets y t are provided, the minimization problem becomes: Here we consider primarily the auto-associative case with p < n where the goal of the autoencoder is to find a way to compress the data. The regime p ≥ n is also of interest, but requires in general additional assumptions beyond the scope of this article (see Discussion). Obviously, from this general framework, different kinds of autoencoders can be derived depending, for instance, on the choice of sets F and G, transformation classes A and B, distortion function Δ, as well as the presence of any additional constraints (Fig. 2). Linear autoencoders correspond to the case where F and G are fields and A and B are the classes of linear transformations, hence A and B are matrices of size p × n and n × p respectively. In this case, the autoencoder problem can be viewed essentially as the problem of finding a rank p approximation to the identity function. The linear real-valued case where F = G = R and Δ is the squared Euclidean distance was addressed in [3]. Similar results were obtained more recently for the linear complex-valued case [4]. The main goal of this article is to study Boolean autoencoders where F = G = {0, 1}, including linear autoencoders over GF(2) = F 2 .

Autoencoder properties
From the study of different kinds of autoencoders [4,2] emerge a set of basic properties that ought to be investigated for each class of autoencoders.
(1) Invariances. What are the relevant group actions for the problem? What are the transformations of F n and G p , or A and B, that leave the problem invariant? (2) Fixed layer solutions. Is it possible to optimize A (resp. B), fully or partially, while B (resp. A) is held constant? (3) Problem complexity. How complex is the autoencoder optimization problem? Is there an overall analytical solution? Is the corresponding decision problem NP-complete? (4) Landscape of E. What is the landscape of the overall error E? Are there any symmetries, local minima, critical points and how can they be characterized? (5) Clustering. Especially in the case where p < n, what is the relationship to clustering? (6) Transposition. Is there a notion of symmetry or transposition between the transformations A and B, in particular around critical points? (7) Recycling. What happens if the values from the output layer are recycled into the input layer, in particular around critical points? (8) Learning algorithms. What are the learning algorithms and their properties? In particular, can A and B be fully, or partially, optimized in alternation? And if so, is the algorithm convergent? And if so, at what speed and what are the properties of the corresponding limit points? (9) Generalization. What are the generalization properties of the autoencoder after learning? In other words, what are the properties of the distortion function (e.g. its average) on vectors in F n − X ? (10) External targets. How does the problem change if external targets are provided? (11) Composition. Autoencoder circuits can be stacked vertically, using the hidden layer of the autoencoder at one level of the stack as the input layer for the autoencoder at the next level of the stack (Fig. 4). What is the overall effect of such composition? Autoencoders can also be composed horizontally [2]. What is the overall effect of such composition?
Most of these questions can be addressed analytically in the case of real-valued or complex-valued autoencoders with the squared Euclidean distance (Δ = L 2 2 ) as the distortion function. For completeness, we restate without proof the main results derived in [3] for the linear real-valued case and generalized in [4] for the complex-valued case.

Linear autoencoders over the real or complex numbers
We use A t to denote the transpose of any matrix A in the real-valued case, or its conjugate transpose in the complex-valued case.
(1) Invariances. (a) Change of coordinates in the hidden layer. Note that for any invertible p × p matrix C, Thus all the properties of the linear autoencoder are fundamentally invariant with respect to any change of coordinates in the hidden layer. (b) Change of coordinates in the input/output layers. Consider an orthonormal change of coordinates in the output space defined by an orthogonal (or unitary) n × n matrix D, and any change of coordinates in the input space defined by an invertible n × n matrix C. This leads to a new autoencoder problem with input vectors C x 1 , . . . , C x m and target output vectors of the form Dy 1 , . . . , Dy m with reconstruction error of the form If we use the one-to-one mapping between pairs of matrices (A, B) and (A , B ) defined by A = D A and B = BC −1 , we have the last equality using the fact that D is an isometry and preserves distances and angles. Thus, using the transformation A = D A and B = BC −1 the original problem and the transformed problem are equivalent and the functions E(A, B) and E(A , B ) have the same landscape.
In particular, in the auto-associative case, we can take C = D to be a unitary matrix. This leads to an equivalent autoencoder problem with input vectors C x t and covariance matrix For the proper choice of C, there is an equivalent problem where the basis of the space is provided by the eigenvectors of the covariance matrix and the covariance matrix is a diagonal matrix with diagonal entries equal to the eigenvalues of the original covariance matrix .
(2) Fixed layer solutions. The problem becomes convex if A is fixed, or if B is fixed. When A is fixed, assuming A has rank p and that the data covariance matrix X X When B is fixed, assuming B has rank p and that X X is invertible, then at the optimum (3) Problem complexity. While the cost function is quadratic and all the operations are linear, the overall problem is not convex because the hidden layer limits the rank of the overall transformation to be at most p, and the set of matrices of rank p or less is not convex. However the linear autoencoder problem over R and C can be solved analytically. (4) Landscape of E. The overall landscape of E has no local minima. All the critical points where the gradient of E is zero, correspond to projections onto subspaces associated with subsets of eigenvectors of the covariance matrix X X . Projections onto the subspace associated with the p largest eigenvalues correspond to the global minimum and Principal Component Analysis. All other critical points, corresponding to projections onto subspaces associated with other set of eigenvalues, are saddle points (Fig. 3). More precisely, if I = i 1 , . . . , i p (1 ≤ i i < . . . < i p ≤ n) is any ordered list of indices, let U I = [u 1 , . . . , u p ] denote the matrix formed by the orthonormal eigenvectors of X X associated with the eigenvalues λ i 1 , . . . , λ i p . Then two matrices A and B of rank p define a critical point if and only if there is a set I and an invertible p × p matrix C such that A = U I C, B = C −1 U t I , and W = AB = P U I , where P U I is the orthogonal projection onto the subspace spanned by the columns of U I . At the global minimum, assuming that C = I , the activities in the hidden layer are given by the dot products u t 1 x . . . u t p x and correspond to the coordinates of x along the first p eigenvectors of X X .  Landscape of E in the linear real-valued and complex-valued case with squared Euclidean distance. All critical points are associated with projections onto subspaces spanned by the eigenvectors of the covariance matrix of X . All critical points are saddle points, except those associated with the projection onto the subspace corresponding to the largest p eigenvalues for comparison with cases where a closed form solution does not exist, or for potential embodiments that may not have direct access to the global minimum. Various descent algorithms can be used including gradient descent and alternate partial, or full, optimization of A and B, possibly combined with transposition (see [4] for details). These algorithms are convergent and generally converge to the global minimum when initialized randomly. In this linear case, alternate minimization of A and B can be viewed as an instance of the EM [8] algorithm under a standard Gaussian model. (9) Generalization. At any critical point, for any x, AB(x) is equal to the projection of x onto the corresponding subspace and the corresponding error can be expressed easily as the squared distance of x to the projection space. (10) External targets. With the proper adjustments, the results above remain similar if a set of target output vectors y 1 , . . . , y m is provided, instead of x 1 , . . . , x m serving as the targets (see [3,4]). (11) Composition. The global minimum of E remains the same if additional matrices of rank greater or equal to p are introduced between the input layer and the hidden layer or the hidden layer and the output layer. Thus there is no reduction in overall distortion by introducing such matrices. However, if such matrices are introduced for other reasons, there is a composition law so that the optimum solution for a deep autoencoder with a stack of matrices, can be obtained by combining the optimal solutions of each shallow autoencoders. More precisely, consider an autoencoder network with layers of size n/ p 1 / p/ p 1 /n ( Fig. 4) with n > p 1 > p. Then the optimal solution of this network can be obtained by first computing the optimal solution for an n/ p 1 /n autoencoder network, and combining it with the optimal solution of an p 1 / p/ p 1 autoencoder network using the activities in the hidden layer of the first network as the training set for the second network, exactly as in the case of stacked RBMs [16,17]. This is because the projection onto the subspace spanned by the top p eigenvectors can be decomposed into a projection onto the subspace spanned by the top p 1 eigenvectors, followed by a projection onto the subspace spanned by the top p eigenvectors. Here we first study the unrestricted case, where A and B contain all possible Boolean functions of the right dimensions, and then the linear case where A and B are represented by matrices over GF (2). Another class, the Boolean threshold gate autoencoder, where the Boolean functions are restricted to being threshold gates, is considered in the Discussion.
The following definitions and notations will be useful in the statement of the main theorem. Given k binary column vectors p 1 , . . . , p k in the n-dimensional hypercube H n , we define the corresponding binary majority vector Majority( p) in H n by taking in each row j the majority of the corresponding components p ji . When n is even, there can be ties in which case one can flip a fair coin to assign the corresponding value.

Lemma 1 The vector Majority( p) is a vector in
Proof The center of gravity is the vector c in R n with coordinates c j = k i=1 p ji /k. For any j, ( p) j is the closest binary value to c j . Furthermore: and each term in the last sum is minimized by the majority vector.
A Voronoi partition of H n generated by the vectors p s is a partition C V or ( p 1 ), . . . , C V or ( p k ) of H n into k sets such that for any x in H n : Points that are equidistant from two or more p s can again be assigned arbitrarily to a unique center.

Theorem 1 (Boolean autoencoder) (1) Invariances (a) Permutations of hidden layer activities The properties of the Boolean autoencoder are invariant under any one-to-one map from H p to H p . Thus every solution is defined up to a permutation of the 2 p points of the hypercube H p . (b) Isometric change of coordinates in the input/output layers Any isometry of the hyper-
cube H n preserving all the Hamming distances (hence also the Euclidean distances) between the points in X leads to an equivalent autoencoder problem.
(2) Fixed layer solution If the A mapping is fixed, then the optimal mapping B * is given by (3) Problem complexity In general, the overall optimization problem is NP-hard. More precisely the optimization problem is NP-hard in the regime where p ∼ log 2 m with > 0.

(4) The landscape of E In general E has many local minima (e.g with respect to the Hamming distance applied to the lookup tables of A and B). (5) Clustering
The overall optimization problem is a problem of optimal clustering. The clustering is defined by the transformation B.
Proof (1) For the hidden layer corresponding to H p , the Boolean function are unrestricted and therefore their lookup tables can accommodate any such permutation, or relabeling of the hidden states. If C is a one-to-one map from For the input layer corresponding to H n , the property is obvious. Note that such isometries are generated by permutations and inversions of coordinates.
(2) Assume first that A is fixed. Then for each of the 2 p possible Boolean vectors h 1 , . . . , h 2 p of the hidden layer, One can build the corresponding Voronoi partition by assigning each point of H n to its closest centroid, breaking ties arbitrarily, thus forming a partition of H n into 2 p corresponding clusters C 1 , . . . , The optimal mapping B * is then easily defined by setting Fig. 5). Conversely, assume that B is fixed. Then for each of the 2 p possible Boolean vectors h 1 , . . . , h 2 p of the hidden layer, let To minimize the reconstruction error, A * must map h i onto a point y of H n minimizing the sum of Hamming distances to points in X ∩ B −1 (h i ). By Lemma 1, the minimum is realized by the component-wise majority vector , breaking ties arbitrarily (Fig. 6). Note that this solution minimizes the distortion on the training set. The generalization or total distortion, however, is minimized by In some situations, one may have the additional constraint that the output vector must belong to the training. With this additional constraint the optimal solution A * (h i should be the vector X that is closest to the vector Majority To be more precise, one must specify the regime of interest characterized by which variables among n, m, and p are going to infinity. Obviously one must have n → ∞ and m > 2 p . If p does not go to infinity, then the problem can be polynomial, for instance when the centroids must belong to the training set. If p → ∞ and m is a polynomial in n, which is the case of interest in machine learning where typically m is a low degree polynomial in n, then the problem of finding the best Boolean mapping (i.e. the Boolean mapping that minimizes the distortion E associated with the Hamming distance on the training set) is NP-hard, or the corresponding decision problem is NP-complete. In fact, the optimal clustering problem is NP hard when the number of clusters scales like 2 p ∼ m ( > 0). The complexity proof is given in the next section.  its cluster to another cluster. Such critical points are local or global minima. The existence of local minima is not surprising since the optimization problem is NP-complete. Simple examples of local minima can be constructed (not shown). (5) This is obvious from the proof of (2). Note that approximate solutions can be sought by several algorithms, such as k-means, belief propagation [11], minimum spanning paths and trees [25], hierarchical clustering, and alternate optimization of A and B. Alternate optimization of A and B yields an approximate optimization algorithm for all autoencoders, since the distortion is positive and must decrease or stay constant at each optimization step. In the purely linear case over R or C, or in the mixed case where the hidden layer is binary-valued but the output is real, the alternate optimization is closely related to the EM algorithm [8] and to the K-means algorithm [9], with the proper probabilistic interpretations. 392 P. Baldi

The complexity of clustering on the hypercube
In this section, we briefly review some results on clustering complexity and then prove that the hypercube clustering decision problem is in general NP-complete. The complexity of various clustering problems, in different spaces, or with different objective functions, has been studied in the literature. There are primarily two kind of results: (1) graphical results derived on graphs G = (V, E, Δ) where the dissimilarity Δ is not necessarily a distance; and (2) geometric results derived in the Euclidean space R d where Δ = L 2 2 , L 2 , or L 1 . In general, the clustering decision problem is NP-complete and the clustering optimization problem is NP-hard, except in some simple cases involving either a constant number k of clusters or clustering in the 1-dimensional Euclidean space. In general, the results in Euclidean spaces are harder to derive than the results on graphs. When polynomial time algorithms exist, geometric problems tend to have faster solutions taking advantage of the geometric properties. However, none of the existing complexity theorems directly addresses the problem of clustering on the hypercube with respect to the Hamming distance.
To deal with the hypercube clustering problem one must first understand which quantities are allowed to go to infinity. If n is not allowed to go to infinity, then the number m of training examples is also bounded by 2 n and, since we are assuming p < n, there is no quantity that can scale. Thus by necessity we must have n → ∞. We must also have m → ∞. The case of interest for machine learning is when m is a low degree polynomial of n. Obviously the hypercube clustering problem is in NP, and it is a special case of clustering in R n . Thus the only important problem to be addressed is the reduction of a known NP-complete problem to a hypercube clustering problem.
For the reduction, it is natural to start from a known NP-complete graphical or geometric clustering problem. In both case, one must find ways to embed the original problem with its original metric into the hypercube with the Hamming distance. There are theorems for homeomorphic or squashed-embedding of graphs into the hypercube [14,29], however these embeddings do not map the original dissimilarity function onto the the Hamming metric. Thus here we prefer to start from some of the known geometric results and use a strict cubical graph embedding. A graph is cubical if it is the subgraph of some hypercube H d for some d [13,18]. Although deciding whether a graph is cubical is NP-complete [1], there is a theorem [15] providing a necessary and sufficient condition for a graph to be cubical. A graph G(V, E) is cubical and embeddable in H d if and only if it is possible to color the edges of G with d colors such that: (1) All edges incident with a common vertex are of different color; (2) In each path of G, there is some color that appears an odd number of times; and (3) In each cycle of G, no color appears an odd number of times. We can now state and prove the following theorem.
The HYPERCUBE CLUSTERING problem is NP hard when n → ∞ and k ∼ m ( > 0).
Proof To sketch the reduction, we start from the problem of clustering m points in the plane R 2 using cluster centroids and the L 1 distance, which is NP-complete [21] by reduction from 3-SAT [12] when k ∼ m ( > 0) (see, also related results in [19] and [28]). Without any loss of generality, we can assume that the points in these problems lie on the vertices of a square lattice. Using the theorem in [15], one can show that a n × m square lattice in the plane can be embedded into H n+m . In fact, an explicit embedding is given in Fig. 7. It is easy to check that the L 1 or Manhattan distance between any two points on the square lattice is equal to the corresponding Hamming distance in H n+m . This polynomial reduction completes the proof that if the number of cluster satisfies k = 2 p ∼ m , or equivalently p ∼ log 2 m ∼ C log 2 n (the latter when m ∼ n C ), then the hypercube clustering problem associated with the Boolean autoencoder is NP-hard, and the corresponding decision problem NP-complete. If the numbers k of clusters is fixed and the centroids must belong to the training set, there are only m k ∼ m k possible choices for the centroids inducing the corresponding Voronoi clusters. This yields a trivial, albeit not efficient, polynomial time algorithm. When the centroids are not required to be in the training set, we conjecture also the existence of polynomial time algorithms by adapting the corresponding theorems in Euclidean space.

Linear autoencoders over finite fields
Here we focus on linear autoencoders over GF(2), although we expect the results over other finite fields to be similar. Clearly this leads to n separate regression problems, one for each output unit (or component). If we fix A, the situation is similar. Consider the 2 p possible Boolean vectors {h 1 , . . . , h 2 p } in h ∈ F p 2 associated with the hidden layer, the corresponding images {A(h) 1 ), . . . , A(h 2 p )} under A, and the Voronoi partition of F n 2 induced by these images (again breaking ties arbitrarily). For any input vector x i , there is at least one vector h k such that x i is in the same Voronoi partition as A(h k ). In other words, In which case, we should have B(x i ) = h k . Thus the optimal B is again defined by a regression problem with input-output pairs of the form (x i , h k ) and the error function associated with Eq. 1, rather than the Hamming distance in F p 2 . (Unlike the case where B is fixed, this does not correspond to p independent regression problems). Thus, in short, to make progress on this issue one must study the complexity of the linear regression problem over finite fields. The next sections show that linear regression over GF (2) and several other related decision problems are NP-complete.

Known preliminary results
We begin by stating five NP-complete problems that are closely related to the autoencoder problem over GF (2). Not surprisingly, these problems come from the coding literature.

Problem: MAXIMUM-LIKELIHOOD DECODING
Instance: A binary m × n matrix H , a vector s ∈ F m 2 , and an integer w > 0. Question: Is there a vector x ∈ F n 2 of weight ≤ w, such that H x = s? The weight of a binary vector is defined as the number of 1-bits it contains. This problem was proved to be NP-complete using a reduction from THREE-DIMENSIONAL MATCHING [6]. From the original proof, the problem remains NP complete if the vector s is the vector of all 1s.

Problem: WEIGHT DISTRIBUTION
Instance: A binary m × n matrix H and an integer w > 0. Question: Is there a vector x ∈ F n 2 of weight w, such that H x = 0? This problem was also proved to be NP-complete using a reduction from THREE-DIMEN-SIONAL MATCHING [6]. Note that any of these problems become solvable in polynomial time if either n or m is finite.

Problem: MINIMUM DISTANCE
Instance: A binary m × n matrix H and an integer w > 0. Question: Is there a nonzero vector x ∈ F n 2 of weight ≤ w, such that H x = 0? This decision problem was conjectured to be NP complete in [6] and was subsequently proved to be so [27]. The following two variants are also known to be NP-complete [22].  (2)) Given a binary m ×n matrix H , the problems of determining whether the kernel of H contains a non-zero vector with the corresponding properties: As noted in [6,27], these problems remain NP-complete when the rows of H are independent, i.e. when H has rank m and m ≤ n. These problems remain NP-complete if one of the components of x is constrained to be 1. To see this, it is first obvious to note that they remain in the class NP. Furthermore any unconstrained kernel problem defined by an m × n matrix H can be reduced to a constrained problem wit an m × (n + 1) matrix, by adding a column of 0s to the matrix H , and adding a corresponding component fixed to 1 to the vector x. This implies immediately that the four-forms of MAXIMUM-LIKELIHOOD DECODING of finding a vector x of of weight equal to w, less or equal to w (minimal weight), greater or equal to w (maximal weight), or in the range [w 1 , w 2 ], are all NP-complete, yielding the following theorem.

Theorem 4 (General maximum likelihood decoding) Given a binary m × n matrix H and a vector s ∈ F m 2 , the problems of determining whether there exists a vector x satisfying H x = s with the corresponding properties: MAXIMUM-LIKELIHOOD DECODING (= w)
For any linear operator H , any vector x ∈ F n 2 can be decomposed uniquely as x = u + v with u ∈ K er H and v ∈ J , where K er H is the kernel of H and J is a subspace of F n 2 isomorphic to the image I m H. Furthermore, it is easy to find a basis of K er H or J in polynomial time using standard algebraic manipulations and create a new matrix G such that I mG = K er H (for instance take the n × n matrix whose columns form a basis of H , padded with 0s as needed) or K er G = I m H. Thus the equivalent image problems of finding a non-zero vector in the image of H of weight equal to w, less or equal to w (minimal weight), greater or equal to w (maximal weight), or in the range [w 1 , w 2 ], are all NP-complete yielding the following theorem. (2)) Given a binary m×n matrix H , the problems of determining whether the image of H contains a non-zero vector with the corresponding properties:

Theorem 5 (Image over GF
More generally, any subspace can be viewed as the image of a matrix H comprising a basis of the subspace as its column vector. Thus the subspace problems of finding a non-zero vector in a subspace of dimension n of F m 2 of weight equal to w, less or equal to w (minimal weight), greater or equal to w (maximal weight), or in the range [w 1 , w 2 ], are all NP-complete yielding the following theorem. Theorem 6 (Subspace over GF (2)) Given a binary m × n matrix H , the problems of determining whether the subspace generated by the columns of H contains a non-zero vector with the corresponding properties:

Linear regression
We can now consider the general problem of linear regression over GF (2). We first consider exact (i.e. with no distortion) linear regression, and then approximate linear regression where the goal is to minimize the distortion.

Problem: EXACT LINEAR REGRESSION OVER GF(2)
Instance: Two integers m and n, an m × n binary matrix H , an integer w, and a vector y ∈ F m 2 . Question: Is there a vector x ∈ F n 2 of weight ≤ w such that H x = y? Note that in this notation the m rows of H are the input vectors, and the components of y are the corresponding targets. This decision problem is of course NP-complete since it is the same problem as MAXIMUM-LIKELIHOOD DECODING viewed from a linear regression perspective, where the rows of the matrix H are the input vectors and the components of y are the corresponding targets. Note that the same decision problem without the restriction x ≤ w can be solved in polynomial time using standard algebraic manipulations to solve the corresponding linear system. Thus the following version of the problem, where the vectors x is required to have weight equal to w, less or equal to w (minimal weight), greater or equal to w (maximal weight), or in the range [w 1 , w 2 ], are all NP-complete yielding the following theorem.
Theorem 7 (Exact linear regression over GF(2)) Given an m × n binary matrix H and a vector y ∈ F m 2 , the problems of finding a vector x satisfying H x = y with the corresponding properties: We now turn to approximate linear regression where we measure the distortion Δ(H x, y) with the Hamming distance Δ. For any two binary vectors u and v, note that Δ(u, v) = wt (u + v) since u i + v i = 0 if and only if u i = v i . We have the following result.

Theorem 8 (Approximate linear regression over GF(2)) The decision problem Problem APPROXIMATE LINEAR REGRESSION OVER GF(2)
Instance Two integers m and n, an m × n binary matrix H , an integer w, and a vector y ∈ F m 2 . Question Is there a vector x ∈ F n 2 of such that Δ(H x, y) = wt (H x + y) ≤ w? is NP-complete.
Proof To see this, we first construct a new m × (n + 1) binary matrix H by appending y to H . Consider the vector x obtained by obtaining appending a 1 to x. Then H x + y = H x thus the problem becomes a MAXIMUM-LIKELIHOOD DECODING problem with the matrix H , the target y, and the last component of the vector x 1 constrained to 1. By the results above on MAXIMUM-LIKELIHOOD DECODING, the linear regression problems over GF (2) with distortion equal to w, less or equal to w (minimal weight), greater or equal to w (maximal weight), or in the range [w 1 , w 2 ], are all NP-complete yielding the following theorem.
Theorem 9 (Approximate linear regression over GF (2)) Given an m × n binary matrix H and a vector y ∈ F m 2 , the problems of finding a vector x so that the distortion Δ(H x, y) satisfies the corresponding properties: are all NP-complete.

Linear autoencoders over GF(2)
We can now address the complexity of linear autoencoders over GF (2). Problem: LINEAR AUTOENCODER OVER GF(2) Instance: Four integers n, p, m, and w. A set of m vectors x 1 , . . . , x m in F n 2 . Question: Is there an p×n matrix B and an n× p matrix A such that m i=1 wt (ABx i +x i ) ≤ w? Note again that wt (ABx + x) is simply the Hamming distance between ABx and x. By taking m = n the problem is at least as hard as LINEAR REGRESSION and therefore it is NP-complete. More generally, by the same argument, we see that the LINEAR AUTOEN-CODER OVER GF(2) with distortion equal to w, less or equal to w (minimal weight), greater or equal to w (maximal weight), or in the range [w 1 , w 2 ], are all NP-complete yielding the following theorem.
Theorem 10 (Linel autoencodes over GF (2)) Given m binary vectors x 1 , . . . , x m over F n 2 , the problems of finding an n × p matrix A and an p ×n matrix B so that the overall distortion i Δ(ABx i , x i ) satisfies the corresponding properties: The problem remains NP-complete if one adds the constraint p < n. It is reasonable to conjecture that all the KERNEL, IMAGE, SUBSPACE, EXACT LINEAR REGRESSION, LINEAR REGRESSION, and AUTOENCODER problems described above remain NP complete on finite fields other than GF (2). Some of the results above can be restated almost immediately in terms of set intersections or parity gates.

Complexity of set intersections
This appendix shows that some of the problems examined in the section on linear autoencoders over finite fields can be restated in terms of set intersections or parity gates.

Problem: EVEN SET INTERSECTIONS (≤ w)
Instance: A collection C of m subsets of a set S of size n, and an integer w > 0. Question: Is there a non-empty subset of S of size ≤ w that has an even intersection with all the members of C?
Obviously this is equivalent to MINIMUM DISTANCE OR KERNEL OVER GF(2) (≤ w). Thus, using the obvious notation, we have the following theorem.
begintheorem(Even set intersections) Given a collection C of m subsets of a set S of size n. The problems of finding a subset of S with an even non-zero intersection with all the subsets in C and the corresponding size properties: The results are similar for odd intersections. In fact, consider the following problem.

Problem: ODD SET INTERSECTIONS (≤ w)
Instance: A collection C of m subsets of a set S of size n, and an integer w > 0. Question: Is there a subset of S of size ≤ w that has an odd intersection with all the members of C? This is exactly equivalent to MAXIMUM-LIKELIHOOD DECODING with s being the vector of all 1s, and similarly for the other versions, yielding the following theorem.

Theorem 11 (Odd set intersections) Given a collection C of m subsets of a set S of size n. The problems of finding a subset of S with an odd intersection with all the subsets in C and the corresponding size properties:
are all NP-complete.

Complexity of parity gates
Some of these problems can also be restated in terms of parity gates. A parity gate or parity Boolean function f with binary inputs u 1 , . . . , u n and support u i 1 , . . . , u i i k can be defined as the Boolean function that produces a 0 when the number of 1-bits in its support is even, and 0 otherwise. Thus f can be seen as a linear operator in F n 2 associated with the vector whose components are 1 on the support and 0 otherwise. By adding an input x 0 which is always equal to 1, it is easy to construct a gate f that behaves like the negation of f , i.e. produces a 1 if the number of 1-bits in the support of f is odd. For simplicity, we call such a gate also a parity gate. Thus the problem of finding a parity function implementing a set of input output relationships exactly or approximately with respect to the Hamming distance with a support of size equal to w, less or equal to w (minimal weight), greater or equal to w (maximal weight), or in the range [w 1 , w 2 ], are all NP-complete yielding the following theorems.
are all NP-complete.

Discussion and conclusion
Here we provide a brief discussion of other autoencoders and autoencoder regimes which can either be folded under the previous analyses or are beyond the scope and space of this article.

Mixed autoencoders
First, one can consider mixed autoencoders with different constraints on F and G, or different constraints on A and B. A simple example is when the input and output layers are real F = R and the hidden layer is binary G = {0, 1} (and Δ = L 2 2 ). It is easy to check that in this case, as long as 2 p = k < m, the autoencoder aims at clustering the real data into k clusters and all the results obtained in the Boolean case are applicable with the proper adjustments. For instance, the centroid associated with a hidden state h should be the center of mass of the input vectors mapped onto h. In general, the optimization decision problem for these autoencoders is also NP-complete and, more importantly, from a probabilistic view point, they correspond exactly to a mixture of k Gaussians model. Other interesting mixed examples can be derived when A and B correspond to different classes of Boolean functions, such as: (1) unrestricted; (2) linear over GF(2); (3) threshold gates; and (4) monotone functions, in all possible combinations.

Threshold gate autoencoders
Threshold gate autoencoders are a special class of Boolean autoencoders where all the Boolean functions are threshold gate functions. A threshold gate function defined on n inputs u 1 , . . . u n with weights w 1 , . . . , w n and threshold t is the function f (u 1 , . . . , u n , and f (u 1 , . . . , u n ) = 0 otherwise. These functions are of interest because they can be viewed as the limiting case of standard sigmoidal neural network functions when the gain of the sigmoidal functions is taken to infinity. They are also closely related to Support Vector Machines [24].
We conjecture that the Threshold Gate Autoencoder decision problem is NP-complete. This is a reasonable conjecture based on the result that training a 3-node neural network is NP-compete [7]. This result is obtained by showing that training a neural network of threshold gates with n inputs, two hidden nodes, and one output node is NP-complete by reduction of the SET SPLITTING (or HYPERGRAPH 2-COLORABILITY) problem. In fact, this problem can be embedded into a threshold gate autoencoder problem as follows. Consider an autoencoder of threshold gates with n + 1 inputs, n + 2 hidden units, and n + 1 outputs. The n + 1 inputs correspond to the same x 1 , . . . , x n inputs of [7] augmented with the corresponding target value x n+1 = t taken from [7]. The inputs x 1 , . . . , x n are fully connected to the hidden units h 1 , . . . , h n , which in turn are fully connected to the outputs z 1 , . . . , z n (allowing for an easy implementation of the identity function). The inputs x 1 , . . . , x n are also fully connected to the two hidden units h n+1 , h n+2 , which in turn are connected to the output z n+1 thereby implementing the circuit used in [7]. These two hidden units are aslo fully connected to the outputs z 1 , . . . , z n . The input x n+1 = y is fully connected to the hidden units h 1 , . . . , h n . Thus the problem used in [7] becomes a sub-problem of a threshold gate autoencoder problem, where some of the connections are missing or forced to be 0. There are n + 2 connections forced to be 0 in the construction above to avoid any communication between x n+1 and z n+1 . Thus this construction proves the following slightly weaker theorem.
Theorem 14 (Threshold gate autoencoder) The threshold gate autoencoder decision problem where up to a linear number of connections are set to 0 is NP-complete.
Although the threshold gate autoencoder optimization problem may be NP-hard, there are interesting optimization strategies that can be applied to it, in particular alternate optimization. This is because for fixed A (resp. fixed B, we know from the theory of the unrestricted Boolean autoencoder developed here what is the optimal Boolean function that B (resp. A) ought to implement. Thus we know what the inputs and outputs of such function ought to be and therefore we are reduced to the problem of training a single layer threshold gate network, or a perceptron, using known input and output targets. When the data is linearly separable, this can be solved using the well-known perceptron algorithm. When the data is not linearly separable, maximum margin approximate solutions can be found using standard Support Vector Machine algorithms [24]. Thus in any case, there are polynomial time efficient algorithms for optimizing B (resp. A) individually, and proceeding with alternate optimization. This yields a novel approximate algorithm for training threshold gate autoencoders. Furthermore, the algorithm can immediately be extended to autoencoders with multiple layers, as well as to the non-associative case, providing a novel algorithm for training multilayer networks of threshold gates, or other functions, with pre-defined inputs and outputs. Basically the idea of the algorithm is to train one layer or fraction of a layer at a time, using the analyses above to provide suitable output targets for training. Simulations of this approach are in progress and will be presented elsewhere.

Autoencoders with p ≥ n
When the hidden layer is larger than the input layer and F = G, there is an optimal 0-distortion solution involving the identity function. Thus this case is interesting only if additional constraints are added to the problem. These can come in many forms, for instance in terms of restrictions on the classes of functions A and B or in terms of regularization, for instance to ensure sparsity of the hidden-layer representation. When these constraints force the hidden layer to assume only k different values and k < m, for instance in the case of a sparse Boolean hidden layer, then some of the previous analyses hold and the problem reduces to a k clustering problem. Two important concept that can lead to large hidden layer are the horizontal composition of autoencoders [2], and the presence of "communication" noise as discussed below.

Autoencoders, coding, and information theory
Not surprisingly, autoencoders are closely related to the theory of information and coding [20]. The autoencoders studied in this paper ( p < n) are implementing a form of compression, which in general is bound to be lossy if the entropy of the input data exceeds the entropy of what can be represented in the hidden layer. Linear autoencoders over GF (2) can be viewed as linear codes for compression. In the dual case of communication over noisy channels (or storage in noisy media), in general messages are expanded before their transmission (or storage), with encoding schemes that include for instance parity bits, and then decoded at the receiving end. This case corresponds to autoencoders with expansive hidden layers ( p > n) (Fig. 8), with the linear case encompassing standard linear codes for communication.

Open questions
In closing, autoencoders have a rich mathematical structure and lead to several additional problems that have not been addressed here. Examples of such problems include: (1) the theory of linear autoencoders over finite fields other than GF(2); (2) the derivation of exact and efficient algorithms for the unrestricted Boolean autoencoder, or the linear autoencoder over GF (2), when the number of clusters is fixed ( p fixed); (3) the derivation of efficient approximate algorithms for all the NP-hard problems described here; and (4) the study of the other properties of linear autoencoders over GF(2) (e.g. transposition); and (5) the study of the family of learning algorithms for deep architectures briefly mentioned above.