Abstract
Proposed in the 1940s as a simplified model of the elementary computing unit in the human cortex, artificial neural networks (ANNs) have since been an active research area. Among the many evolutions of ANN, deep neural networks (DNNs) (Hinton, Osindero, and Teh 2006) stand out as a promising extension of the shallow ANN structure. The best demonstration thus far of hierarchical learning based on DNN, along with other Bayesian inference and deduction reasoning techniques, has been the performance of the IBM supercomputer Watson in the legendary tournament on the game show Jeopardy!, in 2011.
Keywords
Graphic Process Unit Deep Neural Network Restrict Boltzmann Machine Boltzmann Machine Sentiment ClassificationProposed in the 1940s as a simplified model of the elementary computing unit in the human cortex, artificial neural networks (ANNs) have since been an active research area. Among the many evolutions of ANN, deep neural networks (DNNs) (Hinton, Osindero, and Teh 2006) stand out as a promising extension of the shallow ANN structure. The best demonstration thus far of hierarchical learning based on DNN, along with other Bayesian inference and deduction reasoning techniques, has been the performance of the IBM supercomputer Watson in the legendary tournament on the game show Jeopardy!, in 2011.
This chapter starts with some basic introductory information about ANN then outlines the DNN structure and learning scheme.
Introducting ANNs
ANNs have been successfully used in many reallife applications, especially in supervisedlearning modes. However, ANNs have been plagued by a number of notable challenges and shortcomings. Among the many challenges in supervised learning is the curse of dimensionality (Arnold et al. 2011), which occurs when the number of features and training points becomes significantly large. Big data thus makes ANN learning more difficult, owing to the overwhelming amount of data to process and the consequent memory and computational requirements. Another challenge in classification is the data nonlinearity that characterizes the feature overlap of different classes, making the task of separating the classes more difficult. Primarily for these reasons and the heuristic approach to select the appropriate network architecture, ANNs lagged through the 1990s and 2000s behind the widely adopted support vector machines (SVMs), which proved to be, in many respects, superior to ANNs.
Note
SVM offers a principled approach to machine learning problems because of its mathematical foundations in statistical learning theory. SVM constructs solutions as a weighted sum of support vectors, which are only a subset of the training input. Like ANN, SVM minimizes a particular error cost function, based on the training data set, and relies on an empirical risk model. Additionally, SVM uses structural risk minimization and imposes an additional constraint on the optimization problem, forcing the optimization step to find a model that will eventually generalize better as it is situated at an equal and maximum distance between the classes.
With advancements in hardware and computational power, DNNs have been proposed as an extension of ANN shallow architectures. Some critics consider deep learning just another “buzzword for neural nets” (Collobert 2011). Although they borrow the concept of neurons from the biological brain, DNNs do not attempt to model it as cortical algorithms (CAs) or other biologically inspired machine learning approaches do. DNN concepts stem from the neocognitron model proposed by Fukushima (1980). Broadly defined as a consortium of machine learning algorithms that aims to learn in a hierarchical manner and that involves multiple levels of abstraction for knowledge representation, DNN architectures are intended to realize strong artificial intelligence (AI) models. These architectures accumulate knowledge as information propagates through higher levels in a manner such that the learning at the higher level is defined by and built on the statistical learning that happens at the lowerlevel layers.
With such a broad definition of deep learning in mind, we can construe the combinations of the backpropagation algorithm (available since 1974) with recurrent neural networks and convolution neural networks (introduced in the 1980s) as being the predecessors of deep architectures. However, it is only with the advent of Hinton, Osindero, and Teh’s (2006) contribution to deep learning training that research on deep architectures has picked up momentum. The following sections give a brief overview of ANN, along with introducing in more detail deep belief networks (DBNs) and restricted Boltzmann machines (RBMs).
Early ANN Structures
One of the first ANN attempts dates back to the late 1940s, when the psychologist Donald Hebb (Hebb 1949) introduced what is known today as Hebbian learning, based on the plasticity feature of neurons: when neurons situated on either side of a synapse are stimulated synchronously and recurrently, the synapse’s strength is increased in a manner proportional to the respective outputs of the firing neurons (Brown et al. 1990), such that
where t represents the training epoch, w _{ ij } is the weight of the connection between the ith and the jth neurons, x _{ i } is the output of the ith neuron, and η _{ ij } is a learning rate specific to the synapse concerned.
The Hebbian rule is an unsupervisedlearning scheme that updates the weights of a network locally; that is, the training of each synapse depends on the weights of the neurons connected to it only. With its simple implementation the Hebbian rule is considered the first ANN learning rule, from which multiple variants have stemmed. The first implementations of this algorithm were in 1954, at the Massachusetts Institute of Technology, using computational machines (Farley and Clark, 1954).
Classical ANN
Note

Cell body Can have a variety of sizes and shapes

Dendrites: Numerous, treelike structures that extend from the cell body and that constitute the receptive portion of the neuron (i.e., the input site)

Axon: A long, slender structure, with relatively few branches, that transmits electrical signals to connected areas
The inputs (X) are connected to the neuron through weighted connections emulating the dendrite’s structure, whereas the summation, the bias (b), and the activation function (θ) play the role of the cell body, and the propagation of the output is analogous to the axon in a biological neuron.
Mathematically, a neuron is equivalent to the function:
which can be conveniently modeled, using a matrix form,
where Open image in new window , and Open image in new window .

Hard limiter: Open image in new window

Saturating linear function: Open image in new window

Logsigmoid function: Open image in new window

Hyperbolic tangent sigmoid function: Open image in new window
The bias shifts the activation function to the right or the left, as necessary for learning, and can in some cases be omitted.
A layer of neurons can be conveniently represented, using matrix notation, as follows:
The row index in each element of this matrix represents the destination neuron of the corresponding connection, whereas the column index refers to the input source of the connection.
Designating by Y the output of the layer, you can write
where= Open image in new window .
The function achieved by this network is
Note
For the sake of simplicity, the same activation function θ has been adopted in all layers. However, multiple activation functions can be used in different layers in a network. Also, the number of neurons per layer may not be constant throughout the network.
The optimal number of layers and neurons for best performance is a question yet to be answered decisively, because this number is application dependent. A layer of hidden neurons divides the input space into regions whose boundaries are defined by the hyperplanes associated with each neuron.
The smaller the number of hidden neurons, the fewer the subregions created and the more the network tends to cluster points and map them to the same output. The output of each neuron is a non linear transformation of a hyperplane. In the case of classification, this separating curve formed by weighted inputs coming from the previous layer contributes, with other neurons in the same layer, in defining the final classification boundary. With a large number of neurons, the risk of overfitting increases, and the generalized performance decreases, because of overtraining. The network must be trained with enough data points to ensure that the partitions obtained at each hidden layer correctly separate the data.
ANN Training and the Backpropagation Algorithm
To enable an ANN to recognize patterns belonging to different classes, training on an existing dataset seeks to obtain iteratively the set of weights and biases that achieves the highest performance of the network (Jain, Mao, and Mohiuddin 1996).
In a network with M inputs, N output neurons, and L hidden layers, and given a set of labeled data—that is, a set of P pairs (X, T), where X is an Mdimensional vector, and T is an Ndimensional vector—the learning problem is reduced to finding the optimal weights, such that a cost function is optimized. The output of the network should match the target T _{ i } and minimize the mean squared error,
where Y _{ i } is the output obtained by propagating input X _{ i } through the network.
 1.
Initialization: This step initializes the weights of the network in a random, weak manner; that is, it assigns random values close to 0 to the connections’ weights.
 2.
Feedforward: The input X _{ i } is fed into the network and propagated to the output layer. The resultant error is computed.
 3.
Feedback: The weights and biases are updated with:
where α is a positive, tunable learning rate. The choice of α affects whether the backpropagation algorithm converges and how fast it converges. A large learning rate may cause the algorithm to oscillate, whereas a small learning rate may lead to a very slow convergence.
Because the update of the weights necessitates computing the gradient of the error (the cost function), it is essential for it to be differentiable. Failure to satisfy this condition prevents from using the backpropagation algorithm.
 1.
For each output unit Open image in new window (in output layer L of Figure 74), the backpropagated error is computed, using
where, T _{ i } is the desired output; and, for the sigmoidal function,
resulting in the following expression:
 2.
For each hidden unit Open image in new window (in a hidden layer k with N _{ k } hidden units), and moving from layer L–1, backward to the first layer, the backpropagated error can be computed as shown:
 3.
The weights and biases are updated according to the following gradient descent:
The network error is eventually reduced via this gradientdescent approach. For instance, considering a onedimensional training point that belongs to class 1 (+1) and that is wrongly classified as class 2 (–1), the hyperplane should be moved away from class 1. Because, the hyperplane will be shifted to the left (decrease in Open image in new window ) if Open image in new window and it will be shifted to the right (increase in Open image in new window ) if Open image in new window .
DBN Overview
Although a DBN can be viewed as an ANN with more hidden layers, training a DBN, using backpropagation, does not produce a good machine learning model, because the explainingaway phenomenon makes inference more difficult in deep models. When training a network, the simplifying assumption that layers are independent. Explaining away (also called Berkson’s paradox or selection bias), makes this assumption invalid; the hidden nodes become anticorrelated. For example, if an output node can be activated by two equally rare and independent events with an even smaller chance of occurring simultaneously (because the probability of two independent events’ occurring simultaneously is the product of both probabilities), then the occurrence of one event negates (“explains away”) the occurrence of the other, such that a negative correlation is obtained between the two events. As a result of the difficulty of training deep architectures, DBNs lost popularity until Hinton and Salakhutdinov (2006) proposed a greedy training algorithm to train them efficiently. This algorithm broke down DBNs into sequentially stacked RBMs, which is a twolayer network constrained to contain only interlayer neuron connections, that is, connections between neurons that do not belong to the same layer.
DNN Nomenclature 

Open image in new window : Weight of the edge connecting neuron i in layer r to neuron j in layer; r is suppressed when there are only two layers in the network 
Open image in new window : Weight vector of all connections leaving neuron i in layer r 
W _{ r } : Weight vector connecting layer r to layer Open image in new window 
μ: Learning rate 
k: Number of Gibbs sampling steps performed in contrastive divergence 
n: Total number of hidden layer neurons 
m: Total number of input layer neurons 
Open image in new window : Conditional probability distribution 
h ^{ r }: Binary configuration of layer r 
p(h ^{ r }): Prior probability of h ^{ r } under the current weight values 
v ^{0}: Input layer datapoint Open image in new window : binary configuration of neuron j in the input layer at sampling step t 
H _{ i }: Binary configuration variable of neuron i in the hidden layer at sampling step t 
Open image in new window : Binary configuration value of neuron i in the hidden layer at sampling step t 
b _{ j }: Bias term for neuron j in the input layer 
c _{ i }: Bias term for neuron i in the hidden layer 
Restricted Boltzmann Machines
Boltzmann machines (BMs) are twolayer neural network architectures composed of neurons connected in an interlayer and intralayer fashion. Restricted Boltzmann machines (RBMs), first introduced under the name Harmonium, by Smolensky (1986), are constrained to form a bipartite graph. A bipartite graph is a twolayer graph, in which the nodes of the two layers form two disjoint sets of neurons This is achieved by restricting intralayer connections, such that connections between nodes in the same layer are not permitted. This restriction is what distinguishes BMs from RBMs and makes RBMs simpler to train. An RBM with undirected connections between neurons of the different layers forms an autoassociative memory, analogous to neurons in the human brain. Autoassociative memory is characterized by feedback connections that allow the exchange of information between neurons in both directions (Hawkins 2007).
RBM Training Algorithm Workflow, Using CD (Fischer and Igel, 2012)
Based on the Gibbs distribution, the energy function or loss function used to describe the joint probability distribution is denoted in Equation 71, where w _{ ij }, b _{ j }, and c _{ i } are realvalued weights, and h _{ i } and v _{ j } can take values in the set (Aleksandrovsky et al. 1996):
The joint probability distribution is thus computed using Equation 72:
DNN Training Algorithms
Backpropagation is one of the most popular algorithms used to train ANNs (Werbos 1974). Equation 73 displays a simple formulation of the weight update rule, used in backpropagation:
However, as the depth of the network increases, backpropagation’s performance degradation increases as well, making it unsuitable for training general deep architectures. This is due to the vanishing gradient problem (Horchreiter 1991; Horchreiter et al. 2001; Hinton 2007; Bengio 2009), a training issue in which the error propagated back in the network shrinks as it moves from layer to layer, becoming negligible in deep architectures and making it almost impossible for the weights in the early layers to be updated. Therefore, it would be too slow to train and obtain meaningful results from a DNN.
Because of backpropagation’s shortcomings, many attempts were made to develop a fast training algorithm for deep networks. Schmidhuber’s algorithm (Schmidhuber 1992) trained a multilevel hierarchy of recurrent neural networks by using unsupervised pretraining on each layer and then finetuning the resulting weights via backpropagation.
 1.A greedy layerwise training to learn the weights by
 a.
Tying the weights of the unlearned layers.
 b.
Applying CD to learn the weights of the current layer.
 a.
 2.
An updown algorithm for finetuning the weights
This process of tying, learning, and untying weights is repeated until all layers have been processed. DBNs with tied weights resemble RBMs. Therefore, as mentioned earlier, each RBM is learned, using CD learning. However, this algorithm can only be applied if the first two layers form an undirected graph, and the remaining hidden layers form a directed, acyclic graph.
The energy of the directed model is computing, using Equation 74, which is bounded by Equation 75. Tying the weights produces equality in Equation 75 and renders Open image in new window and Open image in new window constant. The derivative of Equation 75 is simplified and equal to Equation 76. Therefore, tying the weights leads to a simpler objective function to maximize. Applying this rule recursively allows the training of a DBN (Hinton, Osindero, and Teh 2006).
Up–Down Algorithm Workflow (Hinton and Salakhutdinov 2006)
Despite its limitations when applied to DNNs, interest in the backpropagation algorithm was renewed, because of the surge in graphics processing unit (GPU) computational power. Ciresan et al. (2010) investigated the performance of the backpropagation algorithm on deep networks. It was observed that, even with the vanishing gradient problem, given enough epochs, backpropagation can achieve results comparable to those of other, more complex training algorithms.
It is to be noted that supervised learning with deep architectures has been reported as performing well on many classification tasks. However, when the network is pretrained in an unsupervised fashion, it almost always performs better than the scenarios where pretraining is omitted without the pretraining phase (Erhan et al. 2010). Several theories have been proposed to explain this phenomenon, such as that the pretraining phase acts as a regularizer (Bengio 2009; Erhan et al. 2009) and an aid (Bengio et al. 2007) for the supervised optimization problem.
DNNRelated Research
The use of DBN in various machine learning applications has flourished since the introduction of Hinton’s fast, greedy training algorithm. Furthermore, many attempts have been made to speed up DBN and address its weaknesses. The following sections offer a brief survey of the most recent and relevant applications of DBN, a presentation on research aimed at speeding up training as well as a discussion of several DBN variants and DNN architectures.
DNN Applications
DNN has been applied to many machine learning applications, including feature extraction, feature reduction, and classification problems, to name a few.
Feature extraction involves transforming raw input data to feature vectors that represent the input; raw data can be audio, image, or text. For example, DBN has been applied to discrete Fourier transform (DFT) representation of music audio (Hamel and Eck 2010) and found to outperform Mel frequency cepstral coefficients (MFCCs), a widely used method of music audio feature extraction.
Once features are extracted from raw data, the highdimensional data representation may have to be reduced to alleviate the memory and computational requirements of classification tasks as well as enable better visualization of the data and decrease the memory needed to store the data for future use. Hinton and Salakhutdinov (Hinton and Salakhutdinov 2006; Salakhutdinov and Hinton 2007) used a stack of RBMs to pretrain the network and then employed autoencoder networks to learn the lowdimensional features.
Extracting expressive and lowdimensional features, using DBN, was shown to be possible for fast retrieval of documents and images, as tested on some evergrowing databases. Ranzato and Szummer (2008) were able to produce compact representations of documents to speed up search engines, while outperforming shallow machine learning algorithms. Applied to image retrieval from large databases, DBN produced results comparable to stateofthe art algorithms, including latent Dirichlet allocation and probabilistic latent semantic analysis (Hörster and Lienhart 2008).
Transferring learned models from one domain to another has always been an issue for machine learning algorithms. However, DNN was able to extract domainindependent features (Bengio and Delalleau 2011), making transfer learning possible in many applications (Collobert and Weston 2008; Glorot, Bordes, and Bengio 2011; Bengio 2012; Ciresan, Meier, and Schmidhuber 2012;Mesnil et al. 2012). DNNs have also been used for curriculum learning, in which data are learned is a specific order (Bengio et al. 2009).
DBN has been applied to many classification tasks in fields such as vision, speech, medical ailments, and natural language processing (NLP). Object recognition from images has been widely addressed, and DBN’s performance exceeded stateoftheart algorithms (Desjardins and Bengio 2008; Uetz and Behnke 2009; Ciresan et al. 2010; Ciresan, Meier, and Schmidhuber 2012). For instance, Ciresan et al. (2010) achieved an error rate of 0.35 percent on the Mixed National Institute of Standards and Technology (MNIST) database. Nair and Hinton (2009) outperformed shallow architectures, including SVM, on threedimensional object recognition, achieving a 6.5 percent error rate, on the New York University Object Recognition Benchmark (NORB) dataset, compared with SVM’s 11.6 percent. Considering speech recognition tasks, deep architectures have improved acoustic modeling (Mohamed et al. 2011; Hinton et al. 2012), speechtotext transcription (Seide, Li, and Yu 2011), and largevocabulary speech recognition (Dahl et al. 2012; Jaitly et al. 2012; Sainath et al. 2011). On phone recognition tasks, DBN achieved an error rate of 23 percent on the TIMIT database—better than reported errors, ranging from 24.4 percent to 36 percent, using other machine learning algorithms (Mohamed, Yu, and Deng 2010).
DBN produced classification results comparable to other machine learning algorithms in seizure prediction, using electroencephalography (EEG) signals, but reached those results in significantly faster times—between 1.7 and 103.7 times faster (Wulsin et al. 2011). McAfee (2008) adopted DBN for document classification and showed promise for succeeding on such databases.
Generating synthetic images—specifically facial expressions—from a highlevel description of human emotion is another area in which DBN has been successfully applied, producing a variety of realistic facial expressions (Susskind et al. 2008).
NLP, in general, has also been investigated with deep architectures to improve on stateoftheart results. Such applications include machine transliteration (Deselaers et al. 2009), sentiment analysis (Zhou, Chen, and Wang 2010; Glorot, Bordes, and Bengio 2011), and language modeling (Collobert and Weston 2008; Weston et al. 2012)—including partofspeech tagging, similarword recognition, and chunking. The complexity of these problems requires a machine learning algorithm with more depth (Bengio and Delalleau 2011) to produce meaningful results. For example, machine transliteration poses a challenge to machine learning algorithms, because the words do not have a unified mapping, which leads to a manytomany mapping that does not exist in dictionaries. Additionally, the large number of sourcetotarget languagepair character symbols and different sound structures leading to missing sounds are just a few properties of transliteration that make it difficult for machines to do well.
Parallel Implementations to Speed Up DNN Training
Sequentially training a DBN layer by layer becomes more timeconsuming as the layer and network sizes increase. Stacking the layers to form networks, called deepstacking networks, and training the network on CPU clusters, as opposed to one supercomputer (Deng, Hutchinson, and Yu 2012), exploit the inherent parallelism in the greedy training algorithm to achieve significant trainingtime savings.
However, this method does not speed up the training time per layer. This can be achieved by parallelizing the training algorithm for the individual RBM layers, using GPUs (Cai et al. 2012).
However, use of the large and sparse data commonly employed to train RBMs creates challenges for parallelizing this algorithm. Modifying the computations for matrixbased operations and optimizing the matrix–matrix multiplication code for sparse matrices make GPU implementation much faster than CPU implementation.
As opposed to speeding up training via software, attempts have been made to speed up training via hardware, using fieldprogrammable gate arrays (FPGAs). Ly and Chow (2010) mapped RBMs to FPGAs and achieved significant speedup of the optimized software code. This work was extended to investigate the scalability of the approach by Lo (2010).
Deep Networks Similar to DBN
One variation of DBN, called modular DBN (MDBN), trains different parts of the network separately, while adjusting the learning rate as training progresses (Pape et al. 2011), as opposed to using one training set for the whole network. This allows MDBN to avoid forgetting features learned early in training, a weakness of DBN that can hinder its performance in online learning applications in which the data distribution changes dynamically over time.
Sparse DBN learns sparse features—unlike Hinton’s DBN, which learns nonsparse data representations—by adding a penalty in the objective function for deviations from the expected activation of hidden units in the RBM formulation (Lee, Ekanadham, and Ng 2007).
Convolutional DBN integrates translation invariance into the image representations by sharing weights between locations in an image, allowing inference to be done when the image is scaled up by using convolution (Lee et al. 2009). Therefore, convolutional DBN scales better to realworldsized images without suffering from computational intractability as a result of the high dimensionality of these images.
DBNs are not the only deep architectures available. Sum product network (SPN) is a deep architecture represented as a graph with directed and weighted edges. SPN is acyclic (contains no loops), with variables on the leaves of the graph, and its internal nodes consist of sum and product operations (Poon and Domingo 2011). SPN trains, using backpropagation and expectation maximization (EM) algorithms. These simple operations result in a network that is more accurate, faster to train, and more tractable than DBN.
Deep Boltzmann machines (DBMs) are similar to but have a more general deep architecture than DBNs. They are composed of BMs stacked on top of each others (Salakhutdinov and Hinton 2009). Although more complex and slower to train than DBNs, owing to the symmetrical connections between all neurons in the BM network, the twoway edges let DBMs propagate input uncertainty better than DBNs, making their generative models more robust. The more complex architecture requires an efficient training algorithm to make training feasible. The DBN greedy training algorithm was modified to achieve a more efficient training algorithm for DBM by using an approximate inference algorithm. However, this rendered DBM training approximately three times slower than DBN training (Salakhutdinov and Larochelle 2010).
References
Aleksandrovsky, Boris, James Whitson, Gretchen Andes, Gary Lynch, and Richard Granger. “Novel Speech Processing Mechanism Derived from Auditory Neocortical Circuit Analysis.” In Proceedings of the Fourth International Conference on Spoken Language, edited by H. Timothy Bunnell and William Idsardi, 558–561. Piscataway, NJ: Institute of Electrical and Electronics Engineers, 1996.
Arnold, Ludovic, Sébastien Rebecchi, Sylvain Chevallier, and Hélène PaugamMoisy. “An Introduction to Deep Learning.” In Proceedings of the 19th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges, Belgium, April 27–29, 2011, edited by Michel Verleysen, 477–488. Leuven, Belgium: Ciaco, 2011.
Bengio, Yoshua. “Learning Deep Architectures for AI.” In Foundations and Trends in Machine Learning 2, no. 1 (2009): 1–127.
Bengio, Yoshua. “Deep Learning of Representations for Unsupervised and Transfer Learning.” In ICML 2011: Proceedings of the International Conference on Machine Learning Unsupervised and Transfer Learning Workshop, edited by Isabelle Guyon, Gideon Dror, Vincent Lemaire, Graham Taylor, and Daniel Silver, 17–36. 2012. http://jmlr.csail.mit.edu/proceedings/papers/v27/bengio12a/bengio12a.pdf .
Bengio, Yoshua, and Olivier Delalleau. “On the Expressive Power of Deep Architectures.” In Algorithmic Learning Theory, edited by Jyrki Kivinen, Csaba Szepesvári, Esko Ukkonen, and Thomas Zeugmann, 18–36. Berlin: Springer, 2011.
Bengio, Yoshua, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. “Greedy LayerWise Training of Deep Networks.” In NIPS ’06: Proceedings of Advances in Neural Information Processing Systems 19, edited by Bernhard Schlkopf, John Platt, and Thomas Hofmann, 153–160. Cambridge, MA: Massachusetts Institute of Technology Press, 2007.
Bengio, Yoshua, Jérôme Louradour, Ronan Collobert, and Jason Weston. “Curriculum Learning.” In ICML ’09: Proceedings of the 26th Annual International Conference on Machine Learning, edited by Léon Bottou and Michael Littman, 41–48. New York: ACM, 2009.
Brown, Thomas H., Edward W. Kairiss, and Claude L. Keenan. “Hebbian Synapses: Biophysical Mechanisms ad Algorithms.”Annual Review of Neuroscience 13, no. 1 (1990): 475–511.
Cai, Xianggao, Zhanpeng Xu, Guoming Lai, Chengwei Wu, and Xiaola Lin. “GPUAccelerated Restricted Boltzmann Machine for Collaborative Filtering.” In Algorithms and Architectures for Parallel Processing: Proceedings of the 12th International ICA3PP Conference, Fukuoka, Japan, September 2012, edited by Yang Xiang, Ivan Stojmenovic´, Bernady O. Apduhan, Guojun Wang, Koji Nakano, and Albert Zomaya, 303–316. Berlin: Springer, 2012.
Ciresan, Dan Claudiu, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber. “Deep, Big, Simple Neural Nets for Handwritten Digit Recognition.”Neural Computation 22, no. 12 (2010): 3207–3220.
Ciresan, Dan Claudiu, Ueli Meier, and Jürgen Schmidhuber. “Transfer Learning for Latin and Chinese Characters with Deep Neural Networks.” In Proceedings of the 2012 International Joint Conference on Neural Networks, 1–6. Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2012.
Collobert, Robert. “Deep Learning for Efficient Discriminative Parsing.” Recorded April 2011. AISTATS video, 21:16. Posted May 6, 2011. http://videolectures.net/aistats2011_collobert_deep/ .
Collobert, Ronan, and Jason Weston. “A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning.” In ICML ’08: Proceedings of the 25th International Conference on Machine Learning, edited by Andrew McCallum and Sam Roweis, 160–167. New York: ACM, 2008.
Dahl, George E., Dong Yu, Li Deng, and Alex Acero. “ContextDependent PreTrained Deep Neural Networks for LargeVocabulary Speech Recognition.” IEEE Transactions on Audio, Speech, and Language Processing 20, no. 1 (2012): 30–42.
Deng, Li, Brian Hutchinson, and Dong Yu. “Parallel Training for Deep Stacking Networks.” In Interspeech 2012: Proceedings of the 13th Annual Conference of the International Speech Communication Association. 2012. www.iscaspeech.org/archive/interspeech_2012 .
Deselaers, Thomas, Saša Hasan, Oliver Bender, and Hermann Ney. “A Deep Learning Approach to Machine Transliteration.” In Proceedings of the Fourth Workshop on Statistical Machine Translation, e233–241. Stroudsburg, PA: Association for Computational Linguistics, 2009.
Desjardins, Guillaume, and Yoshua Bengio. “Empirical Evaluation of Convolutional RBMs for Vision.” Technical report, Université de Montréal, 2008.
Erhan, Dumitru, Yoshua Bengio, Aaron Courville, PierreAntoine Manzagol, Pascal Vincent, and Samy Bengio. “Why Does Unsupervised PreTraining Help Deep Learning?” Journal of Machine Learning Research 11 (2010): 625–660.
Erhan, Dumitru, PierreAntoine Manzagol, Yoshua Bengio, Samy Bengio, and Pascal Vincent. “The Difficulty of Training Deep Architectures and the Effect of Unsupervised PreTraining.” In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, edited by David van Dyk and Max Welling, 153–160. 2009. http://machinelearning.wustl.edu/mlpapers/paper_files/AISTATS09_ErhanMBBV.pdf .
Farley, B. G., and W. Clark. “Simulation of SelfOrganizing Systems by Digital Computer.” IEEE Transactions of the IRE Professional Group on Information Theory 4, no. 4 (1954): 76–84.
Fischer, Asja, and Christian Igel. “An Introduction to Restricted Boltzmann Machines.” In Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: Proceedings of the 17th Iberoamerican Congress, CIARP 2012, Buenos Aires, Argentina, September 3–6, 2012, edited by Luis Alvarez, Marta E. Mejail, Luis E. Gomez, and Julio E. Jacobo, 14–36. Berlin: Springer, 2012.
Fukushima, Kunihiko. “Neocognition: A SelfOrganizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position.” Biological Cybernetics 36 (1980): 193–202.
Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. “Domain Adaptation for LargeScale Sentiment Classification: A Deep Learning Approach.” In ICML ’11: Proceedings of the 28th International Conference on Machine Learning, 513–520. 2011. www.icml2011.org/papers/342_icmlpaper.pdf .
Hamel, Philippe, and Douglas Eck. “Learning Features from Music Audio with Deep Belief Networks.” In ISMIR 2010: Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR 2010), August 9–13, 2010, Utrecht, the Netherlands, edited by J. Stephen Downie and Rembo C. Veltkamp, 339–344. International Society for Music Information Retrieval, 2010. http://ismir2010.ismir.net/proceedings/ISMIR2010_complete_proceedings.pdf .
Hawkins, Jeff, and Sandra Blakeslee. On Intelligence. New York: Macmillan, 2007.
Haykin, Simon. Neural Networks. Upper Saddle River, NJ: Prentice Hall, 1994.
Hebb, Donald. The Organization of Behavior. New York: Wiley, 1949.
Hinton, Geoffrey E. “Training Products of Experts by Minimizing Contrastive Divergence.” Neural Computation 14, no. 8 (2002): 1771–1800.
Hinton, Geoffrey E. “To Recognize Shapes, First Learn to Generate Images.” Progress in Brain Research 165 (2007): 535–547.
Hinton, Geoffrey E.. “A Practical Guide to Training Restricted Boltzmann Machines.” Momentum 9, no. 1 (2010).
Hinton, Geoffrey E., Li Deng, Dong Yu, George E. Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew Senior, et al. “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups.” IEEE Signal Processing Magazine 29, no. 6 (2012): 82–97.
Hinton, Geoffrey E., Simon Osindero, and YeeWhye Teh. “A Fast Learning Algorithm for Deep Belief Nets.” Neural Computation 18, no. 7 (2006): 1527–1554.
Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. “Reducing the Dimensionality of Data with Neural Networks.” Science 313, no. 5786 (2006): 504–507.
Hochreiter, Sepp. “Untersuchungen zu dynamischen neuronalen Netzen.” Master's thesis, Technical University of Munich, 1991.
Hochreiter, Sepp, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. “Gradient Flow in Recurrent Nets: The Difficulty of Learning LongTerm Dependencies.” In A Field Guide to Dynamical Recurrent Neural Networks, edited by John F. Kolen and Stefan C. Kremer, 237–244. Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2001.
Hörster, Eva, and Rainer Lienhart. “Deep Networks for Image Retrieval on LargeScale Databases.” In Proceedings of the 16th ACM International Conference on Multimedia, 643–646. New York: ACM, 2008.
Jain, Anil K., Jianchang Mao, and K. M. Mohiuddin. “Artificial Neural Networks: A Tutorial.” Computer 29, no. 3 (1996): 31–44.
Jaitly, Navdeep, Patrick Nguyen, Andrew W. Senior, and Vincent Vanhoucke. “Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition.” In Interspeech 2012: Proceedings of the 13th Annual Conference of the International Speech Communication Association. 2012. www.iscaspeech.org/archive/interspeech_2012/ .
Lee, Honglak, Chaitanya Ekanadham, and Andrew Y. Ng. “Sparse Deep Belief Net Model for Visual Area V2.” Proceedings of NIPS 2007: Advances in Neural Information Processing Systems, edited by J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis. 2008. http://papers.nips.cc/paper/3313sparsedeepbeliefnetmodelforvisualareav2.pdf .
Lee, Honglak, Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng. “Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations.” In ICML ’09: Proceedings of the 26th Annual International Conference on Machine Learning, edited by Léon Bottou and Michael Littman, 609–616. New York: ACM, 2009.
Lo, Charles. “A FPGA Implementation of Large Restricted Boltzmann Machines.” In Proceedings of the 18th IEEE Annual International Symposium on FieldProgrammable Custom Computing Machines (FCCM), May 2–4, 2010, Charlotte, NC, 201–208. Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2010.
Ly, Daniel L., and Paul Chow. “HighPerformance Reconfigurable Hardware Architecture for Restricted Boltzmann Machines.” IEEE Transactions on Neural Networks 21, no. 1 (2010): 1780–1792.
McAfee, Lawrence. “Document Classification Using Deep Belief Nets,” 2008.
Mesnil, Grégoire, Yann Dauphin, Xavier Glorot, Salah Rifai, Yoshua Bengio, Ian J. Goodfellow, Erick Lavoie, et al. “Unsupervised and Transfer Learning Challenge: A Deep Learning Approach.” In ICML 2011: Proceedings of the International Conference on Machine Learning Unsupervised and Transfer Learning Workshop, edited by Isabelle Guyon, Gideon Dror, Vincent Lemaire, Graham Taylor, and Daniel Silver, 97–110. 2012. http://jmlr.csail.mit.edu/proceedings/papers/v27/mesnil12a/mesnil12a.pdf .
Mohamed, Abdelrahman, Tara N. Sainath, George Dahl, Bhuvana Ramabhadran, Geoffrey E. Hinton, and Michael A. Picheny. “Deep Belief Networks Using Discriminative Features for Phone Recognition.” In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, 5060–5063. Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2011.
Mohamed, Abdelrahman, Dong Yu, and Li Deng. “Investigation of FullSequence Training of Deep Belief Networks for Speech Recognition.” In Interspeech 2010: Proceedings of 11th Annual Conference of the International Speech Communication Association, edited by Takao Kobayashi, Keikichi Hirose, and Satoshi Nakamura, 2846–2849. 2010. www.iscaspeech.org/archive/interspeech_2010/i10_2846.html .
Nair, Vinod, and Geoffrey E. Hinton. “3D Object Recognition with Deep Belief Nets.” In NIPS ’09: Proceedings of Advances in Neural Information Processing Systems 22, edited Yoshua Bengio, Dale Schuurmans, John Lafferty, Chris Williams, and Aron Culotta, 1339–1347. 2009. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2009_0807.pdf .
Pape, Leo, Faustino Gomez, Mark Ring, and Jürgen Schmidhuber. “Modular Deep Belief Networks That Do Not Forget.” In Proceedings of the 2011 International Joint Conference on Neural Networks, 1191–1198. Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2011.
Poon, Hoifung, and Pedro Domingos. “SumProduct Networks: A New Deep Architecture.” In Proceedings of the 2011 IEEE International Conference on Computer Vision Workshops, 689–690. Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2011.
Ranzato, Marc’Aurelio, and Martin Szummer. “SemiSupervised Learning of Compact Document Representations with Deep Networks.” In ICML ’08: Proceedings of the 25th International Conference on Machine Learning, edited by Andrew McCallum and Sam Roweis, 792–799. New York: ACM, 2008.
Rosenblatt, Frank. “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.” Psychological Review 65, no. 6 (1958): 386–408.
Sainath, Tara N., Brian Kingsbury, Bhuvana Ramabhadran, Petr Fousek, Petr Novak, and Abdelrahman Mohamed. “Making Deep Belief Networks Effective for Large Vocabulary Continuous Speech Recognition.” In Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, edited by Thomas Hain and Kai Yu, 30–35. Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2011.
Salakhutdinov, Ruslan, and Geoffrey Hinton. “Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure.” In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics, edited by Marina Meila and Xiaotong Shen, 412–419. 2007. http://jmlr.csail.mit.edu/proceedings/papers/v2/salakhutdinov07a/salakhutdinov07a.pdf .
Salakhutdinov, Ruslan, and Geoffrey Hinton. “Deep Boltzmann Machines.” In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, edited by David van Dyk and Max Welling, 448–455. 2009. www.jmlr.org/proceedings/papers/v5/salakhutdinov09a/salakhutdinov09a.pdf .
Salakhutdinov, Ruslan, and Hugo Larochelle. “Efficient Learning of Deep Boltzmann Machines.” In Proceedings of the 13th Annual International Conference on Artificial Intelligence and Statistics, edited by Yee Whye Teh and Mike Titterington, 693–700. 2010. www.dmi.usherb.ca/∼larocheh/publications/aistats_2010_dbm_recnet.pdf .
Schmidhuber, Jurgen. “Learning Complex, Extended Sequences Using the Principle of History Compression.” Neural Computation 4 (1992): 234–242.
Seide, Frank, Gang Li, and Dong Yu. “Conversational Speech Transcription Using ContextDependent Deep Neural Networks.” In Interspeech 2011: Proceedings of 11th Annual Conference of the International Speech Communication Association, edited by Piero Cosi, Renato De Mori, Giuseppe Di Fabbrizio, and Roberto Pieraccini, 437–440. 2011. www.iscaspeech.org/archive/interspeech_2011 .
Smolensky, Paul. “Information Processing in Dynamical Systems: Foundations of Harmony Theory.” In Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol. 1, edited by David E. Rumelhart, James L. McClelland, and the PDP Research Group, 194–281. Cambridge, MA: Massachusetts Institute of Technology Press, 1986.
Susskind, Joshua M., Geoffrey E. Hinton, Javier R. Movellan, and Adam K. Anderson. “Generating Facial Expressions with Deep Belief Nets.” In Affective Computing: Focus on EmotionExpression, Synthesis and Recognition, edited by Jimmy Or, 421–440. Vienna: ITech, 2008.
Uetz, Rafael, and Sven Behnke. “LocallyConnected Hierarchical Neural Networks for GPUAccelerated Object Recongition.” In Proceedings of the NIPS 2009 Workshop on LargeScale Machine Learning Parallelism and Massive Datasets. 2009.
Werbos, Paul. “Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences.” PhD thesis, Harvard University, 1974.
Weston, Jason, Frédéric Ratle, Hossein Mobahi, and Ronan Collobert. “Deep Learning via SemiSupervised Embedding.” In Neural Networks: Tricks of the Trade, Second Edition, edited by Grégoire Montavon, Geneviève Orr, and KlausRobert Müller, 639–655. Berlin: Springer, 2012.
Wulsin, D. F., J. R. Gupta, R. Mani, J. A. Blanco, and B. Litt. “Modeling Electroencephalography Waveforms with SemiSupervised Deep Belief Nets: Fast Classification and Anomaly Measurement.” Journal of Neural Engineering 8, no. 3 (2011): 036015.
Zhou, Shusen, Qingcai Chen, and Xiaolong Wang. “Active Deep Networks for SemiSupervised Sentiment Classification.” In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, edited by ChuRen Huang and Dan Jurafsky, 1515–1523. Stroudsburg, PA: Association for Computational Linguistics, 2010.