Abstract
In this chapter, we introduce the background of neural networks and review related literature. Section 2.1 introduces the general neural network and its learning algorithm—backpropagation. Section 2.2 addresses specialty of natural language processing, and introduces neural language models and word embedding learning. Section 2.3 introduces existing structure-sensitive neural networks, including the convolutional neural network, recurrent neural network, and recursive neural network.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The orthodox perceptron, introduced by Rosenblatt [40], only uses thresholding as the activation function, that is, if the weighted sum of input is less than a threshold, the perceptron outputs 0, or otherwise, 1. In this sense, the perceptron is a special type of neuron. However, we do not distinguish these two terminologies as they are very similar.
- 2.
We can unambiguously distinguish from the font if the target label is represented by the index or one-hot vector. Therefore, it is common to omit the superscripts “id” and “onehot.”
- 3.
The assumption is in fact trivial because every finite, discrete distribution is a multinomial distribution.
- 4.
The backpropagation equations are useful only when we implement backpropagation manually. Nowadays, mature auto-differentiation tools are available, e.g., TensorFlow abd pytorch, where backpropagation is handled automatically. However, it is still interesting to understand backpropagation from a mathematical perspective, and manual implementation is also a fun exercise.
- 5.
An interesting terminological abuse is that textbook stochastic gradient descent (SGD) usually refers to updating with a single data point, i.e., the batch size is 1, but that it may refer to mini-batch gradient descent in research papers with a batch size greater than 1. In our book, we follow the convention of the literature and abuse the two terminologies if needed.
- 6.
We denote \(w_i, w_{i+1}, \ldots , w_j\) by \(\varvec{w}_i^j\) for short.
- 7.
Subscripts I and O represent input and output, respectively.
References
Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. In: Advances in Neural Information Processing Systems, pp. 153–160
Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2006)
Bordes, A., Weston, J., Collobert, R., Bengio, Y.: Learning structured embeddings of knowledge bases. In: Proceedings of the 25th AAAI Conference on Artificial Intelligence, pp. 301–306 (2011)
Cho, K., van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: Encoder-decoder approaches (2014). arXiv preprint arXiv:1409.1259
Collobert, R., Weston, J.: A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning, pp. 160–167 (2008)
Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signal Syst. 2(4), 303–314 (1989)
Fu, R., Guo, J., Qin, B., Che, W., Wang, H., Liu, T.: Learning semantic hierarchies via word embeddings. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 1199–1209 (2014)
Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649 (2013)
Grbovic, M., Radosavljevic, V., Djuric, N., Bhamidipati, N., Savla, J., Bhagwan, V., Sharp, D.: E-commerce in your inbox: Product recommendations at scale. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1809–1818 (2015)
Guo, J., Che, W., Wang, H., Liu, T.: Revisiting embedding features for simple semi-supervised learning. In: Proceedings of Conference on Empirical Methods in Natural Language Processing, pp. 110–120 (2014)
Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J. Mach. Learn. Res. 13(1), 307–361 (2012)
Hastie, T., Tibshirani, R., Friedman, J., Franklin, J.: The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer (2009)
Haykin, S.S., Haykin, S.S., Haykin, S.S., Haykin, S.S.: Neural Networks and Learning Machines. Pearson Education (2009)
He, H., Gimpel, K., Lin, J.: Multi-perspective sentence similarity modeling with convolutional neural networks. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1576–1586 (2015)
Hermann, K., Blunsom, P.: The role of syntax in vector space models of compositional semantics. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pp. 894–904 (2013)
Hinton, G., Osindero, S., Teh, Y.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Hu, B., Lu, Z., Li, H., Chen, Q.: Convolutional neural network architectures for matching natural language sentences. In: Advances in Neural Information Processing Systems, pp. 2042–2050 (2014)
Proakis, J.G., Manolakis, D.G.: Digital Signal Processing: Principles, Algorithms, and Applications. Pentice Hall (1996)
Ji, Y., Eisenstein, J.: One vector is not enough: Entity-augmented distributed semantics for discourse relations. Trans. Assoc. Comput. Linguist. 3, 329–344 (2015)
Jurafsky, D., Martin, J.: Speech and Language Processing. Pearson Education (2000)
Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modelling sentences. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 655–665 (2014)
Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large-batch training for deep learning: Generalization gap and sharp minima. In: Proceedings fo the International Conference on Learning Representations (2017)
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1746–1751 (2014)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of the International Conference on Learning Representations (2015)
Le, P., Zuidema, W.: Compositional distributional semantics with long short term memory (2015). arXiv preprint arXiv:1503.02510
Le, Q.V.: Building high-level features using large scale unsupervised learning. In: Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8595–8598 (2013)
LeCun, Y., Jackel, L., Bottou, L., Brunot, A., Cortes, C., Denker, J., Drucker, H., Guyon, I., Muller, U., Sackinger, E., et al.: Comparison of learning algorithms for handwritten digit recognition. In: Proceedings of the International Conference on Artificial Neural Networks, pp. 53–60 (1995)
Lei, T., Barzilay, R., Jaakkola, T.: Molding CNNs for text: Non-linear, non-consecutive convolutions. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1565–1575 (2015)
Li, W.: Random texts exhibit Zipf’s-law-like word frequency distribution. IEEE Trans. Inf. Theory 38(6), 1842–1845 (1992)
Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: Proceedings ot the 11th Annual Conference of the International Speech Communication Association, pp. 1045–1048 (2010)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (2013)
Mou, L., Peng, H., Li, G., Xu, Y., Zhang, L., Jin, Z.: Discriminative neural sentence modeling by tree-based convolution. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2315–2325 (2015)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning, pp. 807–814 (2010)
Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks (2012). arXiv preprint arXiv:1211.5063
Peng, H., Mou, L., Li, G., Chen, Y., Lu, Y., Jin, Z.: A comparative study on regularization strategies for embedding-based neural networks. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2106–2111 (2015)
Perozzi, B., Al-Rfou, R., Skiena, S.: DeepWalk: Online learning of social representations. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 701–710 (2014)
Rosenblatt, F.: The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev. 65(6), 386 (1958)
Socher, R., Huval, B., Manning, C., Ng, A.: Semantic compositionality through recursive matrix-vector spaces. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1201–1211 (2012)
Socher, R., Karpathy, A., Le, Q., Manning, C., Ng, A.: Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2, 207–218 (2014)
Socher, R., Pennington, J., Huang, E., Ng, A., Manning, C.: Semi-supervised recursive autoencoders for predicting sentiment distributions. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 151–161 (2011)
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C., Ng, A., Potts, C.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642 (2013)
Song, Y., Mou, L., Yan, R., Yi, L., Zhu, Z., Hu, X., Zhang, M.: Dialogue session segmentation by embedding-enhanced texttiling. Interspeech 2016, 2706–2710 (2016)
Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: Proceedings of the 30th International Conference on Machine Learning, pp. 1139–1147 (2013)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Szeliski, R.: Computer Vision: Algorithms and Applications. Springer Science & Business Media (2010)
Tai, K., Socher, R., Manning, D.: Improved semantic representations from tree-structured long short-term memory networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, pp. 1556–1566 (2015)
Tan, J., Wan, X., Xiao, J.: Abstractive document summarization with a graph-based attentional neural model. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 1171–1181 (2017)
Tang, D., Qin, B., Liu, T.: Document modeling with gated recurrent neural network for sentiment classification. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1422–1432 (2015)
Thomas Laurent, J.v.B.: A recurrent neural network without chaos. In: Proceedings of the International Conference on Learning Representations (2017). https://openreview.net/forum?id=S1dIzvclg
Vincent, P.: A connection between score matching and denoising autoencoders. Neural Comput. 23(7), 1661–1674 (2011)
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1096–1103 (2008)
Webb, A.: Statistical Pattern Recognition. Wiley (2003)
Xu, Y., Mou, L., Li, G., Chen, Y., Peng, H., Jin, Z.: Classifying relations via long short term memory networks along shortest dependency paths. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1785–1794 (2015)
Zaremba, W., Sutskever, I.: Learning to execute (2014). arXiv preprint arXiv:1410.4615
Zeiler, M.D.: AdaDelta: An adaptive learning rate method (2012). arXiv preprint arXiv:1212.5701
Zhu, X., Sobhani, P., Guo, Y.: Long short-term memory over tree structures. In: Proceedings of the 32nd International Conference on Machine Learning, pp. 1604–1612 (2015)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2018 The Author(s)
About this chapter
Cite this chapter
Mou, L., Jin, Z. (2018). Background and Related Work. In: Tree-Based Convolutional Neural Networks. SpringerBriefs in Computer Science. Springer, Singapore. https://doi.org/10.1007/978-981-13-1870-2_2
Download citation
DOI: https://doi.org/10.1007/978-981-13-1870-2_2
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1869-6
Online ISBN: 978-981-13-1870-2
eBook Packages: Computer ScienceComputer Science (R0)