Towards human-like and transhuman perception in AI 2.0: a review

  • Yong-hong Tian
  • Xi-lin Chen
  • Hong-kai Xiong
  • Hong-liang Li
  • Li-rong Dai
  • Jing Chen
  • Jun-liang Xing
  • Jing Chen
  • Xi-hong Wu
  • Wei-min Hu
  • Yu Hu
  • Tie-jun Huang
  • Wen Gao


Perception is the interaction interface between an intelligent system and the real world. Without sophisticated and flexible perceptual capabilities, it is impossible to create advanced artificial intelligence (AI) systems. For the next-generation AI, called ‘AI 2.0’, one of the most significant features will be that AI is empowered with intelligent perceptual capabilities, which can simulate human brain’s mechanisms and are likely to surpass human brain in terms of performance. In this paper, we briefly review the state-of-the-art advances across different areas of perception, including visual perception, auditory perception, speech perception, and perceptual information processing and learning engines. On this basis, we envision several R&D trends in intelligent perception for the forthcoming era of AI 2.0, including: (1) human-like and transhuman active vision; (2) auditory perception and computation in an actual auditory setting; (3) speech perception and computation in a natural interaction setting; (4) autonomous learning of perceptual information; (5) large-scale perceptual information processing and learning platforms; and (6) urban omnidirectional intelligent perception and reasoning engines. We believe these research directions should be highlighted in the future plans for AI 2.0.

Key words

Intelligent perception Active vision Auditory perception Speech perception Autonomous learning 

CLC number



Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Amodei, D., Anubhai, R., Battenberg, E., et al., 2015. Deep Speech 2: end-to-end speech recognition in English and Mandarin. arXiv:1512.02595.Google Scholar
  2. 2.
    Bear, M.F., Connors, B.W., Paradiso, M.A., 2001. Neuroscience. Lippincott Williams and Wilkins, Maryland, p.208.Google Scholar
  3. 3.
    Bruna, J., Mallat, S., 2013. Invariant scattering convolution networks. IEEE Trans. Patt. Anal. Mach. Intell., 35(8)):1872–1886. CrossRefGoogle Scholar
  4. 4.
    Candès, E., Romberg, J., Tao, T., 2006. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inform. Theory, 52(2)):489–509. MathSciNetCrossRefMATHGoogle Scholar
  5. 5.
    Deng, J., Dong, W., Socher, R., et al., 2009. ImageNet: a large-scale hierarchical image database. IEEE Conf. on Computer Vision and Pattern Recognition, p.248–255. Google Scholar
  6. 6.
    Duarte, M., Davenport, M., Takhar, D., et al., 2008. Single-pixel imaging via compressive sampling. IEEE Signal Proc. Mag., 25(2)):83–91. CrossRefGoogle Scholar
  7. 7.
    Han, J., Shao, L., Xu, D., et al., 2013. Enhanced computer vision with Microsoft Kinect sensor: a review. IEEE Trans. Cybern., 43(5)):1318–1334. CrossRefGoogle Scholar
  8. 8.
    Hinton, G., Deng, L., Yu, D., et al., 2012. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Proc. Mag., 29(6)):82–97. CrossRefGoogle Scholar
  9. 9.
    Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neur. Comput., 9(8)):1735–1780. CrossRefGoogle Scholar
  10. 10.
    Hou, Y.Z., Jiao, L.F., 2014. Survey of smart city construction study from home and abroad. Ind. Sci. Trib., 13(24)):94–97 (in Chinese).Google Scholar
  11. 11.
    Jiang, H., Huang, G., Wilford, P., 2014. Multiview in lensless compressive imaging. Apsipa Trans. Signal Inform. Proc., 3(15)):1–10. Google Scholar
  12. 12.
    Kadambi, A., Whyte, R., Bhandari, A., et al., 2013. Coded time of flight cameras: sparse deconvolution to address multipath interference and recover time profiles. ACM Trans. Graph., 32(6)):1–10. CrossRefGoogle Scholar
  13. 13.
    Kale, P.V., Sharma, S.D., 2014. A review of securing home using video surveillance. Int. J. Sci. Res., 3(5)):1150–1154.Google Scholar
  14. 14.
    Kendrick, K.M., 1998. Intelligent perception. Appl. Animal Behav. Sci., 57(3-4)):213–231. CrossRefGoogle Scholar
  15. 15.
    King, S., 2014. Measuring a decade of progress in text-to-speech. Loquens, 1(1)):e006. CrossRefGoogle Scholar
  16. 16.
    Krizhevsk, A., Sutskever, I., Hinton, G., 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, p.1097–1105.Google Scholar
  17. 17.
    Lacey, G., Taylor, G.W., Areibi, S., 2016. Deep learning on FPGAs: past, present, and future. arXiv:1602.04283.Google Scholar
  18. 18.
    LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature, 521(7553)):436–444. CrossRefGoogle Scholar
  19. 19.
    Li, T., Chang, H., Wang, M., et al., 2015. Crowded scene analysis: a survey. IEEE Trans. Circ. Syst. Video Technol., 25(3)):367–386. CrossRefGoogle Scholar
  20. 20.
    Ling, Z.H., Kang, S.Y., Zen, H., et al., 2015. Deep learning for acoustic modeling in parametric speech generation: a systematic review of existing techniques and future trends. IEEE Signal Proc. Mag., 32(3)):35–52. CrossRefGoogle Scholar
  21. 21.
    Lippmann, R.P., 1997. Speech recognition by machines and humans. Speech Commun., 22(1)):1–15. CrossRefGoogle Scholar
  22. 22.
    Litovsky, R.Y., Colburn, H.S., Yost, W.A., et al., 1999. The precedence effect. J. Acoust. Soc. Am., 106:1633–1654. CrossRefGoogle Scholar
  23. 23.
    Mahendran, A., Vedaldi, A., 2015. Understanding deep image representations by inverting them. IEEE Int. Conf. on Computer Vision Pattern Recognition, p.5188–5196. Google Scholar
  24. 24.
    Makhoul, J., 2016. A 50-year retrospective on speech and language processing. Int. Conf. on Interspeech, p.1.Google Scholar
  25. 25.
    Mattys, S.L., Davis, M.H., Bradlow, A.R., et al., 2012. Speech recognition in adverse conditions: a review. Lang. Cogn. Proc., 27:953–978. CrossRefGoogle Scholar
  26. 26.
    McMackin, L., Herman, M.A., Chatterjee, B., et al., 2012. A high-resolution SWIR camera via compressed sensing. SPIE, 8353:835303. Google Scholar
  27. 27.
    Mountcastle, V., 1978. An organizing principle for cerebral function: the unit model and the distributed system. In: Edelman, G.M., Mountcastle, V.B. (Eds.), The Mindful Brain. MIT Press, Cambridge.Google Scholar
  28. 28.
    Musialski, P., Wonka, P., Aliaga, D.G., et al., 2013. A survey of urban reconstruction. Comput. Graph. Forum, 32(6)):146–177. CrossRefGoogle Scholar
  29. 29.
    Ngiam, J., Khosla, A., Kim, M., et al., 2011. Multimodal deep learning. 28th In. Conf. on Machine Learning, p.689–696.Google Scholar
  30. 30.
    Niwa, K., Koizumi, Y., Kawase, T., et al., 2016. Pinpoint extraction of distant sound source based on DNN mapping from multiple beamforming outputs to prior SNR. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, p.435–439. Google Scholar
  31. 31.
    Oord, A., Dieleman, S., Zen, H., et al., 2016. WaveNet: a generative model for raw audio. arXiv:1609.03499.Google Scholar
  32. 32.
    Pan, Y.H., 2016. Heading toward artificial intelligence 2.0. Engineering, 2(4)):409–413. ENG.2016.04.018CrossRefGoogle Scholar
  33. 33.
    Pratt, G., Manzo, J., 2013. The DARPA robotics challenge. IEEE Robot. Autom. Mag., 20(2)):10–12. CrossRefGoogle Scholar
  34. 34.
    Priano, F.H., Armas, R.L., Guerra, C.F., 2016. A model for the smart development of island territories. Int. Conf. on Digital Government Research, p.465–474. Google Scholar
  35. 35.
    Raina, R., Battle, A., Lee, H., et al., 2007. Self-taught learning: transfer learning from unlabeled data. 24th Int. Conf. on Machine Learning, p.759–766. Google Scholar
  36. 36.
    Robinson, E.A., Treitel, S., 1967. Principles of digital Wiener filtering. Geophys. Prospect., 15(3)):311–332. CrossRefGoogle Scholar
  37. 37.
    Roy, R., Kailath, T., 1989. ESPRIT-estimation of signal parameters via rotational invariance techniques. IEEE Trans. Acoust. Speech Signal Process., 37(7)):984–995. CrossRefMATHGoogle Scholar
  38. 38.
    Salakhutdinov, R., Hinton, G., 2009. Deep Boltzmann machines. J. Mach. Learn. Res., 5:448–455.MATHGoogle Scholar
  39. 39.
    Saon, G., Kuo, H.K.J., Rennie, S., et al., 2015. The IBM 2015 English conversational telephone speech recognition system. arXiv:1505.05899.Google Scholar
  40. 40.
    Seide, F., Li, G., Yu, D., 2011. Conversational speech transcription using context-dependent deep neural networks. Int. Conf. on Interspeech, p.437–440.Google Scholar
  41. 41.
    Soltau, H., Saon, G., Sainath, T.N., 2014. Joint training of convolutional and nonconvolutional neural networks. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, p.5572–5576. Google Scholar
  42. 42.
    Song, T., Chen, J., Zhang, D.B., et al., 2016. A sound source localization algorithm using microphone array with rigid body. Int. Congress on Acoustics, p.1–8.Google Scholar
  43. 43.
    Suzuki, L.R., 2015. Data as Infrastructure for Smart Cities. PhD Thesis, University College London, London, UK.Google Scholar
  44. 44.
    Tadano, R., Pediredla, A., Veeraraghavan, A., 2015. Depth selective camera: a direct, on-chip, programmable technique for depth selectivity in photography. Int. Conf. on Computer Vision, p.3595–3603. Google Scholar
  45. 45.
    Tokuda, K., Nankaku, Y., Toda, T., et al., 2013. Speech synthesis based on hidden Markov models. Proc. IEEE, 101(5)):1234–1252. CrossRefGoogle Scholar
  46. 46.
    Turk, M., Pentland, A., 1991. Eigenfaces for recognition. J. Cogn. Neurosci., 3(1)):71–86. CrossRefGoogle Scholar
  47. 47.
    Veselý, K., Ghoshal, A., Burget, L., et al., 2013. Sequence-discriminative training of deep neural networks. Int. Conf. on Interspeech, p.2345–2349.Google Scholar
  48. 48.
    Wang, W., Xu, S., Xu, B., 2016. First step towards end-to-end parametric TTS synthesis: generating spectral parameters with neural attention. Int. Conf. on Interspeech, p.2243–2247. CrossRefGoogle Scholar
  49. 49.
    Xiong, W., Droppo, J., Huang, X., et al., 2016. Achieving human parity in conversational speech recognition. arXiv:1610.05256.Google Scholar
  50. 50.
    Zhang, J.P., Wang, F.Y., Wang, K.F., et al., 2011. Data-driven intelligent transportation systems: a survey. IEEE Trans. Intell. Transp. Syst., 12(4)):1624–1639. CrossRefGoogle Scholar
  51. 51.
    Zheng, L., Yang, Y., Hauptmann, A.G., 2016. Person re-identification: past, present and future. arXiv:1610.02984.Google Scholar

Copyright information

© Zhejiang University and Springer-Verlag GmbH Germany, part of Springer Nature 2017

Authors and Affiliations

  • Yong-hong Tian
    • 1
  • Xi-lin Chen
    • 2
  • Hong-kai Xiong
    • 3
  • Hong-liang Li
    • 4
  • Li-rong Dai
    • 5
  • Jing Chen
    • 1
  • Jun-liang Xing
    • 6
  • Jing Chen
    • 7
  • Xi-hong Wu
    • 1
  • Wei-min Hu
    • 6
  • Yu Hu
    • 5
  • Tie-jun Huang
    • 1
  • Wen Gao
    • 1
  1. 1.School of Electronics Engineering and Computer SciencePeking UniversityBeijingChina
  2. 2.Institute of Computing TechnologyChinese Academy of SciencesBeijingChina
  3. 3.Department of Electronic EngineeringShanghai Jiao Tong UniversityShanghaiChina
  4. 4.School of Electronic EngineeringUniversity of Electronic Science and Technology of ChinaChengduChina
  5. 5.Department of Electronic Engineering and Information SciencesUniversity of Science and Technology of ChinaHefeiChina
  6. 6.Institute of AutomationChinese Academy of SciencesBeijingChina
  7. 7.School of OptoelectronicsBeijing Institute of TechnologyBeijingChina

Personalised recommendations