Skip to main content

Part of the book series: Lecture Notes in Economics and Mathematical Systems ((LNE,volume 83))

Abstract

Visual and speech perception tasks, which can be performed with no apparent effort by people, have proved to be difficult for machines. This may be in part due to the absence of cognitive models of perception of the type proposed above by Jakobson. In this paper we attempt to give a unified view of the research in machine perception of speech and vision in the hope that a clear appreciation of similarities and differences may lead to better information processing models of perception. Being active in research in both computer vision and speech, we have found it useful to look at the problems that have arisen in one domain and anticipate corresponding problems in the other (Reddy, 1969). Thus, this paper represents a comparitive study of the issues, systems and unsolved problems that are, at present, of interest to visual and speech recognition research.

It is clear that all the (visual and speech) phenomena occur in both space and time. In visual signs it is the spacial dimension which takes priority, whereas the temporal dimension takes priority in auditory signs...what is the substantial difference between spacial and auditory signs? We observe a strong tendency to reify visual signs, to connect them with objects, to ascribe mimesis to such signs, and to view them as elements of an “imitative art”. ... On the other hand verbal and musical signs show us two essential features. First, both music and language present a consistantly heirarchized structure, and, second, both are resolvable into ultimate, discrete, rigorously patterned components which, as such, have no existence in nature but are built ad hoc.

One should not draw the frequently suggested but over-simplified conclusion that speech displays a purely linear character or that visual perception is performed by purely simultaneous synthesis. Luria shows that in our perception of a painting, we first deploy step-by-step efforts to go over from certain selected details from parts to the whole, and for the contemplator of a painting the integration follows as a further phase, as a goal. In the fifth century, Bhartrhari, the great master of Indic linguistic theory, distinguished three stages in a speech event, conceptualization, production and audition, and comprehension. While production and audition are naturally sequential, both conceptualization and comprehension of the whole message is done at one and the same time. This conception is akin to the modern psychological problem of “short-term memory”.

Jakobson (1964)

This research was supported in part by the Advanced Research Projects Agency of the Department of Defense under contract no. F44620-70-C-0107 and monitored by Air Force Office of Scientific Research.

This research was supported in part by the Advanced Research Projects Agency of the Department of Defense under contract no. F44620-70-C-OI07 and monitored by Air Force Office of Scientific Research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Barnett, J. (1972), A Vocal Data Management System, International Conference on Speech Communication and Processing, Boston, 340–343.

    Google Scholar 

  • Chase, W.G. and H.A. Simon (1971), Perception in Chess, CIP-182, Dept. of Psychology, Carnegie-Mellon Univ., Pittsburgh, Pa.

    Google Scholar 

  • Chomsky, N. and M. Halle (1968), The Sound Pattern of English, Harper and Row, New York.

    Google Scholar 

  • Erman, L.D. (1973), An Environment and System for Machine Recognition of Continuous Speech, Ph.D. Thesis, Computer Science Dept., Stanford Univ., to appear as a Technical Report, Computer Science Dept., Carnegie-Mellon Univ., Pittsburgh, Pa.

    Google Scholar 

  • Fant, G. (1960), Acoustic Theory of Speech Production, Mouton and Company: The Hague.

    Google Scholar 

  • Fant, G. (1970), Automatic Recognition and Speech Research, Quarterly Progress Report, 16–31, Dept. of Speech Communication, KTH, Stockholm.

    Google Scholar 

  • Feldman, J.A., et al. (1969), The Stanford Hand Eye Project, Proc. IJCAI, May 7–9, Washington, D.C.

    Google Scholar 

  • Feldman, J.A. et al. (1971), The Use of Vision and Manipulation to Solve the “Instant Insanity” Puzzle, Proc. Second IJCAI, London, 359–365.

    Google Scholar 

  • Fikes, R.E. and N.J. Nilsson (1971), STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving, Proc. Second IJCAI, London, 608–621.

    Google Scholar 

  • Flanagan, J.L. (1965), Speech Analysis, Synthesis, and Perception, Academic Press: New York. Second edition, 1971.

    Google Scholar 

  • Forgie, J. (1972), Personal Communication, MIT Lincoln Laboratories, Lexington, Mass.

    Google Scholar 

  • Fry, D.B. and P.B. Denes (1959), The Design and Operation of a Mechanical Speech’ Recognizer, J. British IRE, 19, 211–229.

    Google Scholar 

  • Gillogly, J.J. (1972), The TECHNOLOGY Chess Program, Artificial Intelligence, 3, 145–163.

    Article  MATH  Google Scholar 

  • Hughes, G.W. and J.F. Hemdal (1965), Speech Analysis, Tech. Rept, AFCRL-65–681, Purdue Univ., Lafayette, Ind.

    Google Scholar 

  • Jakobson, R. (1964), About the Relation between Visual and Auditory Signs, Models for the Perception of Speech and Visual Form (Ed. Wathen-Dunn), MIT Press, Cambridge, Mass., 1–7.

    Google Scholar 

  • Kelly, M.D. (1970), Visual Identification of People by Computers, AIM-130, Ph.D. thesis, Computer Science Dept., Stanford Univ., Stanford, Ca.

    Google Scholar 

  • Krakauer, L.J. (1971), Computer Analysis of Visual Properties of Curved Objects, Ph.D. Thesis, Electrical Engineering Dept., MIT, Cambridge, Mass.

    Google Scholar 

  • Lehiste, I. (1967), Readings in Acoustic-Phonetics, MIT Press, Cambridge, Mass.

    Google Scholar 

  • Minsky, M. and S. Papert (1972), Artificial Intelligence, Technical Report, Al Group, MIT, Cambridge, Mass.

    Google Scholar 

  • Narasimhan, R. (1966), Syntax-Directed Interpretation of Classes of Pictures, CACM, 9, 3, 166–173.

    MathSciNet  Google Scholar 

  • Neely, R.B. (1973), On the Use of Syntax and Semantics in a Speech Understanding System, Ph.D. Thesis, Stanford Univ., to appear as a Technical Report, Computer Science Dept., Carnegie-Mellon Univ., Pittsburgh, Pa.

    Google Scholar 

  • Neely, R.B. (1973), On the Use of Syntax and Semantics in a Speech Understanding System, Ph.D. Thesis, Stanford Univ., to appear as a Technical Report, Computer Science Dept., Carnegie-Mellon Univ., Pittsburgh, Pa.

    Google Scholar 

  • Newell, A. (1970), Remarks on the Relationship between Artificial Intelligence and Cognitive Psychology, in Banerji and Mesarovic (eds.), Non-Numerical Problem Solving, 363400, Springer-Verlag.

    Google Scholar 

  • Newell, A. and H.A. Simon (1972), Human Problem Solving, Prentice-Hall.

    Google Scholar 

  • Newell, A. et al. (1973), Visualization, unpublished research, Carnegie-Mellon Univ., Pittsburgh, Pa.

    Google Scholar 

  • Nilsson, N.J. (1969), A Mobile Automaton: An Application of Artificial Intelligence Techniques, Proc. IJCAI, May 7–9, Washington, D.C.

    Google Scholar 

  • Pierce, J.R. (1969), Whither Speech Recognition, J. Acoust. Soc. Am. 46 1049–1051.

    Google Scholar 

  • Reddy, D.R. (1967), Computer Recognition of Connected Speech, J. Acoust. Soc. Am., 42. 2, 329–347.

    Article  Google Scholar 

  • Reddy, D.R., (1969), On the Use of Environmental, Syntactic, and Probabilistic Constraints in Vision and Speech, AIM 78, Computer Science Dept., Stanford Univ., Stanford, Ca.

    Google Scholar 

  • Reddy, D.R., L.D. Erman, and R.B. Neely (1972), A Model and A System for Machine Recognition of Speech, (to be published in IEEE Trans. on Audio and Electroacoustics, 1973 ).

    Google Scholar 

  • Reddy, D.R., W.J. Davis, R.B. Ohlander, and D.J. Bihary (1972a), Computer Analysis of Neuronal Structure, Technical Report, Computer Science Dept., Carnegie-Mellon Univ., Pittsburgh, Pa.

    Google Scholar 

  • Reddy, D.R., B. Broadley, L. Erman, R. Johnsson, J. Newcomer, G. Robertson, and J. Wright (1972b), XCRIBL, A Hardcopy Scan Line Graphics System for Document Generation, Technical Report, Computer Science Dept., Carnegie-Mellon Univ., Pittsburgh, Pa.

    Google Scholar 

  • Reddy, D.R., L.D. Erman, R. Fennell, R.B. Neely (1973), The HEARSAY Speech Understanding System, to be published.

    Google Scholar 

  • Rosenfeld, A. (1969), Picture Processing by Computer, Academic Press, N.Y.

    Google Scholar 

  • Rosenfeld, A. (1973), Progress in Picture Processing: 1969–71. Computing Surveys 5, in press.

    Google Scholar 

  • Simon, H.A., and M. Barenfeld (1969), Information Processing Analysis of Perceptual Processes in Problem Solving, Psychological Review, 76, 473–483.

    Article  Google Scholar 

  • Simon, H.A., and M. Barenfeld (1969), Information Processing Analysis of Perceptual Processes in Problem Solving, Psychological Review, 76, 473–483.

    Article  Google Scholar 

  • Tenenbaum, J.M. (1970), Accomodation in Computer Vision, Ph.D. Thesis, Computer Science Dept., Stanford Univ., Stanford, Ca.

    Google Scholar 

  • Vicens, P.J. (1969), Aspects of Speech Recognition by a Computer, Ph.D. Thesis, AIM 85, Computer Science Dept.ment, Stanford Univ., Stanford, Ca.

    Google Scholar 

  • Walker, D. (1972), Personal Communication, Stanford Research Institute, Menlo Park, Ca. Winston ( 1971 ), Learning Structural Descriptions from Visual Scenes, Ph.D. Thesis, Electrical Engineering Dept., MIT, Cambridge, Mass.

    Google Scholar 

  • Woods, W. (1972), Personal Communication, Bolt, Beranek and Newman, Cambridge, Mass.

    Google Scholar 

Download references

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1973 Springer-Verlag Berlin · Heidelberg

About this paper

Cite this paper

Reddy, R. (1973). Eyes and Ears for Computers. In: Einsele, T., Giloi, W., Nagel, HH. (eds) NTG/GI Gesellschaft für Informatik Nachrichtentechnische Gesellschaft Fachtagung „Cognitive Verfahren und Systeme“. Lecture Notes in Economics and Mathematical Systems, vol 83. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-80749-7_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-80749-7_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-06268-4

  • Online ISBN: 978-3-642-80749-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics