Abstract
Visual and speech perception tasks, which can be performed with no apparent effort by people, have proved to be difficult for machines. This may be in part due to the absence of cognitive models of perception of the type proposed above by Jakobson. In this paper we attempt to give a unified view of the research in machine perception of speech and vision in the hope that a clear appreciation of similarities and differences may lead to better information processing models of perception. Being active in research in both computer vision and speech, we have found it useful to look at the problems that have arisen in one domain and anticipate corresponding problems in the other (Reddy, 1969). Thus, this paper represents a comparitive study of the issues, systems and unsolved problems that are, at present, of interest to visual and speech recognition research.
It is clear that all the (visual and speech) phenomena occur in both space and time. In visual signs it is the spacial dimension which takes priority, whereas the temporal dimension takes priority in auditory signs...what is the substantial difference between spacial and auditory signs? We observe a strong tendency to reify visual signs, to connect them with objects, to ascribe mimesis to such signs, and to view them as elements of an “imitative art”. ... On the other hand verbal and musical signs show us two essential features. First, both music and language present a consistantly heirarchized structure, and, second, both are resolvable into ultimate, discrete, rigorously patterned components which, as such, have no existence in nature but are built ad hoc.
One should not draw the frequently suggested but over-simplified conclusion that speech displays a purely linear character or that visual perception is performed by purely simultaneous synthesis. Luria shows that in our perception of a painting, we first deploy step-by-step efforts to go over from certain selected details from parts to the whole, and for the contemplator of a painting the integration follows as a further phase, as a goal. In the fifth century, Bhartrhari, the great master of Indic linguistic theory, distinguished three stages in a speech event, conceptualization, production and audition, and comprehension. While production and audition are naturally sequential, both conceptualization and comprehension of the whole message is done at one and the same time. This conception is akin to the modern psychological problem of “short-term memory”.
Jakobson (1964)
This research was supported in part by the Advanced Research Projects Agency of the Department of Defense under contract no. F44620-70-C-0107 and monitored by Air Force Office of Scientific Research.
This research was supported in part by the Advanced Research Projects Agency of the Department of Defense under contract no. F44620-70-C-OI07 and monitored by Air Force Office of Scientific Research.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Barnett, J. (1972), A Vocal Data Management System, International Conference on Speech Communication and Processing, Boston, 340–343.
Chase, W.G. and H.A. Simon (1971), Perception in Chess, CIP-182, Dept. of Psychology, Carnegie-Mellon Univ., Pittsburgh, Pa.
Chomsky, N. and M. Halle (1968), The Sound Pattern of English, Harper and Row, New York.
Erman, L.D. (1973), An Environment and System for Machine Recognition of Continuous Speech, Ph.D. Thesis, Computer Science Dept., Stanford Univ., to appear as a Technical Report, Computer Science Dept., Carnegie-Mellon Univ., Pittsburgh, Pa.
Fant, G. (1960), Acoustic Theory of Speech Production, Mouton and Company: The Hague.
Fant, G. (1970), Automatic Recognition and Speech Research, Quarterly Progress Report, 16–31, Dept. of Speech Communication, KTH, Stockholm.
Feldman, J.A., et al. (1969), The Stanford Hand Eye Project, Proc. IJCAI, May 7–9, Washington, D.C.
Feldman, J.A. et al. (1971), The Use of Vision and Manipulation to Solve the “Instant Insanity” Puzzle, Proc. Second IJCAI, London, 359–365.
Fikes, R.E. and N.J. Nilsson (1971), STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving, Proc. Second IJCAI, London, 608–621.
Flanagan, J.L. (1965), Speech Analysis, Synthesis, and Perception, Academic Press: New York. Second edition, 1971.
Forgie, J. (1972), Personal Communication, MIT Lincoln Laboratories, Lexington, Mass.
Fry, D.B. and P.B. Denes (1959), The Design and Operation of a Mechanical Speech’ Recognizer, J. British IRE, 19, 211–229.
Gillogly, J.J. (1972), The TECHNOLOGY Chess Program, Artificial Intelligence, 3, 145–163.
Hughes, G.W. and J.F. Hemdal (1965), Speech Analysis, Tech. Rept, AFCRL-65–681, Purdue Univ., Lafayette, Ind.
Jakobson, R. (1964), About the Relation between Visual and Auditory Signs, Models for the Perception of Speech and Visual Form (Ed. Wathen-Dunn), MIT Press, Cambridge, Mass., 1–7.
Kelly, M.D. (1970), Visual Identification of People by Computers, AIM-130, Ph.D. thesis, Computer Science Dept., Stanford Univ., Stanford, Ca.
Krakauer, L.J. (1971), Computer Analysis of Visual Properties of Curved Objects, Ph.D. Thesis, Electrical Engineering Dept., MIT, Cambridge, Mass.
Lehiste, I. (1967), Readings in Acoustic-Phonetics, MIT Press, Cambridge, Mass.
Minsky, M. and S. Papert (1972), Artificial Intelligence, Technical Report, Al Group, MIT, Cambridge, Mass.
Narasimhan, R. (1966), Syntax-Directed Interpretation of Classes of Pictures, CACM, 9, 3, 166–173.
Neely, R.B. (1973), On the Use of Syntax and Semantics in a Speech Understanding System, Ph.D. Thesis, Stanford Univ., to appear as a Technical Report, Computer Science Dept., Carnegie-Mellon Univ., Pittsburgh, Pa.
Neely, R.B. (1973), On the Use of Syntax and Semantics in a Speech Understanding System, Ph.D. Thesis, Stanford Univ., to appear as a Technical Report, Computer Science Dept., Carnegie-Mellon Univ., Pittsburgh, Pa.
Newell, A. (1970), Remarks on the Relationship between Artificial Intelligence and Cognitive Psychology, in Banerji and Mesarovic (eds.), Non-Numerical Problem Solving, 363400, Springer-Verlag.
Newell, A. and H.A. Simon (1972), Human Problem Solving, Prentice-Hall.
Newell, A. et al. (1973), Visualization, unpublished research, Carnegie-Mellon Univ., Pittsburgh, Pa.
Nilsson, N.J. (1969), A Mobile Automaton: An Application of Artificial Intelligence Techniques, Proc. IJCAI, May 7–9, Washington, D.C.
Pierce, J.R. (1969), Whither Speech Recognition, J. Acoust. Soc. Am. 46 1049–1051.
Reddy, D.R. (1967), Computer Recognition of Connected Speech, J. Acoust. Soc. Am., 42. 2, 329–347.
Reddy, D.R., (1969), On the Use of Environmental, Syntactic, and Probabilistic Constraints in Vision and Speech, AIM 78, Computer Science Dept., Stanford Univ., Stanford, Ca.
Reddy, D.R., L.D. Erman, and R.B. Neely (1972), A Model and A System for Machine Recognition of Speech, (to be published in IEEE Trans. on Audio and Electroacoustics, 1973 ).
Reddy, D.R., W.J. Davis, R.B. Ohlander, and D.J. Bihary (1972a), Computer Analysis of Neuronal Structure, Technical Report, Computer Science Dept., Carnegie-Mellon Univ., Pittsburgh, Pa.
Reddy, D.R., B. Broadley, L. Erman, R. Johnsson, J. Newcomer, G. Robertson, and J. Wright (1972b), XCRIBL, A Hardcopy Scan Line Graphics System for Document Generation, Technical Report, Computer Science Dept., Carnegie-Mellon Univ., Pittsburgh, Pa.
Reddy, D.R., L.D. Erman, R. Fennell, R.B. Neely (1973), The HEARSAY Speech Understanding System, to be published.
Rosenfeld, A. (1969), Picture Processing by Computer, Academic Press, N.Y.
Rosenfeld, A. (1973), Progress in Picture Processing: 1969–71. Computing Surveys 5, in press.
Simon, H.A., and M. Barenfeld (1969), Information Processing Analysis of Perceptual Processes in Problem Solving, Psychological Review, 76, 473–483.
Simon, H.A., and M. Barenfeld (1969), Information Processing Analysis of Perceptual Processes in Problem Solving, Psychological Review, 76, 473–483.
Tenenbaum, J.M. (1970), Accomodation in Computer Vision, Ph.D. Thesis, Computer Science Dept., Stanford Univ., Stanford, Ca.
Vicens, P.J. (1969), Aspects of Speech Recognition by a Computer, Ph.D. Thesis, AIM 85, Computer Science Dept.ment, Stanford Univ., Stanford, Ca.
Walker, D. (1972), Personal Communication, Stanford Research Institute, Menlo Park, Ca. Winston ( 1971 ), Learning Structural Descriptions from Visual Scenes, Ph.D. Thesis, Electrical Engineering Dept., MIT, Cambridge, Mass.
Woods, W. (1972), Personal Communication, Bolt, Beranek and Newman, Cambridge, Mass.
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1973 Springer-Verlag Berlin · Heidelberg
About this paper
Cite this paper
Reddy, R. (1973). Eyes and Ears for Computers. In: Einsele, T., Giloi, W., Nagel, HH. (eds) NTG/GI Gesellschaft für Informatik Nachrichtentechnische Gesellschaft Fachtagung „Cognitive Verfahren und Systeme“. Lecture Notes in Economics and Mathematical Systems, vol 83. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-80749-7_1
Download citation
DOI: https://doi.org/10.1007/978-3-642-80749-7_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-06268-4
Online ISBN: 978-3-642-80749-7
eBook Packages: Springer Book Archive