Massively parallel symbolic induction of protein structure/function relationships

  • Richard H. Lathrop
  • Teresa A. Webster
  • Temple F. Smith
  • Patrick H. Winston
Part of the Lecture Notes in Computer Science book series (LNCS, volume 661)


We have described a running system that embodies efficient parallel implementations of several symbolic machine learning induction operators. It functions as an “Induction Assistant” to a domain expert. First we developed an efficient, noise-tolerant, similarity-based parallel matching algorithm. This should apply to other graph-based representations of domains possessing an embedding in which the low-level features (relations or groupings) are mostly local. It was used as infrastructure to construct efficient parallel implementations of several symbolic machine learning induction operators. Finally, the induction operators were sandwiched together with sets of filters (both syntactic and empirical) to compose a crude form of induction scripts, which are invoked by a domain expert. The matching algorithm has very attractive scaling properties as the size of the problem and/or the number of processors increases. Hardware usage is efficient. The results reported in this article were obtained on an 8K CM-2 Connection Machine. The implemented system was used to discover something previously unknown to the domain expert [47].

For us, the key contribution of this work is its demonstration of the scalability of the algorithms involved: The time complexity of every algorithm reported here is nearly independent of the size of the data, provided sufficient parallel hardware is available (subject to discussion about the instance-ID and characteristic set bit-vectors).


Domain Expert Induction Operator Negative Instance Pattern Space Parallel Hardware 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Abarbanel, R. M. (1984), Protein Structural Knowledge Engineering, Ph.D. thesis, University of California, San Francisco.Google Scholar
  2. 2.
    Bradley, M., T. Smith, R. Lathrop, D. Livingston, and T. Webster (1987), “Consensus Topography in the ATP Binding Site of the Simian Virus 40 and Polyomavirus Large Tumor Antigens,” Proc. Natl. Acad. Sciences USA, 84:4026–4030.Google Scholar
  3. 3.
    Cohen, F. E., R. M. Abarbanel, I. D. Kuntz, and R. J. Fletterick (1986), “Turn Prediction in Proteins Using a Pattern-Matching Approach,” Biochemistry, 25:266–275.Google Scholar
  4. 4.
    Cohen, F. E., and I. D. Kuntz (1989), “Tertiary Structure Predictions,” in Prediction of Protein Structure and the Principles of Protein Conformation, G. D. Fasman (ed.), Plenum Press, New York, pp. 647–706.Google Scholar
  5. 5.
    Collins, J. F., and A. F. Coulson (1984), “Applications of Parallel Processing Algorithms for DNA Sequence Analysis,” Nucl. Acids Res., 12:181–192.Google Scholar
  6. 6.
    Drescher, G. L. (1989), A Mechanism for Early Piagetian Learning, Ph.D. thesis, Massachusetts Institute of Technology, Cambridge.Google Scholar
  7. 7.
    Farmer, J. and N. Packard (1986), “The Immune System, Adaptation, and Machine Learning,” Physica, 22D:187–204.Google Scholar
  8. 8.
    Figge, J., T. Webster, T. Smith, and E. Paucha (1988), “Prediction of Similar Transforming Region in Simian Virus 40 Large T, Adenovirus E1A, and myc Oncoproteins,” J. Virology, 62(5):1814–1818.Google Scholar
  9. 9.
    Figge, J., and T. Smith (1988), “Cell-Division Sequence Motif,” Nature, 334:109.Google Scholar
  10. 10.
    Friedland, P., and L. Kedes (1985), “Discovering the Secrets of DNA,” Computer, 18(11):49–69.Google Scholar
  11. 11.
    Friedrichs, M., and P. Wolynes (1989), “Toward Protein Tertiary Structure Recognition by Means of Associative Memory Hamiltonians,” Science, 246:371–373.Google Scholar
  12. 12.
    Gascuel, O., and A. Danchin (1986), “Protein Export in Prokaryotes and Eukaryotes: Indications of a Difference in the Mechanism of Exportation,” J. Mol. Evol., 24:130–142.Google Scholar
  13. 13.
    Goldsborough, M. D., D. DiSilvestre, G. F. Temple, A. T. Lorincz (1989), “Nucleotide Sequence of Human Papilloma Virus Type 31: A Cervical Neoplasia-Associated Virus,” Virology, 171:306–311.Google Scholar
  14. 14.
    Hayes-Roth, B., et al. (1986), “PROTEAN: Deriving Protein Structure from Constraints,” in Proc. Fifth Natl. Conf. on Artificial Intelligence, pp. 904–909.Google Scholar
  15. 15.
    Hillis, W. D. (1986), The Connection Machine, MIT Press, Cambridge, MA.Google Scholar
  16. 16.
    Holland, J., K. Holyoak, R. Nisbett, and P. Thagard (1986), Induction: Processes of Inference, Learning, and Discovery, MIT Press, Cambridge, MA, USA.Google Scholar
  17. 17.
    Holley, L. H. and M. Karplus (1989), “Protein Structure Prediction With a Neural Network,” Proc. Natl. Acad. Sciences USA, 86:152–156.Google Scholar
  18. 18.
    Hunter, L. E. (1989), Knowledge Acquisition Planning: Gaining Expertise Through Experience, Ph.D. thesis, Yale University.Google Scholar
  19. 19.
    Karp, P. and P. Friedland (1989), “Coordinating the Use of Qualitative and Quantitative Knowledge in Declarative Device Modeling,” in Widman, L. E., D. H. Helman, and K. Loparo (eds.), Artificial Intelligence, Modeling and Simulation, John Wiley and Sons.Google Scholar
  20. 20.
    Koile, K. and C. Overton (1989), “A Qualitative Model for Gene Expression,” Proc. 1989 Summer Computer Simulation Conf., Soc. for Computer Simulation.Google Scholar
  21. 21.
    Kolata, G. (1986), “Trying to Crack the Second Half of the Genetic Code,” Science, 233:1037–1039.Google Scholar
  22. 22.
    Lander, E., J. Mesirov, and W. Taylor (1988), “Study of Protein Sequence Comparison Metrics on the Connection Machine CM-2,” Proc. Supercomputing'88.Google Scholar
  23. 23.
    Lathrop, R. H. (1990), Efficient Methods For Massively Parallel Symbolic Induction: Algorithms and Implementation, Ph.D. thesis, Massachusetts Institute of Technology.Google Scholar
  24. 24.
    Lathrop, R. H., T. A. Webster, and T. F. Smith (1987a), “ARIADNE: Pattern/Directed Inference and Hierarchical Abstraction in Protein Structure Recognition,” Comm. of the ACM, 30(11):909–921.Google Scholar
  25. 25.
    Maryanski, F. J., and T. L. Booth (1977), “Inference of Finite-State Probabilistic Grammars,” IEEE Trans. on Computers, C-26(6):521–536.Google Scholar
  26. 26.
    Michalski, R. S., J. G. Carbonell, and T. M. Mitchell (1983), (eds.) Machine Learning: An Artificial Intelligence Approach, (first in a series), Tioga Press, Palo Alto, CA.Google Scholar
  27. 27.
    Minsky, M. (1986), The Society of Mind, Simon and Schuster.Google Scholar
  28. 28.
    Mitchell, T. M. (1977), “Version Spaces: A Candidate Elimination Approach to Rule Learning,” Proc. Fifth Intl. Joint Conf. on Artificial Intelligence, Cambridge, MA, pp. 305–310.Google Scholar
  29. 29.
    Qian, N. and T. Sejnowski (1988), “Predicting the Secondary Structure of Globular Proteins Using Neural Network Models,” J. Mol. Biol., 202:865–884.Google Scholar
  30. 30.
    Quinlan, J. R., and R. L. Rivest (1989), “Inferring Decision Trees Using the Minimum Description Length Principle,” Information and Computation, March, 80(3):227–248.Google Scholar
  31. 31.
    Richardson, J. (1981), “The Anatomy and Taxonomy of Protein Structure,” Advances in Protein Chemistry, 34:167–339.Google Scholar
  32. 32.
    1986 (Rumelhart et al.), Parallel Distributed Processing, volume 1, MIT Press, Cambridge, MA.Google Scholar
  33. 33.
    Sankoff, D. and J. B. Kruskal (1983), (eds.) Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley, Reading, MA, USA.Google Scholar
  34. 34.
    Searls, D. B. (1988), “Representing Genetic Information with Formal Grammars,” in Proc. of the Seventh Natl. Conf. on Artificial Intelligence, pp. 386–391.Google Scholar
  35. 35.
    Smith, R. F. and T. F. Smith (1990), “Automatic Generation of Primary Sequence Patterns from Sets of Related Protein Sequences,” Proc. Natl. Acad. Sci. USA, 87:118–122, Jan.Google Scholar
  36. 36.
    Smith, T. F. and M. S. Waterman (1981), “Identification of Common Molecular Subsequences,” J. Mol. Biol., 147:195–197.Google Scholar
  37. 37.
    Steele, G. L. (1984), Common LISP: The Manual, Digital Press, Billerica, MA, USA.Google Scholar
  38. 38.
    Tambe, M., D. Kapl, A. Gupta, C. Forgy, B. Milnes, A. Newell (1988), “Soar/PSM-E: Investigating Match Parallelism in a Learning Production System,” Proc. Parallel Programming Environments Applications Languages and Systems.Google Scholar
  39. 39.
    Taylor, W. (1987), “Identification of Protein Sequence Homology by Consensus Template Alignment,” J. Mol. Biol., 188:233–258.Google Scholar
  40. 40.
    Thinking Machines Corp. (1988), Paris Reference Manual, Cambridge, MA, USA.Google Scholar
  41. 41.
    Valiant, L. G. (1984), “A Theory of the Learnable,” Comm. of the ACM, 27(11):1134–1142.Google Scholar
  42. 42.
    Vitter, S. J. and J. H. Lin (1988), “Learning in Parallel,” in Proc. 1988 Workshop on Computational Learning Theory (COLT'88), pp. 106–124, ed. D. Haussler and L. Pitt.Google Scholar
  43. 43.
    Waterman, M. S. (1984), “General Methods of Sequence Comparison,” Bull. of Math. Biol., 46:473–500.Google Scholar
  44. 44.
    Webster, T. A., R. H. Lathrop, and T. F. Smith (1987), “Prediction of a Common Structural Domain in Aminoacyl-tRNA Synthetases Through Use of a New Pattern-Directed Inference System,” Biochemistry, 26:6950–6957.Google Scholar
  45. 45.
    Webster, T. A., R. H. Lathrop, and T. F. Smith (1988), “Pattern Descriptors and the Unidentified Reading Frame 6 Human mtDNA Dinucleotide-Binding Site” Proteins, 3(2):97–101.Google Scholar
  46. 46.
    Webster, T. A., R. Patarca, R. H. Lathrop, and T. F. Smith (1989), “Potential Structural Motifs in Reverse Transcriptases,” Mol. Biol. Evol., 6(3):317–320.Google Scholar
  47. 47.
    Webster, T. A., R. H. Lathrop, P. H. Winston, and T. F. Smith (1990), “DNA-and RNA-directed DNA Polymerase Common Structural Motif,” (submitted).Google Scholar
  48. 48.
    Winston, P. H., T. O. Binford, B. Katz, and M. Lowry (1983), “Learning Physical Descriptions from Functional Descriptions, Examples, and Precedents,” in Proc. of the Natl. Conf. on Artificial Intelligence, (Washington, D. C., Aug. 22–26), William Kaufman, Los Altos, Ca., pp. 433–439.Google Scholar
  49. 49.
    Winston, P. H. (1984), Artificial Intelligence, 2nd ed., Addison-Wesley, Reading, MA, USA.Google Scholar
  50. 50.
    Winston, P. H., and Rao, S. (1990), “Repairing Learned Knowledge using Experience,” in Artificial Intelligence at MIT: Expanding Frontiers, edited by Patrick H. Winston with Sarah A. Shellard, MIT Press, Cambridge, MA, in press.Google Scholar
  51. 51.
    Zhang, X., D. Waltz, and J. Mesirov (1989), “Protein Structure Prediction by a Data-level Parallel Algorithm,” Proc. Supercomputing'89, Nov. 13–17, Reno, NV, USA, pp. 215–223.Google Scholar
  52. 52.
    Zhu, Q., T. F Smith, R. H. Lathrop, and J. Figge (1990), “The Acid Helix-Turn Activator Motif,” Proteins, 8:156–163.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1993

Authors and Affiliations

  • Richard H. Lathrop
    • 1
  • Teresa A. Webster
    • 2
  • Temple F. Smith
    • 3
  • Patrick H. Winston
    • 1
  1. 1.Artificial Intelligence LaboratoryMassachusetts Institute of TechnologyCambridge
  2. 2.Computational DevelopmentARRIS Pharmaceutical CorporationSouth San Francisco
  3. 3.Molecular Biology Computer Research Resource, Dana Farber Cancer InstituteHarvard School of Public HealthBoston

Personalised recommendations