Skip to main content

A process for predicting manhole events in Manhattan


We present a knowledge discovery and data mining process developed as part of the Columbia/Con Edison project on manhole event prediction. This process can assist with real-world prioritization problems that involve raw data in the form of noisy documents requiring significant amounts of pre-processing. The documents are linked to a set of instances to be ranked according to prediction criteria. In the case of manhole event prediction, which is a new application for machine learning, the goal is to rank the electrical grid structures in Manhattan (manholes and service boxes) according to their vulnerability to serious manhole events such as fires, explosions and smoking manholes. Our ranking results are currently being used to help prioritize repair work on the Manhattan electrical grid.


  1. Azevedo, A., & Santos, M. F. (2008). KDD, SEMMA and CRISP-DM: a parallel overview. In Proceedings of the IADIS European conf. data mining (pp. 182–185).

  2. Becker, H., & Arias, M. (2007). Real-time ranking with concept drift using expert advice. In Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’07) (pp. 86–94). New York: ACM.

    Chapter  Google Scholar 

  3. Boriah, S., Kumar, V., Steinbach, M., Potter, C., & Klooster, S. A. (2008). Land cover change detection: a case study. In Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’08) (pp. 857–865). New York: ACM.

    Chapter  Google Scholar 

  4. Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145–1159.

    Article  Google Scholar 

  5. Castano, R., Judd, M., Anderson, R. C., & Estlin, T. (2003). Machine learning challenges in Mars rover traverse science. In Workshop on machine learning technologies for autonomous space applications, international conference on machine learning.

  6. Chen, G., & Peterson, A. T. (2002). Prioritization of areas in China for the conservation of endangered birds using modelled geographical distributions. Bird Conservation International, 12, 197–209.

    Article  Google Scholar 

  7. Chen, H., Chung, W., Xu, J. J., Wang, G., Qin, Y., & Chau, M. (2004). Crime data mining: a general framework and some examples. IEEE Computer, 37(4), 50–56.

    Google Scholar 

  8. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.

    Article  Google Scholar 

  9. Cunningham, H., Maynard, D., Bontcheva, K., & Tablan, V. (2002). GATE: a framework and graphical development environment for robust NLP tools and applications. In Proceedings of the 40th anniversary meeting of the association for computational linguistics (ACL’02).

  10. Devaney, M., & Ram, A. (2005). Preventing failures by mining maintenance logs with case-based reasoning. In Proceedings of the 59th meeting of the society for machinery failure prevention technology (MFPT-59).

  11. Dudík, M., Phillips, S. J., & Schapire, R. E. (2007). Maximum entropy density estimation with generalized regularization and an application to species distribution modeling. Journal of Machine Learning Research, 8, 1217–1260.

    Google Scholar 

  12. Dutta, H., Rudin, C., Passonneau, R., Seibel, F., Bhardwaj, N., Radeva, A., Liu, Z. A., & Ierome S, Isaac, D. (2008). Visualization of manhole and precursor-type events for the Manhattan electrical distribution system. In Proceedings of the workshop on geo-visualization of dynamics, movement and change, 11th AGILE international conference on geographic information science, Girona, Spain.

  13. Fayyad, U., & Uthurusamy, R. (2002). Evolving data into mining solutions for insights. Communications of the ACM, 45(8), 28–31.

    Article  Google Scholar 

  14. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17, 37–54.

    Google Scholar 

  15. Frawley, W. J., Piatetsky-Shapiro, G., & Matheus, C. J. (1992). Knowledge discovery in databases: an overview. AI Magazine, 13(3), 57–70.

    Google Scholar 

  16. Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4, 933–969.

    Article  MathSciNet  Google Scholar 

  17. Google Earth (2009).

  18. Grishman, R., Hirschman, L., & Nhan, N. T. (1986). Discovery procedures for sublanguage selectional patterns: initial experiments. Computational Linguistics, 205–215.

  19. Gross, P., Boulanger, A., Arias, M., Waltz, D. L., Long, P. M., Lawson, C., Anderson, R., Koenig, M., Mastrocinque, M., Fairechio, W., Johnson, J. A., Lee, S., Doherty, F., & Kressner, A. (2006). Predicting electricity distribution feeder failures using machine learning susceptibility analysis. In Proceedings of the eighteenth conference on innovative applications of artificial intelligence IAAI-06, Boston, Massachusetts.

  20. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.

    MATH  Article  Google Scholar 

  21. Hand, D. J. (1994). Deconstructing statistical questions. Journal of the Royal Statistical Society Series A (Statistics in Society), 157(3), 317–356.

    Article  MathSciNet  Google Scholar 

  22. Harding, J. A., Shahbaz, M., Srinivas, & Kusiak, A. (2006). Data mining in manufacturing: a review. Journal of Manufacturing Science and Engineering, 128(4), 969–976.

    Article  Google Scholar 

  23. Harris, Z. (1982). Discourse and sublanguage. In Kittredge, R., & Lehrberger, J. (Eds.) Sublanguage: studies of language in restricted semantic domains (pp. 231–236). Berlin: de Gruyter.

    Google Scholar 

  24. Hirschman, L., Palmer, M., Dowding, J., Dahl, D., Linebarger, M., Passonneau, R., Lang, F., Ball, C., & Weir, C. (1989). The PUNDIT natural-language processing system. In Proceedings of the annual AI systems in government conference (pp. 234–243).

  25. Hsu, W., Lee, M. L., Liu, B., & Ling, T. W. (2000). Exploration mining in diabetic patients databases: findings and conclusions. In Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’00) (pp. 430–436). New York: ACM.

    Chapter  Google Scholar 

  26. Jiang, R., Yang, H., Zhou, L., Kuo, C. C. J., Sun, F., & Chen, T. (2007). Sequence-based prioritization of nonsynonymous single-nucleotide polymorphisms for the study of disease mutations. American Journal of Human Genetics, 81(2), 346–360.

    Article  Google Scholar 

  27. Kirtley, J. Jr., Hagman, W., Lesieutre, B., Boyd, M., Warren, E., Chou, H., & Tabors, R. (1996). Monitoring the health of power transformers. IEEE Computer Applications in Power, 9(1), 18–23.

    Article  Google Scholar 

  28. Kittredge, R. (1982). Sublanguages. American Journal of Computational Linguistics, 79–84.

  29. Kittredge, R., Korelsky, T., & Rambow, O. (1991). On the need for domain communication knowledge. Computational Intelligence, 7(4), 305–314.

    Article  Google Scholar 

  30. Kohavi, R., & John, G. (1997). Wrappers for feature selection. Artificial Intelligence, 97(1–2), 273–324.

    MATH  Article  Google Scholar 

  31. Krippendorff, K. (1980). Content analysis: an introduction to its methodology. Beverly Hills: Sage.

    Google Scholar 

  32. Kusiak, A., & Shah, S. (2006). A data-mining-based system for prediction of water chemistry faults. IEEE Transactions on Industrial Electronics, 53(2), 593–603.

    Article  Google Scholar 

  33. Liddy, E. D., Symonenko, S., & Rowe, S. (2006). Sublanguage analysis applied to trouble tickets. In Proceedings of the Florida artificial intelligence research society conference (pp. 752–757).

  34. Linebarger, M., Dahl, D., Hirschman, L., & Passonneau, R. (1988). Sentence fragments regular structures. In Proceedings of the 26th association for computational linguistics, Buffalo, NY.

  35. Murray, J. F., Hughes, G. F., & Kreutz-Delgado, K. (2005). Machine learning methods for predicting failures in hard drives: a multiple-instance application. Journal of Machine Learning Research, 6, 783–816.

    MathSciNet  Google Scholar 

  36. National Institute of Standards and Technology (NIST), Information Access Division (ACE) Automatic Content Extraction Evaluation.

  37. Oza, N., Castle, J. P., & Stutz, J. (2009). Classification of aeronautics system health and safety documents. IEEE Transactions on Systems, Man and Cybernetics, Part C, 39, 670–680.

    Article  Google Scholar 

  38. Passonneau, R., Rudin, C., Radeva, A., & Liu, Z. A. (2009). Reducing noise in labels and features for a real world dataset: application of NLP corpus annotation methods. In Proceedings of the 10th international conference on computational linguistics and intelligent text processing (CICLing).

  39. Patel, K., Fogarty, J., Landay, J. A., & Harrison, B. (2008). Investigating statistical machine learning as a tool for software development. In Proceedings of ACM CHI 2008 conference on human factors in computing systems (CHI 2008) (pp. 667–676).

  40. Radeva, A., Rudin, C., Passonneau, R., & Isaac, D. (2009). Report cards for manholes: eliciting expert feedback for a machine learning task. In Proceedings of the international conference on machine learning and applications.

  41. Rudin, C. (2009). The P-Norm Push: a simple convex ranking algorithm that concentrates at the top of the list. Journal of Machine Learning Research, 10, 2233–2271.

    Google Scholar 

  42. Sager, N. (1970). The sublanguage method in string grammars. In R. W. Ewton Jr. & J. Ornstein (Eds.), Studies in language and linguistics, University of Texas at El Paso (pp. 89–98).

  43. Steed, J. (1995). Condition monitoring applied to power transformers-an REC view. In Second international conference on the reliability of transmission and distribution equipment (pp. 109–114).

  44. Symonenko, S., Rowe, S., & Liddy, E. D. (2006). Illuminating trouble tickets with sublanguage theory. In Proceedings of the human language technology/North American association of computational linguistics conference.

  45. Vilalta, R., & Ma, S. (2002). Predicting rare events in temporal domains. In IEEE international conference on data mining (pp. 474–481).

  46. Weiss, G. M., & Hirsh, H. (2000). Learning to predict extremely rare events. In AAAI workshop on learning from imbalanced data sets (pp. 64–68). Menlo Park: AAAI Press.

    Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Cynthia Rudin.

Additional information

This work was done while Cynthia Rudin was at the Center for Computational Learning Systems at Columbia University.

Editor: Carla Brodley.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Rudin, C., Passonneau, R.J., Radeva, A. et al. A process for predicting manhole events in Manhattan. Mach Learn 80, 1–31 (2010).

Download citation


  • Manhole events
  • Applications of machine learning
  • Ranking
  • Knowledge discovery