Machine Learning

, Volume 80, Issue 1, pp 1–31 | Cite as

A process for predicting manhole events in Manhattan

  • Cynthia RudinEmail author
  • Rebecca J. Passonneau
  • Axinia Radeva
  • Haimonti Dutta
  • Steve Ierome
  • Delfina Isaac


We present a knowledge discovery and data mining process developed as part of the Columbia/Con Edison project on manhole event prediction. This process can assist with real-world prioritization problems that involve raw data in the form of noisy documents requiring significant amounts of pre-processing. The documents are linked to a set of instances to be ranked according to prediction criteria. In the case of manhole event prediction, which is a new application for machine learning, the goal is to rank the electrical grid structures in Manhattan (manholes and service boxes) according to their vulnerability to serious manhole events such as fires, explosions and smoking manholes. Our ranking results are currently being used to help prioritize repair work on the Manhattan electrical grid.


Manhole events Applications of machine learning Ranking Knowledge discovery 


  1. Azevedo, A., & Santos, M. F. (2008). KDD, SEMMA and CRISP-DM: a parallel overview. In Proceedings of the IADIS European conf. data mining (pp. 182–185). Google Scholar
  2. Becker, H., & Arias, M. (2007). Real-time ranking with concept drift using expert advice. In Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’07) (pp. 86–94). New York: ACM. CrossRefGoogle Scholar
  3. Boriah, S., Kumar, V., Steinbach, M., Potter, C., & Klooster, S. A. (2008). Land cover change detection: a case study. In Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’08) (pp. 857–865). New York: ACM. CrossRefGoogle Scholar
  4. Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145–1159. CrossRefGoogle Scholar
  5. Castano, R., Judd, M., Anderson, R. C., & Estlin, T. (2003). Machine learning challenges in Mars rover traverse science. In Workshop on machine learning technologies for autonomous space applications, international conference on machine learning. Google Scholar
  6. Chen, G., & Peterson, A. T. (2002). Prioritization of areas in China for the conservation of endangered birds using modelled geographical distributions. Bird Conservation International, 12, 197–209. CrossRefGoogle Scholar
  7. Chen, H., Chung, W., Xu, J. J., Wang, G., Qin, Y., & Chau, M. (2004). Crime data mining: a general framework and some examples. IEEE Computer, 37(4), 50–56. Google Scholar
  8. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46. CrossRefGoogle Scholar
  9. Cunningham, H., Maynard, D., Bontcheva, K., & Tablan, V. (2002). GATE: a framework and graphical development environment for robust NLP tools and applications. In Proceedings of the 40th anniversary meeting of the association for computational linguistics (ACL’02). Google Scholar
  10. Devaney, M., & Ram, A. (2005). Preventing failures by mining maintenance logs with case-based reasoning. In Proceedings of the 59th meeting of the society for machinery failure prevention technology (MFPT-59). Google Scholar
  11. Dudík, M., Phillips, S. J., & Schapire, R. E. (2007). Maximum entropy density estimation with generalized regularization and an application to species distribution modeling. Journal of Machine Learning Research, 8, 1217–1260. Google Scholar
  12. Dutta, H., Rudin, C., Passonneau, R., Seibel, F., Bhardwaj, N., Radeva, A., Liu, Z. A., & Ierome S, Isaac, D. (2008). Visualization of manhole and precursor-type events for the Manhattan electrical distribution system. In Proceedings of the workshop on geo-visualization of dynamics, movement and change, 11th AGILE international conference on geographic information science, Girona, Spain. Google Scholar
  13. Fayyad, U., & Uthurusamy, R. (2002). Evolving data into mining solutions for insights. Communications of the ACM, 45(8), 28–31. CrossRefGoogle Scholar
  14. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17, 37–54. Google Scholar
  15. Frawley, W. J., Piatetsky-Shapiro, G., & Matheus, C. J. (1992). Knowledge discovery in databases: an overview. AI Magazine, 13(3), 57–70. Google Scholar
  16. Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4, 933–969. CrossRefMathSciNetGoogle Scholar
  17. Google Earth (2009).
  18. Grishman, R., Hirschman, L., & Nhan, N. T. (1986). Discovery procedures for sublanguage selectional patterns: initial experiments. Computational Linguistics, 205–215. Google Scholar
  19. Gross, P., Boulanger, A., Arias, M., Waltz, D. L., Long, P. M., Lawson, C., Anderson, R., Koenig, M., Mastrocinque, M., Fairechio, W., Johnson, J. A., Lee, S., Doherty, F., & Kressner, A. (2006). Predicting electricity distribution feeder failures using machine learning susceptibility analysis. In Proceedings of the eighteenth conference on innovative applications of artificial intelligence IAAI-06, Boston, Massachusetts. Google Scholar
  20. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182. zbMATHCrossRefGoogle Scholar
  21. Hand, D. J. (1994). Deconstructing statistical questions. Journal of the Royal Statistical Society Series A (Statistics in Society), 157(3), 317–356. CrossRefMathSciNetGoogle Scholar
  22. Harding, J. A., Shahbaz, M., Srinivas, & Kusiak, A. (2006). Data mining in manufacturing: a review. Journal of Manufacturing Science and Engineering, 128(4), 969–976. CrossRefGoogle Scholar
  23. Harris, Z. (1982). Discourse and sublanguage. In Kittredge, R., & Lehrberger, J. (Eds.) Sublanguage: studies of language in restricted semantic domains (pp. 231–236). Berlin: de Gruyter. Google Scholar
  24. Hirschman, L., Palmer, M., Dowding, J., Dahl, D., Linebarger, M., Passonneau, R., Lang, F., Ball, C., & Weir, C. (1989). The PUNDIT natural-language processing system. In Proceedings of the annual AI systems in government conference (pp. 234–243). Google Scholar
  25. Hsu, W., Lee, M. L., Liu, B., & Ling, T. W. (2000). Exploration mining in diabetic patients databases: findings and conclusions. In Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’00) (pp. 430–436). New York: ACM. CrossRefGoogle Scholar
  26. Jiang, R., Yang, H., Zhou, L., Kuo, C. C. J., Sun, F., & Chen, T. (2007). Sequence-based prioritization of nonsynonymous single-nucleotide polymorphisms for the study of disease mutations. American Journal of Human Genetics, 81(2), 346–360. CrossRefGoogle Scholar
  27. Kirtley, J. Jr., Hagman, W., Lesieutre, B., Boyd, M., Warren, E., Chou, H., & Tabors, R. (1996). Monitoring the health of power transformers. IEEE Computer Applications in Power, 9(1), 18–23. CrossRefGoogle Scholar
  28. Kittredge, R. (1982). Sublanguages. American Journal of Computational Linguistics, 79–84. Google Scholar
  29. Kittredge, R., Korelsky, T., & Rambow, O. (1991). On the need for domain communication knowledge. Computational Intelligence, 7(4), 305–314. CrossRefGoogle Scholar
  30. Kohavi, R., & John, G. (1997). Wrappers for feature selection. Artificial Intelligence, 97(1–2), 273–324. zbMATHCrossRefGoogle Scholar
  31. Krippendorff, K. (1980). Content analysis: an introduction to its methodology. Beverly Hills: Sage. Google Scholar
  32. Kusiak, A., & Shah, S. (2006). A data-mining-based system for prediction of water chemistry faults. IEEE Transactions on Industrial Electronics, 53(2), 593–603. CrossRefGoogle Scholar
  33. Liddy, E. D., Symonenko, S., & Rowe, S. (2006). Sublanguage analysis applied to trouble tickets. In Proceedings of the Florida artificial intelligence research society conference (pp. 752–757). Google Scholar
  34. Linebarger, M., Dahl, D., Hirschman, L., & Passonneau, R. (1988). Sentence fragments regular structures. In Proceedings of the 26th association for computational linguistics, Buffalo, NY. Google Scholar
  35. Murray, J. F., Hughes, G. F., & Kreutz-Delgado, K. (2005). Machine learning methods for predicting failures in hard drives: a multiple-instance application. Journal of Machine Learning Research, 6, 783–816. MathSciNetGoogle Scholar
  36. National Institute of Standards and Technology (NIST), Information Access Division (ACE) Automatic Content Extraction Evaluation.
  37. Oza, N., Castle, J. P., & Stutz, J. (2009). Classification of aeronautics system health and safety documents. IEEE Transactions on Systems, Man and Cybernetics, Part C, 39, 670–680. CrossRefGoogle Scholar
  38. Passonneau, R., Rudin, C., Radeva, A., & Liu, Z. A. (2009). Reducing noise in labels and features for a real world dataset: application of NLP corpus annotation methods. In Proceedings of the 10th international conference on computational linguistics and intelligent text processing (CICLing). Google Scholar
  39. Patel, K., Fogarty, J., Landay, J. A., & Harrison, B. (2008). Investigating statistical machine learning as a tool for software development. In Proceedings of ACM CHI 2008 conference on human factors in computing systems (CHI 2008) (pp. 667–676). Google Scholar
  40. Radeva, A., Rudin, C., Passonneau, R., & Isaac, D. (2009). Report cards for manholes: eliciting expert feedback for a machine learning task. In Proceedings of the international conference on machine learning and applications. Google Scholar
  41. Rudin, C. (2009). The P-Norm Push: a simple convex ranking algorithm that concentrates at the top of the list. Journal of Machine Learning Research, 10, 2233–2271. Google Scholar
  42. Sager, N. (1970). The sublanguage method in string grammars. In R. W. Ewton Jr. & J. Ornstein (Eds.), Studies in language and linguistics, University of Texas at El Paso (pp. 89–98). Google Scholar
  43. Steed, J. (1995). Condition monitoring applied to power transformers-an REC view. In Second international conference on the reliability of transmission and distribution equipment (pp. 109–114). Google Scholar
  44. Symonenko, S., Rowe, S., & Liddy, E. D. (2006). Illuminating trouble tickets with sublanguage theory. In Proceedings of the human language technology/North American association of computational linguistics conference. Google Scholar
  45. Vilalta, R., & Ma, S. (2002). Predicting rare events in temporal domains. In IEEE international conference on data mining (pp. 474–481). Google Scholar
  46. Weiss, G. M., & Hirsh, H. (2000). Learning to predict extremely rare events. In AAAI workshop on learning from imbalanced data sets (pp. 64–68). Menlo Park: AAAI Press. Google Scholar

Copyright information

© The Author(s) 2010

Authors and Affiliations

  • Cynthia Rudin
    • 1
    Email author
  • Rebecca J. Passonneau
    • 2
  • Axinia Radeva
    • 2
  • Haimonti Dutta
    • 2
  • Steve Ierome
    • 3
  • Delfina Isaac
    • 3
  1. 1.MIT Sloan School of Management, E53-323Massachusetts Institute of TechnologyCambridgeUSA
  2. 2.Center for Computational Learning SystemsColumbia UniversityNew YorkUSA
  3. 3.Consolidated Edison Company of New YorkNew YorkUSA

Personalised recommendations