Advertisement

Data Mining and Knowledge Discovery

, Volume 22, Issue 1–2, pp 259–290 | Cite as

Detecting and ordering salient regions

  • Larry ShoemakerEmail author
  • Robert E. Banfield
  • Lawrence O. Hall
  • Kevin W. Bowyer
  • W. Philip Kegelmeyer
Article
  • 189 Downloads

Abstract

We describe an ensemble approach to learning salient regions from arbitrarily partitioned data. The partitioning comes from the distributed processing requirements of large-scale simulations. The volume of the data is such that classifiers can train only on data local to a given partition. Since the data partition reflects the needs of the simulation, the class statistics can vary from partition to partition. Some classes will likely be missing from some or even most partitions. We combine a fast ensemble learning algorithm with scaled probabilistic majority voting in order to learn an accurate classifier from such data. Since some simulations are difficult to model without a considerable number of false positive errors, and since we are essentially building a search engine for simulation data, we order predicted regions to increase the likelihood that most of the top-ranked predictions are correct (salient). Results from simulation runs of a canister being torn and from a casing being dropped show that regions of interest are successfully identified in spite of the class imbalance in the individual training sets. Lift curve analysis shows that the use of data driven ordering methods provides a statistically significant improvement over the use of the default, natural time step ordering. Significant time is saved for the end user by allowing an improved focus on areas of interest without the need to conventionally search all of the data.

Keywords

Random forest Saliency Probabilistic voting Imbalanced training data Lift 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal CC, Han J, Wang J, Yu PS (2004) On demand classification of data streams. In: KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 503–508Google Scholar
  2. ASC, National Nuclear Security Administration in collaboration with Sandia, Lawrence Livermore, and Los Alamos National Laboratories, http://www.sandia.gov/nnsa/asc/. Accessed 29 Nov 2008
  3. Baeza-Yates R, Ribeiro-Neto B: Modern information retrieval. ACM Press, New York (1999)Google Scholar
  4. Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2005) Ensembles of classifiers from spatially disjoint data. In: Multiple classifier systems, sixth international workshop. Lecture Notes in Computer Science, vol. 3541. Springer, Seaside, CA, USA, pp 196–205Google Scholar
  5. Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP: A comparison of decision tree ensemble creation techniques. IEEE transactions on pattern analysis and machine intelligence 29(1), 173–180 (2007)CrossRefGoogle Scholar
  6. Breiman L: Random forests. Mach Learn 45(1), 5–32 (2001)zbMATHCrossRefGoogle Scholar
  7. Brinker K (2004) Active learning of label ranking functions. In: Proceedings of the 21st international conference on machine learning, July 4–8. Banff, Alberta, Canada, pp 129–136Google Scholar
  8. Chawla NV, Hall LO, Bowyer KW, Kegelmeyer WP: SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16, 321–357 (2002)zbMATHGoogle Scholar
  9. Chawla NV, Moore TE, Hall LO, Bowyer KW, Kegelmeyer WP, Springer C: Distributed learning with bagging-like performance. Pattern Recognit Lett 24(1-3), 455–471 (2003)CrossRefGoogle Scholar
  10. Chawla NV, Hall LO, Bowyer KW, Kegelmeyer WP: Learning ensembles from bites: a scalable and accurate approach. J Mach Learn Res 5, 421–451 (2004)MathSciNetGoogle Scholar
  11. Cohen WW, Schapire RE, Singer Y: Learning to order things. J Artif Intell Res 10, 243–270 (1999)zbMATHMathSciNetGoogle Scholar
  12. Demsar J: Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7, 1–30 (2006)MathSciNetGoogle Scholar
  13. Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: KDD ’99: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 155–164Google Scholar
  14. Domingos P, Hulten G (2000) Mining high-speed data streams. In: KDD ’00: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 71–80Google Scholar
  15. Erdem Z, Polikar R, Gurgen F, Yumusak N (2005) Ensemble of SVMs for incremental learning. In: Multiple classifier systems, 6th international workshop. Lecture Notes in Computer Science, vol. 3541. Springer, Seaside, CA, USA, pp 246–256Google Scholar
  16. Eschrich S, Hall LO (2003) Learning from soft partitions of data: reducing the variance. In: The 12th IEEE international conference on fuzzy systems, 2003. FUZZ ’03, May 25–28, vol 1. St. Louis, Missouri, USA, pp 666–671Google Scholar
  17. Fan W (2004) Systematic data selection to mine concept-drifting data streams. In: KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 128–137Google Scholar
  18. Fan W, Wang H, Yu PS, Stolfo SJ (2002) A fully distributed framework for cost-sensitive data mining. In: Proceedings 22nd international conference on distributed computing systems, July 2–5. Vienna, Austria, pp 445–446Google Scholar
  19. Gionis A, Mannila H, Puolamäki K, Ukkonen A (2006) Algorithms for discovering bucket orders from data. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20–23. Philadelphia, PA, USA, pp 561–566Google Scholar
  20. Hall LO, Bhadoria D, Bowyer KW (2004) Learning a model from spatially disjoint data. In: 2004 IEEE international conference on systems, man, and cybernetics, October 10–13, vol 2. The Hague, Netherlands, pp 1447–1451Google Scholar
  21. Henderson A: The ParaView guide. Kitware Inc., United States (2004)Google Scholar
  22. Hullermeier E, Furnkranz J (2005) Learning label preferences: ranking error versus position error. Proceedings IDA05, 6th international symposium on intelligent data analysis, September 8–10. Madrid, Spain, pp 180–191Google Scholar
  23. Koegler WS, Kegelmeyer WP (2005) FCLib: a library for building data analysis and data discovery tools. Advances in intelligent data analysis VI IDA 2005, pp 192–203Google Scholar
  24. Kong R, Zhang B: A fast incremental learning algorithm for support vector machine. Control Decision 20(10), 1129–1136 (2005)MathSciNetGoogle Scholar
  25. Korecki JN, Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2008) Semi-supervised learning on large complex simulations. In: Proceedings of the 19th conference of the international association for pattern recognition, December 8–11. Tampa, FL, USAGoogle Scholar
  26. Kotsiantis S, Kanellopoulos D, Pintelas P: Handling imbalanced datasets: a review. GESTS Int Trans Comput Sci Eng 30(1), 25–36 (2006)Google Scholar
  27. Kusnezov DF (2004) Advanced simulation & computing: the next ten years. Tech. rep., NA-ASC-100R-04, Sandia National Labs, Albuquerque. http://www.acq.usd.mil/dsb/reports/ADA495920.pdf
  28. Lazarevic A, Obradovic Z: Boosting algorithms for parallel and distributed learning. Distrib Parallel Databases J 11(2), 203–229 (2002)zbMATHCrossRefGoogle Scholar
  29. Ling CX, Li C (1998) Data mining for direct marketing: problems and solutions. In: Proceedings of the fourth international conference on knowledge discovery and data mining (KDD-98), pp 73–79Google Scholar
  30. Maloof MA, Michalski RS: Incremental learning with partial instance memory. Artif Intell 154(1-2), 95–126 (2004)zbMATHCrossRefMathSciNetGoogle Scholar
  31. Manning C, Raghavan P, Schutze H: Introduction to information retrieval. Cambridge University Press, Cambridge (2008)zbMATHGoogle Scholar
  32. Otsu N: A threshold selection method from gray level histograms. IEEE Trans Syst Man Cybern 9, 62–66 (1979)CrossRefGoogle Scholar
  33. Piatetsky-Shapiro G, Steingold S: Measuring lift quality in database marketing. SIGKDD Explor Newsl 2(2), 76–80 (2000)CrossRefGoogle Scholar
  34. Schoof LA, Yarberry VR (1998) EXODUS II: a finite element data model, Technical Report # SAND92–2137. Tech. rep., Sandia National Labs, Albuquerque, NM 87185Google Scholar
  35. Shipp CA, Kuncheva LI: Relationships between combination methods and measures of diversity in combining classifiers. Inf Fusion 3(2), 135–148 (2002)CrossRefGoogle Scholar
  36. Shoemaker L, Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2006) Learning to predict salient regions from disjoint and skewed training sets. In: 18th IEEE Conference on Tools with Artificial Intelligence (ICTAI 2006), Arlington, VA, USA, pp 116–123Google Scholar
  37. Shoemaker L, Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2008a) Detecting and ordering salient regions for efficient browsing. In: Proceedings of the 19th conference of the international association for pattern recognition, December 8–11. Tampa, FL, USAGoogle Scholar
  38. Shoemaker L, Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP: Using classifier ensembles to label spatially disjoint data. Inf Fusion 9(1), 120–133 (2008b)CrossRefGoogle Scholar
  39. Wang F, Ma S, Yang L, Li T (2006) Recommendation on item graphs. Proceedings of the sixth international conference on data mining. pp 1119–1123Google Scholar
  40. Webb GI, Boughton JR, Wang Z: Not so naive Bayes: aggregating one-dependence estimators. Mach Learn 58(1), 5–24 (2005)zbMATHCrossRefGoogle Scholar
  41. Weiss G: Mining with rarity: a unifying framework. SIGKDD Explor 6(1), 7–19 (2004)CrossRefGoogle Scholar
  42. Witten IH, Frank E: Data mining: practical machine learning tools and techniques. 2. Morgan Kaufmann, San Francisco (2005)Google Scholar

Copyright information

© The Author(s) 2010

Authors and Affiliations

  • Larry Shoemaker
    • 1
    Email author
  • Robert E. Banfield
    • 1
  • Lawrence O. Hall
    • 1
  • Kevin W. Bowyer
    • 2
  • W. Philip Kegelmeyer
    • 3
  1. 1.Computer Science and EngineeringUniversity of South FloridaTampaUSA
  2. 2.Computer Science and EngineeringUniversity of Notre DameSouth BendUSA
  3. 3.Sandia National LabsComputer and Information SciencesLivermoreUSA

Personalised recommendations