Integration of Semantics Information and Clustering in Binary-Class Classification for Handling Imbalanced Multimedia Data

  • Chao Chen
  • Mei-Ling Shyu


It is well-acknowledged that the data imbalance issue is one of the major challenges in classification, i.e., when the ratio of the positive data instances to the negative data instances is very small, especially for multimedia data. One solution is to utilize the clustering technique in binary-class classification to partition the majority class (also called negative class) into several subsets, each of which merges with the minority class (also called positive class) to form a much more balanced subset of the original data set. However, one major drawback of clustering is its time-consuming process to construct each cluster. Due to the fact that there are rich semantics in multimedia data (such as video and image data), the utilization of video semantics (i.e., semantic concepts as class labels) to form negative subsets can (i) effectively construct several groups whose data instances are semantically related, and (ii) significantly reduce the number of data instances participating in the clustering step. Therefore, in this chapter, a novel binary-class classification framework that integrates the video semantics information and the clustering technique is proposed to address the data imbalance issue. Experiments are conducted to compare our proposed framework with other techniques that are commonly used to learn from imbalanced data sets. The experimental results on some highly imbalanced video data sets demonstrate that our proposed classification framework outperforms these comparative classification approaches about 3–16 %.


Data Instance Ranking Score Semantic Concept Minority Class Positive Class 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Batista GE, Batista RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29CrossRefGoogle Scholar
  2. 2.
    Chawla NV, Japkowicz N, Kolcz A (2003) Workshop learning from imbalanced data sets ii. ACM SIGKDD Explorations Newsletter. In: Proceedings of the ICML’2003 workshop on learning from imbalanced data sets, Washington DC, Aug 2003Google Scholar
  3. 3.
    Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) Smoteboost: improving prediction of the minority class in boosting. In: Proceedings of the seventh European conference on principles and practice of knowledge discovery in databases, Cavtat-Dubrovnik, Sept 2003, pp 107–119Google Scholar
  4. 4.
    Chawla NV, Japkowicz N, Kolcz A (2004) Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl 6(1):1–6CrossRefGoogle Scholar
  5. 5.
    Chen C, Shyu M-L (2011) Clustering-based binary-class classification for imbalanced data sets. In: The 12th IEEE international conference on information reuse and integration (IRI 2011), Las Vegas, Aug 2011, pp 384–389Google Scholar
  6. 6.
    Chen S-C, Shyu M-L, Zhang C, Luo L, Chen M (2003) Detection of soccer goal shots using joint multimedia features and classification rules. In: Proceedings of the fourth international workshop on multimedia data mining, Washington, DC, Aug 2003, pp 36–44Google Scholar
  7. 7.
    Chen S-C, Shyu M-L, Zhang C, Chen M (2006) A multimodal data mining framework for soccer goal detection based on decision tree logic. Int J Comput Appl Technol, Special Issue on Data Mining Applications 27(4):312–323CrossRefGoogle Scholar
  8. 8.
    Freund Y, Schapire R (1996) Experiments with a new boosting algorithm. In: Proceedings of the 13th international conference on machine learning, Bari, July 1996, pp 148–156Google Scholar
  9. 9.
    Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the databoost im approach. ACM SIGKDD Explor Newsl 6(1):30–39CrossRefGoogle Scholar
  10. 10.
    Han H, Wang WY, Mao BH (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: Advances in intelligent computing, Hefei, pp 878–887Google Scholar
  11. 11.
    He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284CrossRefGoogle Scholar
  12. 12.
    He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks, Hong Kong, June 2008, pp 1322–1328Google Scholar
  13. 13.
    Japkowicz N (2000) Learning from imbalanced data sets. In: Proceedings of association for the advancement of artificial intelligence, Austin, July–Aug 2000, pp 10–15Google Scholar
  14. 14.
    Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–450zbMATHGoogle Scholar
  15. 15.
    Liu XY, Zhou ZH (2006) The influence of class imbalance on cost-sensitive learning: an empirical study. In: Sixth international conference on data mining (ICDM’06), Hong Kong, Dec 2006, pp 970–974Google Scholar
  16. 16.
    Maloof MA (2003) Learning when data sets are imbalanced and when costs are unequal and unknown. In: Proceedings of the ICML’2003 workshop on learning from imbalanced data sets, workshop learning from imbalanced data sets II, Washington, DC, Aug 2003Google Scholar
  17. 17.
    McCarthy K, Zabar K, Weiss GM (2005) Does cost-sensitive learning beat sampling for classifying rare classes? In: Proceedings of the 1st international workshop on utility-based data mining, Chicago, Aug 2005, pp 69–77Google Scholar
  18. 18.
    Mease D, Wyner AJ, Buja A (2007) Boosted classification trees and class probability/quantile estimation. J Mach Learn Res 8:18–36Google Scholar
  19. 19.
    Moya M, Hush D (1996) Network constraints and multi-objective optimization for one-class classification. Neural Netw 9(3):463–474CrossRefGoogle Scholar
  20. 20.
    Shyu M-L, Xie Z, Chen M, Chen S-C (2008) Video semantic event/concept detection using a subspace-based multimedia data mining framework. IEEE Trans Multimed 10(2):252–259CrossRefGoogle Scholar
  21. 21.
    Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVid. In: ACM international workshop on multimedia information retrieval (MIR06), Santa Barbara, Oct 2006, pp 321–330Google Scholar
  22. 22.
    Sneok C, Worring M, Gemert J, Geusebroek J, Smeulders A (2006) The challenge problem for automated detection of 101 semantic concepts in multimedia. In: ACM multimedia, Santa Barbara, Oct 2006, pp 421–430Google Scholar
  23. 23.
    The mediamill challenge problem (2005). Available at
  24. 24.
    Vapnik V (1998) Statistical learning theory. Wiley, New YorkzbMATHGoogle Scholar
  25. 25.
    Weiss GM (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newsl 6(1):7–19CrossRefGoogle Scholar
  26. 26.
    Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San FranciscoGoogle Scholar
  27. 27.
    Zadrozny B, Langford J, Abe N (2003) Cost-sensitive learning by cost-proportionate example weighting. In: Third international conference on data mining (ICDM’03), Melbourne, FL, Nov 2003, pp 435–442Google Scholar

Copyright information

© Springer-Verlag Wien 2013

Authors and Affiliations

  1. 1.Department of Electrical and Computer EngineeringUniversity of MiamiCoral GablesUSA

Personalised recommendations