Multimedia Tools and Applications

, Volume 68, Issue 3, pp 641–657 | Cite as

Enhanced and hierarchical structure algorithm for data imbalance problem in semantic extraction under massive video dataset

  • Zan Gao
  • Long-fei Zhang
  • Ming-yu Chen
  • Alexander Hauptmann
  • Hua Zhang
  • An-Ni Cai


Data imbalance problem often exists in our real life dataset, especial for massive video dataset, however, the balanced data distribution and the same misclassification cost are assumed in traditional machine learning algorithms, thus, it will be difficult for them to accurately describe the true data distribution, and resulting in misclassification. In this paper, the data imbalance problem in semantic extraction under massive video dataset is exploited, and enhanced and hierarchical structure (called EHS) algorithm is proposed. In proposed algorithm, data sampling, filtering and model training are considered and integrated together compactly via hierarchical structure algorithm, thus, the performance of model can be improved step by step, and is robust and stability with the change of features and datasets. Experiments on TRECVID2010 Semantic Indexing demonstrate that our proposed algorithm has much more powerful performance than that of traditional machine learning algorithms, and keeps stable and robust when different kinds of features are employed. Extended experiments on TRECVID2010 Surveillance Event Detection also prove that our EHS algorithm is efficient and effective, and reaches top performance in four of seven events.


Data imbalance Enhanced and hierarchical structure (EHS) Semantic indexing Surveillance event detection Massive video dataset TRECVID 



This material is based in part upon work supported by the National Science Foundation under Grants No. 0624236 and 0751185. Zan Gao is partially supported by the NSFC (No.90920001), and Key project in Science and Technology Pillar Program of Tianjin, P.R. China (10ZCKFGX00400). We also thank the anonymous reviewers for their valuable suggestions.


  1. 1.
    “Learning from Imbalanced Data Sets,” Proc. Am. Assoc. for Artificial Intelligence (AAAI) Workshop, N. Japkowicz, ed., 2000, (Technical Report WS-00-05).Google Scholar
  2. 2.
    “Workshop Learning from Imbalanced Data Sets II,” Proc. Int’l Conf. Machine Learning, N.V. Chawla, N. Japkowicz, and A. Kolcz, eds., 2003.Google Scholar
  3. 3.
    Abe N, Zadrozny B, Langford J (2004) An iterative method for multi-class cost-sensitive learning. Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 3–11Google Scholar
  4. 4.
    Akbani R, Kwek S, Japkowicz N (2004) Applying Support Vector Machines to Imbalanced Datasets. European Conference on Machine Learning (ECML) 3201:39–50Google Scholar
  5. 5.
    Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 6(1):20–29CrossRefGoogle Scholar
  6. 6.
    Chan PK, Fan W, Prodromidis AL, Stolfo SJ (1999) Distributed data mining in credit card fraud detection. IEEE Intell Syst 14(6):67–74CrossRefGoogle Scholar
  7. 7.
    Chan P, Stolfo S (1998) Toward scalable learning with non-uniform class and cost distributions. Proc. Int’l Conf. Knowledge Discovery and Data Mining, pp. 164–168Google Scholar
  8. 8.
    Chang S-F, Hsu W, Jiang W, Kennedy L, Xu D et al (2006) Columbia university trecvid-2006 video search and high-level feature extraction,” in TRECVID workshopGoogle Scholar
  9. 9.
    Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357zbMATHGoogle Scholar
  10. 10.
    Chawla NV, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter 6(1):1–6CrossRefGoogle Scholar
  11. 11.
    Chen M-Y, Hauptmann A (2009) MoSIFT: Reocgnizing Human Actions in Surveillance Videos. CMU-CS-09-161, Carnegie Mellon UniversityGoogle Scholar
  12. 12.
    Chen K, Lu BL, Kwok J (2006) Efficient classification of multi-label and imbalanced data using min-max modular classifiers. Proc. World Congress on Computation Intelligence-Int’l Joint Conf. Neural Networks, pp. 1770–1775Google Scholar
  13. 13.
    Clifton P, Damminda A, Vincent L (2004) Minority report in fraud detection: classification of skewed data. ACM SIGKDD Explorations Newsletter 6(1):50–59CrossRefGoogle Scholar
  14. 14.
    Cortes C, Vapnik V (1995) Support vector networks. Mach Learn 20:273–297zbMATHGoogle Scholar
  15. 15.
    Daugman J (1980) Two-dimensional spectral analysis of cortical receptive field profiles. Vis Res 20:847–856CrossRefGoogle Scholar
  16. 16.
    Drucker H, Burges CJC, Kaufman L, Smola A, Vapnik V (1997) Support vector regression machines. Adv Neural Inform Process Syst 9:155–161Google Scholar
  17. 17.
    Drummond C, Holte RC (2003) C4.5, class imbalance, and cost sensitivity: why under sampling beats over-sampling. Proc. Int’l Conf. Machine Learning, Workshop Learning from Imbalanced Data Sets IIGoogle Scholar
  18. 18.
    Elkan C (2001) The foundations of cost-sensitive learning. Proc. Int’l Joint Conf. Artificial Intelligence, pp. 973978Google Scholar
  19. 19.
    Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20:18–36CrossRefMathSciNetGoogle Scholar
  20. 20.
    Freund Y, Schapire RE (1997) Decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139CrossRefzbMATHMathSciNetGoogle Scholar
  21. 21.
    Graf HP, Cosatto E, Bottou L, Durdanovic I, Vapnik V (2005) Parallel support vector machines: The cascade svm. In Advances in Neural Information Processing Systems 17:521–528Google Scholar
  22. 22.
    Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the databoost IM approach. ACM SIGKDD Explorations Newsletter 6(1):30–39CrossRefGoogle Scholar
  23. 23.
    Haibo He, Member, IEEE, and Edwardo A. Garcia (2009) Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, Vol.21, No.9, pp.1263-1284, SepGoogle Scholar
  24. 24.
    Han H, Wang W, Mao B (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International Conference on Intelligent Computing (ICIC) 3644:878–887Google Scholar
  25. 25.
    He H, Bai Y, Garcia EA, Li S (2008) ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. Proc. Int’l J. Conf. Neural Networks, pp. 1322-1328Google Scholar
  26. 26.
    He H, Shen X (2007) A ranked subspace learning method for gene expression data classification. Proc. Int’l Conf. Artificial Intelligence, pp. 358-364Google Scholar
  27. 27.
    Holte RC, Acker L, Porter BW (1989) Concept learning and the problem of small disjuncts. Proc. Int’l J. Conf. Artificial Intelligence, pp. 813–818Google Scholar
  28. 28.
    Hong X, Chen S, Harris CJ (2007) A kernel-based two-class classifier for imbalanced data sets. IEEE Transactions on Neural Networks (TNN) 18(1):28–41CrossRefGoogle Scholar
  29. 29.
  30. 30.
  31. 31.
  32. 32.
  33. 33.
    Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449zbMATHGoogle Scholar
  34. 34.
    Jiang Y-G, Yang J, Ngo C-W, Hauptmann AG (2010) Representations of keypoint-based semantic concept detection: a comprehensive study. IEEE Transactions on Multimedia 12(1):42–53CrossRefGoogle Scholar
  35. 35.
    Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2/3):195–215CrossRefGoogle Scholar
  36. 36.
    Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. Proc. Conf. AI in Medicine in Europe: Artificial Intelligence Medicine, pp. 63-66, Google Scholar
  37. 37.
    Liu XY, Wu J, Zhou ZH (2006) Exploratory under sampling for class imbalance learning. Proc. Int’l Conf. Data Mining. 965–969Google Scholar
  38. 38.
    Liu XY, Zhou ZH (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowledge and Data Eng 18(1):63–77CrossRefGoogle Scholar
  39. 39.
    Lowe DG (1999) Object recognition from local scale-invariant features. Proc of the International Conference on Computer Vision, Corfu 2:1150–1157Google Scholar
  40. 40.
    Maloof MA (2003) Learning when data sets are imbalanced and when costs are unequal and unknown. Proc. Int’l Conf. Machine Learning, Workshop Learning from Imbalanced Data Sets IIGoogle Scholar
  41. 41.
    Mease D, Wyner AJ, Buja A (2007) Boosted classification trees and class probability/quantile estimation. J Mach Learn Res 8:409–439zbMATHGoogle Scholar
  42. 42.
    Mehrotra R (1992) Gabor filter-based edge detection. PaRem Recognition 25(12):1479–1494CrossRefMathSciNetGoogle Scholar
  43. 43.
    National Institute of Standards and Technology (NIST):
  44. 44.
    Pearson R, Goney G, Shwaber J (2003) Imbalanced clustering for microarray time-series,” Proc. Int’l Conf. Machine Learning, Workshop Learning from Imbalanced Data Sets II, Google Scholar
  45. 45.
    Peng Y, Yang Z, Yi J, Cao L, Li H, Yao J (2008) Peking University at TRECVID 2008: High Level Feature Extraction, in TRECVID workshopGoogle Scholar
  46. 46.
    Rao RB, Krishnan S, Niculescu RS (2006) Data mining for improved cardiac care. ACM SIGKDD Explorations Newsletter 8(1):3–10CrossRefGoogle Scholar
  47. 47.
    Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14:199–222CrossRefMathSciNetGoogle Scholar
  48. 48.
    Sun Y, Kamel MS, Wang Y (2006) Boosting for learning multiple classes with imbalanced class distribution. Proc. Int’l Conf. Data Mining, pp. 592–602Google Scholar
  49. 49.
    Surveillance event detection: System task, Data, Submissions, Evaluation
  50. 50.
    Tan C, Gilbert D, Deville Y (2003) Multi-class protein fold classification using a new ensemble machine learning approach. Genome Informatics 14:206–217Google Scholar
  51. 51.
    Ting KM (2002) An Instance-Weighting Method to Induce Cost-Sensitive Trees. IEEE Trans Knowledge and Data Eng 14(3):659–665CrossRefGoogle Scholar
  52. 52.
    Tomek I (1976) Two modifications of CNN. IEEE Trans System Man Cybernetics 6(11):769–772CrossRefzbMATHMathSciNetGoogle Scholar
  53. 53.
    TREC Video Retrieval Evaluation (TRECVID):
  54. 54.
    Vapnik VN (1999) An overview of statistical learning theory. IEEE Trans Neural Netw 10:988–999CrossRefGoogle Scholar
  55. 55.
    Viola P, Jones M (2001) Robust real-time object detection, second international workshop on statistical and computational theories of vision – modeling, learning, computing and sampling, Vancouver, Canada, July, 13Google Scholar
  56. 56.
    Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features, International conference of computer vision and pattern recognition, Kauai, HI, USA, December, 8–14Google Scholar
  57. 57.
    Wang BX, Japkowicz N (2004) Imbalanced data set learning with synthetic samples. Proc. IRIS Machine Learning WorkshopGoogle Scholar
  58. 58.
    Weiss GM (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explorations Newsletter 6(1):7–19CrossRefGoogle Scholar
  59. 59.
    Weiss GM, Provost F (2001) The effect of class distribution on classifier learning: an empirical study. Technical Report MLTR-43, Dept. of Computer Science, Rutgers Univ., 2001.Google Scholar
  60. 60.
    Woods K, Doss C, Bowyer K, Solka J, Priebe C, Kegelmeyer W (1993) Comparative evaluation of pattern recognition techniques for detection of microcalcifications in mammography. Int’l J Pattern Recognition and Artificial Intelligence 7(6):1417–1436CrossRefGoogle Scholar
  61. 61.
    Wu G, Chang EY (2003) Class-boundary alignment for imbalanced dataset learning. ICML Workshop on Learning from Imbalanced Data Sets IIGoogle Scholar
  62. 62.
    Yang J, Jiang Y-G, Hauptmann AG (2007) etc, Evaluating bag-of-visual-words representations in scene classification[C]//International Multimedia Conference, MM'07, pp.197–206Google Scholar
  63. 63.
    Yilmaz E, Aslam JA (2006) Estimating average precision with incomplete and imperfect judgments. in Proceedings of the Fifteenth ACM International Conference on Information and Knowledge Management (CIKM). NovemberGoogle Scholar
  64. 64.
    Zhang J, Mani I (2003) KNN approach to unbalanced data distributions: a case study involving information extraction. Proc. Int’l Conf. Machine Learning (ICML’2003), Workshop Learning from Imbalanced Data SetsGoogle Scholar
  65. 65.
    Zhou ZH, Liu XY (2006) On multi-class cost-sensitive learning. Proc. Nat’l Conf. Artificial Intelligence, pp. 567-572Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Zan Gao
    • 1
    • 2
    • 4
    • 5
  • Long-fei Zhang
    • 3
    • 4
  • Ming-yu Chen
    • 4
  • Alexander Hauptmann
    • 4
  • Hua Zhang
    • 1
    • 2
  • An-Ni Cai
    • 5
  1. 1.Key Laboratory of Computer Vision and SystemTianjin University of Technology, Ministry of EducationTianjinPeople’s Republic of China
  2. 2.Tianjin Key Laboratory of Intelligence Computing and Novel Software TechnologyTianjin University of TechnologyTianjinPeople’s Republic of China
  3. 3.School of SoftwareBeijing Institute of TechnologyBeijingPeople’s Republic of China
  4. 4.School of Computer ScienceCarnegie Mellon UniversityPittsburghUSA
  5. 5.School of Information and Telecommunication EngineeringBeijing University of Posts and TelecommunicationsBeijingPeople’s Republic of China

Personalised recommendations