Multimedia Tools and Applications

, Volume 75, Issue 15, pp 9461–9487 | Cite as

Feature pattern based representation of multimedia documents for efficient knowledge discovery



The rapid growth of multimedia documents has raised huge demand for sophisticated multimedia knowledge discovery systems. The knowledge extraction of the documents mainly relies on the data representation model and the document representation model. As the multimedia document comprised of multimodal multimedia objects, the data representation depends on modality of the objects. The multimodal objects require distinct processing and feature extraction methods resulting in different features with different dimensionalities. Managing multiple types of features is challenging for knowledge extraction tasks. The unified representation of multimedia document benefits the knowledge extraction process, as they are represented by same type of features. The appropriate document representation will benefit the overall decision making process by reducing the search time and memory requirements. In this paper, we propose a domain converting method known as Multimedia to Signal converter (MSC) to represent the multimodal multimedia document in an unified representation by converting multimodal objects as signal objects. A tree based approach known as Multimedia Feature Pattern (MFP) tree is proposed for the compact representation of multimedia documents in terms of features of multimedia objects. The effectiveness of the proposed framework is evaluated by performing the experiments on four multimodal datasets. Experimental results show that the unified representation of multimedia documents helped in improving the classification accuracy for the documents. The MFP tree based representation of multimedia documents not only reduces the search time and memory requirements, also outperforms the competitive approaches for search and retrieval of multimedia documents.


Multimedia document representation Domain conversion Unified representation Feature pattern Multimedia search and retrieval 


  1. 1.
    Adams W, Iyengar G, Lin CY, Naphade MR, Neti C, Nock HJ, Smith JR (2003) Semantic indexing of multimedia content using visual, audio, and text cues. EURASIP J Appl Signal Process 2:170–185CrossRefGoogle Scholar
  2. 2.
    Ananthanarayana V, Murty MN, Subramanian D (2003) Tree structure for efficient data mining using rough sets. Pattern Recogn Lett 24(6):851–862CrossRefMATHGoogle Scholar
  3. 3.
    Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: Proceedings of the 30th international conference on machine learning, pp 1247–1255Google Scholar
  4. 4.
    Caicedo JC, BenAbdallah J, González FA, Nasraoui O (2012) Multimodal representation, indexing, automated annotation and retrieval of image collections via non-negative matrix factorization. Neurocomputing 76(1):50–60CrossRefGoogle Scholar
  5. 5.
    Cazan A, Vârbanescu R, Popescu D (2007) Algorithms and techniques for image to sound conversion for helping the visually impaired people-application proposal. In: 14th international workshop on systems, signals and image processing, 2007 and 6th EURASIP conference focused on speech and image processing, multimedia communications and services. IEEE, pp 471–474Google Scholar
  6. 6.
    Chen YL, Chiu YT (2011) An ipc-based vector space model for patent retrieval. Inf Process Manag 47(3):309–322CrossRefGoogle Scholar
  7. 7.
    Chim H, Deng X (2008) Efficient phrase-based document similarity for clustering. IEEE Trans Knowl Data Eng 20(9):1217–1229CrossRefGoogle Scholar
  8. 8.
    Costa Pereira J, Coviello E, Doyle G, Rasiwasia N, Lanckriet GR, Levy R, Vasconcelos N (2014) On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans Pattern Anal Mach Intell 36(3):521–535CrossRefGoogle Scholar
  9. 9.
    Daras P, Manolopoulou S, Axenopoulos A (2012) Search and retrieval of rich media objects supporting multiple multimodal queries. IEEE Trans Multimedia 14 (3):734–746CrossRefGoogle Scholar
  10. 10.
    Fei-Fei L, Fergus R, Perona P (2007) Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. Comput Vis Image Underst 106(1):59–70CrossRefGoogle Scholar
  11. 11.
    Fisher B, Perkins S, Walker A, Wolfart E (1996) Hypermedia image processing reference. Wiley, ChichesterGoogle Scholar
  12. 12.
    Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Mikolov T et al (2013) Devise: a deep visual-semantic embedding model. In: Advances in neural information processing systems, pp 2121–2129Google Scholar
  13. 13.
    Huang J, Kumar SR, Mitra M, Zhu WJ, Zabih R (1997) Image indexing using color correlograms. In: Proceedings of computer society conference on computer vision and pattern recognition, 1997, vol 1997. IEEE, pp 762–768Google Scholar
  14. 14.
    Hunt MJ, Lennig M, Mermelstein P (1980) Experiments in syllable-based recognition of continuous speech. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP’80), vol 5. IEEE, pp 880–883Google Scholar
  15. 15.
    Zz Lan, Bao L, Yu SI, Liu W, Hauptmann AG (2014) Multimedia classification and event detection using double fusion. Multimedia Tools Appl 71 (1):333–347CrossRefGoogle Scholar
  16. 16.
    Levy M, Sandler M (2009) Music information retrieval using social tags and audio. IEEE Trans Multimedia 11(3):383–395CrossRefGoogle Scholar
  17. 17.
    Li H, Ma B, Lee CH (2007) A vector space modeling approach to spoken language identification. IEEE Trans Audio Speech Lang Process 15(1):271–284CrossRefGoogle Scholar
  18. 18.
    Li Y, Chung SM, Holt JD (2008) Text document clustering based on frequent word meaning sequences. Data Knowl Eng 64(1):381–404CrossRefGoogle Scholar
  19. 19.
    Mallat SG (1989) A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans Pattern Anal Mach Intell 11(7):674–693CrossRefMATHGoogle Scholar
  20. 20.
    Manjunath BS, Ma WY (1996) Texture features for browsing and retrieval of image data. IEEE Trans Pattern Anal Mach Intell 18(8):837–842CrossRefGoogle Scholar
  21. 21.
    Mao W, Chu WW (2007) The phrase-based vector space model for automatic retrieval of free-text medical documents. Data Knowl Eng 61(1):76–92CrossRefGoogle Scholar
  22. 22.
    Mao X, Lin B, Cai D, He X, Pei J (2013) Parallel field alignment for cross media retrieval. In: Proceedings of the 21st ACM international conference on multimedia. ACM, pp 897–906Google Scholar
  23. 23.
    Monay F, Gatica-Perez D (2007) Modeling semantic aspects for cross-media image indexing. IEEE Trans Pattern Anal Mach Intell 29(10):1802–1817CrossRefGoogle Scholar
  24. 24.
    Muneesawang P, Guan L, Amin T (2010) A new learning algorithm for the fusion of adaptive audio-visual features for the retrieval and classification of movie clips. J Signal Process Syst 59(2):177–188CrossRefGoogle Scholar
  25. 25.
    Nefian AV, Liang L, Pi X, Liu X, Murphy K (2002) Dynamic bayesian networks for audio-visual speech recognition. EURASIP J Appl Signal Process 2002 (1):1274–1288CrossRefMATHGoogle Scholar
  26. 26.
    Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 689–696Google Scholar
  27. 27.
    Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137CrossRefGoogle Scholar
  28. 28.
    Rafailidis D, Manolopoulou S, Daras P (2013) A unified framework for multimodal retrieval. Pattern Recogn 46(12):3358–3370CrossRefGoogle Scholar
  29. 29.
    Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620CrossRefMATHGoogle Scholar
  30. 30.
    Santos I, Laorden C, Sanz B, Bringas PG (2012) Enhanced topic-based vector space model for semantics-aware spam filtering. Expert Syst Appl 39(1):437–444CrossRefGoogle Scholar
  31. 31.
    Sargin ME, Yemez Y, Erzin E et al (2007) Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Trans Multimedia 9(7):1396–1403CrossRefGoogle Scholar
  32. 32.
    Srivastava N, Salakhutdinov RR (2012) Multimodal learning with deep boltzmann machines. In: Advances in neural information processing systems, pp 2222–2230Google Scholar
  33. 33.
    Stricker MA, Orengo M (1995) Similarity of color images. In: IS&T/SPIE’s symposium on electronic imaging: science & technology, international society for optics and photonics, pp 381–392Google Scholar
  34. 34.
    Taylor P (2009) Text-to-speech synthesis. Cambridge University PressGoogle Scholar
  35. 35.
    Tsatsaronis G, Panagiotopoulou V (2009) A generalized vector space model for text retrieval based on semantic relatedness. In: Proceedings of the 12th conference of the european chapter of the association for computational linguistics: student research workshop, association for computational linguistics, pp 70–78Google Scholar
  36. 36.
    Wang S, Joo J, Wang Y, Zhu SC (2013) Weakly supervised learning for attribute localization in outdoor scenes. In: IEEE conference on computer vision and pattern recognition (CVPR), 2013. IEEE, pp 3111–3118Google Scholar
  37. 37.
    Wang X, Kankanhalli M (2010) Multifusion: a boosting approach for multimedia fusion. ACM Trans Multimed Comput Commun Appl (TOMM) 6(4):25Google Scholar
  38. 38.
    Worawitphinyo P, Gao X, Jabeen S (2011) Improving suffix tree clustering with new ranking and similarity measures. In: Advanced data mining and applications. Springer, pp 55–68Google Scholar
  39. 39.
    Wu P, Hoi SC, Xia H, Zhao P, Wang D, Miao C (2013) Online multimodal deep similarity learning with application to image retrieval. In: Proceedings of the 21st ACM international conference on Multimedia. ACM, pp 153–162Google Scholar
  40. 40.
    Yan Y, Ricci E, Subramanian R, Lanz O, Sebe N (2013) No matter where you are: flexible graph-guided multi-task learning for multi-view head pose classification under target motion. In: Proceedings of the IEEE international conference on computer vision, pp 1177–1184Google Scholar
  41. 41.
    Yan Y, Ricci E, Liu G, Sebe N (2014) Recognizing daily activities from first-person videos with multi-task clustering. In: Computer vision–ACCV 2014. Springer, pp 522–537Google Scholar
  42. 42.
    Yan Y, Ricci E, Liu G, Sebe N (2015) Egocentric daily activity recognition via multitask clustering. IEEE Trans Image Process 24(10):2984–2995MathSciNetCrossRefGoogle Scholar
  43. 43.
    Yan Y, Ricci E, Subramanian R, Liu G, Lanz O, Sebe N (2015) A multi-task learning framework for head pose estimation under target motion. IEEE Trans Pattern Anal Mach Intell. doi: 10.1109/TPAMI.2015.2477843 Google Scholar
  44. 44.
    Yan Y, Yang Y, Meng D, Liu G, Tong W, Hauptmann AG, Sebe N (2015) Event oriented dictionary learning for complex event detection. IEEE Trans Image Process 24(6):1867–1878MathSciNetCrossRefGoogle Scholar
  45. 45.
    Yang Y, Xu D, Nie F, Luo J, Zhuang Y (2009) Ranking with local regression and global alignment for cross media retrieval. In: Proceedings of the 17th ACM international conference on multimedia. ACM, pp 175–184Google Scholar
  46. 46.
    Yang Y, Nie F, Xu D, Luo J, Zhuang Y, Pan Y (2012) A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans Pattern Anal Mach Intell 34(4):723–742CrossRefGoogle Scholar
  47. 47.
    Yoshitaka A, Ichikawa T (1999) A survey on content-based retrieval for multimedia databases. IEEE Trans Knowl Data Eng 11(1):81–93CrossRefGoogle Scholar
  48. 48.
    Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 46–54Google Scholar
  49. 49.
    Zhai X, Peng Y, Xiao J (2014) Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Trans Circuits Syst Video Technol 24 (6):965–978CrossRefGoogle Scholar
  50. 50.
    Zhang Z, Zhang R (2008) Multimedia data mining: a systematic introduction to concepts and theory. CRC PressGoogle Scholar
  51. 51.
    Zhao R, Grosky WI (2002) Narrowing the semantic gap-improved text-based web document retrieval using visual features. IEEE Trans Multimedia 4(2):189–200CrossRefGoogle Scholar
  52. 52.
    Zu Eissen SM, Stein B, Potthast M (2005) The suffix tree document model revisited. In: Proceedings of the 5th international conference on knowledge management, pp 596–603Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.National Institute of Technology KarnatakaSurathkalIndia

Personalised recommendations