Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Using visual and text features for direct marketing on multimedia messaging services domain


Traditionally, direct marketing companies have relied on pre-testing to select the best offers to send to their audience. Companies systematically dispatch the offers under consideration to a limited sample of potential buyers, rank them with respect to their performance and, based on this ranking, decide which offers to send to the wider population. Though this pre-testing process is simple and widely used, recently the industry has been under increased pressure to further optimize learning, in particular when facing severe time and learning space constraints. The main contribution of the present work is to demonstrate that direct marketing firms can exploit the information on visual content to optimize the learning phase. This paper proposes a two-phase learning strategy based on a cascade of regression methods that takes advantage of the visual and text features to improve and accelerate the learning process. Experiments in the domain of a commercial Multimedia Messaging Service (MMS) show the effectiveness of the proposed methods and a significant improvement over traditional learning techniques. The proposed approach can be used in any multimedia direct marketing domain in which offers comprise both a visual and text component.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11


  1. 1.

    Click-through rate, or CTR, is a common way of measuring success for an advertising campaign targeted to mobile devices. For the scope of our paper it can be measured as the ratio between the number of users who clicked a specific offer over the total number of users that were exposed to that offer.

  2. 2.

    For mobile operators, sending commercial messages to their customers is very cost-effective: operators can easily reach millions of potential buyers at little cost, making the profit potential of these advertising-related services very high. In addition, in the case of mobile phone operators market saturation and fierce competition [23] have turned value added services (VAS), like the ones these commercial messages advertise, into significant revenue source and in some cases the only opportunity for revenue growth. Because these services are now central to profitability, mobile phone operators and independent production companies are becoming increasingly creative in generating and proposing new services and offers. The result is a rapidly growing set of possible services available.

  3. 3.

    In the following section we explain in more detail what commercial mobile multimedia messages are and present several examples.

  4. 4.

    Given the speed of offer production in our application, even with daily contact (e.g., daily messages sent to mobile phone users), the number of offers to be tested grows at a faster pace than the rate at which a traditional pre-testing system is able to learn (while at the same time keeping enough potential customers for optimized delivery).

  5. 5.

    The real targeting system could reach millions of users, but large segments of users would have to receive the same message. Only a maximum of 20 messages could be sent daily.

  6. 6.

    By definition, a holistic cue is one that is processed over the entire human visual field and does not require attention to analyze local features [29].

  7. 7.

    By using the Bayesian classifier one can infer the presence of faces in an image by the skin appearance in the pixel domain; likewise, an outdoor context can be inferred by sky and/or vegetation appearance [21]. We used these three types of visual information in our system as proposed by [4] and used the percentage of pixels belonging to each one of these appearance classes as determined by the Bayesian classifier to describe each image. The disadvantage of this method is that it required hand-labeling of a training set.

  8. 8.

    Taking into account the overall simulation settings, 30 offers per day is an arrival rate comparable to the mean arrival rate observed in the real system.


  1. 1.

    Alpaydin E (2004) Introduction to machine learning. MIT, Cambridge

  2. 2.

    Barnard K, Forsyth DA (2001) Learning the semantics of words and pictures. In: ICCV, Vancouver, 7–14 July 2001, pp 408–415

  3. 3.

    Battiato S, Farinella GM, Gallo G, Ravì D (2008) Scene categorization using bag of textons on spatial hierarchy. In: International conference on image processing (ICIP), San Diego, 12–15 October 2008

  4. 4.

    Battiato S, Farinella G, Giuffrida G, Tribulato G (2007) Data mining learning bootstrap through semantic thumbnail analysis. In: SPIE-IS&T 19th annual symposium electronic imaging science and technology 2007—multimedia content access: algorithms and systems, Orlando, 9–13 April 2007

  5. 5.

    Bergen JR, Julesz B (1983) Rapid discrimination of visual patterns. IEEE Trans Syst Man Cybern 13:857–863

  6. 6.

    Biederman I (1987) Recognition by components: a theory of human image interpretation. Psychol Rev 94:115–148

  7. 7.

    Biederman I, Mezzanotte R, Rabinowitz J (1982) Scene perception: detecting and judging objects undergoing relational violations. Cogn Psychol 14:143–177

  8. 8.

    Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth and Brooks, Monterey

  9. 9.

    Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/∼cjlin/libsvm

  10. 10.

    Cleveland WS, Devlin SJ, Grosse E (1988) Regression by local fitting: methods, properties, and computational algorithms. J Econom 37(1):87–114

  11. 11.

    Comaniciu D, Ramesh V, Meer P (2003) Kernel-based object tracking. IEEE Trans Pattern Anal Mach Intell 25(5):564–575

  12. 12.

    Direct Marketing Association (2007) The power of direct marketing: ROI, sales, expenditures and employment in the U.S., 2006–2007 edn. Direct Marketing Association, Washington, DC

  13. 13.

    Florent P (2008) Universal and adapted vocabularies for generic visual categorization. IEEE Trans Pattern Anal Mach Intell 53(7):1243–1256

  14. 14.

    Hull D (1996) Stemming algorithms: a case study for detailed evaluation. J Am Soc Inf Sci 47:70–84

  15. 15.

    Julesz B (1981) Textons, the elements of texture perception, and their interactions. Nature 290:91–97

  16. 16.

    Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: IEEE conference on computer vision and pattern recognition, vol II. IEEE, Piscataway, pp 2169–2178

  17. 17.

    Li FF, Perona P (2005) A bayesian hierarchical model for learning natural scene categories. In: CVPR ’05: Proceedings of the 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol 2. IEEE Computer Society, Los Alamitos, pp 524–531

  18. 18.

    Lim JH (1999) Categorizing visual contents by matching visual “keywords”. In: VISUAL, pp 367–374

  19. 19.

    Mairal J, Bach F, Ponce J, Sapiro G, Zisserman A (2008) Discriminative learned dictionaries for local image analysis. In: IEEE conference on computer vision and pattern recognition

  20. 20.

    Moosmann F, Triggs B, Jurie F (2007) Fast discriminative visual codebooks using randomized clustering forests. In: Schölkopf B, Platt J, Hoffman T (eds) Advances in neural information processing systems, vol 19. MIT, Cambridge, pp 985–992

  21. 21.

    Naccari F, Battiato S, Bruna A, Capra A, Castorina A (2005) Natural scene classification for color enhancement. IEEE Trans Consum Electron 5:234–239

  22. 22.

    Nash E (2000) Direct marketing. McGraw-Hill, New York

  23. 23.

    Netsize (2007) Convergence: everything is going mobile. The Netsize Guide 2007. Netsize, Levallois Perret

  24. 24.

    Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42:145–175

  25. 25.

    Oren N (2002) Reexamining tf.idf based information retrieval with genetic programming. In: SAICSIT 2002, South African Institute for Computer Scientists and Information Technologists, Republic of South Africa, pp 224–234

  26. 26.

    Oza NC (2005) Online bagging and boosting. In: Systems, man and cybernetics, 2005 IEEE international conference on. IEEE, Piscataway, pp 2340–2345

  27. 27.

    Potter M (1975) Meaning in visual search. Science 187:965–966

  28. 28.

    Prinzie A, Van Den Poel D (2005) Constrained optimization of data-mining problems to improve model performance: a direct-marketing application. Expert Syst Appl 29(3):630–640

  29. 29.

    Renninger LW, Malik J (2004) When is scene recognition just texture recognition? Vis Res 44:2301–2311

  30. 30.

    Roberts M, Berger PD (1989) Direct marketing management. Prentice-Hall, New York

  31. 31.

    Schapire R (2001) The boosting approach to machine learning: an overview. Kluwer, Boston

  32. 32.

    Schölkopf B, Smola AJ, Williamson RC, Bartlett PL (2000) New support vector algorithms. Neural Comput 12(5):1207–1245

  33. 33.

    Shawe-Taylor J, Cristianini N (2000) Support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge

  34. 34.

    Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. In: Proceedings of the international conference on computer vision, vol 2. IEEE, Piscataway, pp 1470–1477

  35. 35.

    Taylor P, Caley R, Black AW, King S (1999) Wagon, Edinburgh Speech Tools Library

  36. 36.

    Varma M, Zisserman A (2005) A statistical approach to texture classification from single images. Int J Comput Vis 62(1–2):61–81

  37. 37.

    Winn J, Criminisi A, Minka T (2005) Object categorization by learned universal visual dictionary. In: ICCV ’05: proceedings of the tenth IEEE international conference on computer vision. IEEE Computer Society, Washington, DC, pp 1800–1807

  38. 38.

    Yang J, Jiang YG, Hauptmann AG, Ngo CW (2007) Evaluating bag-of-visual-words representations in scene classification. In: MIR ’07: proceedings of the international workshop on multimedia information retrieval. ACM, New York, pp 197–206

Download references


The authors would like to thank Daniele Ravì for helping in the implementation of the simulation studies. The authors would also like to thank Neodata Group for giving access to the mobile messaging dataset, and for helping in the implementation and testing of the proposed approach.

Author information

Correspondence to Sebastiano Battiato.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Battiato, S., Farinella, G.M., Giuffrida, G. et al. Using visual and text features for direct marketing on multimedia messaging services domain. Multimed Tools Appl 42, 5–30 (2009). https://doi.org/10.1007/s11042-008-0250-z

Download citation


  • Visual and text features
  • Learning in time and space constrained domains
  • Multimedia messaging services
  • Direct marketing