Advertisement

Content Extraction from Marketing Flyers

  • Ignazio GalloEmail author
  • Alessandro Zamberletti
  • Lucia Noce
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9256)

Abstract

The rise of online shopping has hurt physical retailers, which struggle to persuade customers to buy products in physical stores rather than online. Marketing flyers are a great mean to increase the visibility of physical retailers, but the unstructured offers appearing in those documents cannot be easily compared with similar online deals, making it hard for a customer to understand whether it is more convenient to order a product online or to buy it from the physical shop. In this work we tackle this problem, introducing a content extraction algorithm that automatically extracts structured data from flyers. Unlike competing approaches that mainly focus on textual content or simply analyze font type, color and text positioning, we propose novel and more advanced visual features that capture the properties of graphic elements typically used in marketing materials to attract the attention of readers towards specific deals, obtaining excellent results and a high language and genre independence.

Keywords

Content extraction Portable document format Visual features Marketing flyers 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Linguisticae Investigationes 30, 3–26 (2007)CrossRefGoogle Scholar
  2. 2.
    Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: CoNNL, pp. 147–155 (2009)Google Scholar
  3. 3.
    Ling, X., Weld, D.: Fine-grained entity recognition. In: AAAI (2012)Google Scholar
  4. 4.
    Yuan, F., Liu, B., Yu, G.: A study on information extraction from PDF files. In: Yeung, D.S., Liu, Z.-Q., Wang, X.-Z., Yan, H. (eds.) ICMLC 2005. LNCS (LNAI), vol. 3930, pp. 258–267. Springer, Heidelberg (2006) CrossRefGoogle Scholar
  5. 5.
    Prokofyev, R., Demartini, G., Cudré-Mauroux, P.: Effective named entity recognition for idiosyncratic web collections. In: WWW, pp. 397–408 (2014)Google Scholar
  6. 6.
    Apostolova, E., Tomuro, N.: Combining visual and textual features for information extraction from online flyers. In: EMNLP, pp. 1924–1929 (2014)Google Scholar
  7. 7.
    Zhou, Z., Mashuq, M., Sun, L.: Web content extraction through machine learning (2014)Google Scholar
  8. 8.
    Burget, R.: Layout based information extraction from html documents. In: ICDAR, pp. 624–628 (2007)Google Scholar
  9. 9.
    Burget, R., Rudolfova, I.: Web page element classification based on visual features. In: ACIIDS, pp. 67–72 (2009)Google Scholar
  10. 10.
    Sun, F., Song, D., Liao, L.: Dom based content extraction via text density. In: SIGIR, pp. 245–254 (2011)Google Scholar
  11. 11.
    Smith, R.: An overview of the tesseract ocr engine. In: ICDAR, pp. 629–6332 (2007)Google Scholar
  12. 12.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: An update. SIGKDD Explorations 11, 10–18 (2009)CrossRefGoogle Scholar
  13. 13.
    Bosch, A., Zisserman, A., Munoz, X.: Image classification using random forests and ferns. In: ICCV, pp. 1–8 (2007)Google Scholar
  14. 14.
    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes challenge. Computer Vision 88, 303–338 (2010)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Ignazio Gallo
    • 1
    Email author
  • Alessandro Zamberletti
    • 1
  • Lucia Noce
    • 1
  1. 1.Department of Theoretical and Applied ScienceUniversity of InsubriaVareseItaly

Personalised recommendations