Advertisement

Detecting and Dismantling Composite Visualizations in the Scientific Literature

  • Po-Shen LeeEmail author
  • Bill Howe
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9493)

Abstract

We are analyzing the visualizations in the scientific literature to enhance search services, detect plagiarism, and study bibliometrics. An immediate problem is the ubiquitous use of multi-part figures: single images with multiple embedded sub-visualizations. Such figures account for approximately 35 % of the figures in the scientific literature. Conventional image segmentation techniques and other existing approaches have been shown to be ineffective for parsing visualizations. We propose an algorithm to automatically recognize multi-chart visualizations and segment them into a set of single-chart visualizations, thereby enabling downstream analysis. Our approach first splits an image into fragments based on background color and layout patterns. An SVM-based binary classifier then distinguishes complete charts from auxiliary fragments such as labels, ticks, and legends, achieving an average 98.1 % accuracy. Next, we recursively merge fragments to reconstruct complete visualizations. Finally, a scoring function is used to choose between alternative merge trees. For the multi-chart figure detection, we utilize the output of the splitting algorithm as image features to train a classifier. It can avoid unnecessary time consuming by applying the complete algorithm to determine a multi-chart visualization. To evaluate our approach, we randomly collected 880 single-chart scientific figures and 1067 multi-chart scientific figures from the PubMed database. For the detection, we achieve 90.2 % accuracy via 10-fold cross-validation on the entire corpus. To evaluate the decomposition algorithm, we randomly extracted 261 multi-chart figures as a testing set. Our algorithm achieves 80 % recall and 85 % precision of perfect extractions for the common case of eight or fewer sub-figures per figure. Further, even imperfect extractions are shown to be sufficient for most chart classification and reasoning tasks associated with bibliometrics and academic search applications.

Keywords

Visualization Multi-chart figure Chart segmentation Chart recognition and understanding Scientific literature retrieval Content-based image retrieval 

Notes

Acknowledgements

The authors wish to thank the authors of the papers from which we drew the examples in this paper. This work is sponsored in part by the National Science Foundation through S2I2 award 1216879 and IIS award III-1064505, the University of Washington eScience Institute, and an award from the Gordon and Betty Moore Foundation and the Alred P. Sloan Foundation.

References

  1. 1.
    Bergstrom, C.T., West, J.D., Wiseman, M.A.: The Eigenfactor metrics. J. Neurosci. Official J. Soc. Neurosci. 28, 11433–11434 (2008)CrossRefGoogle Scholar
  2. 2.
    Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image retrieval: ideas, influences, and trends of the new age. ACM Comput. Surv. 1–35 (2006)Google Scholar
  3. 3.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008). http://doi.acm.org/10.1145/1327452.1327492 CrossRefGoogle Scholar
  4. 4.
    Futrelle, R., Kakadiaris, I., Alexander, J., Carriero, C., Nikolakis, N., Futrelle, J.: Understanding diagrams in technical documents. Computer 25, 75–78 (1992)CrossRefGoogle Scholar
  5. 5.
    Futrelle, R., Shao, M., Cieslik, C., Grimes, A.: Extraction, layout analysis and classification of diagrams in pdf documents. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition, pp. 1007–1013, August 2003Google Scholar
  6. 6.
    Hsu, C.W., Chang, C.C., Lin, C.J.: A practical guide to support vector classification. Bioinformatics 1, 1–16 (2010). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.3096&rep=rep1&type=pdf Google Scholar
  7. 7.
    Huang, W., Tan, C.L.: A system for understanding imaged infographics and its applications. In: DOCENG 2007: Proceedings of the 2007 ACM Symposium on Document Engineering, pp. 9–18 (2007)Google Scholar
  8. 8.
    Huang, W., Tan, C.-L., Leow, W.-K.: Model-Based Chart Image Recognition. In: Lladós, J., Kwon, Y.-B. (eds.) GREC 2003. LNCS, vol. 3088, pp. 87–99. Springer, Heidelberg (2004). http://dx.doi.org/10.1007/978-3-540-25977-0_8 CrossRefGoogle Scholar
  9. 9.
    Lew, M.S.: Content-based multimedia information retrieval: state of the art and challenges. ACM Trans. Multimedia Comput. Commun. Appl. 2, 1–19 (2006)CrossRefGoogle Scholar
  10. 10.
    Prasad, V., Siddiquie, B., Golbeck, J., Davis, L.: Classifying computer generated charts. In: 2007 International Workshop on Content-Based Multimedia Indexing (2007)Google Scholar
  11. 11.
    Savva, M., Kong, N., Chhajta, A., Fei-Fei, L., Agrawala, M., Heer, J.: ReVision: automated classification, analysis and redesign of chart images. In: UIST 2011, pp. 393–402 (2011)Google Scholar
  12. 12.
    Shao, M., Futrelle, R.P.: Recognition and classification of figures in PDF documents. In: Liu, W., Llads, J. (eds.) GREC 2005. LNCS, vol. 3926, pp. 231–242. Springer, Heidelberg (2006). http://dx.doi.org/10.1007/11767978_21 Google Scholar
  13. 13.
    Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell. 22, 1349–1380 (2000)CrossRefGoogle Scholar
  14. 14.
    Tufle, E.: The visual display of quantitative information. CT Graphics, Cheshire (1983). http://www.colorado.edu/UCB/AcademicAffairs/ArtsSciences/geography/oote/maps/assign/reading/TufteCoversheet.pdf
  15. 15.
    West, J.D., Bergstrom, T.C., Bergstrom, C.T.: The eigenfactor metrics: a network approach to assessing scholarly journals. Coll. Res. Libr. 71, 236–244 (2006)CrossRefGoogle Scholar
  16. 16.
    White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Sebastopol (2009) Google Scholar
  17. 17.
    Yokokura, N., Watanabe, T.: Layout-based approach for extracting constructive elements of bar-charts. In: Tombre, K., Chhabra, A. (eds.) Graphics Recognition Algorithms and Systems. LNCS, vol. 1389, pp. 163–174. Springer, Heidelberg (1998). http://dx.doi.org/10.1007/3-540-64381-8_47 CrossRefGoogle Scholar
  18. 18.
    Zhou, Y.P.Z.Y.P., Tan, C.L.T.C.L.: Hough technique for bar charts detection and recognition in document images. In: Proceedings 2000 International Conference on Image Processing (Cat. No. 00CH37101), vol. 2 (2000)Google Scholar
  19. 19.
    Zhou, Y., Tan, C.L.: Learning-based scientific chart recognition. In: 4th IAPR International Workshop on Graphics Recognition, GREC 2001, pp. 482–492 (2001)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Department of Electrical EngineeringUniversity of WashingtonSeattleUSA
  2. 2.Department of Computer Science and EngineeringUniversity of WashingtonSeattleUSA

Personalised recommendations