Skip to main content
Log in

A generalized topic modeling approach for automatic document annotation

  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

Ecological and environmental sciences have become more advanced and complex, requiring observational and experimental data from multiple places, times, and thematic scales to verify their hypotheses. Over time, such data have not only increased in amount, but also in diversity and heterogeneity of the data sources that spread throughout the world. This heterogeneity poses a huge challenge for scientists who have to manually search for desired data. ONEMercury has recently been implemented as part of the DataONE project to alleviate such problems and to serve as a portal for accessing environmental and observational data across the globe. ONEMercury harvests metadata records from multiple archives and repositories, and makes them searchable. However, harvested metadata records sometimes are poorly annotated or lacking meaningful keywords, which could impede effective retrieval. We propose a methodology that learns the annotation from well-annotated collections of metadata records to automatically annotate poorly annotated ones. The problem is first transformed into the tag recommendation problem with a controlled tag library. Then, two variants of an algorithm for automatic tag recommendation are presented. The experiments on four datasets of environmental science metadata records show that our methods perform well and also shed light on the natures of different datasets. We also discuss relevant topics such as using topical coherence to fine-tune parameters and experiments on cross-archive annotation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. https://cn.dataone.org/onemercury/.

  2. http://mercury.ornl.gov/.

  3. http://daac.ornl.gov/.

  4. http://datadryad.org/.

  5. http://knb.ecoinformatics.org/index.jsp.

  6. http://treebase.org/treebase-web/home.html.

  7. http://daac.ornl.gov/.

  8. http://datadryad.org/.

  9. http://knb.ecoinformatics.org/index.jsp.

  10. http://treebase.org/treebase-web/home.html.

  11. http://earthdata.nasa.gov/esdis.

  12. http://www.phylofoundation.org/.

  13. http://snowball.tartarus.org/algorithms/english/stemmer.html.

  14. http://alias-i.com/lingpipe/.

  15. http://nlp.stanford.edu/software/tmt/tmt-0.4/.

  16. http://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=930.

    Table 3 Comparison of the recommended keywords by the TF-IDF, TM, and KEA (baseline) algorithms on a sample document “ISLSCP II IGBP DISCOVER AND SIB LAND COVER, 1992–1993

References

  1. AlSumait, L., Barbar, D., Domeniconi, C.: On-line lda: adaptive topic models for mining text streams with applications to topic detection and tracking. In: IEEE Computer Society ICDM, pp. 3–12 (2008)

  2. Asuncion, A., Welling, M., Smyth, P., Teh, Y.W.: On smoothing and inference for topic models. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, pp. 27–34. AUAI Press, Arlington (2009)

  3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  4. Bron, M., Huurnink, B., de Rijke, M.: Linking archives using document enrichment and term selection. In: Proceedings of the 15th International Conference on Theory and Practice of Digital Libraries: Research and Advanced Technology for Digital Libraries, TPDL’11, pp. 360–371 (2011)

  5. Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’04, pp. 25–32. ACM, New York (2004)

  6. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  7. Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. JASIS 41(6), 391–407 (1990)

    Article  Google Scholar 

  8. Heymann, P., Ramage, D., Garcia-Molina, H.: Social tag prediction. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, pp. 531–538. ACM, New York (2008)

  9. Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)

    Article  MATH  Google Scholar 

  10. Huang, W., Kataria, S., Caragea, C., Mitra, P., Giles, C.L., Rokach, L.: Recommending citations: translating papers into references. CIKM ’12, pp. 1910–1914. ACM, New York (2012)

  11. Iwata, T., Yamada, T., Sakurai, Y., Ueda, N.: Online multiscale dynamic topic models. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’10, pp. 663–672. ACM, New York (2010)

  12. Kataria, S., Mitra, P., Bhatia, S.: Utilizing context in generative bayesian models for linked corpus. In: AAAI’10, p. 1 (2010)

  13. Krestel, R., Fankhauser, P., Nejdl, W.: Latent dirichlet allocation for tag recommendation. In: Proceedings of the Third ACM Conference on Recommender Systems, RecSys ’09, pp. 61–68. ACM, New York (2009)

  14. Liu, Z., Chen, X., Sun, M.: A simple word trigger method for social tag suggestion. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, pp. 1577–1588. Association for Computational Linguistics, Stroudsburg (2011)

  15. Liu, Z., Huang, W., Zheng, Y., Sun, M.: Automatic keyphrase extraction via topic decomposition. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ’10, pp. 366–376. Association for Computational Linguistics, Stroudsburg (2010)

  16. Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)

    Book  MATH  Google Scholar 

  17. Medelyan, O., Witten, I.H.: Thesaurus based automatic keyphrase indexing. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’06, pp. 296–297. ACM, New York (2006)

  18. Michener, W., Vieglais, D., Vision, T., Kunze, J., Cruse, P., Janée, G.: DataONE: data observation network for earth—preserving data and enabling innovation in the biological and environmental sciences. DLib Mag. 17(1/2), 1–12 (2011)

    Google Scholar 

  19. Mishne, G.: Autotag: a collaborative approach to automated tag assignment for weblog posts. In: Proceedings of the 15th International Conference on World Wide Web, WWW ’06, pp. 953–954. ACM, New York (2006)

  20. Newman, D., Hagedorn, K., Chemudugunta, C., Smyth, P.: Subject metadata enrichment using statistical topic models. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital libraries, JCDL ’07, pp. 366–375. ACM, New York (2007)

  21. Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10, pp. 100–108. Association for Computational Linguistics, Stroudsburg (2010)

  22. Newman, D., Smyth, P., Welling, M., Asuncion, A.U.: Distributed inference for latent dirichlet allocation. In: Advances in Neural Information Processing Systems, pp. 1081–1088 (2007)

  23. Song, Y., Zhuang, Z., Li, H., Zhao, Q., Li, J., Lee, W.-C., Giles, C.L.: Real-time automatic tag recommendation. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, pp. 515–522 (2008)

  24. Steyvers, M., Griffiths, T.: Probabilistic topic models. Handb. Latent Semant. Anal. 427(7), 424–440 (2007)

    Google Scholar 

  25. Tuarob, S., Pouchard, L.C., Giles, C.L.: Automatic tag recommendation for metadata annotation using probabilistic topic modeling. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’13, pp. 239–248. ACM, New York (2013)

  26. Tuarob, S., Pouchard, L.C., Noy, N., Horsburgh, J.S., Palanisamy, G.: Onemercury: towards automatic annotation of earth science metadata. In: AGU Fall Meeting Abstracts, vol. 1, p. 1482 (2012)

  27. Tuarob, S., Pouchard, L.C., Noy, N., Horsburgh, J.S., Palanisamy, G.: Onemercury: towards automatic annotation of environmental science metadata. In: Proceedings of the 2nd International Workshop on Linked Science (2012)

  28. Tuarob, S., Tucker, C.S.: Fad or here to stay: predicting product market adoption and longevity using large scale, social media data. In: Proceedings ASME 2013 Internationl Design Engineering Technical Conference Computers and Information in Engineering Conference, IDETC/CIE ’13 (2013)

  29. Tuarob, S., Tucker, C.S.: Discovering next generation product innovations by identifying lead user preferences expressed through large scale social media data. In: Proceedings ASME 2014 International Design Engineering Technical Conference Computers and Information in Engineering Conference, IDETC/CIE ’14 (2014)

  30. Tuarob, S., Tucker, C.S.: Automated discovery of lead users and latent product features by mining large scale social media networks. J. Mech. Des. (2015, accepted)

  31. Tuarob, S., Tucker, C.S.: Quantifying product favorability and extracting notable product features using large scale social media data. J. Comput. Inf. Sci. Eng. (2015). doi:10.1115/1.4029562

  32. Tuarob, S., Tucker, C. S., Salathe, M., Ram, N.: Discovering health-related knowledge in social media using ensembles of heterogeneous features. In: Proceedings of the 22Nd ACM International Conference on Conference on Information and Knowledge Management, CIKM ’13, pp. 1685–1690. ACM, New York (2013)

  33. Tuarob, S., Tucker, C.S., Salathe, M., Ram, N.: An ensemble heterogeneous classification methodology for discovering health-related knowledge in social media messages. J. Biomed. Inf. 49, 255–268 (2014)

    Article  Google Scholar 

  34. Voorhees, E.M.: The trec-8 question answering track report. In: Proceedings of TREC-8, pp. 77–82 (1999)

  35. Widdows, D., Ferraro, K.: Semantic vectors: a scalable open source package and online technology management application. In: LREC (2008)

  36. Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C. G.: Kea: practical automatic keyphrase extraction. In: Proceedings of the Fourth ACM Conference on Digital libraries, DL ’99, pp. 254–255. ACM, New York (1999)

  37. Wu, L., Yang, L., Yu, N., Hua, X.-S.: Learning to tag. In: Proceedings of the 18th International Conference on World Wide Web, WWW ’09, pp. 361–370 (2009)

  38. Zhou, T., Ma, H., Lyu, M., King, I.: Userrec: A user recommendation framework in social tagging systems. In: Proceedings of AAAI, pp. 1486–1491 (2010)

Download references

Acknowledgments

We gratefully acknowledge useful comments from Natasha Noy, Jeffery S. Horsburgh, and Giri Palanisamy. This work has been supported in part by the National Science Foundation (DataONE: Grant #OCI-0830944) and Oak Ridge National Laboratory Contract No. De-AC05-00OR22725 of the US Department of Energy.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Suppawong Tuarob.

Additional information

This manuscript is an extension of the authors’ earlier work presented at the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2013) [25].

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tuarob, S., Pouchard, L.C., Mitra, P. et al. A generalized topic modeling approach for automatic document annotation. Int J Digit Libr 16, 111–128 (2015). https://doi.org/10.1007/s00799-015-0146-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-015-0146-2

Keywords

Navigation