Skip to main content

A Weighted Density-Based Approach for Identifying Standardized Items that are Significantly Related to the Biological Literature

  • Chapter
  • First Online:
Data Mining for Service

Part of the book series: Studies in Big Data ((SBD,volume 3))

  • 3364 Accesses

Abstract

A large part of scientific knowledge is confined to the text of publications. An algorithm is presented for distinguishing those pieces of information that can be predicted from the text of publication abstracts from those, for successes in prediction are spurious. The significance of relationships between textual data and information that is represented in standardized ontologies and protein domains is evaluated using a density-based approach. The approach also integrates a weighting system to account for many-to-many relationships between the abstracts and the genes they represent as well as between genes and the items that describe them. We evaluate the approach using data related from the model species yeast, and show that our results are in better agreement with biological expectations than a comparison algorithm.

Supported by the National Science Foundation under Grant No. IDM-0415190.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Zweigenbaum, P., Demner-Fushman, D., Cohen, K.B.: Frontiers of biomedical text mining: current progress. Briefings Bioinform 8(5), 58–375 (2007)

    Article  Google Scholar 

  2. Valencia, A.: Text mining in genomics and systems biology. DTMBIO ’08: Proceeding of the 2nd International Workshop on Data and Tex Mining in Bioinformatics, pp. 3–4. Napa Valley, California, USA, ACM (2008)

    Google Scholar 

  3. Mima, H., Ananiadou, S., Matsushima, K.: Terminology-based knowledge mining for new knowledge discovery. ACM Trans. Asian Lang. Inf. Process. 5(1), 74–88 (2006)

    Article  Google Scholar 

  4. Chiang, Jung-Hsien, Hsu-Chun, Yu.: MeKE: discovering the functions of gene products from biomedical literature via sentence alignment. Bioinformatics 19(11), 1417–1422 (2003)

    Article  Google Scholar 

  5. Lussier, Y.A., Borlawsky, T., Rappaport, D., Liu, Y., Friedman, C.: PhenoGO: assigning phenotypic context to gene ontology annotations with natural language processing. In: Pacific Symposium on Biocomputing, pp. 64–75. World Scientific, Singapore (2006)

    Google Scholar 

  6. Koller, D.: Probabilistic Relational Models, ILP. Lecture Notes in Computer Science, vol 1634, pp. 3–13. Springer (1999)

    Google Scholar 

  7. Anne, M.: Denton and Jianfei Wu: data mining of vector-item patterns using neighborhood histograms. Knowl. Inf. Syst. 21(2), 173–199 (2009)

    Article  Google Scholar 

  8. Everitt, B.S.: The Analysis of Contingency Tables. CHAPMAN and HALL/CRC, London (1992)

    Google Scholar 

  9. Fan, W., Wallace, L., Rich, S., Zhang, Z.: Tapping the power of text mining. Commun. ACM 49(9), 76–82 (2006)

    Article  Google Scholar 

  10. Mooney, R.J., Bunescu, R.: Mining knowledge from text using information extraction. SIGKDD Explor. Newsl. 7(1), 3–10 (2006)

    Google Scholar 

  11. Qi, X., Davison, B.D.: Web page classification: features and algorithms. ACM Comput. Surv. 41(2), 1–31 (2009)

    Google Scholar 

  12. Godbole, S., Roy, S.: Text classification, business intelligence, and interactivity: automating C-Sat analysis for services industry. In: KDD ’08: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 911–919. Las Vegas, Nevada, USA, ACM (2008)

    Google Scholar 

  13. Johnson, H.L., Cohen, K.B., Hunter, L.: A fault model for ontology mapping, alignment, and linking systems. In: Pacific Symposium on Biocomputing, pp. 233–268. Publisher World Scientific, Singapore (2007)

    Google Scholar 

  14. Inniss, T.R., Lee, J.R., Light, M., Grassi, M.A., Thomas, G., Williams, A.B.: Towards applying text mining and natural language processing for biomedical ontology acquisition, In: TMBIO’06: Proceedings of the 1st International Workshop on Text Mining in Bioinformatics, pp. 7–14, Arlington, Virginia, USA, ACM, (2006)

    Google Scholar 

  15. Spasic, I., Ananiadou, S.: Using automatically learnt verb selectional preferences for classification of biomedical terms. J. Biomed. Inform. 37(6), 483–497 (2004)

    Article  Google Scholar 

  16. Xiong, L., Chitti, S., Liu. L.: k nearest neighbor classification across multiple private databases. In: CIKM’06: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pp. 840–841. Arlington, Virginia, USA, ACM, (2006)

    Google Scholar 

  17. Song, Y., Huang, J., Zhou, D., Zha, H., Giles, C.L.: IKNN: informative K-nearest neighbor pattern classification, PKDD. Lecture Notes in Computer Science, vol 4702, pp. 248–264. Springer (2007)

    Google Scholar 

  18. Zhang, C., Lu, X., Zhang, X.: Significance of gene ranking for classification of microarray samples. IEEE/ACM Trans. Comput. Biol. Bioinformatics 3(3), 312–320 (2006)

    Google Scholar 

  19. Evert, S.: Significance tests for the evaluation of ranking methods. COLING’04: Proceedings of the 20th International Conference on Computational Linguistics, p. 945. Association for Computational Linguistics, Geneva, Switzerland, (2004)

    Google Scholar 

  20. Smucker, M.D., Allan, J., Carterette, B.: A comparison of statistical significance tests for information retrieval evaluation. CIKM’07: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, pp. 623–632. Lisbon, Portugal, ACM, (2007)

    Google Scholar 

  21. Smucker, M.D., Allan, J., Carterette, B.: Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes. SIGIR’09: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 630-631. Boston, MA, USA, ACM, (2009)

    Google Scholar 

  22. Zhang, L., Zhang, D., Simoff, S.J., Debenham, J.: Weighted kernel model for text categorization. AusDM’06: Proceedings of the Fifth Australasian Conference on Data Mining and Analystics, pp. 111–114. Sydney, Australia, Australian Computer Society Inc, (2006)

    Google Scholar 

  23. Klopotek, M.A.: Very large Bayesian multinets for text classification. Future Gener. Comput. Syst. 21(7), 1068–1082 (2005)

    Google Scholar 

  24. Brants, T.: Natural language processing in information retrieval. CLIN, Antwerp papers in linguistics, University of Antwerp, vol 111 (2003)

    Google Scholar 

  25. Carvalho, G., de Matos, D.M.., Rocio, V.: Document retrieval for question answering: a quantitative evaluation of text preprocessing. PIKM ’07: Proceedings of the ACM First Ph.D. Workshop in CIKM, pp. 125–130. Lisbon, Portugal, ACM, (2007)

    Google Scholar 

  26. Porter, M.: Porter Stemming Algorithm http://tartarus.org/martin/PorterStemmer, http://tartarus.org/martin/PorterStemmer, (1977)

  27. Elkan, C.: Deriving TF-IDF as a fisher kernel, SPIRE. Lect. Notes Comput. Sci. 3772, 295–300 (2005)

    Google Scholar 

  28. Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In: ICML, pp. 143–151 (1997)

    Google Scholar 

Download references

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant No. IDM-0415190.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Omar Al-Azzam .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Al-Azzam, O., Wu, J., Al-Nimer, L., Chitraranjan, C., Denton, A.M. (2014). A Weighted Density-Based Approach for Identifying Standardized Items that are Significantly Related to the Biological Literature. In: Yada, K. (eds) Data Mining for Service. Studies in Big Data, vol 3. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45252-9_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-45252-9_6

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-45251-2

  • Online ISBN: 978-3-642-45252-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics