Abstract
A large part of scientific knowledge is confined to the text of publications. An algorithm is presented for distinguishing those pieces of information that can be predicted from the text of publication abstracts from those, for successes in prediction are spurious. The significance of relationships between textual data and information that is represented in standardized ontologies and protein domains is evaluated using a density-based approach. The approach also integrates a weighting system to account for many-to-many relationships between the abstracts and the genes they represent as well as between genes and the items that describe them. We evaluate the approach using data related from the model species yeast, and show that our results are in better agreement with biological expectations than a comparison algorithm.
Supported by the National Science Foundation under Grant No. IDM-0415190.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Zweigenbaum, P., Demner-Fushman, D., Cohen, K.B.: Frontiers of biomedical text mining: current progress. Briefings Bioinform 8(5), 58–375 (2007)
Valencia, A.: Text mining in genomics and systems biology. DTMBIO ’08: Proceeding of the 2nd International Workshop on Data and Tex Mining in Bioinformatics, pp. 3–4. Napa Valley, California, USA, ACM (2008)
Mima, H., Ananiadou, S., Matsushima, K.: Terminology-based knowledge mining for new knowledge discovery. ACM Trans. Asian Lang. Inf. Process. 5(1), 74–88 (2006)
Chiang, Jung-Hsien, Hsu-Chun, Yu.: MeKE: discovering the functions of gene products from biomedical literature via sentence alignment. Bioinformatics 19(11), 1417–1422 (2003)
Lussier, Y.A., Borlawsky, T., Rappaport, D., Liu, Y., Friedman, C.: PhenoGO: assigning phenotypic context to gene ontology annotations with natural language processing. In: Pacific Symposium on Biocomputing, pp. 64–75. World Scientific, Singapore (2006)
Koller, D.: Probabilistic Relational Models, ILP. Lecture Notes in Computer Science, vol 1634, pp. 3–13. Springer (1999)
Anne, M.: Denton and Jianfei Wu: data mining of vector-item patterns using neighborhood histograms. Knowl. Inf. Syst. 21(2), 173–199 (2009)
Everitt, B.S.: The Analysis of Contingency Tables. CHAPMAN and HALL/CRC, London (1992)
Fan, W., Wallace, L., Rich, S., Zhang, Z.: Tapping the power of text mining. Commun. ACM 49(9), 76–82 (2006)
Mooney, R.J., Bunescu, R.: Mining knowledge from text using information extraction. SIGKDD Explor. Newsl. 7(1), 3–10 (2006)
Qi, X., Davison, B.D.: Web page classification: features and algorithms. ACM Comput. Surv. 41(2), 1–31 (2009)
Godbole, S., Roy, S.: Text classification, business intelligence, and interactivity: automating C-Sat analysis for services industry. In: KDD ’08: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 911–919. Las Vegas, Nevada, USA, ACM (2008)
Johnson, H.L., Cohen, K.B., Hunter, L.: A fault model for ontology mapping, alignment, and linking systems. In: Pacific Symposium on Biocomputing, pp. 233–268. Publisher World Scientific, Singapore (2007)
Inniss, T.R., Lee, J.R., Light, M., Grassi, M.A., Thomas, G., Williams, A.B.: Towards applying text mining and natural language processing for biomedical ontology acquisition, In: TMBIO’06: Proceedings of the 1st International Workshop on Text Mining in Bioinformatics, pp. 7–14, Arlington, Virginia, USA, ACM, (2006)
Spasic, I., Ananiadou, S.: Using automatically learnt verb selectional preferences for classification of biomedical terms. J. Biomed. Inform. 37(6), 483–497 (2004)
Xiong, L., Chitti, S., Liu. L.: k nearest neighbor classification across multiple private databases. In: CIKM’06: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pp. 840–841. Arlington, Virginia, USA, ACM, (2006)
Song, Y., Huang, J., Zhou, D., Zha, H., Giles, C.L.: IKNN: informative K-nearest neighbor pattern classification, PKDD. Lecture Notes in Computer Science, vol 4702, pp. 248–264. Springer (2007)
Zhang, C., Lu, X., Zhang, X.: Significance of gene ranking for classification of microarray samples. IEEE/ACM Trans. Comput. Biol. Bioinformatics 3(3), 312–320 (2006)
Evert, S.: Significance tests for the evaluation of ranking methods. COLING’04: Proceedings of the 20th International Conference on Computational Linguistics, p. 945. Association for Computational Linguistics, Geneva, Switzerland, (2004)
Smucker, M.D., Allan, J., Carterette, B.: A comparison of statistical significance tests for information retrieval evaluation. CIKM’07: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, pp. 623–632. Lisbon, Portugal, ACM, (2007)
Smucker, M.D., Allan, J., Carterette, B.: Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes. SIGIR’09: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 630-631. Boston, MA, USA, ACM, (2009)
Zhang, L., Zhang, D., Simoff, S.J., Debenham, J.: Weighted kernel model for text categorization. AusDM’06: Proceedings of the Fifth Australasian Conference on Data Mining and Analystics, pp. 111–114. Sydney, Australia, Australian Computer Society Inc, (2006)
Klopotek, M.A.: Very large Bayesian multinets for text classification. Future Gener. Comput. Syst. 21(7), 1068–1082 (2005)
Brants, T.: Natural language processing in information retrieval. CLIN, Antwerp papers in linguistics, University of Antwerp, vol 111 (2003)
Carvalho, G., de Matos, D.M.., Rocio, V.: Document retrieval for question answering: a quantitative evaluation of text preprocessing. PIKM ’07: Proceedings of the ACM First Ph.D. Workshop in CIKM, pp. 125–130. Lisbon, Portugal, ACM, (2007)
Porter, M.: Porter Stemming Algorithm http://tartarus.org/martin/PorterStemmer, http://tartarus.org/martin/PorterStemmer, (1977)
Elkan, C.: Deriving TF-IDF as a fisher kernel, SPIRE. Lect. Notes Comput. Sci. 3772, 295–300 (2005)
Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In: ICML, pp. 143–151 (1997)
Acknowledgments
This material is based upon work supported by the National Science Foundation under Grant No. IDM-0415190.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Al-Azzam, O., Wu, J., Al-Nimer, L., Chitraranjan, C., Denton, A.M. (2014). A Weighted Density-Based Approach for Identifying Standardized Items that are Significantly Related to the Biological Literature. In: Yada, K. (eds) Data Mining for Service. Studies in Big Data, vol 3. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45252-9_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-45252-9_6
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45251-2
Online ISBN: 978-3-642-45252-9
eBook Packages: EngineeringEngineering (R0)