Text Mining with the Stanford CoreNLP

Chapter

Abstract

Text mining techniques have been widely employed to analyze various texts from massive social media to scientific publications and patents. As a bibliographic analysis tool the technique presents the opportunity for large-scale topical analysis of papers covering an entire domain, country, institution, or specific journal. For this project, we have chosen to use the Stanford CoreNLP parser due to its extensibility and enriched functionalities which can be applied to bibliometric research. The current version includes a suite of processing tools designed to take raw English language text input and output a complete textual analysis and linguistic annotation appropriate for higher-level textual analysis. The data for this project includes the title and abstract of all articles published in the Journal of the American Society for Information Science and Technology (JASIST) in 2012 (n = 177). Our process will provide an overview of the concepts depicted in the journal that year and will highlight the most frequent concepts to establish an overall trend for the year.

References

  1. Aggarwal, C. C., & Zhai, C. (2012). Mining text data. New York, NY: Springer.CrossRefGoogle Scholar
  2. Aggarwal, C. C., Zhao, Y., & Yu, P. S. (2012). On text clustering with side information. In Proceedings from the 28th International Conference on Data Engineering (ICDE), 2012 IEEE (pp. 894–904).Google Scholar
  3. Bar-Ilan, J. (2008). Informetrics at the beginning of the 21st century—A review. Journal of Informetrics, 2, 1–52. doi:10.1016/j.joi2007.11.001.CrossRefGoogle Scholar
  4. Bekkerman, R., El-Yaniv, R., Tishby, N., & Winter, Y. (2001). On feature distributional clustering for text categorization. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01) (pp. 146–153).Google Scholar
  5. Berger, A. L., Pietra, V. J. D., & Pietra, S. A. D. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22(1), 39–71.Google Scholar
  6. Bhattacharya, S., Kretschmer, H., & Meyer, M. (2003). Characterizing intellectual spaces between science and technology. Scientometrics, 58(2), 369–390. doi:10.1023/A:1026244828759.CrossRefGoogle Scholar
  7. Brody, S., & Elhadad, N. (2010). An unsupervised aspect-sentiment model for online reviews. In Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT’10: Human Language Technologies) (pp. 804–812).Google Scholar
  8. Callon, M., Courtial, J. P., Turner, W. A., & Bauin, S. (1983). From translations to problematic networks: An introduction to co-word analysis. Social Science Information, 22(2), 191–235. doi:10.1177/053901883022002003.CrossRefGoogle Scholar
  9. Callon, M., Courtial, J. P., & Laville, F. (1991). Co-word analysis as a tool for describing the network of interactions between basic and technological research: The case of polymer chemistry. Scientometrics, 22(1), 155–205. doi:10.1007/BF02019280.CrossRefGoogle Scholar
  10. Cambria, E., Rajagopal, D., Olsher, D., & Das, D. (2013). Big social data analysis. In R. Akerkar (Ed.), Big data computing (pp. 401–414). Boca Raton, FL: Taylor & Francis.CrossRefGoogle Scholar
  11. Cambria, E., & White, B. (2014). Jumping NLP curves: A review of natural language processing research. IEEE Computational Intelligence Magazine, 9(2), 48–57.CrossRefGoogle Scholar
  12. Carvalho, V. R., & Cohen, W. W. (2005). On the collective classification of email “speech acts.” In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’05) (pp. 345–352). doi:10.1145/1076034.1076094
  13. Cui, B., Mondal, A., Shen, J., Cong, G., & Tan, K. (2005). On effective e-mail classification via neural networks. In K. V. Andersen, J. Debenham, & R. Wagner (Eds.), Database and Expert Systems Applications: 16th International Conference, DEXA 2005, Copenhagen, Denmark, August 22–26, 2005. Proceedings (pp. 85–94). Berlin: Springer. doi:10.1007/11546924_9.CrossRefGoogle Scholar
  14. Cutting, D., Karger, D., & Pederson, J. (1993). Constant interaction-time scatter/gather browsing of large document collections. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 126–134). ACM.Google Scholar
  15. De Looze, M., & Lemarie, J. (1997). Corpus relevance through co-word analysis: An application to plant proteins. Scientometrics, 39(3), 267–280.CrossRefGoogle Scholar
  16. De Marneff, M. C., MacCartney, B., & Manning, C. D. (2006). Generating typed dependency parses from phrase structure parses. In Proceedings of LREC (Vol. 6, pp. 449–454).Google Scholar
  17. Ding, Y., Chowdhury, C. C., & Foo, S. (1999). Bibliometic cartography of information retrieval research by using co-word analysis. Information Processing & Management, 37(6), 817–842.CrossRefGoogle Scholar
  18. Ding, X., Liu, B., & Zhang, L. (2009). Entity discovery and assignment for opinion mining applications. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09) (pp. 1125–1134). doi:10.1145/1557019.1557141
  19. Ding, Y., Song, M., Han, J., Yu, Q., Yan, E., Lin, L., & Chambers, T. (2013). Entitymetrics: Measuring the impact of entities. PLoS One, 8(8), 1–14. doi:10.1371/journal.pone.0071416Google Scholar
  20. Du, R., Safavi-Naini, R., & Susilon, W. (2003). Web filtering using text classification. In Proceedings of the 11th IEEE International Conference on Networks, 28 September–1 October, 2003 (pp. 352–330).Google Scholar
  21. Feldman, R., & Dagan, I. (1995). Knowledge discovery in textual databases (KDT). In Proceedings of the Workshop in Knowledge Discovery, ECML-95 (pp. 112–117).Google Scholar
  22. Feldman, R., Klösgen, W., & Ziberstein, A. (1997). Document explorer: Discovering knowledge in document collections. In Z. W. Raś & A. Skowron (Eds.), Proceedings of the Foundations of Intelligent Systems: 10th International Symposium, ISMIS’97 Charlotte, North Carolina, USA October 15–18, 1997 (pp. 137–146). doi:10.1007/3-540-63614-5_13
  23. Feldman, R., & Sanger, J. (2007). Introduction to text mining. In The text mining handbook: Advanced approaches to analyzing unstructured data (pp. 1–10). New York, NY: Cambridge University Press.Google Scholar
  24. Finkel, J. R., Grenager, T., & Manning, C. D. (2005). Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005) (pp. 363–370). doi:10.3115/1219840.1219885
  25. Glenisson, P., Glänzel, W., Janssens, F., & De Moor, B. (2005). Combining full text and bibliometric information in mapping scientific disciplines. Information Processing & Management, 41, 1548–1572.CrossRefGoogle Scholar
  26. Glenisson, P., Glänzel, W., & Persson, O. (2005). Combining full text analysis and bibliometric indicators: A pilot study. Scientometrics, 63(1), 163–180.CrossRefGoogle Scholar
  27. Gunes, E., & Radev, D. (2004). Lexrank: Graph-based lexical centrality as salience in text summerication. Journal of Artificial Intelligence Research, 22(1), 457–479.Google Scholar
  28. Hepple, M., Ireson, N., Allegrini, P., Marchi, S., Monemagni, S., & Hidalgo, J. M. G. (2004). NLP-enhanced content filtering within the POESIA project. In Proceedings of the International Conference on Language Resources and Evaluation.Google Scholar
  29. Hotho, A., Nürnberger, A., & Paaß, G. (2005). A brief survey of text mining. LDV Forum, 20(1), 19–26.Google Scholar
  30. Janssens, F., Glänzel, W., & De Moor, B. (2008). A hybrid mapping of information science. Scientometrics, 75(3), 607–631.CrossRefGoogle Scholar
  31. Janssens, F., Leta, J., Glänzel, W., & De Moor, B. (2006). Towards mapping library and information science. Information Processing & Management, 42(6), 1614–1642. doi:10.1016/j.ipm.2006.03.025.CrossRefGoogle Scholar
  32. Joachims, T. (2002). Learning to classify text using support vector machines: Methods, theory and algorithms. Boston, MA: Kluwer Academic Publishers.CrossRefGoogle Scholar
  33. Kim, H., & Lee, J. Y. (2008). Exploring the emerging intellectual structure of archival studies using text mining: 2001–2004. Journal of Information Science, 34(2), 356–369.CrossRefGoogle Scholar
  34. Kim, H., & Lee, J. Y. (2009). Archiving research trends in LIS domain using profiling analysis. Scientometrics, 80(1), 75–90.CrossRefGoogle Scholar
  35. Klein, D., & Manning, C. D. (2003a). Accurate unlexicalized Parsing. In Proceedings of the 41st Meeting of the Association for Computational Linguistics (pp. 423–430). doi:10.3115/1075096.1075150
  36. Klein, D., & Manning, C. D. (2003). Fast exact inference with a factored model for natural language parsing. In Advances in neural information processing systems 15 (NIPS 2002) (pp. 3–10). Cambridge, MA: MIT Press.Google Scholar
  37. Kostoff, R. N., del Rio, J. A., Cortés, H. D., Smith, C., Smith, A., Wagner, C., … Tshiteya, R. (2007). Clustering methodologies for identifying country core competencies. Journal of Information Science, 33(1), 21–40. doi:10.1177/0165551506067124Google Scholar
  38. Kostoff, R. N., del Río, J. A., Humenik, J. A., García, E. O., & Ramírez, A. M. (2001). Citation mining: Integrating text mining and bibliometrics for research user profiling. Journal of the American Society for Information Science and Technology, 52(13), 1148–1156. doi:10.1002/asi.1181.CrossRefGoogle Scholar
  39. Kostoff, R. N., Eberhart, H. J., Toothman, D. R., & Pallenbarg, R. (2006). Database tomography for technical intelligence: Comparative roadmaps of research impact assessment literature and the journal of the American Chemical Society. Scientometrics, 40(1), 103–138.CrossRefGoogle Scholar
  40. Kostoff, R. N., Eberhart, H. J., & Toothman, D. R. (1998). Database tomography for technical intelligence: A roadmap of the near-earth space science and technology literature. Information Processing & Management, 34(1), 69–85.CrossRefGoogle Scholar
  41. Kostoff, R. N., Green, K. A., Toothman, D. R., & Humenik, J. A. (2000). Database tomography applied to an aircraft science and technology investment strategy. Journal of Aircraft, 37(4), 727–730.CrossRefGoogle Scholar
  42. Kostoff, R. N., Miles, D. L., & Eberhart, H. J. (1995). System and method for database tomography (No. PAT-APPL-9967 341). Washingtion, DC.Google Scholar
  43. Kostoff, R. N., Toothman, D. R., Eberhart, H. J., & Humenik, J. A. (2001). Text mining using database tomography and bibliometrics: A review. Technological Forecasting and Social Change, 68(3), 223–253.CrossRefGoogle Scholar
  44. Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probablistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning 2001 (ICML 2001) (pp. 282–289).Google Scholar
  45. Lang, K. (1995). Newsweeder: Learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning.Google Scholar
  46. Lawson, M., Kemp, N., Lynch, M. F., & Chowdhury, G. G. (1996). Automatic extraction of citations from the text of English-language patents—An example of template mining. Journal of Information Science, 22(6), 423–436.CrossRefGoogle Scholar
  47. Lee, J. Y., Kim, H., & Kim, P. J. (2010). Domain analysis with text mining: Analysis of digital library research trends using profiling methods. Journal of Information Science, 36(2), 144–161.CrossRefGoogle Scholar
  48. Lee, H., Peirsman, Y., Chang, A., Chambers, N., Surdeanu, M., & Jurafsky, D. (2011). Stanford’s multi-pass sieve coreference resolution system at the CoNLL-2011 shared task. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task (pp. 28–34). Association for Computational Linguistics.Google Scholar
  49. Lent, B., Agrawal, R., & Srikant, R. (1997). Discovering trends in text databases. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-97) (pp. 227–230).Google Scholar
  50. Leydesdorff, L., & Hellsten, I. (2005). Metaphors and diaphors in science communication: Mapping the case of stem cell research. Science Communication, 27(1), 64–99. doi:10.1177/1075547005278346.CrossRefGoogle Scholar
  51. Li, R., Chambers, T., Ding, Y., Zhang, G., & Meng, L. (2014). Patent citation analysis: Calculating science linkage motivation. Journal of the Association for Information Science and Technology. doi:10.1002/asi.23054.Google Scholar
  52. Lin, J., & Demner-Fushman, D. (2007). Semanic clustering of answers to clinical questions. In Proceedings of the Annual Symposium of the American Medical Informatic Association (AMIA 2007), Chicago (pp. 458–462).Google Scholar
  53. Liu, X., Yu, S., Janssens, F., Glänzel, W., Moreau, Y., & De Moor, B. (2010). Weighted hybrid clustering by combing text mining and bibliometrics on a large-scale journal database. Journal of the American Society for Information Science and Technology, 61(6), 1105–1119.Google Scholar
  54. Liu, X., Zhang, J., & Guo, C. (2012). Full-text citation analysis: enhancing bibliometric and scientific publication ranking. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management, ACM, 2012. (pp. 1975–1979). doi:10.1145/2396761.2398555
  55. Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English : The Penn Treebank. In Proceedings of the Computational Intelligence in Security for Information Systems: CISIS’09, 2nd International Workshop Burgos, Spain, September 2009 (Vol. 19, pp. 313–330).Google Scholar
  56. Marcus, M. P., Santorini, B. & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The penn Treebank. Computational Linguistics, 19: 313–330.Google Scholar
  57. Mihalcea, R., & Tarau, P. (2004). TextRank: Bringing order into texts. Proceedings of EMNLP, 4(4), 404–411. doi:10.3115/1219044.1219064.Google Scholar
  58. Ming, Z., Wang, K., & Chua, T. S. (2010). Prototype hierarchy-based clustering for the categorization and navigation of web collections. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 2–9).Google Scholar
  59. Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2–3), 103–134.CrossRefMATHGoogle Scholar
  60. Onyancha, O. B., & Ocholla, D. N. (2005). An informetric investigation of the relatedness of opportunistic infections to HIV/AIDS. Information Processing & Management, 41(6), 1573–1588. doi:10.1016/j.ipm.2005.03.015.CrossRefGoogle Scholar
  61. Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The pagerank citation ranking: Bringing order to the web. Working paper, Department of computer science, Stanford University (1999).Google Scholar
  62. Porter, A. L., Kongthon, A., & Lu, J. (2002). Research profiling: Improving the literature review. Scientometrics, 53(3), 351–370. doi:10.1023/A:1014873029258.CrossRefGoogle Scholar
  63. Rajman, M., & Vesely, M. (2004). From text to knowledge: Document processing and visualization: A text mining approach. In S. Sirmakessis (Ed.), Text mining and its applications: Results of the NEMIS Launch Conference (pp. 7–24). Berlin: Springer. doi:10.1007/978-3-540-45219-5_2.CrossRefGoogle Scholar
  64. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34(1), 1–47.CrossRefGoogle Scholar
  65. Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., & Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment Treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1631–1642).Google Scholar
  66. Song, M., Han, N. G., Kim, Y. H., Ding, Y., & Chambers, T. (2013). Discovering implicit entity relation with the gene-citation-gene network. PLoS One, 8(12), e84639. doi:10.1371/journal.pone.0084639.CrossRefGoogle Scholar
  67. Song, M., & Kim, S. Y. (2013). Detecting the knowledge structure of bioinformatics by mining full-text collections. Scientometrics, 96, 183–201. doi:10.1007/s11192-012-0900-9.CrossRefGoogle Scholar
  68. Suchanek, F. M., Kasneci, G., & Weikum, G. (2007). Yago: A core of semantic knowledge. In Proceedings of the 16th International Conference of World Wide Web (WWW’07) (pp. 697–706).Google Scholar
  69. The Stanford Natural Language Processing Group. (2013). Stanford CoreNLP. Stanford University. Retrieved from http://nlp.stanford.edu/downloads/corenlp.shtml
  70. Toutanova, K., Klein, D., Manning, C., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the NLT-NAACL 2003 (pp. 252–259). Association for Computational Linguistics. doi:10.3115/1073445.1073478
  71. Tseng, Y. H., Lin, C. J., & Lin, Y. I. (2007). Text mining techniques for patent analysis. Information Processing & Management, 43(5), 1216–1247. Retrieved from http://www.sciencedirect.com/science/article/pii/S0306457306002020
  72. Tseng, Y. H., Wang, Y. M., Lin, Y. I., Lin, C. J., & Juang, D. W. (2007). Patent surrogate extraction and evaluation in the context of patent mapping. Journal of Information Science, 33(6), 718–736. doi:10.1177/0165551507077406.CrossRefGoogle Scholar
  73. Van Raan, A. F. J., & Tijssen, R. J. W. (1993). The neural net of neural network research. Scientometrics, 26(1), 169–192. doi:10.1007/BF02016799.CrossRefGoogle Scholar
  74. Wang, B. B., McKay, R. I., Abbass, H. A., & Barlow, M. (2002). Learning text classifier using the domain concept hierarchy. In Proceedings of the International Conference on Communications, Circuits, and Systems, China.Google Scholar
  75. Zitt, M. (1991). A simple method for dynamic scientometrics using lexical analysis. Scientometrics, 2(1), 229–252.CrossRefGoogle Scholar
  76. Zitt, M., & Bassecoulard, E. (1994). Development of a method for detection and trend analysis of research fronts built by lexicoal or cocitation analysis. Scientometrics, 30(1), 333–351.CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.Department of Library and Information ScienceYonsei UniversitySeoulSouth Korea
  2. 2.Department of Information and Library Science, School of Informatics and ComputingIndiana UniversityBloomingtonUSA

Personalised recommendations