Information Retrieval

, Volume 13, Issue 2, pp 101–131 | Cite as

Document clustering of scientific texts using citation contexts

  • Bader Aljaber
  • Nicola Stokes
  • James BaileyEmail author
  • Jian Pei


Document clustering has many important applications in the area of data mining and information retrieval. Many existing document clustering techniques use the “bag-of-words” model to represent the content of a document. However, this representation is only effective for grouping related documents when these documents share a large proportion of lexically equivalent terms. In other words, instances of synonymy between related documents are ignored, which can reduce the effectiveness of applications using a standard full-text document representation. To address this problem, we present a new approach for clustering scientific documents, based on the utilization of citation contexts. A citation context is essentially the text surrounding the reference markers used to refer to other scientific works. We hypothesize that citation contexts will provide relevant synonymous and related vocabulary which will help increase the effectiveness of the bag-of-words representation. In this paper, we investigate the power of these citation-specific word features, and compare them with the original document’s textual representation in a document clustering task on two collections of labeled scientific journal papers from two distinct domains: High Energy Physics and Genomics. We also compare these text-based clustering techniques with a link-based clustering algorithm which determines the similarity between documents based on the number of co-citations, that is in-links represented by citing documents and out-links represented by cited documents. Our experimental results indicate that the use of citation contexts, when combined with the vocabulary in the full-text of the document, is a promising alternative means of capturing critical topics covered by journal articles. More specifically, this document representation strategy when used by the clustering algorithm investigated in this paper, outperforms both the full-text clustering approach and the link-based clustering technique on both scientific journal datasets.


Citation contexts Document clustering Text categorization 


  1. Aas, K., & Eikvil, L. (1999). Text categorisation: A survey. Technical Report NR 941, Norwegian Computing Center, June.Google Scholar
  2. Angelova, R., & Siersdorfer, S. (2006). A neighborhood-based approach for clustering of linked document collecitons. In Proceedings of the 15th ACM conference on Information and knowledge management (pp. 778–779).Google Scholar
  3. Bekkerman, R., El-Yaniv, R., Tishby, N., & Winter, Y. (2003). Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research, 3, 1183–1208.zbMATHCrossRefGoogle Scholar
  4. Bergmark, D. (2000). Automatic extraction of reference linking information from online documents. Technical Report CSTR 2000-1821, Cornell Digital Library Research Group.Google Scholar
  5. Bergmark D., Phempoonpanich P., & Zhao, S. (2001). Scraping the ACM digital library. SIGIR Forum, 35(2), 1–7CrossRefGoogle Scholar
  6. Bradshaw, S. (2001) Document indexing vocabularies: Reference vs content. Northwestern University (Technical Report, NWU-CS-01-7).Google Scholar
  7. Bradshaw, S. (2002). Reference directed indexing: Indexing scientific literature in the context of its use. Ph.D. dissertation, Northwestern University (Technical Report, NWU-CS-02-7).Google Scholar
  8. Bradshaw, S. (2003). Reference directed indexing: Redeeming relevance for subject search in citation indexes. In Proceedings of the 7th European conference on research and advanced technology for digital libraries (pp. 499–510).Google Scholar
  9. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. In Proceedings of the seventh International conference on world wide web (pp. 107–117).Google Scholar
  10. Chik, F., Luk, R., & Chung, K. (2005). Text categorization based on subtopic clusters. Natural Language Processing and Information Systems, 3513, 203–214.Google Scholar
  11. Councill, I. G., Giles, C. L., & Kan, M. Y. (2008). Parscit: An open-source crf reference string parsing package. In Proceedings of language resources and evaluation conference (LREC 08).Google Scholar
  12. Dash, M., & Liu, H. (2000). Feature selection for clustering. In Proceedings of The Pacific-Asia conference on knowledge discovery and data mining (PAKDD) (pp. 110–121).Google Scholar
  13. Dhillon, I., Kogan, J., & Nicholas M. (2004). Feature selection and document clustering. Survey of text mining (pp. 73–100). New York: Springer.Google Scholar
  14. Dhillon, I., Guan, Y., & Kulis, B. (2007). Weighted graph cuts without eigenvectors: A multilevel approach. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 29(11), 1944–1957.CrossRefGoogle Scholar
  15. Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of the ACM conference on information and knowledge management (pp. 148–155).Google Scholar
  16. Elkiss, A., Shen, S., Fader, A., Erkan, G., States, D. J., & Radev, D. R. (2008). Blind men and elephants: What do citation summaries tell us about a research article? JASIST, 59(1), 51–62.CrossRefGoogle Scholar
  17. Furnas, G., Landauer, T., Gomez, L., & Dumais, S. (1987). The vocabulary problem in human-system communication. Communications of the ACM, 30(11), 964–971.CrossRefGoogle Scholar
  18. Gabrilovich, E., & Markovitch, S. (2006). Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge. In Proceedings of the twenty-first AAAI conference on artificial intelligence (pp. 1301–1306).Google Scholar
  19. Garfield, E. (1964). Science citation index, a new dimension in indexing. Science, 144(3619), 649–654.CrossRefGoogle Scholar
  20. Giles, C., Bollacker, K., & Lawrence, S. (1998). Citeseer: An automatic citation indexing system. In Proceedings of the third ACM conference on digital libraries, June 1998 (pp. 89–98).Google Scholar
  21. Glover, E., Tsioutsiouliklis, K., Lawrence, S., Pennock, D., & Flake, G. (2002). Using web structure for classifying and describing web pages. In Proceedings of the world wide web conference (pp. 562–569).Google Scholar
  22. Hartigan, J., & Wong, M. (1979). A k-means clustering algorithm. Applied Statistics, 28, 100–108zbMATHCrossRefGoogle Scholar
  23. Haveliwala, T., Gionis, A., Klein, D., & Indyk, P. (2002). Evaluating strategies for similarity search on the web. In Proceedings of the world wide web conference (pp. 432–442).Google Scholar
  24. Hunter, L., & Cohen, K. (2006). Biomedical language processing: What’s beyond pubmed?. Molecular Cell, 21(5), 589–594.CrossRefGoogle Scholar
  25. Kao, H.-Y., Chen, M.-S., Lin, S.-H., & Ho, J.-M. (2002). Entropy-based link analysis for mining web informative structures. In Proceedings of the ACM conference on information and knowledge management (pp. 574–581).Google Scholar
  26. Kleinberg, J. (1999) Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604–632.zbMATHCrossRefMathSciNetGoogle Scholar
  27. Krovetz, R., & Croft, W. (1992). Lexical ambiguity and information retrieval. ACM Transactions on Information Systems, 10(2), 115–141.CrossRefGoogle Scholar
  28. Kull, M., & Vilo, J. (2008). Fast approximate hierarchical clustering using similarity heuristics. BioData Mining, 9, 1.Google Scholar
  29. Lawrence, S., Giles, C., & Bollacker, K. (1999). Digital libraries and autonomous citation indexing. IEEE Computer, 32(6), 67–71.Google Scholar
  30. Liu, X., & Croft, W. B. (2004). Cluster-based retrieval using language models. In SIGIR 2004: proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, Sheffield, UK (pp. 186–193).Google Scholar
  31. Liu, T., Liu, S., Chen, Z., & Ma, W. (2003). An evaluation on feature selection for text clustering. In Proceedings of the twentieth international conference on machine learning (ICML), Washington, DC (pp. 488–495).Google Scholar
  32. Liu, J., Paulsen, S., Sun, X., Wang, W., Nobel, A., & Prins, J. (2006). Mining approximate frequent itemsets in the presence of noise: Algorithm and analysis. In Proceedings of the 6th SIAM international conference on data mining (SDM) (pp. 405–416).Google Scholar
  33. Madeira, S., & Oliveira, A. (2004). Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1(1), 24–45.CrossRefGoogle Scholar
  34. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.zbMATHGoogle Scholar
  35. Mercer, R., & Marco, C. D. (2004). A design methodology for a biomedical literature indexing tool using the rhetoric of science. In Proceedings of the bioLink workshop in conjunction with human language technology conference/North American chapter of the association for computational linguistics annual meeting (HLT/NAACL) (pp. 77–84).Google Scholar
  36. Moravcsik, M., & Murugesan, P. (1975). Some results on the function and quality of citations. Social Studies of Science, 5, 86—92.CrossRefGoogle Scholar
  37. Nakov, P., Schwartz, A., & Hearst, M. (2004). Citances: Citation sentences for semantic analysis of bioscience text. In Proceedings of the SIGIR’04 workshop on search and discovery in bioinformatics (pp. 81–88).Google Scholar
  38. Nanba, H., & Okumura, M. (2005) Automatic detection of survey articles. In A. Rauber, S. Christodoulakis, & A. M. Tjoa (Eds.), Research and advanced technology for digital libraries, 9th European conference, ECDL, Proceedings, September 18–23, 2005. Lecture Notes in Computer Science (Vol. 3652, pp. 391–401). Vienna, Austria: SpringerGoogle Scholar
  39. Nanba, H., Kando, N., & Okumura, M. (1999). Towards multi paper summarization using reference information. In Proceedings of the 16th international joint conferences on artificial intelligence (IJCAI-99) (pp. 926–931).Google Scholar
  40. Nanba, H., Kando, N., & Okumura, M. (2000). Classification of research papers using citation links and citation types: Towards automatic review article generation. In Proceedings of the The American Society for Information Science (ASIS)/the 11th SIG classification research workshop, classification for user support and learning, 2000, Chicago, USA (pp. 117–134).Google Scholar
  41. Nanba, H., Abekawa, T., Okumura, M., & Saito, S. (2004). Bilingual presri integration of multiple research paper databases. In Proceedings of RIAO (pp. 195–211).Google Scholar
  42. Porter, M. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.Google Scholar
  43. Powley, B., & Dale, R. (2007) Evidence-based information extraction for high-accuracy citation extraction and author name recognition. In Proceedings of the 8th RIAO international conference on large-scale semantic access to content.Google Scholar
  44. Ritchie, A., Teufel, S., & Robertson, S. (2006). How to find better index terms through citations. In Proceedings of the workshop on how can computational linguistics improve information retrieval?, Sydney (pp. 25–32).Google Scholar
  45. Ritchie, A., Robertson, S., & Teufel, S. (2008a). Comparing citation contexts for information retrieval. In J. G. Shanahan, S. Amer-Yahia, I. Manolescu, Y. Zhang, D. A. Evans, A. Kolcz, K.-S. Choi, & A. Chowdhury (Eds.), Proceedings of the 17th ACM conference on information and knowledge management, CIKM 2008, October 26–30, 2008 (pp. 213–222). Napa Valley, CA, USA: ACM.Google Scholar
  46. Ritchie, A., Teufel, S., & Robertson, S. (2008b). Using terms from citations for information retrieval: Some first results. In Proceedings of the 30th European conference on information retrieval (ECIR) (pp. 211–221).Google Scholar
  47. Robertson, S., Zaragoza, H., & Taylor, M. (2004). Simple bm25 extension to multiple weighted fields. In Proceedings of the thirteenth ACM international conference on information and knowledge management (CIKM), 2004 (pp. 42–49). New York, NY, USA: ACM.Google Scholar
  48. Salton, G. (1971) The SMART retrieval system—experiments in automatic document processing. Upper Saddle River, NJ: Prentice-Hall, Inc.Google Scholar
  49. Siddharthan, A., & Teufel, S. (2007). Whose idea was this and why does it matter? attributing scientific work to citations. In Proceedings of the annual conference of the North American chapter of the association for computational linguistics (NAACL-HLT) (pp. 316–323).Google Scholar
  50. Slonim, N., & Tishby, N. (2000). Document clustering using word clusters via the information bottleneck method. In Proceedings of the international ACM SIGIR conference on research and development in information retrieval (pp. 208–215).Google Scholar
  51. Small, H., & Sweeney, E. (1985). Clustering the science citation index using co-citations. Scienrometrics, 7(3-6), 391–409.CrossRefGoogle Scholar
  52. Tang, B., Shepherd, M., Milios, E., & Heywood, M. (2004). Comparing and combining dimension reduction techniques for efficient test clustering. In Proceedings of the workshop on feature selection for data mining, SIAM international conference on data mining (SDM) (pp. 17–26).Google Scholar
  53. Teufel, S., & Moens, M. (2002). Summarizing scientific articles: Experiments with relevance and rhetorical status. Computational Linguistics, 28(4), 409–445.CrossRefGoogle Scholar
  54. Teufel, S., Siddharthan, A., & Tidhar, D. (2006). Automatic classification of citation function. In Proceedings of EMNLP-06.Google Scholar
  55. Voorhees, E. (1986). Implementing agglomerative hierarchic clustering algorithms for use in document retrieval. Information Processing and Management, 22(6), 465–476.CrossRefGoogle Scholar
  56. Wang, Y., & Kitsuregawa, M. (2002). Evaluating contents-link coupled web page clustering for web search results. In Proceedings of the ACM conference on information and knowledge management (CIKM) (pp. 499–506).Google Scholar
  57. Wang, Y., & Kitsuregawa, M. (2004). Enhancing contents-link coupled web page clustering and its evaluation. In Proceedings of data engineering workshop (DEWS2004).Google Scholar
  58. White, H. (2004). Citation analysis and discourse analysis revisited. Applied Linguistics, 25(1), 89–116.CrossRefGoogle Scholar
  59. Wyse, N., Dubes, R., & Jain, A. (1980). A critical evaluation of intrinsic dimensionality algorithms. In E. S. Gelsema & L. N. Kanal (Eds.), Pattern recognition in practice (pp. 415–425). North-Holland Inc.Google Scholar
  60. Yang, Y., & Pedersen, J. (1997). A comparative study on feature selection in text categorization. In Proceedings of the international conference on machine learning (pp. 412–420).Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Bader Aljaber
    • 1
  • Nicola Stokes
    • 2
  • James Bailey
    • 3
    Email author
  • Jian Pei
    • 4
  1. 1.Department of Computer Science and Software EngineeringThe University of MelbourneMelbourneAustralia
  2. 2.School of Computer Science and InformaticsUniversity College DublinDublinIreland
  3. 3.NICTA Victoria Laboratory, Department of Computer Science and Software EngineeringThe University of MelbourneMelbourneAustralia
  4. 4.School of Computing ScienceSimon Fraser UniversityBurnabyCanada

Personalised recommendations