Skip to main content

Automatic identification of cited text spans: a multi-classifier approach over imbalanced dataset

Abstract

Recently, a new form of structured summary on scientific papers is explored by grouping cited text spans from the reference paper. Its primary goal is to generate summaries based on the cited paper itself. Previously, traditional scientific summarization focused on citation-based methods by aggregating all citances that cite one unique paper without doing content-based citation analysis, while sometimes citations might differ between researchers or time slots. By investigating original text spans where scholars cited, the new method can reflect exact contributions of reference papers more. Therefore, how to identify cited text spans accurately becomes the first important problem to solve. Generally, it can be converted into finding the sentences in reference paper that is more similar with citation sentences. Taking it as a classification task, we investigate the potential of four actions to improve identification performance. Firstly, feature selections are conducted carefully according to multi-classifiers. Secondly, we apply sampling-based algorithms to preprocess class-imbalanced datasets. Since we integrated results via a weighted voting system, the third action is tuning parameters like, voting weights for multi-classifiers integration or running settings to see if we can improve performance further. Evaluation results show effectiveness of each action and demonstrate that researchers can take these actions for more accurate cited text spans identification when doing scientific summarization.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Notes

  1. Available at: https://tac.nist.gov/2014/BiomedSumm/index.html.

  2. Available at: http://wing.comp.nus.edu.sg/cl-scisumm2016/.

  3. Available at: http://wing.comp.nus.edu.sg/~cl-scisumm2017/.

  4. Available at: https://tac.nist.gov//2014/BiomedSumm/.

  5. Available at: http://wing.comp.nus.edu.sg/cl-scisumm2016/.

  6. Available at: http://wing.comp.nus.edu.sg/~birndl-sigir2017/.

  7. Available at: https://github.com/WING-NUS/scisumm-corpus.

  8. Available at: http://www.nist.gov/tac/2014.

  9. Available at: http://knowtator.sourceforge.net/.

  10. Available at: http://protege.stanford.edu/about.php.

  11. Available at: https://tac.nist.gov/2014/BiomedSumm/guidelines/BiomedSumm2014TaskDescription.pdf.

  12. Available at: http://tartarus.org/~martin/PorterStemmer/.

  13. Available at: http://radimrehurek.com/gensim/index.html.

  14. Available at: https://pypi.python.org/pypi/lda.

  15. Available at: http://scikit-learn.org/stable/index.html.

References

  • Abu-Jbara, A., Ezra, J., & Radev, D. (2013). Purpose and polarity of citation: Towards nlp-based bibliometrics. Paper presented at the Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: Human language technologies.

  • Abu-Jbara, A., & Radev, D. (2011). Coherent citation-based summarization of scientific papers. Paper presented at the Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1.

  • Abura’ed, A., Chiruzzo, L., Saggion, H., et al. (2017). LaSTUS/TALN@ CLSciSumm-17: Cross-document sentence matching and scientific text summarization systems.

  • Aggarwal, P., & Sharma, R. (2016). Lexical and Syntactic cues to identify Reference Scope of Citance. Paper presented at the BIRNDL@ JCDL.

  • Boyack, K. W., Small, H., & Klavans, R. (2013). Improving the accuracy of co-citation clustering using full text. Journal of the Association for Information Science And Technology, 64(9), 1759–1767.

    Google Scholar 

  • Cao, Z., Li, W., & Wu, D. (2016). PolyU at CL-SciSumm 2016. Paper presented at the BIRNDL@ JCDL.

  • Cao, Z., Wei, F., Dong, L., et al. (2015). Ranking with recursive neural networks and its application to multi-document summarization. Paper presented at the AAAI.

  • Carbonell, J., & Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. Paper presented at the Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval.

  • Chawla, N. V., Bowyer, K. W., Hall, L. O., et al. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.

    Article  MATH  Google Scholar 

  • Chen, Y., Lin, Z., Zhao, X., et al. (2014). Deep learning-based classification of hyperspectral data. IEEE Journal of Selected topics in applied earth observations and remote sensing, 7(6), 2094–2107.

    Article  Google Scholar 

  • Chen, J., & Zhuge, H. (2014). Summarization of scientific documents by detecting common facts in citations. Future Generation Computer Systems, 32, 246–252.

    Article  Google Scholar 

  • Cieslak, D. A., Chawla, N. V., & Striegel, A. (2006). Combating imbalance in network intrusion datasets. Paper presented at the GrC.

  • Cohan, A., & Goharian, N. (2017). Scientific article summarization using citation-context and article’s discourse structure. arXiv preprint arXiv:1704.06619.

  • Cohan, A., Soldaini, L., & Goharian, N. (2015). Matching citation text and cited spans in biomedical literature: A search-oriented approach. Paper presented at the proceedings of the 2015 conference of the North American Chapter of the association for computational linguistics: Human language technologies.

  • Cohen, A. M., Bhupatiraju, R. T., & Hersh, W. R. (2004). Feature generation, feature selection, classifiers, and conceptual drift for biomedical document triage. Paper presented at the TREC.

  • Cover, T. (1968). Estimation by the nearest neighbor rule. IEEE Transactions on Information Theory, 14(1), 50–55.

    Article  MATH  Google Scholar 

  • Cover, T. M. (1974). The best two independent measurements are not the two best. IEEE Transactions on Systems, Man, and Cybernetics, 4(1), 116–117.

    Article  MATH  Google Scholar 

  • Doyle, S., Monaco, J., Feldman, M., et al. (2011). An active learning based classification strategy for the minority class problem: Application to histopathology annotation. BMC Bioinformatics, 12(1), 424.

    Article  Google Scholar 

  • Elkiss, A., Shen, S., Fader, A., et al. (2008). Blind men and elephants: What do citation summaries tell us about a research article? Journal of the American Society for Information Science and Technology, 59(1), 51–62. https://doi.org/10.1002/asi.20707.

    Article  Google Scholar 

  • Fiszman, M., Rindflesch, T. C., & Kilicoglu, H. (2004). Abstraction summarization for managing the biomedical research literature. Paper presented at the Proceedings of the HLT-NAACL workshop on computational lexical semantics.

  • Goldstein, J., Mittal, V., Carbonell, J., et al. (2000). Multi-document summarization by sentence extraction. Paper presented at the Proceedings of the 2000 NAACL-ANLPWorkshop on Automatic summarization-Volume 4.

  • Haghighi, A., & Vanderwende, L. (2009). Exploring content models for multi-document summarization. Paper presented at the proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics.

  • Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In D.-S. Huang, X.-P. Zhang, & G.-B. Huang (Eds.), Advances in Intelligent Computing: International Conference on Intelligent Computing, ICIC 2005, Hefei, China, August 23-26, 2005, Proceedings, Part I (pp. 878–887). Berlin, Heidelberg: Springer, Berlin Heidelberg.

    Chapter  Google Scholar 

  • Hart, P. (1968). The condensed nearest neighbor rule (Corresp.). IEEE Transactions on Information Theory, 14(3), 515–516.

    Article  Google Scholar 

  • He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.

    Article  Google Scholar 

  • Jaidka, K., Chandrasekaran, M. K., Elizalde, B. F., et al. (2014). The computational linguistics summarization pilot task. Paper presented at the proceedings of text analysis conference, Gaithersburg, USA.

  • Jaidka, K., Chandrasekaran, M. K., Jain, D., et al. (2017a). The CL-SciSumm Shared Task 2017: Results and Key Insights.

  • Jaidka, K., Chandrasekaran, M. K., Rustagi, S., et al. (2016). Overview of the CL-SciSumm 2016 Shared Task. Paper presented at the BIRNDL@ JCDL.

  • Jaidka, K., Chandrasekaran, M. K., Rustagi, S., et al. (2017b). Insights from CL-SciSumm 2016: The faceted scientific document summarization Shared Task. International Journal on Digital Libraries, 1–9.

  • Jaidka, K., Khoo, C., & Na, J.-C. (2013). Deconstructing human literature reviewsA framework for multi-document summarization. Paper presented at the proceedings of the 14th European workshop on natural language generation.

  • Johnson, R. A., Chawla, N. V., & Hellmann, J. J. (2012). Species distribution modeling and prediction: A class imbalance problem. Paper presented at the intelligent data understanding (CIDU), 2012 conference on.

  • Klampfl, S., Rexha, A., & Kern, R. (2016). Identifying referenced text in scientific publications by summarisation and classification techniques. Paper presented at the BIRNDL@ JCDL.

  • Lauscher, A., Glavaš, G., & Eckert, K. (2017). University of Mannheim@ CLSciSumm-17: Citation-based summarization of scientific articles using semantic textual similarity.

  • Li, L., Mao, L., Zhang, Y., et al. (2016). CIST System for CL-SciSumm 2016 Shared Task. Paper presented at the BIRNDL@ JCDL.

  • Li, L., Zhang, Y., Mao, L., et al. (2017). CIST@ CLSciSumm-17: Multiple features based citation linkage, classification and summarization. Paper presented at the proceedings of the 2nd joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL2017). Tokyo, Japan (August 2017).

  • Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out.

  • Longadge, R., & Dongre, S. (2013). Class imbalance problem in data mining review. arXiv preprint arXiv:1305.1707.

  • Ma, S., Xu, J., Wang, J., et al. (2017). NJUST@ CLSciSumm-17. Paper presented at the proceedings of the 2nd joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL2017). Tokyo, Japan (August 2017).

  • Mei, Q., & Zhai, C. (2008). Generating impact-based summaries for scientific literature. Proceedings of ACL-08: HLT, 816–824.

  • Mendialdua, I., Arruti, A., Jauregi, E., et al. (2015). Classifier Subset Selection to construct multi-classifiers by means of estimation of distribution algorithms. Neurocomputing, 157, 46–60.

    Article  Google Scholar 

  • Mohammad, S., Dorr, B., Egan, M., et al. (2009). Using citations to generate surveys of scientific paradigms. Paper presented at the proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics.

  • Mollá, D., & Santiago-Martinez, M. E. (2011). Development of a corpus for evidence based medicine summarisation.

  • Moraes, L., Baki, S., Verma, R., et al. (2016). University of Houston at CL-SciSumm 2016: SVMs with tree kernels and Sentence Similarity. Paper presented at the proceedings of the joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL).

  • Nanba, H., Kando, N., & Okumura, M. (2011). Classification of research papers using citation links and citation types: Towards automatic review article generation. Advances in Classification Research Online, 11(1), 117–134.

    Article  Google Scholar 

  • Nomoto, T. (2016). NEAL: A neurally enhanced approach to linking citation and reference. Paper presented at the BIRNDL@ JCDL.

  • Ouamane, A., Belahcene, M., Benakcha, A., et al. (2012). The classification of scores from multi-classifiers for face verification. Sensors & Transducers, 145(10), 106.

    Google Scholar 

  • Prasad, A. (2017). WING-NUS at CL-SciSumm 2017: Learning from syntactic and semantic similarity for citation contextualization. Paper presented at the proceedings of the 2nd joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL2017). Tokyo, Japan (August 2017).

  • Qazvinian, V., & Radev, D. R. (2008). Scientific paper summarization using citation summary networks. Paper presented at the proceedings of the 22nd international conference on computational linguistics-volume 1.

  • Qazvinian, V., Radev, D. R., Mohammad, S. M., et al. (2013). Generating extractive summaries of scientific paradigms. Journal of Artificial Intelligence Research, 46, 165–201.

    Article  MathSciNet  Google Scholar 

  • Ronzano, F., & Saggion, H. (2016). An empirical assessment of citation information in scientific summarization. Paper presented at the international conference on applications of natural language to information systems.

  • Sáez, J. A., Luengo, J., Stefanowski, J., et al. (2015). SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Information Sciences, 291, 184–203.

    Article  Google Scholar 

  • Salama, G. I., Abdelhalim, M., & Zeid, M. A.-E. (2012). Breast cancer diagnosis on three different datasets using multi-classifiers. Breast Cancer (WDBC), 32(569), 2.

    Google Scholar 

  • Sándor, Á., & De Waard, A. (2012). Identifying claimed knowledge updates in biomedical research articles. Paper presented at the proceedings of the workshop on detecting structure in scholarly discourse.

  • Tang, D., Wei, F., Qin, B., et al. (2014). Coooolll: A deep learning system for Twitter sentiment classification. Paper presented at the SemEval@ COLING.

  • Teufel, S. (2010). The structure of scientific articles: Applications to citation indexing and summarization: Center for the Study of Language and Information.

  • Teufel, S., & Moens, M. (2002). Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status. Computational Linguistics, 28(4), 409–445. https://doi.org/10.1162/089120102762671936.

    Article  Google Scholar 

  • Thomas, J. A., & Cover, T. (2006). Elements of information theory (Vol. 2). New York, NY: Wiley.

    MATH  Google Scholar 

  • Tillmann, C., & Ney, H. (2000). Word re-ordering and DP-based search in statistical machine translation. Paper presented at the proceedings of the 18th conference on computational linguistics-volume 2.

  • Udupa, R., Faruquie, T. A., & Maji, H. K. (2004). An algorithmic framework for solving the decoding problem in statistical machine translation. Paper presented at the COLING 2004: Proceedings of the 20th international conference on computational linguistics.

  • Waard, A. D., & Maat, H. P. (2012). Epistemic modality and knowledge attribution in scientific discourse: A taxonomy of types and overview of features. Paper presented at the Proceedings of the Workshop on Detecting Structure in Scholarly Discourse, Jeju, Republic of Korea.

  • Ware, M., & Mabe, M. (2015). The STM report: An overview of scientific and scholarly journal publishing.

  • Whidby, M., Zajic, D., & Dorr, B. (2011). Citation handling for improved summarization of scientific documents.

  • Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, 2(3), 408–421.

    Article  MathSciNet  MATH  Google Scholar 

  • Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1–2), 69–90.

    Article  Google Scholar 

  • Yeh, J.-Y., Hsu, T.-Y., Tsai, C.-J., et al. (2017). Reference scope identification for citances by classification with text similarity measures. Paper presented at the proceedings of the 6th international conference on software and computer applications.

  • Zhang, D., & Li, S. (2017). PKU@ CLSciSumm-17: Citation Contextualization. Paper presented at the proceedings of the 2nd joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL2017). Tokyo, Japan (August 2017).

Download references

Acknowledgements

This work is supported by Major Projects of National Social Science Fund (No. 17ZDA291), Fujian Provincial Key Laboratory of Information Processing and Intelligent Control (Minjiang University) (No. MJUKF201704) and Qing Lan Project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chengzhi Zhang.

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, S., Xu, J. & Zhang, C. Automatic identification of cited text spans: a multi-classifier approach over imbalanced dataset. Scientometrics 116, 1303–1330 (2018). https://doi.org/10.1007/s11192-018-2754-2

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-018-2754-2

Keywords