Estimating Keyphrases Popularity in Sampling Collections

  • Svetlana Popova
  • Gabriella Skitalinskaya
  • Ivan Khodyrev
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9416)


The problem of structured representation of data has high practical value and is particularly relevant due to growth of data volume. Such means of data representation as topic graphs, concepts trees, etc. is a convenient way to represent information retrieved from a collection of documents. In this paper, we research some aspects of using collection of samples for evaluation popularity of concepts. The last can be used to visualize concept significance and concept ranking in the tasks of structured representation.

Multi-word phrases are considered concepts. We address the case when these phrases are automatically extracted from the processed document collection. The popularity of a concept (e.g., visually can be presented as the size of the vertex in the topic graph) is judged by the number of documents containing this phrase. We elaborate the case when a sample from the document collection is used to estimate concept popularity. For this case we estimate how permissible is such representation of data, reflecting the proportions of the number of documents containing specific concepts. A frequency-based criterion and a procedure of its calculation is described in the paper. This helps to estimate the expedience of concept popularity representation in respect to the popularity of other concepts. The main aspect here is to establish criteria when relations between values of concepts popularity in a sample are the same as in the population, and establish criterion for selecting n high-frequency concepts which have the same sample rank and frequency distributions as in the population.


Key phrase Topic graph Search result Information extraction Short texts Sampling 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Scaiella, U., Ferragina, P., Marino, A., Ciaramita, M.: Topical clustering of search results. In: WSDM, pp. 223–232. ACM (2012)Google Scholar
  2. 2.
    Mirylenka, D., Passerini, A.: Navigating the topical structure of academic search results via wikipedia category network. In: CIKM 2013 Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 891–896 (2013)Google Scholar
  3. 3.
    Bernardini, A., Carpineto, C.: Full-subtopic retrieval with keyphrase-based search results clustering. In: IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies, WI-IAT 2009, vol. 1 (2009)Google Scholar
  4. 4.
    Zhang, D., Dong, Y.: Semantic, hierarchical, online clustering of web search results. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 69–78. Springer, Heidelberg (2004) CrossRefGoogle Scholar
  5. 5.
    Zeng, H.-J., He, Q.-C., Chen, Z., Ma, W.-Y., Ma, J.: Learning to cluster web search results. In: Proceeding SIGIR 2004 Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 210–217 (2004)Google Scholar
  6. 6.
    Babcock, B., Chaudhuri, S., Das, G.: Dynamic sample selection for approximate query processing. In: Proc. SIGMOD 2003, pp. 539–550 (2003)Google Scholar
  7. 7.
    Wang, J., Krishnan, S., Franklin, M., Goldberg, K., Kraska, T., Milo, T.: A sample-and-clean framework for fast and accurate query processing on dirty data. In: SIGMOD 2014 (2014)Google Scholar
  8. 8.
    Ganti, V., Lee, M., Ramakrishnan, R.: ICICLES: self-tuning samples for approximate query answering. In: Proc. VLDB 2000, vol. 176, p. 187 (2000)Google Scholar
  9. 9.
    Kong, W., Allan, J.: Extracting query facets from search results. In: Proc. SIGIR 2013, pp. 93–102 (2013)Google Scholar
  10. 10.
    Mihalcea, R., Tarau, P.: Textrank: bringing order into texts. In: Proc. EMNLP 2004, pp. 404–411 (2004)Google Scholar
  11. 11.
    Xiaojun, W., Xiao, J.: Exploiting Neighborhood Knowledge for Single Document Sum-marization and Keyphrase Extraction. ACM Transactions on Information Systems 28(2) (2010)Google Scholar
  12. 12.
    Zesch, T., Gurevych, I.: Approximate matching for evaluating keyphrase extraction. In: Proc. RANLP 2009, pp. 484–489 (2009)Google Scholar
  13. 13.
    You, W., Fontaine, D., Barhes, J.-P.: An automatic keyphrase extraction system for scientific documents. Knowl. Inf. Syst. 34, 691–724 (2013)CrossRefGoogle Scholar
  14. 14.
    El-Beltagy, S.R., Rafea, A.: KP-Miner: A keyphrase extraction system for english and arabic documents. Information Systems 34, 132–144 (2009)CrossRefGoogle Scholar
  15. 15.
    Kim, S.N., Medelyan, O., Yen, M.: Automatic keyphrase extraction from scientific articles. Language Resources and Evaluation. Springer Kan & Timothy Baldwin (2012)Google Scholar
  16. 16.
    Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proc. EMNLP, pp. 216–223 (2003)Google Scholar
  17. 17.
    Turney, P.: Learning to extract keyphrases from text. In: NRC/ERB-1057 1999, p. 173 (1999)Google Scholar
  18. 18.
    Popova, S.V., Khodyrev, I.A.: Ranking in keyphrase extraction problem: is it suitable to use statistics of words occurrences? The Proceedings of ISP RAS 1(2), 2014 (2014)Google Scholar
  19. 19.
    Popova, S., Kovriguina, L., Khodyrev, I., Mouromtsev, D.: Stop-words in keyphrase extraction problem. In: Proc. FRUCT 2013, pp. 113–121 (2003)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Svetlana Popova
    • 1
    • 2
  • Gabriella Skitalinskaya
    • 3
  • Ivan Khodyrev
    • 1
  1. 1.ITMO UniversitySaint-PetersburgRussia
  2. 2.Saint-Petersburg State UniversitySaint-PetersburgRussia
  3. 3.Russian Presidential Academy of National Economy and PublicMoscowRussia

Personalised recommendations