Skip to main content

How Document Properties Affect Document Relatedness Measures

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2014)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8404))

  • 1699 Accesses

Abstract

We address the question of how document properties (word count, term frequency, cohesiveness, genre) affect the quality of unsupervised document relatedness measures (Google trigram model and vector space model). We use three genres of documents: aviation safety reports, medical equipment failure descriptions, and biodiversity heritage library text. Quality of document relatedness is assessed by the accuracy of a classification task using the kNN method. Experiments discover correlations between document property values and document relatedness quality, and we discuss how one approach may perform better depending on property values of the dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Islam, A., Milios, E., Kešelj, V.: Text similarity using google tri-grams. In: Kosseim, L., Inkpen, D. (eds.) Canadian AI 2012. LNCS, vol. 7310, pp. 312–317. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  2. Bickmore, T., Giorgino, T.: Health dialog systems for patients and consumers. Journal of Biomedical Informatics 39(5), 556–571 (2006)

    Article  Google Scholar 

  3. Liu, T., Liu, S., Chen, Z., Ma, W.Y.: An evaluation on feature selection for text clustering. In: Fawcett, T., Mishra, N. (eds.) Proc. 20th International Conference on Machine Learning (ICML 2003), pp. 488–495. AAAI Press (August 2003)

    Google Scholar 

  4. Erkan, G., Radev, D.: Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research 22, 457–479 (2004)

    Google Scholar 

  5. Schutze, H.: Automatic word sense discrimination. Computational Linguistics 24(1), 97–124 (1998)

    MathSciNet  Google Scholar 

  6. Wagstaff, K.L.: Chapter 1: Introduction: 1.2 Supervised Learning: Disadvantages of supervised learning (April 2007), http://www.wkiri.com/research/papers/wagstaff-diss-1.ps (last accessed August 13, 2013)

  7. Liu, Q., Wu, Y.: Supervised Learning (Janurary 2011), http://www.fxpal.com/publications/FXPAL-PR-11-626.pdf (last accessed August 13, 2013)

  8. Sathy, R., Abraham, A.: Comparison of Supervised and Unsupervised Learning Algorithms for Pattern Classification. International Journal of Advanced Research in Artificial Intelligence 2(2), 34–38 (2013)

    Google Scholar 

  9. Islam, A., Milios, E., Kešelj, V.: Comparing Word Relatedness Measures Based on Google n-grams. In: International Conference on Computational Linguistics, pp. 495–506 (December 2012)

    Google Scholar 

  10. Lee, M.D., Pincombe, B.M., Welsh, M.B.: An empirical evaluation of models of text document similarity. In: Proceedings of the XXVII Annual Conference of the Cognitive Science Society, Austin, Texas, USA, pp. 1254–1259. Cognitive Science Society (2005)

    Google Scholar 

  11. Thirumuruganathan, S.: A Detailed Introduction to K-Nearest Neighbor (KNN) Algorithm (May 2013), http://saravananthirumuruganathan.wordpress.com/2010/05/17/a-detailed-introduction-to-k-nearest-neighbor-knn-algorithm/ (last accessed July 30, 2013)

  12. Colas, F., Brazdil, P.: Comparison of svm and some older classification algorithms in text classification tasks. In: Bramer, M. (ed.) Artificial Intelligence in Theory and Practice. IFIP, vol. 217, pp. 169–178. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  13. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  14. Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using web search engines. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 757–766. ACM, New York (2007)

    Google Scholar 

  15. Turney, P.D.: Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: Flach, P.A., De Raedt, L. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, pp. 491–502. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  16. Gomaa, W.H., Fahmy, A.A.: A survey of text similarity approaches. International Journal of Computer Applications 68(13), 13–18 (2013)

    Article  Google Scholar 

  17. Islam, M. A., Inkpen, D.Z., Kiringa, I.: A generalized approach to word segmentation using maximum length descending frequency and entropy rate. In: Gelbukh, A. (ed.) CICLing 2007. LNCS, vol. 4394, pp. 175–185. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  18. Oza, N.: SIAM 2007 Text Mining Competition dataset (September 2010), https://c3.nasa.gov/dashlink/resources/138/ (last accessed May 30, 2013)

  19. Berry, M.W.: Automating the Detection of Anomalies and Trends from Text (2007), http://citeseerx.ist.psu.edu/viewdoc/download?rep=rep1&type=pdf&doi=10.1.1.132.751 (last accessed August 23, 2013)

  20. Kiros, R., Soto, A.J., Milios, E., Keselj, V.: Representation learning for sparse, high dimensional multi-label classification. In: Yao, J., Yang, Y., Słowiński, R., Greco, S., Li, H., Mitra, S., Polkowski, L. (eds.) RSCTC 2012. LNCS, vol. 7413, pp. 463–470. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  21. Bliznakov, Z., Stavrianou, K., Pallikarakis, N.: Medical devices recalls analysis focusing on software failures during the last decade. In: Roa Romero, L.M. (ed.) XIII Mediterranean Conference on Medical and Biological Engineering and Computing 2013. IFMBE Proceedings, vol. 41, pp. 1174–1177. Springer International Publishing (2014)

    Google Scholar 

  22. BHL consortium: About (June 2013), http://biodivlib.wikispaces.com/About (last accessed August 7, 2013)

  23. BHL consortium: Download All File Types and Descriptions (August 2010), http://biodivlib.wikispaces.com/Download+All+File+Types+and+Descriptions (last accessed August 13, 2013)

  24. Lewis, D.D.: RCV1-v2/LYRL2004: The LYRL2004 Distribution of the RCV1-v2 Text Categorization Test Collection (April 12 2004 Version) (April 2004), http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm (last accessed August 12, 2013)

  25. Taylor, R.: Interpretation of the correlation coefficient: A basic review. Journal of Diagnostic Medical Sonography 1(6), 35–39 (1990)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Perrie, J., Islam, A., Milios, E. (2014). How Document Properties Affect Document Relatedness Measures. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8404. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54903-8_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-54903-8_33

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-54902-1

  • Online ISBN: 978-3-642-54903-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics