How Document Properties Affect Document Relatedness Measures

Perrie, Jessica; Islam, Aminul; Milios, Evangelos

doi:10.1007/978-3-642-54903-8_33

Jessica Perrie¹⁷,
Aminul Islam¹⁷ &
Evangelos Milios¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8404))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1699 Accesses

Abstract

We address the question of how document properties (word count, term frequency, cohesiveness, genre) affect the quality of unsupervised document relatedness measures (Google trigram model and vector space model). We use three genres of documents: aviation safety reports, medical equipment failure descriptions, and biodiversity heritage library text. Quality of document relatedness is assessed by the accuracy of a classification task using the kNN method. Experiments discover correlations between document property values and document relatedness quality, and we discuss how one approach may perform better depending on property values of the dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Islam, A., Milios, E., Kešelj, V.: Text similarity using google tri-grams. In: Kosseim, L., Inkpen, D. (eds.) Canadian AI 2012. LNCS, vol. 7310, pp. 312–317. Springer, Heidelberg (2012)
Chapter Google Scholar
Bickmore, T., Giorgino, T.: Health dialog systems for patients and consumers. Journal of Biomedical Informatics 39(5), 556–571 (2006)
Article Google Scholar
Liu, T., Liu, S., Chen, Z., Ma, W.Y.: An evaluation on feature selection for text clustering. In: Fawcett, T., Mishra, N. (eds.) Proc. 20th International Conference on Machine Learning (ICML 2003), pp. 488–495. AAAI Press (August 2003)
Google Scholar
Erkan, G., Radev, D.: Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research 22, 457–479 (2004)
Google Scholar
Schutze, H.: Automatic word sense discrimination. Computational Linguistics 24(1), 97–124 (1998)
MathSciNet Google Scholar
Wagstaff, K.L.: Chapter 1: Introduction: 1.2 Supervised Learning: Disadvantages of supervised learning (April 2007), http://www.wkiri.com/research/papers/wagstaff-diss-1.ps (last accessed August 13, 2013)
Liu, Q., Wu, Y.: Supervised Learning (Janurary 2011), http://www.fxpal.com/publications/FXPAL-PR-11-626.pdf (last accessed August 13, 2013)
Sathy, R., Abraham, A.: Comparison of Supervised and Unsupervised Learning Algorithms for Pattern Classification. International Journal of Advanced Research in Artificial Intelligence 2(2), 34–38 (2013)
Google Scholar
Islam, A., Milios, E., Kešelj, V.: Comparing Word Relatedness Measures Based on Google n-grams. In: International Conference on Computational Linguistics, pp. 495–506 (December 2012)
Google Scholar
Lee, M.D., Pincombe, B.M., Welsh, M.B.: An empirical evaluation of models of text document similarity. In: Proceedings of the XXVII Annual Conference of the Cognitive Science Society, Austin, Texas, USA, pp. 1254–1259. Cognitive Science Society (2005)
Google Scholar
Thirumuruganathan, S.: A Detailed Introduction to K-Nearest Neighbor (KNN) Algorithm (May 2013), http://saravananthirumuruganathan.wordpress.com/2010/05/17/a-detailed-introduction-to-k-nearest-neighbor-knn-algorithm/ (last accessed July 30, 2013)
Colas, F., Brazdil, P.: Comparison of svm and some older classification algorithms in text classification tasks. In: Bramer, M. (ed.) Artificial Intelligence in Theory and Practice. IFIP, vol. 217, pp. 169–178. Springer, Heidelberg (2006)
Chapter Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using web search engines. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 757–766. ACM, New York (2007)
Google Scholar
Turney, P.D.: Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: Flach, P.A., De Raedt, L. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, pp. 491–502. Springer, Heidelberg (2001)
Chapter Google Scholar
Gomaa, W.H., Fahmy, A.A.: A survey of text similarity approaches. International Journal of Computer Applications 68(13), 13–18 (2013)
Article Google Scholar
Islam, M. A., Inkpen, D.Z., Kiringa, I.: A generalized approach to word segmentation using maximum length descending frequency and entropy rate. In: Gelbukh, A. (ed.) CICLing 2007. LNCS, vol. 4394, pp. 175–185. Springer, Heidelberg (2007)
Chapter Google Scholar
Oza, N.: SIAM 2007 Text Mining Competition dataset (September 2010), https://c3.nasa.gov/dashlink/resources/138/ (last accessed May 30, 2013)
Berry, M.W.: Automating the Detection of Anomalies and Trends from Text (2007), http://citeseerx.ist.psu.edu/viewdoc/download?rep=rep1&type=pdf&doi=10.1.1.132.751 (last accessed August 23, 2013)
Kiros, R., Soto, A.J., Milios, E., Keselj, V.: Representation learning for sparse, high dimensional multi-label classification. In: Yao, J., Yang, Y., Słowiński, R., Greco, S., Li, H., Mitra, S., Polkowski, L. (eds.) RSCTC 2012. LNCS, vol. 7413, pp. 463–470. Springer, Heidelberg (2012)
Chapter Google Scholar
Bliznakov, Z., Stavrianou, K., Pallikarakis, N.: Medical devices recalls analysis focusing on software failures during the last decade. In: Roa Romero, L.M. (ed.) XIII Mediterranean Conference on Medical and Biological Engineering and Computing 2013. IFMBE Proceedings, vol. 41, pp. 1174–1177. Springer International Publishing (2014)
Google Scholar
BHL consortium: About (June 2013), http://biodivlib.wikispaces.com/About (last accessed August 7, 2013)
BHL consortium: Download All File Types and Descriptions (August 2010), http://biodivlib.wikispaces.com/Download+All+File+Types+and+Descriptions (last accessed August 13, 2013)
Lewis, D.D.: RCV1-v2/LYRL2004: The LYRL2004 Distribution of the RCV1-v2 Text Categorization Test Collection (April 12 2004 Version) (April 2004), http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm (last accessed August 12, 2013)
Taylor, R.: Interpretation of the correlation coefficient: A basic review. Journal of Diagnostic Medical Sonography 1(6), 35–39 (1990)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada, B3H 4R2
Jessica Perrie, Aminul Islam & Evangelos Milios

Authors

Jessica Perrie
View author publications
You can also search for this author in PubMed Google Scholar
Aminul Islam
View author publications
You can also search for this author in PubMed Google Scholar
Evangelos Milios
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research, National Polytechnic Institute, Av. Juan Dios Bátiz, Col. Nueva Industrial Vallejo, 07738, Mexico D.F, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Perrie, J., Islam, A., Milios, E. (2014). How Document Properties Affect Document Relatedness Measures. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8404. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54903-8_33

Download citation

DOI: https://doi.org/10.1007/978-3-642-54903-8_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54902-1
Online ISBN: 978-3-642-54903-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics