Abstract
We address the question of how document properties (word count, term frequency, cohesiveness, genre) affect the quality of unsupervised document relatedness measures (Google trigram model and vector space model). We use three genres of documents: aviation safety reports, medical equipment failure descriptions, and biodiversity heritage library text. Quality of document relatedness is assessed by the accuracy of a classification task using the kNN method. Experiments discover correlations between document property values and document relatedness quality, and we discuss how one approach may perform better depending on property values of the dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Islam, A., Milios, E., Kešelj, V.: Text similarity using google tri-grams. In: Kosseim, L., Inkpen, D. (eds.) Canadian AI 2012. LNCS, vol. 7310, pp. 312–317. Springer, Heidelberg (2012)
Bickmore, T., Giorgino, T.: Health dialog systems for patients and consumers. Journal of Biomedical Informatics 39(5), 556–571 (2006)
Liu, T., Liu, S., Chen, Z., Ma, W.Y.: An evaluation on feature selection for text clustering. In: Fawcett, T., Mishra, N. (eds.) Proc. 20th International Conference on Machine Learning (ICML 2003), pp. 488–495. AAAI Press (August 2003)
Erkan, G., Radev, D.: Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research 22, 457–479 (2004)
Schutze, H.: Automatic word sense discrimination. Computational Linguistics 24(1), 97–124 (1998)
Wagstaff, K.L.: Chapter 1: Introduction: 1.2 Supervised Learning: Disadvantages of supervised learning (April 2007), http://www.wkiri.com/research/papers/wagstaff-diss-1.ps (last accessed August 13, 2013)
Liu, Q., Wu, Y.: Supervised Learning (Janurary 2011), http://www.fxpal.com/publications/FXPAL-PR-11-626.pdf (last accessed August 13, 2013)
Sathy, R., Abraham, A.: Comparison of Supervised and Unsupervised Learning Algorithms for Pattern Classification. International Journal of Advanced Research in Artificial Intelligence 2(2), 34–38 (2013)
Islam, A., Milios, E., Kešelj, V.: Comparing Word Relatedness Measures Based on Google n-grams. In: International Conference on Computational Linguistics, pp. 495–506 (December 2012)
Lee, M.D., Pincombe, B.M., Welsh, M.B.: An empirical evaluation of models of text document similarity. In: Proceedings of the XXVII Annual Conference of the Cognitive Science Society, Austin, Texas, USA, pp. 1254–1259. Cognitive Science Society (2005)
Thirumuruganathan, S.: A Detailed Introduction to K-Nearest Neighbor (KNN) Algorithm (May 2013), http://saravananthirumuruganathan.wordpress.com/2010/05/17/a-detailed-introduction-to-k-nearest-neighbor-knn-algorithm/ (last accessed July 30, 2013)
Colas, F., Brazdil, P.: Comparison of svm and some older classification algorithms in text classification tasks. In: Bramer, M. (ed.) Artificial Intelligence in Theory and Practice. IFIP, vol. 217, pp. 169–178. Springer, Heidelberg (2006)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using web search engines. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 757–766. ACM, New York (2007)
Turney, P.D.: Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: Flach, P.A., De Raedt, L. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, pp. 491–502. Springer, Heidelberg (2001)
Gomaa, W.H., Fahmy, A.A.: A survey of text similarity approaches. International Journal of Computer Applications 68(13), 13–18 (2013)
Islam, M. A., Inkpen, D.Z., Kiringa, I.: A generalized approach to word segmentation using maximum length descending frequency and entropy rate. In: Gelbukh, A. (ed.) CICLing 2007. LNCS, vol. 4394, pp. 175–185. Springer, Heidelberg (2007)
Oza, N.: SIAM 2007 Text Mining Competition dataset (September 2010), https://c3.nasa.gov/dashlink/resources/138/ (last accessed May 30, 2013)
Berry, M.W.: Automating the Detection of Anomalies and Trends from Text (2007), http://citeseerx.ist.psu.edu/viewdoc/download?rep=rep1&type=pdf&doi=10.1.1.132.751 (last accessed August 23, 2013)
Kiros, R., Soto, A.J., Milios, E., Keselj, V.: Representation learning for sparse, high dimensional multi-label classification. In: Yao, J., Yang, Y., Słowiński, R., Greco, S., Li, H., Mitra, S., Polkowski, L. (eds.) RSCTC 2012. LNCS, vol. 7413, pp. 463–470. Springer, Heidelberg (2012)
Bliznakov, Z., Stavrianou, K., Pallikarakis, N.: Medical devices recalls analysis focusing on software failures during the last decade. In: Roa Romero, L.M. (ed.) XIII Mediterranean Conference on Medical and Biological Engineering and Computing 2013. IFMBE Proceedings, vol. 41, pp. 1174–1177. Springer International Publishing (2014)
BHL consortium: About (June 2013), http://biodivlib.wikispaces.com/About (last accessed August 7, 2013)
BHL consortium: Download All File Types and Descriptions (August 2010), http://biodivlib.wikispaces.com/Download+All+File+Types+and+Descriptions (last accessed August 13, 2013)
Lewis, D.D.: RCV1-v2/LYRL2004: The LYRL2004 Distribution of the RCV1-v2 Text Categorization Test Collection (April 12 2004 Version) (April 2004), http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm (last accessed August 12, 2013)
Taylor, R.: Interpretation of the correlation coefficient: A basic review. Journal of Diagnostic Medical Sonography 1(6), 35–39 (1990)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Perrie, J., Islam, A., Milios, E. (2014). How Document Properties Affect Document Relatedness Measures. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8404. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54903-8_33
Download citation
DOI: https://doi.org/10.1007/978-3-642-54903-8_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54902-1
Online ISBN: 978-3-642-54903-8
eBook Packages: Computer ScienceComputer Science (R0)