Abstract
In previous work, we introduced a way of encoding free-form documents called the bigram proximity matrix (BPM). When this encoding was used on a corpus of documents, where each document is tagged with a topic label, results showed that the documents could be classified based on their tagged meaning. In this paper, we investigate methods of weighting the elements of the BPM, analogous to the weighting schemes found in natural language processing. These include logarithmic weights, augmented normalized frequency, inverse document frequency and pointwise mutual information. Results presented in this paper show that some of the weights increased the proportion of correctly classified documents.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Berry M.W., Browne M. (1999). Understanding search engines: mathematical modeling and text retrieval. SIAM.
Cox T.F., Cox M.A.A. (2001). Multidimensional scaling, 2nd edition. Chapman and Hall — CRC.
Duda R.O., Hart P.E., Stork D.G. (2000). Pattern classification, 2nd edition. Wiley-Interscience.
Fraley C, Raftery A.E. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal 41, 578–588.
Gale, Church and Yarowsky. (1992). A method for disambiguating word senses in a corpus. Computers and the Humanities 26, 415–439.
Kohonen, Tuevo. (2001). Self-organizing maps, third edition. Springer Verlag.
Manning C.D., Schütze H. 2000. Foundations of statistical natural language processing. The MIT Press.
Martinez A.R. (2002). A framework for the representation of semantics. Ph.D. Dissertation, George Mason University.
Martinez A.R., Wegman E.J. (2002). A text stream transformation for semantic-based clustering. Proceedings of the Interface.
Martinez A.R., Wegman E.J. (2002). Encoding of text to preserve meaning. Proceedings of the Army Conference on Applied Statistics.
Pantel P., Lin D. (2002). Discovering word senses from text. Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 613–619.
Tenenbaum J.B., de Silva V., Langford J.C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Martinez, A.R., Wegman, E.J., Martinez, W.L. (2004). Using Weights with a Text Proximity Matrix. In: Antoch, J. (eds) COMPSTAT 2004 — Proceedings in Computational Statistics. Physica, Heidelberg. https://doi.org/10.1007/978-3-7908-2656-2_26
Download citation
DOI: https://doi.org/10.1007/978-3-7908-2656-2_26
Publisher Name: Physica, Heidelberg
Print ISBN: 978-3-7908-1554-2
Online ISBN: 978-3-7908-2656-2
eBook Packages: Springer Book Archive