Abstract
Electronic documents such as SGML/HTML/XML files and LaTeX files have been rapidly increasing, by the rapid progress of network and storage technologies. Many electronic documents have no rigid structure and are called semistructured documents. Since a lot of semistructured documents contain large plain texts, we focus on the structural characteristics among words in semistructured documents. The aim of this paper is to present a text mining technique for semistructured documents. We consider a problem of finding all frequent structured patterns among words in semistructured documents. Let (W 1, W 2,..., W k) be a list of words which are sorted in lexicographical order and let k ≥ 2 be an integer. Firstly, we define a tree-association pattern on (W 1, W 2,..., W k). A tree-association pattern on (W 1, W 2,..., W k) is a sequence 〈t 1; t 2;...; t k-1〉 of labeled rooted trees such that, for i = 1, 2,..., k-1, (1) t i consists of only one node having the pair of two words W i and W i+1 as its label, or (2) t i is a labeled rooted tree which has just two leaves labeled with W i and W i+1, respectively. Next, we present a text mining algorithm for finding all frequent tree-association patterns in semistructured documents. Finally, by reporting experimental results on our algorithm, we show that our algorithm is effective for extracting structural characteristics in semistructured documents.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, 2000.
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. Proc. of the 20th VLDB Conference, pages 487–499, 1994.
T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa. Efficient substructure discovery from large semi-structured data. Proc. 2nd SIAM Int. Conf. Data Mining (SDM-2002) (to appear), 2002.
M. Fernandez and Suciu D. Optimizing regular path expressions using graph schemas. Proc. Int. Conf. on Data Engineering (ICDE-98), pages 14–23, 1998.
R. Fujino, H. Arimura, and S. Arikawa. Discovering unordered and ordered phrase association patterns for text mining. Proc. PAKDD-2000, Springer-Verlag, LNAI 1805, pages 281–293, 2000.
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2001.
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. Proc. ACM SIGMOD Conf., pages 1–12, 2000.
D. Lewis. Reuters-21578 text categorization test collection. UCI KDD Archive, http://kdd.ics.uci.edu/databases/reuters21578/reuters21 578.html , 1997.
T. Miyahara, T. Shoudai, T. Uchida, K. Takahashi, and H. Ueda. Polynomial time matching algorithms for tree-like structured patterns in knowledge discovery. Proc. PAKDD-2000, Springer-Verlag, LNAI 1805, pages 5–16, 2000.
T. Miyahara, T. Shoudai, T. Uchida, K. Takahashi, and H. Ueda. Discovery of frequent tree structuted patterns in semistructured web documents. Proc. PAKDD-2001, Springer-Verlag, LNAI 2035, pages 47–52, 2001.
T. Miyahara, Y. Suzuki, T. Shoudai, T. Uchida, K. Takahashi, and H. Ueda. Discovery of frequent tag tree patterns in semistructured web documents. Proc. PAKDD-2002, Springer-Verlag, LNAI (to appear), 2002.
K. Wang and H. Liu. Discovering structural association of semistructured data. IEEE Trans. Knowledge and Data Engineering, 12:353–371, 2000.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Furukawa, K., Uchida, T., Yamada, K., Miyahara, T., Shoudai, T., Nakamura, Y. (2002). Extracting Characteristic Structures among Words in Semistructured Documents. In: Chen, MS., Yu, P.S., Liu, B. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2002. Lecture Notes in Computer Science(), vol 2336. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-47887-6_36
Download citation
DOI: https://doi.org/10.1007/3-540-47887-6_36
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43704-8
Online ISBN: 978-3-540-47887-4
eBook Packages: Springer Book Archive