U-STRUCT: A Framework for Conversion of Unstructured Text Documents into Structured Form

Jindal, Rajni; Taneja, Shweta

doi:10.1007/978-3-642-36321-4_6

Rajni Jindal⁴ &
Shweta Taneja⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 361))

Included in the following conference series:

International Conference on Advances in Computing, Communication and Control

2863 Accesses
4 Citations

Abstract

The term Text Mining or Text Analytics refers to the process of extracting useful patterns or knowledge from text. The data in textual documents can be of two types, either it can be unstructured or semi-structured. Unstructured data is freely naturally occurring text, whereas web documents data (HTML or XML) is semi structured. Since the natural language text is not organized and does not represent context, it needs to be converted into structured form to perform data analysis and mine useful patterns from it. The field of text mining deals with mining useful patterns or knowledge from unstructured text.

In this paper, we propose a framework for the conversion of the unstructured text documents to a structured form. We present a generalized framework called U – STRUCT which translates unstructured text into structured form. This framework analyses the text documents from different views: lexically, syntactically and semantically and produces a generalized intermediate form of documents. Further, we also discuss the opportunities and challenges in the field of text mining.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Kroeze, J.H., Matthee, M.C., Bothma, T.J.D.: Differentiating between data-mining and text-mining terminology. In: SAICSIT 2003, pp. 93–101. ACM Digital Library (2003)
Google Scholar
Chen, H.: Knowledge management systems: a text mining perspective. Knowledge Computing Corporation (2001)
Google Scholar
Stavrianou, A., Andritsos, P., Nicoloyannis, N.: Overview and Semantic Issues of Text Mining. SIGMOD Record 36, 3 (2007)
Article Google Scholar
Han, J., Kamber, M.: Data Mining Concepts and Techniques, 2nd edn. Morgan Kaufmann. The University of Illinois at Urbana-Champaign (2006)
Google Scholar
Pujari, A.K.: Data Mining Techniques. University Press (2002)
Google Scholar
Gupta, V., Lehal, G.S.: A Survey of Text Mining Techniques and Applications. Journal of Emerging Technologies in Web Intelligence 1(1) (2009)
Google Scholar
Berry, M.W., Browne, M.: Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM, Philadelphia (1999)
MATH Google Scholar
Lee, S., Song, J., Kim, Y.: An Empirical Comparison of Four Text Mining Methods. In: 43rd Hawaii International Conference on HICSS, pp. 1–10 (2010)
Google Scholar
Chen, M.S., Han, J., Yu, P.: Data Mining: A Overview from a Database Perspective. IEEE Transactions on Knowledge and Data Engineering 8(6) (1996)
Google Scholar
Hearst, M.A.: Untangling text data mining. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, ACL 1999, pp. 3–10 (1999) ISBN:1-55860-609-3
Google Scholar
Witten, I.H.: Text mining. In: Practical Handbook of Internet Computing, pp. 14-1–14-22. Chapman & Hall/CRC Press, Boca Raton, Florida (2005)
Google Scholar
Fan, W., Wallace, L., Rich, S., Zhang: Tapping the power of text mining. Communications of the ACM - Privacy and Security in Highly Dynamic Systems 49(9), 76–82 (2006)
Google Scholar
http://en.wikipedia.org/wiki/Brown_Corpus
http://wordnet.princeton.edu/
Balakrishnan, K., Sreedhanya, S., Soman, K.P.: Effect Of Pre-Processing On Historical Sanskrit Text Documents. International Journal of Engineering Research and Applications (IJERA) 2(4), 1529–1534 (2012) ISSN: 2248-9622
Google Scholar
Torunoglu, D.: Analysis of preprocessing methods on classification of Turkish texts. In: International Symposium on Innovations in Intelligent Systems and Applications (INISTA), pp. 112–117 (2011)
Google Scholar
Sagar, S., Imambi, S.T.: Pre processing of Medical Documents and Reducing Dimensionality. Advanced Computing: An International Journal (ACIJ) 2(5) (2011)
Google Scholar
Farooq, F., Govindaraju, V., Perrone, M.: Pre-processing methods for handwritten Arabic documents. In: Proceedings of Eighth International Conference on Document Analysis and Recognition, vol. 1, pp. 267–271 (2005)
Google Scholar
Suliman, A., Sulaiman, M.N., Othman, M.: Chain Coding and Pre Processing Stages of Handwritten Character Image File. Electronic Journal of Computer Science and Information Technology (eJCSIT) 2(1) (2010)
Google Scholar
Feldman, R., Dagan, I.: Knowledge Discovery in Textual Databases (KDT). In: Proceedings of KDD 1995 (1995)
Google Scholar
Shatkay, H., Feldman, R.: Mining the Biomedical Literature in the Genomic Era: An Overview. Journal of Computational Biology 10(6), 821–855 (2003)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Engineering, Delhi Technological University Formerly Delhi College of Engineering (DCE), Bawana Road, Delhi, 42, India
Rajni Jindal & Shweta Taneja

Authors

Rajni Jindal
View author publications
You can also search for this author in PubMed Google Scholar
Shweta Taneja
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Fr. Conceicao Rodrigues College of Engineering, Bandstand, Bandra (W), 400 050, Mumbai, Maharashtra, India
Srija Unnikrishnan
Fr. Conceicao Rodrigues College of Engineering, Bandstand, Bandra (W), 400 050, Mumbai, India
Sunil Surve
Dept. of Electronics Engineering, Fr. Conceicao Rodrigues College of Engineering, Bandstand, Bandra (West), 400 050, Mumbai, India
Deepak Bhoir

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jindal, R., Taneja, S. (2013). U-STRUCT: A Framework for Conversion of Unstructured Text Documents into Structured Form. In: Unnikrishnan, S., Surve, S., Bhoir, D. (eds) Advances in Computing, Communication, and Control. ICAC3 2013. Communications in Computer and Information Science, vol 361. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36321-4_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-36321-4_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36320-7
Online ISBN: 978-3-642-36321-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics