Corpus Building for Corporate Knowledge Discovery and Management: A Case Study of Manufacturing

  • Ying Liu
  • Han Tong Loh
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4692)


Building a collection of electronic documents, e.g. corpus, is a cornerstone for the research in information retrieval, text mining and knowledge management. In literature, very few papers have discussed the necessary concerns for building a corpus and explained the building process systematically. In this paper, we explain our work of building an enterprise corpus called manufacturing corpus version 1 (MCV1) for corporate knowledge management purpose. Relevant issues, e.g. input texts, category labels and policies, as well as its parallel coding process and quality measurements are discussed. The real-world automated text classification experiments based on MCV1 show the soundness of its coding process. Finally, suggestions are made on how the proposed approach can be implemented in a more economical manner.


Information Retrieval Knowledge Management Code Operator Category Label Code Process 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval. Addison-Wesley Longman Publishing Co., Inc, Boston, MA, USA (1999)Google Scholar
  2. 2.
    Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery: an overview. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P. (eds.) Advances in knowledge discovery and data mining. American Association for Artificial Intelligence, Menlo Park, CA, USA (1996)Google Scholar
  3. 3.
    Hearst, M.A.: Untangling Text Data Mining. In: Proceedings of ACL’99, the 37th Annual Meeting of the Association for Computational Linguistics, invited paper (1999)Google Scholar
  4. 4.
    Hersh, W., Buckley, C., Leone, T.J., Hickam, D.: OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: 17th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR’94) (1994)Google Scholar
  5. 5.
    Joachims, T.: Text categorization with Support Vector Machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) Machine Learning: ECML-98. LNCS, vol. 1398, Springer, Heidelberg (1998)CrossRefGoogle Scholar
  6. 6.
    Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)Google Scholar
  7. 7.
    Mitchell, T.M.: Machine learning and data mining. Communications of the ACM 42, 30–36 (1999)CrossRefGoogle Scholar
  8. 8.
    Rennie, J.D.M., Shih, L., Teevan, J., Karger, D.R.: Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML) (2003)Google Scholar
  9. 9.
    Rose, T., Stevenson, M., Whitehead, M.: The Reuters Corpus Volume 1 - from Yesterday’s News to Tomorrow’s Language Resources. In: The third international conference on language resource and evaluation (2002)Google Scholar
  10. 10.
    Rose, T., Whitehead, M.: Private communication: RCV1 building (2003)Google Scholar
  11. 11.
    Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys (CSUR) 34, 1–47 (2002)CrossRefGoogle Scholar
  12. 12.
    Ulrich, K.T., Eppinger, S.D.: Product Design and Development, 2nd edn. McGraw-Hill, New York, USA (2000)Google Scholar
  13. 13.
    Vapnik, V.N.: The Nature of Statistical Learning Theory, 2nd edn. Springer, New York (1999)Google Scholar
  14. 14.
    Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Ying Liu
    • 1
  • Han Tong Loh
    • 2
  1. 1.Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong SARChina
  2. 2.Department of Mechanical Engineering, National University of Singapore, 21 Lower Kent Ridge Road, 119077Singapore

Personalised recommendations