Skip to main content

Enriching Domain-Specific Language Models Using Domain Independent WWW N-Gram Corpus

  • Conference paper
  • 1680 Accesses

Part of the Lecture Notes in Computer Science book series (LNAI,volume 7268)

Abstract

This paper describes the new techniques developed to extract and compute the domain-specific knowledge implicitly embedded in a highly structural ontology-based information system for TV Electronic Programming Guide (EPG). The domain knowledge represented by a set of mutually related n-gram data sets is then enriched by exploring the explicit structural dependencies and implicit semantic association between the data entities in the domain and the domain-independent texts from the Google 1 trillion 5-grams corpus created from general WWW documents. The knowledge-based enrichment process creates the language models required for a natural language based EPG search system that outperform the baseline model created only from the original EPG data source by a significant margin measured by an absolute improvement of 14.1% on the model coverage (recall accuracy) using large-scale test data collected from a real-world EPG search application.

Keywords

  • Knowledge engineering
  • natural language processing
  • text mining
  • WWW
  • speech recognition

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (Canada)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Katz, S.M.: Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Trans. ASSP 35(3), 400–401 (1987)

    CrossRef  Google Scholar 

  2. Eseen, H.N., Kneser, R.: On Structuring Probabilistic Dependencies in Stochastic Language Modeling. Computer, Speech, and Language 8, 1–38 (1994)

    CrossRef  Google Scholar 

  3. Kneser, R., Ney, H.: Improved Backing-off for M-gram Language Modeling. In: Proc. of ICASSP, vol. 1, pp. 181–184 (1995)

    Google Scholar 

  4. Chen, S.F., Goodman, J.: An Empirical Study of Smoothing Techniques for Language Modeling. Technical Report TR-10-98, Harvard University Center for Research in Computing Technology (1998)

    Google Scholar 

  5. Chelba, C., Acero, A.: Discriminative Training of N-gram Classifier for Speech and Text Routing. In: Proc. of Eurospeech, pp. 1–4 (2003)

    Google Scholar 

  6. Chen, Z., Lee, K.F., Li, M.J.: Discriminative Training on Language Models. In: Proc. of ICSLP (2000)

    Google Scholar 

  7. Chang, H.M.: Conceptual Modeling of Online Entertainment Programming Guide for Natural Language Interface. In: Hopfe, C.J., Rezgui, Y., Métais, E., Preece, A., Li, H. (eds.) NLDB 2010. LNCS, vol. 6177, pp. 188–195. Springer, Heidelberg (2010)

    CrossRef  Google Scholar 

  8. Chang, H.M.: Constructing N-gram Rules for Natural Language Models through Exploring the Limitations of the Zipf-Mandelbrot Law. Computing 91, 241–264 (2011)

    CrossRef  MathSciNet  MATH  Google Scholar 

  9. Chang, H.M.: Topics Inference by Weighted Mutual Information Measures Computed from Structured Corpus. In: Muñoz, R., Montoyo, A., Métais, E. (eds.) NLDB 2011. LNCS, vol. 6716, pp. 64–75. Springer, Heidelberg (2011)

    CrossRef  Google Scholar 

  10. Ng, T., Ostendorf, M., Hwang, M.Y., Manhung, S., Bulyko, I., Xin, L.: Web-data Augmented Language Models for Mandarin Conversational Speech Recognition. In: Proc. of ICASSP, pp. 589–592 (2005)

    Google Scholar 

  11. Tsiartas, A., Tsiartas, P., Narayanan, S.: Language Model Adaptation Using WWW Docuements Obtained by Utterance-based Queries. In: ICASSP (2010)

    Google Scholar 

  12. Brants, T., Franz, A.: Web 1T 5-gram Corpus Version 1.1. Technical Report, Google Research (2006)

    Google Scholar 

  13. Bille, P.: A Survey on Tree Edit Distance and Related Problems. Theor. Computing Sci. 331(1-3), 217–239 (2005)

    CrossRef  MathSciNet  Google Scholar 

  14. Dalamagas, T., Cheng, T., Wintel, K.J., Sellis, T.: A Methodology for Clustering XML Documents by Structure. Information System 31(3), 187–228 (2006)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chang, H. (2012). Enriching Domain-Specific Language Models Using Domain Independent WWW N-Gram Corpus. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds) Artificial Intelligence and Soft Computing. ICAISC 2012. Lecture Notes in Computer Science(), vol 7268. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29350-4_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-29350-4_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-29349-8

  • Online ISBN: 978-3-642-29350-4

  • eBook Packages: Computer ScienceComputer Science (R0)