Conquering Language: Using NLP on a Massive Scale to Build High Dimensional Language Models from the Web

  • Gregory Grefenstette
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4394)


Dictionaries only contain some of the information we need to know about a language. The growth of the Web, the maturation of linguistic processing tools, and the decline in price of memory storage allow us to envision descriptions of languages that are much larger than before. We can conceive of building a complete language model for a language using all the text that is found on the Web for this language. This article describes our current project to do just that.


Search Engine Language Model Word Form Language Identification Computational Linguistics 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Grefenstette, G., Nioche, J.: Estimation of English and non-English language use on the WWW. In: Proceedings of RIAO (2000)Google Scholar
  2. 2.
    Ipeirotis, P.G., Agichtein, E., Jain, P., Gravano, L.: To search or to crawl?: towards a query optimizer for text-centric tasks. In: Proceedings of the 2006 ACM SIGMOD international Conference on Management of Data, SIGMOD ’06, Chicago, IL, USA, June 27 - 29, pp. 265–276. ACM Press, New York (2006)CrossRefGoogle Scholar
  3. 3.
    Nemeth, L., Tron, V., Halacsy, P., Kornai, A., Rung, A., Szakadat, I.: Leveraging the open source ispell codebase for minority language analysis. In: First Steps in Language Documentation for Minority Languages: Computational Linguistic Tools for Morphology, Lexicon and Corpus Compilation, Proceedings of the SALTMIL Workshop at LREC, pp. 56–59 (2004)Google Scholar
  4. 4.
    Besançon, R., de Chalendar, G., Ferret, O., Fluhr, C., Mesnard, O., Naets, H.: Concept-Based Searching and Merging for Multilingual Information Retrieval: First Experiments at CLEF 2003. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 174–184. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  5. 5.
    Cavnar, W.B., Trenkle, J.M.: N-gram based text categorization. In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, pp. 161–175 (1994)Google Scholar
  6. 6.
    Grefenstette, G.: Comparing two language identification schemes. In: Proceedings of the Third International Conference on the Statistical Analysis of Textual Data (JADT’95), Rome, December 11-13, pp. 263–268 (1995)Google Scholar
  7. 7.
    New, B., Pallier, C., Brysbaert, M., Fer, L.: Lexique 2: A New French Lexical Database. Behavior Research Methods, Instruments, & Computers 36(3), 516–524 (2004)CrossRefGoogle Scholar
  8. 8.
    Cunningham, H.: GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In: Proc. 40th Anniversary Meeting Assoc. for Computational Linguistics (ACL 2002). Assoc. for Computational Linguistics, East Stroudsburg, Pa. (2002)Google Scholar
  9. 9.
    Kikui, G-I.: Identifying the coding system and language of on-line documents on the internet. In: Proceedings of the 16th International Conference on Computational Linguistics (COLING) (1996)Google Scholar
  10. 10.
    Berland, S., Grabar, N.: Assistance automatique pour l’homogénéisation d’un corpus Web de spécialité. In: Actes des 6èmes Journées internationales d’analyse statistique des données textuelles, JADT 2002, Saint-Malo (2002)Google Scholar
  11. 11.
    Heydon, A., Najork, M.: Mercator: A scalable, extensible Web crawler. World Wide Web 2(4), 219–229 (1999)CrossRefGoogle Scholar
  12. 12.
    Sundheim, B.: Overview of results of the MUC-6 evaluation. In: Proceedings of Sixth Message Understanding Conference (MUC-6), Columbia, Maryland, November 6-8, pp. 13–32 (1995)Google Scholar
  13. 13.
    Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16(1), 22–29 (1990)Google Scholar
  14. 14.
    Hiemstra, D.: A probabilistic justification for using tf.idf term weighting in information retrieval. International Journal on Digital Libraries 3(2), 131–139 (2000)CrossRefGoogle Scholar
  15. 15.
    Hindle, D., Rooth, M.: Structural ambiguity and lexical relations. Computational Linguistics 19(1), 103–120 (1993)Google Scholar
  16. 16.
    Merlo, P., Crocker, M.W., Berthouzoz, C.: Attaching multiple prepositional phrases: Generalized Backed-off Estimation. In: Cardie, C., Weischedel, R. (eds.) Proceedings of the second conference on Empirical Methods in Natural Language Processing, EMNLP-97, pp. 149–155 (1997)Google Scholar
  17. 17.
    Nakov, P., Hearst, M.: Using the Web as an implicit training set: Application to structural ambiguity resolution. In: Proceedings of HLT-EMNLP, Vancouver, British Columbia, Canada, pp. 835–842 (2005)Google Scholar
  18. 18.
    Grefenstette, G.: The World Wide Web as a resource for example-based machine translation tasks. In: Proceedings of the ASLIB Conference on Translating and the Computer, London (1998)Google Scholar
  19. 19.
    Li, Y., Grefenstette, G.: Translating Chinese idiographic characters via corpus and web validation. In: CORIA’2005, Grenoble, France, March 9-11 (2005)Google Scholar
  20. 20.
    Qu, Y., Grefenstette, G.: Finding Ideographic Representations of Japanese Names Written in Latin Script via Language Identification and Corpus Validation. In: Proc. of ACL, pp. 184–191 (2004)Google Scholar
  21. 21.
    Turney, P.D., Littman, M.L.: Measuring praise and criticism: Inference of semantic orientation from association. ACM Transactions on Information Systems (TOIS) 21(4), 315–346 (2003)CrossRefGoogle Scholar
  22. 22.
    Grefenstette, G.: The Color of Things: Towards the automatic acquisition of information for a descriptive dictionary. Revue Française de Linguisitque Appliquée, vol. X-2 1386-1204, 83-94 (2005)Google Scholar
  23. 23.
    Cimiano, P., Staab, S.: Learning by googling. ACM SIGKDD Explorations Newsletter 6(2), 24–33 (2004)CrossRefGoogle Scholar
  24. 24.
    Kilgarriff, A.: Linguistic search engine. In: Simov, K. (ed.) Shallow Processing of Large Corpora: Workshop Held in Association with Corpus Linguistics 2003, Lancaster, England (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Gregory Grefenstette
    • 1
  1. 1.Commissariat à l’Energie Atomique, CEA LIST, SRCI, BP 6, 92265 Fontenay aux Roses CedexFrance

Personalised recommendations