Knowledge and Information Systems

, Volume 7, Issue 1, pp 56–83 | Cite as

Building Minority Language Corpora by Learning to Generate Web Search Queries

  • Rayid GhaniEmail author
  • Rosie Jones
  • Dunja Mladenic


The Web is a source of valuable information, but the process of collecting, organizing, and effectively utilizing the resources it contains is difficult. We describe CorpusBuilder, an approach for automatically generating Web search queries for collecting documents matching a minority concept. The concept used for this paper is that of text documents belonging to a minority natural language on the Web. Individual documents are automatically labeled as relevant or nonrelevant using a language filter, and the feedback is used to learn what query lengths and inclusion/exclusion term-selection methods are helpful for finding previously unseen documents in the target language. Our system learns to select good query terms using a variety of term scoring methods. Using odds ratio scores calculated over the documents acquired was one of the most consistently accurate query-generation methods. To reduce the number of estimated parameters, we parameterize the query length using a Gamma distribution and present empirical results with learning methods that vary the time horizon used when learning from the results of past queries. We find that our system performs well whether we initialize it with a whole document or with a handful of words elicited from a user. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes well across several languages regardless of the initial conditions.


Web mining Online learning Query generation Corpus construction 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Blum A (1996) On-line algorithms in machine learning. In: Proceedings of the workshop on on-line algorithms, Dagstuhl, Germany, June 1996, pp 306–325Google Scholar
  2. 2.
    Boley D, Gini M, Gross R, Han E-HS, Hastings K, Karypis G, Kumar V, Mobasher B, Moor J (1999) Document categorization and query generation on the world wide web using webace. AI Rev 13:365–391Google Scholar
  3. 3.
    Brown P, Pietra SD, Pietra VD, Mercer R (1993) The mathematics of statistical machine translation. Comput Ling 19:263–311Google Scholar
  4. 4.
    Callan J, Connell M, Du A (1999) Automatic discovery of language models for text databases. In: Proceedings of the 1999 ACM SIGMOD international conference on management of data, Philadelphia, 1–3 June 1999, pp 479–490Google Scholar
  5. 5.
    Cavnar WB, Trenkle JM (1994) N-gram-based text categorization. In: Proceedings of the 3rd annual symposium on document analysis and information retrieval, Las Vegas, April 1994, pp 161–175Google Scholar
  6. 6.
    Chen Z, Meng X, Zhu B, Fowler RH (2000) Websail: from on-line learning to web search. In: Proceedings of the 2000 international conference on Web information systems engineering, Hong Kong, 19–21 June 2000Google Scholar
  7. 7.
    Dean J, Henzinger M (1999) Finding related pages in the World Wide Web. In: Proceedings of the 8th international World Wide Web conference, Toronto, 11–14 May 1999, pp 1467–1479Google Scholar
  8. 8.
    Diligenti M, Coetzee F, Lawrence S, Giles CL, Gori M (2000) Focused crawling using context graphs. In: Proceedings of the 26th international conference on very large databases, Cairo, Egypt, 10–14 September 2000, pp 527–534Google Scholar
  9. 9.
    Ghani R, Jones R (2000) Learning a monolingual language model from a multilingual text database. In: Proceedings of the 9th ACM international conference on information and knowledge management, Washington, DC, November 2000, pp 187–193Google Scholar
  10. 10.
    Glover E, Flake G, Lawrence S, Birmingham WP, Kruger A, Giles CL, Pennock D (2001) Improving category specific web search by learning query modifications. In: Proceedings of the symposium on applications and the Internet, San Diego, 8–12 January 2001, pp 23–31Google Scholar
  11. 11.
    Golding AR, Roth D (1999) A winnow-based approach to context-sensitive spelling correction. Mach Learn 34:107–130CrossRefGoogle Scholar
  12. 12.
    Haines D, Croft B (1993) Relevance feedback and inference networks. In: Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval, Pittsburgh, 27 June–1 July 1993, pp 2–11Google Scholar
  13. 13.
    Jelinek F (1999) Statistical methods for speech recognition. MIT Press, Cambridge, MAGoogle Scholar
  14. 14.
    Liberman M, Cieri C (1998) The creation, distribution and use of linguistic data. In: Proceedings of the 1st international conference on language resources and evaluation, Grenada, Spain, May 1998Google Scholar
  15. 15.
    Mladenic D, Grobelnik M (1999) Feature selection for unbalanced class distribution and naive Bayes. In: Proceedings of the 16th international conference on machine learning, Bled, Slovenia, 27–30 June 1999, pp 258–267Google Scholar
  16. 16.
    Rennie J, McCallum AK (1999) Using reinforcement learning to spider the web efficiently. In: Proceedings of the 16th international conference on machine learning, Bled, Slovenia, 27–30 June 1999, pp 335–343Google Scholar
  17. 17.
    Resnik P (1999) Mining the web for bilingual text. In: Proceedings of 37th annual meeting of the Association of Computational Linguistics, College Park, MD, 20–26 June 1999, pp 527–534Google Scholar
  18. 18.
    van Noord G (1997) Textcat.∼vannoord/TextCat/Google Scholar
  19. 19.
    Yang Y, Pedersen J (1997) A comparative study on feature selection in text categorization. In: Proceedings of the 14th international conference on machine learning, Nashville, TN, 8–12 July 1997, pp 412–420Google Scholar

Copyright information

© Springer-Verlag 2004

Authors and Affiliations

  1. 1.Accenture Technology LabsChicagoUSA
  2. 2.Carnegie Mellon UniversityPittsburghUSA
  3. 3.J. Stefan InstituteLjubljanaSlovenia

Personalised recommendations