Skip to main content
Log in

Constructing specialised corpora through analysing domain representativeness of websites

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

The role of the Web for text corpus construction is becoming increasingly significant. However, the contribution of the Web is largely confined to building a general virtual corpus or low quality specialised corpora. In this paper, we introduce a new technique called SPARTAN for constructing specialised corpora from the Web by systematically analysing website contents. Our evaluations show that the corpora constructed using our technique are independent of the search engines employed. In particular, SPARTAN-derived corpora outperform all corpora based on existing techniques for the task of term recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. Google’s Web search interface serves up to 1,000 results. However, automated crawling and scraping of that page for URLs will result in the blocking of IP addresses. The SOAP API by Google, which allows up to 1,000 queries per day has been permanently phased out since August 2009. Refer to http://www.googleajaxsearchapi.blogspot.com/2007/12/search-result-limit-increase.html for more information.

  2. http://www.webcorp.org.uk.

  3. There are certain websites whose contents are heterogeneous in nature such as news sites, hosting sites, etc. Such sites are, however, automatically and systematically identified and removed during the corpus construction process by the proposed technique.

  4. A generalised version of the Normalised Google Distance (NGD) by Cilibrasi and Vitanyi (2007).

  5. This page count and all subsequent page counts derived from Google and Yahoo are obtained on 2 April 2009.

  6. Other commonly-used search engines such as AltaVista and AlltheWeb were not cited for comparison since they use the same search index as Yahoo’s.

  7. http://www.cleaneval.sigwac.org.uk/devset.html.

  8. http://www.search.cpan.org/stro/Text-Compare-1.03/lib/Text/Compare.pm.

  9. A demo is available at http://www.ontology.csse.uwa.edu.au/research/algorithm_hercules.pl.

  10. http://www.sslmit.unibo.it/baroni/bootcat.html.

  11. The terms are ranked using the technique by Basili et al. (2001).

  12. The download speed was tested using http://www.ozspeedtest.com/.

  13. More information on Yahoo! Search, including API key registration, is available at http://www.developer.yahoo.com/search/web/V1/webSearch.html.

  14. A demo is available at http://www.ontology.csse.uwa.edu.au/research/data_virtualcorpus.pl.

  15. A demo is available at http://www.ontology.csse.uwa.edu.au/research/data_localcorpus.pl.

  16. Note that this estimate is highly conjectural but serves as an interesting point of discussion and future work. If linear extrapolation was used instead, a 99.21% precision may only require 85 seeds. Linear extrapolation, however, is less likely considering that if 0 number of seed is used or in other words an empty corpus is produced, the precision is still at an improbable high of 94.12%.

References

  • Adamic, L., & Huberman, B. (2002). Zipf’s law and the internet.Glottometrics, 3(1), 143–150.

    Google Scholar 

  • Agbago, A., & Barriere, C. (2005). Corpus construction for terminology. In Proceedings of the corpus linguistics conference, Birmingham, UK.

  • Baroni, M., & Bernardini, S. (2004). Bootcat: Bootstrapping corpora and terms from the web. In Proceedings of the 4th language resources and evaluation conference (LREC), Lisbon, Portugal.

  • Baroni, M., & Bernardini, S. (2006). Wacky! working papers on the web as corpus. Bologna, Italy: GEDIT.

  • Baroni, M., & Ueyama, M. (2006). Building general- and special-purpose corpora by web crawling. In Proceedings of the 13th NIJL international symposium on language corpora: Their compilation and application.

  • Baroni, M., Kilgarriff, A., Pomikalek, J., & Rychly, P. (2006). Webbootcat: Instant domain-specific corpora to support human translators. In Proceedings of the 11th annual conference of the European association for Machine Translation (EAMT), Norway.

  • Basili, R., Moschitti, A., Pazienza, M., & Zanzotto, F. (2001). A contrastive approach to term extraction. In Proceedings of the 4th terminology and artificial intelligence conference (TIA), France.

  • Blair, I., Urland, G., & Ma, J. (2002). Using internet search engines to estimate word frequency. Behavior Research Methods Instruments & Computers, 34(2), 286–290.

    Article  Google Scholar 

  • Cavaglia, G., & Kilgarriff, A. (2001). Corpora from the web. In Proceedings of the 4th annual CLUCK colloquium, Sheffield, UK.

  • Cilibrasi, R., & Vitanyi, P. (2007). The google similarity distance. IEEE Transactions on Knowledge and Data Engineering, 19(3), 370–383.

    Article  Google Scholar 

  • Evert, S. (2007). Stupidos: A high-precision approach to boilerplate removal. In Proceedings of the 3rd web as corpus workshop, Belgium.

  • Evert, S. (2008). A lightweight and efficient tool for cleaning web pages. In Proceedings of the 4th web as corpus workshop (WAC), Morocco.

  • Fetterly, D., Manasse, M., Najork, M., & Wiener, J. (2003). A large-scale study of the evolution of web pages. In Proceedings of the 12th international conference on world wide web, Budapest, Hungary.

  • Fletcher, W. (2007). Implementing a bnc-comparable web corpus. In Proceedings of the 3rd web as corpus workshop, Belgium.

  • Francis, W., & Kucera, H. (1979). Brown corpus manual. http://icame.uib.no/brown/bcm.html.

  • Girardi, C. (2007). Htmlcleaner: Extracting the relevant text from the web pages. In Proceedings of the 3rd web as corpus workshop, Belgium.

  • Halliday, M., Teubert, W., Yallop, C., & Cermakova, A. (2004). Lexicology and corpus linguistics: An introduction. Continuum, London.

    Google Scholar 

  • Henzinger, M., & Lawrence, S. (2004). Extracting knowledge from the world wide web. PNAS, 101(1), 5186–5191.

    Article  Google Scholar 

  • Jock, F. (2009). An overview of the importance of page rank. http://www.associatedcontent.com/article/1502284/an_overview_of_the_importance_of_page.html?cat=15; 9 March 2009.

  • Keller, F., Lapata, M., & Ourioupina, O. (2002). Using the web to overcome data sparseness. In Proceedings of the conference on empirical methods in natural language processing (EMNLP), Philadelphia.

  • Kida, M., Tonoike, M., Utsuro, T., & Sato, S (2007). Domain classification of technical terms using the web. Systems and Computers in Japan, 38(14), 11–19.

    Article  Google Scholar 

  • Kilgarriff, A. (2001). Web as corpus. In Proceedings of the corpus linguistics (CL), Lancaster University, UK.

  • Kilgarriff, A. (2007). Googleology is bad science. Computational Linguistics, 33(1), 147–151

    Article  Google Scholar 

  • Kilgarriff, A., & Grefenstette, G. (2003). Web as corpus. Computational Linguistics, 29(3), 1–15.

    Article  Google Scholar 

  • Kim, J., Ohta, T., Teteisi, Y., & Tsujii, J. (2003). Genia corpus-a semantically annotated corpus for bio-textmining. Bioinformatics, 19(1), 180–182.

    Article  Google Scholar 

  • Lapata, M., & Keller, F. (2005). Web-based models for natural language processing. ACM Transactions on Speech and Language Processing, 2(1),1–30.

    Article  Google Scholar 

  • Liberman, M. (2005). Questioning reality. http://www.itre.cis.upenn.edu./myl/languagelog/archives/001837.html; 26 March 2009.

  • Liu, V., & Curran, J. (2006). Web text corpus for natural language processing. In Proceedings of the 11th conference of the European chapter of the association for computational linguistics (EACL), Italy.

  • McEnery, T., Xiao, R., & Tono, Y. (2005). Corpus-based language studies: An advanced resource book. London, UK: Taylor & Francis Group Plc.

    Google Scholar 

  • Nakov, P., & Hearst, M. (2005). A study of using search engine page hits as a proxy for n-gram frequencies. In Proceedings of the international conference on recent advances in natural language processing (RANLP), Bulgaria.

  • O’Neill, E., McClain, P., & Lavoie, B. (2001). A methodology for sampling the world wide web. Journal of Library Administration, 34(3), 279–291.

    Article  Google Scholar 

  • Ravichandran, D., Pantel, P., & Hovy, E. (2005). Randomized algorithms and nlp: Using locality sensitive hash function for high speed noun clustering. In Proceedings of the 43rd annual meeting on association for computational linguistics, Michigan, USA.

  • Renouf, A., Kehoe, A., & Banerjee, J. (2007). Webcorp: An integrated system for web text search. In Nadja Nesselhauf MHCB (Ed.), Corpus linguistics and the web. Amsterdam: Rodopi

    Google Scholar 

  • Resnik, P., & Smith, N. (2003). The web as a parallel corpus. Computational Linguistics, 29(3), 349–380.

    Article  Google Scholar 

  • Sharoff, S. (2006). Creating general-purpose corpora using automated search engine queries. In M. Baroni & S. Bernardini (Eds.), Wacky! Working papers on the web as corpus. Bologna: GEDIT

  • Thelwall, M., & Stuart, D. (2006). Web crawling ethics revisited: Cost, privacy and denial of service. Journal of the American Society for Information Science and Technology, 57(13), 1771–1779.

    Article  Google Scholar 

  • Turney, P. (2001). Mining the web for synonyms: Pmi-ir versus lsa on toefl. In Proceedings of the 12th European conference on machine learning (ECML). Freiburg, Germany.

  • Wong, W., Liu, W., & Bennamoun, M. (2007). Tree-traversing ant algorithm for term clustering based on featureless similarities. Data Mining and Knowledge Discovery, 15(3), 349–381.

    Article  Google Scholar 

  • Wong, W., Liu, W., & Bennamoun, M. (2008a). Constructing web corpora through topical web partitioning for term recognition. In Proceedings of the 21st Australasian joint conference on artificial intelligence (AI). Auckland, New Zealand.

  • Wong W., Liu W., & Bennamoun M. (2008b). Determination of unithood and termhood for term recognition. In M. Song & Y. Wu (Eds.), Handbook of research on text and web mining technologies. IGI Global

  • Wong, W., Liu, W., & Bennamoun, M. (2009). A probabilistic framework for automatic term recognition. Intelligent Data Analysis 13(4), 499–539.

    Google Scholar 

Download references

Acknowledgments

This research was supported by the Australian Endeavour International Postgraduate Research Scholarship, and the UWA Research Development Award 2009 from the University of Western Australia. The authors would like to thank the anonymous reviewers for their invaluable comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wong, W., Liu, W. & Bennamoun, M. Constructing specialised corpora through analysing domain representativeness of websites. Lang Resources & Evaluation 45, 209–241 (2011). https://doi.org/10.1007/s10579-011-9141-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-011-9141-4

Keywords

Navigation