A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

  • Estela Marine-Roig
  • Salvador Anton Clave
Conference paper


In recent years, many articles have been published about the study of user-generated content (UGC) data in the domains of tourism and hospitality, in particular concerning quantitative and qualitative content analysis of travel blogs and online travel reviews (OTR). In general, researchers have worked on more or less population-representative samples of travel diaries, of tens or hundreds of files, which enables their manual processing. However, due to their dramatic growth, especially in the case of hospitality OTRs, this article proposes a method for semi-automatic downloading, arranging, cleaning, debugging, and analysing large-scale travel blog and OTR data. The main goal is to classify the collected webpages by dates and destinations and to be able to perform offline content analysis of the written text as provided by the author. This methodology is applied to analyse about 85,000 diaries of tourists who visited Catalonia between 2004 and 2013, and significant results are obtained in terms of content analysis.


Travel blog Online travel review Web harvesting Web data mining Massive content analysis Catalonia 



This work was supported by the Spanish Ministry of Economy and Competitiveness [Grant id.: GLOBALTUR CSO2011-23004 / GEOG].


  1. Abburu, S., & Babu, G. S. (2013). A frame work for web information extraction and analysis. International Journal of Computers & Technology, 7(2), 574–579.Google Scholar
  2. Banyai, M., & Glover, T. D. (2012). Evaluating research methods on travel blogs. Journal of Travel Research, 51(3), 267–277.CrossRefGoogle Scholar
  3. Eurostat. (2014). Tourism. In Eurostat regional yearbook 2014 (pp. 187–210). Luxembourg: Publications Office of the European Union.Google Scholar
  4. Johnson, P. A., Sieber, R. E., Magnien, N., & Ariwi, J. (2012). Automated web harvesting to collect and analyse user-generated content for tourism. Current Issues in Tourism, 15(3), 293–299.CrossRefGoogle Scholar
  5. Liu, B. (2011). Web data mining: Exploring hyperlinks, contents, and usage data. Berlin: Springer.CrossRefGoogle Scholar
  6. Lu, W., & Stepchenkova, S. (2014). User-generated content as a research mode in tourism and hospitality applications: Topics, methods, and software. Journal of Hospitality Marketing & Management. doi: 10.1080/19368623.2014.907758.Google Scholar
  7. Marine-Roig, E. (2013). From the projected to the transmitted image: The 2.0 construction of tourist destination image and identity in Catalonia. Ph.D. dissertation. Retrieved September 1, 2014 from
  8. Marine-Roig, E. (2014a). A webometric analysis of travel blogs and reviews hosting: The case of Catalonia. Journal of Travel & Tourism Marketing, 31(3), 381–396.CrossRefGoogle Scholar
  9. Marine-Roig, E. (2014b). The impact of the consecration of ‘La Sagrada Familia’ basilica in Barcelona by Pope Benedict XVI. International Journal of Tourism Anthropology (Special issue on “Sites of Religion, Sites of Heritage: Exploring the Interface between Religion and Heritage in Tourist Destinations”), 1–21. Retrieved September 1, 2014, from
  10. Michael, C. (2014, May 6). From Milan to Mecca: The world’s most powerful city brands revealed. The Guardian, News, Cities, City brand. Retrieved September 1, 2014, from
  11. Moens, M. F., Li, J., & Chua, T. S. (Eds.). (2014). Mining user generated content. Boca Raton, FL: CRC Press.Google Scholar
  12. Schmunk, S., Hopken, W., Fuchs, M., & Lexhagen, M. (2014). Sentiment analysis: Extracting decision-relevant knowledge from UGC. In Z. Xiamg & L. Tussyadiah (Eds.), Information and communication technologies in tourism (pp. 253–265). ENTER 2014: Proceedings of the international conference in Dublin, Ireland, January 21–24, 2014. Switzerland: Springer.Google Scholar
  13. Serna, A., Gerrikagoitia, J. K., & Alzua, A. (2014). Towards a better understanding of the cognitive destination image of Euskadi-Basque Country based on the analysis of UGC. In Z. Xiamg & L. Tussyadiah (Eds.), Information and communication technologies in tourism (pp. 395–407). ENTER 2014: Proceedings of the international conference in Dublin, Ireland, January 21–24, 2014. Switzerland: Springer.Google Scholar
  14. Wahsheh, H. A., Alsmadi, I. M., & Al-Kabi, M. N. (2012). Analyzing the popular words to evaluate spam in Arabic web pages. The Research Bulletin of Jordan ACM, 2(2), 22–26.Google Scholar
  15. Wang, Y., Chan, S. C., Ngai, G., & Leong, H. V. (2013). Quantifying reviewer credibility in online tourism. In H. Decker et al. (Eds.), DEXCA 2013 (pp. 381–395). Proceedings of 24th international conference: Database and expert systems applications, Prague, Czech Republic.Google Scholar
  16. Yadav, Y., & Yadav, P. K. (2011). Site content analyzer in context of keyword density and key phrase. International Journal of Computer Technology and Applications, 2(4), 860–872.Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Research Group on Territorial Analysis and Tourism Studies (GRATET)Rovira i Virgili UniversityCataloniaSpain

Personalised recommendations