Topic Crawler for Social Networks Monitoring

  • Andrei V. Yakushev
  • Alexander V. Boukhanovsky
  • Peter M. A. Sloot
Part of the Communications in Computer and Information Science book series (CCIS, volume 394)

Abstract

Paper describes a focused crawler for monitoring social networks which is used for information extraction and content analysis. Crawler implements MapReduce model for distributed computations and is oriented to big text data. Focused crawler allows to look for the pages classified as relevant to the specified topic. Classifier is build using knowledge database that defines words, their classes and rules of joining words into the phrases. Based on the weights of words and phrases the text weight which indicates relevance to the topic is obtained. This system was used to detect drug community in Russian segment of Livejournal social network. Official and slang drug terminology was implemented to develop knowledge database. Different aspects of knowledge database and classifier are studied. The non-homogeneous Poisson process was used to model blogs changing since it permits to build a monitoring policy that includes blogs update frequency and day-time effect. Evaluation on real data shows 25% increase in new posts detection.

Keywords

crawling social networks knowledge base document classification monitoring Poisson process 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Lammel, R.: Google’s MapReduce programming model — Revisted. Science of Computer Programming 70, 1–30 (2007)MathSciNetCrossRefGoogle Scholar
  2. 2.
    White, T.: Hadoop: the definitive guide. O’Reilly Media, Yahoo! Press (2009)Google Scholar
  3. 3.
    Cafarella, M., Cutting, D.: Building Nutch: open source search. ACM Queue 2(2), 54–61 (2004)CrossRefGoogle Scholar
  4. 4.
    Sia, K., Cho, J., Cho, H.: Efficient monitoring algorithm for fast news alerts. Knowledge and Data Engineering (2007)Google Scholar
  5. 5.
    Cho, J., Garcia-Molina, H.: Effective page refresh policies for Web crawlers. ACM Transactions on Database Systems 28(4), 390–426 (2003)CrossRefGoogle Scholar
  6. 6.
    Ipeirotis, P.G., Agichtein, E., Gravano, L.: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks, pp. 265–276 (2006)Google Scholar
  7. 7.
    Cho, J., Garcia-Molina, H.: Synchronizing a database to Improve Freshness, 1–30 (2000)Google Scholar
  8. 8.
    Mityagin, S.A., et al.: Definition of target thresholds for drug-using indexes in respect to regional safety. Social Sciences (Obshestvennye nauki) 4, 243–251 (2012) (in Russian)Google Scholar
  9. 9.
    Mityagin, S.A, Yakushev, A.V., Boukhanovsky, A.V.: Simulation of drug-spreading in population using social network monitoring. SISP Journal 2(10), 133–151 (2012) (in Russian)Google Scholar
  10. 10.
    Simma, A., Jordan, M.: Modeling events with cascades of Poisson processes. Arxiv preprint arXiv:1203.3516 (2012)Google Scholar
  11. 11.
    Bloehdorn, S., Hotho, A.: Boosting for Text Classification with Semantic Features. In: Mobasher, B., Nasraoui, O., Liu, B., Masand, B. (eds.) WebKDD 2004. LNCS (LNAI), vol. 3932, pp. 149–166. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  12. 12.
    Hotho, A., Staab, S., Stumme, G.: Ontologies improve text document clustering. In: Third IEEE International Conference on Data Mining, ICDM 2003. IEEE (2003)Google Scholar
  13. 13.
    Bloehdorn, S., Hotho, A.: Text classification by boosting weak learners based on terms and concepts. In: Fourth IEEE International Conference on Data Mining, ICDM 2004. IEEE (2004)Google Scholar
  14. 14.
    Song, M.-H., Lim, S.-Y., Park, S.-B., Kang, D.-J., Lee, S.-J.: An automatic approach to classify web documents using a domain ontology. In: Pal, S.K., Bandyopadhyay, S., Biswas, S., et al. (eds.) PReMI 2005. LNCS, vol. 3776, pp. 666–671. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  15. 15.
    Castells, P., Fernandez, M., Vallet, D.: An Adaptation of the Vector-Space Model for Ontology-Based Information Retrieval. IEEE Transactions on Knowledge and Data Engineering (2007)Google Scholar
  16. 16.
    Chau, D.H., et al.: Parallel Crawling for Online Social Networks. In: Proceedings of the 16th International Conference on World Wide Web. ACM (2007)Google Scholar
  17. 17.
    Boanjak, M., et al.: TwitterEcho: a distributed focused crawler to support open research with twitter data. In: Proceedings of the 21st International World Wide Web Conference (2012)Google Scholar
  18. 18.
    Ravakhah, M., Kamyar, M.: Semantic Similarity Based Focused Crawling, Computational Intelligence, Communication Systems and Networks (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Andrei V. Yakushev
    • 1
  • Alexander V. Boukhanovsky
    • 1
  • Peter M. A. Sloot
    • 1
    • 2
  1. 1.Saint-Petersburg National University of Information Technologies, Mechanics and OpticsSaint-PetersburgRussia
  2. 2.School of Computer Engineering (SCE)Nanyang Technological University (NTU)Singapore

Personalised recommendations