Skip to main content
Log in

Unified domain-specific language for collecting and processing data of social media

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Data provided by social media becomes an increasingly important analysis material for social scientists, market analysts, and other stakeholders. Diversity of interests leads to the emergence of a variety of crawling techniques and programming solutions. Nevertheless, these solutions have a lack of flexibility to satisfy requirements of different users and individual crawling scenarios, that can range from a simple query to a complex workflow containing multiple steps and requiring data from different networks to be collected. To address this problem, our paper proposes an approach based on a developed domain specific language (DSL) and architecture of distributed crawling system. The DSL has a declarative style that requires the user to define the description of needed data and based on an ontological model of social networks and the essential crawling techniques. Thus, the crawling system can be applied to collect the data from different online social networks within complex workflows along with the exploitation of various crawling methods implemented in a distributed computing environment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Arnaboldi, V., Conti, M., Passarella, A., Pezzoni, F. (2013). Ego networks in twitter: an experimental analysis. In INFOCOM, 2013 Proceedings IEEE (pp. 3459–3464): IEEE.

  • Avrachenkov, K.E., Mazalov, V.V., Tsynguev, B.T. (2015). Beta Current Flow Centrality for Weighted Networks. In Computational Social Networks (pp. 216–227): Springer International Publishing.

  • Bansal, N., & Koudas, N. (2007). Blogscope: spatio-temporal analysis of the blogosphere. In Proceedings of the 16th international conference on World Wide Web (pp. 1269–1270): ACM.

  • Boanjak, M., Oliveira, E., Martins, J., Mendes Rodrigues, E., Sarmento, L. (2012). TwitterEcho: a distributed focused crawler to support open research with twitter data. In Proceedings of the 21st international conference companion on World Wide Web (pp. 1233–1240): ACM.

  • Buccafurri, F., Lax, G., Nocera, A., Ursino, D. (2012). Crawling social internetworking systems. In 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) (pp. 506–510): IEEE. - (BFS, Random Walk and others).

  • Buccafurri, F., Lax, G., Nocera, A., Ursino, D. (2015). A system for extracting structural information from Social Network accounts. Software: Practice and Experience, 45(9), 1251–1275.

    Google Scholar 

  • Buccafurri, F., Lax, G., Nicolazzo, S., Nocera, A. (2016). A model to support design and development of multiple-social-network applications. Information Sciences, 331, 99–119.

    Article  MathSciNet  Google Scholar 

  • Buraya, K., Farseev, A., Filchenkov, A., Chua, T.S. (2017). Towards User Personality Profiling from Multiple Social Networks. In AAAI (pp. 4909–4910).

  • Butakov, N., Chuprova, Y., Knyazkov, K., Shindyapina, N., Boukhanovsky, A. (2015). Evolutionary-based Framework for Optimizing the Spread of Information on Twitter. Procedia Computer Science, 66, 287–296.

    Article  Google Scholar 

  • Dunbar, R.I.M., Arnaboldi, V., Conti, M., Passarella, A. (2015). The structure of online social networks mirrors those in the offline world. Social Networks, 43, 39–47.

    Article  Google Scholar 

  • Duvanova, D., Nikolaev, A., Nikolsko-Rzhevskyy, A., Semenov, A. (2015). Violent conflict and online segregation: An analysis of social network communication across Ukraine’s regions. Journal of Comparative Economics.

  • Farseev, A., Nie, L., Akbari, M., Chua, T.S. (2015). Harvesting multiple sources for user profile learning: a big data study. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval (pp. 235–242): ACM.

  • Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A. (2010). Walking in Facebook: A case study of unbiased sampling of OSNs. In IEEE (pp. 1–9).

  • Hicks, A., & BE, D.F. (2015). Mining Twitter as a First Step toward Assessing the Adequacy of Gender Identification Terms on Intake Forms.

  • Kahanda, I., & Neville, J. (2009). Using Transactional Information to Predict Link Strength in Online Social Networks. ICWSM, 9, 74–81.

    Google Scholar 

  • Knyazkov, K.V., Kovalchuk, S.V., Tchurov, T.N., Maryin, S.V., Boukhanovsky, A.V. (2012). CLAVIRE: e-Science infrastructure for data-driven computing. Journal of Computational Science, 3(6), 504–510.

    Article  Google Scholar 

  • Kwak, H., Lee, C., Park, H., Moon, S. (2010). What is Twitter, a social network or a news media?. In Proceedings of the 19th international conference on World wide web (pp. 591–600): ACM.

  • Li, R., Lei, K.H., Khadiwala, R., Chang, K.C.C. (2012). Tedas: A twitter-based event detection and analysis system. In 2012 ieee 28th international conference on Data engineering (icde) (pp. 1273–1276): IEEE.

  • Marcus, A., Bernstein, M.S., Badar, O., Karger, D.R., Madden, S., Miller, R.C. (2012). Processing and visualizing the data in tweets. ACM SIGMOD Record, 40(4), 21–27.

    Article  Google Scholar 

  • Mathioudakis, M., & Koudas, N. (2010). Twittermonitor: trend detection over the twitter stream. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (pp. 1155–1158): ACM.

  • METRA, I. (2014). Influence based exploration of twitter social network.

  • Papadakis, G., Tserpes, K., Sardis, E., Kardara, M., Papaoikonomou, A., Aisopos, F. (2012). Social media meta-API: leveraging the content of social networks. In Proceedings of the 21st international conference companion on World Wide Web (pp. 271–274): ACM.

  • Psallidas, F., Ntoulas, A., Delis, A. (2013). Soc web: Efficient monitoring of social network activities. In Web Information Systems Engineering–WISE 2013 (pp. 118–136): Springer Berlin Heidelberg.

  • Serrano, D., Stroulia, E., Barbosa, D., Guana, V. (2012). Sociql: A query language for the socialweb, Springer Berlin Heidelberg.

  • Shuai, H.H., Yang, D.N., Shen, C.Y., Yu, P.S., Chen, M.S. (2015). QMSampler: Joint Sampling of Multiple Networks with Quality Guarantee. arXiv:1502.07439.

  • Teng, S.Y., Yeh, M.Y., Chuang, K.T. (2015). Toward Understanding the Mobile Social Properties: An Analysis on Instagram Photo-Sharing Network. In Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015 (pp. 266–269): ACM.

  • Toshniwal, A., Taneja, S., Shukla, A., Ramasamy, K., Patel, J. M., Kulkarni, S., Bhagat, N. (2014). Storm@ twitter. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data (pp. 147–156): ACM.

  • Valkanas, G., & Gunopulos, D. (2013). How the live web feels about events. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management (pp. 639–648): ACM.

  • Valkanas, G., Saravanou, A., Gunopulos, D. (2014). A faceted crawler for the twitter service. In Web Information Systems Engineering–WISE 2014 (pp. 178–188): Springer International Publishing.

  • Wang, X., Tokarchuk, L., Cuadrado, F., Poslad, S. (2013). Exploiting hashtags for adaptive microblog crawling. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (pp. 311–315): ACM.

  • Wachowicz, M., Arteaga, M.D., Cha, S., Bourgeois, Y. (2015). Developing a streaming data processing workflow for querying space–time activities from geotagged tweets. Computers, Environment and Urban Systems.

  • Xiong, F., Liu, Y., Zhang, Z. J., Zhu, J., Zhang, Y. (2012). An information diffusion model based on retweeting mechanism for online social media. Physics Letters A, 376(30), 2103–2108.

    Article  Google Scholar 

  • Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Stoica, I. (2012a). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (pp. 2–2): USENIX Association.

  • Zaharia, M., Das, T., Li, H., Shenker, S., Stoica, I. (2012b). Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In Presented as part of the.

  • Zou, J., Fekri, F., McLaughlin, S. W. (2015). Mining Streaming Tweets for Real-Time Event Credibility Prediction in Twitter. In Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015 (pp. 1586–1589): ACM.

Download references

Acknowledgments

This research financially supported by Ministry of Education and Science of the Russian Federation, Agreement #14.578.21.0196 (03.10.2016). Unique Identification RFMEFI57816X0196.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nikolay Butakov.

Appendix

Appendix

Listing 1
figure a

Data collecting scenario similar to Dunbar et al. (2015) (see. Table 1) implemented with spring-social library for Twitter

Listing 2
figure b

Data collecting scenario similar to Dunbar et al. (2015) (see. Table 1) implemented with spring-social library for Facebook

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Butakov, N., Petrov, M., Mukhina, K. et al. Unified domain-specific language for collecting and processing data of social media. J Intell Inf Syst 51, 389–414 (2018). https://doi.org/10.1007/s10844-018-0508-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-018-0508-5

Keywords

Navigation