Big Data Preprocessing: An Application on Online Social Networks

Sapountzi, Androniki; Psannis, Kostas E.

doi:10.1007/978-3-030-43981-1_4

Androniki Sapountzi⁸ &
Kostas E. Psannis⁹

Part of the book series: Transactions on Computational Science and Computational Intelligence ((TRACOSCI))

1237 Accesses
2 Citations

Abstract

The mass adoption of social network services enabled online social networks a big data source. Machine learning and statistical analysis results are highly dependent on data preprocessing tasks. The purpose of data preprocessing is to revert the data to a format capable for the analysis and to ensure the high quality of data. However, not only management aspects for unstructured or semi-structured data remain largely unexplored but also new preprocessing techniques are required for addressing big data. In this chapter, the data preprocessing stages for big data sources emphasizing on online social networks are investigated. Special attention is paid to practical questions regarding low-quality data including incomplete, imbalanced, and noisy data. Furthermore, challenges and potential solutions of statistical and rule-based analysis for data cleansing are overviewed. The contribution of natural language processing, feature engineering, and machine learning methods is explored. Online social networks are investigated as (i) context, (ii) analysis practices, (iii) low-quality data, and most importantly (iv) how the latter are being addressed by techniques and frameworks. Last but not least, preprocessing on the broader field of distributed infrastructures is briefly overviewed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Evaluation and Development of Data Mining Tools for Social Network Analysis

Social Media Analytics, Types and Methodology

Big Data Search and Mining

References

Saha, B., & Srivastava, D. (2014). Data quality: The other face of Big Data. In 2014 IEEE 30th international conference on data engineering, pp. 1294–1297.
Google Scholar
Amin, A., et al. (2016). Comparing oversampling techniques to handle the class imbalance problem: A customer churn prediction case study. IEEE Access, 4, 7940–7957. https://doi.org/10.1109/ACCESS.2016.2619719.
Article Google Scholar
Jagadish, H. V., Gehrke, J., Labrinidis, A., Papakonstantinou, Y., Patel, J. M., Ramakrishnan, R., & Shahabi, C. (Jul. 2014). Big data and its technical challenges. Communications of the ACM, 57(7), 86–94.
Article Google Scholar
Sapountzi, A., & Psannis, K. E. (2016). Social networking data analysis tools & challenges. Future Generation Computer Systems. https://doi.org/10.1016/j.future.2016.10.019.
Hassan, H., & Menezes, A. (2013). Social text normalization using contextual graph random walks (pp. 1577–1586). Sofia: Association for Computational Linguistics.
Google Scholar
Peled, O., Fire, M., Rokach, L., & Elovici, Y. (2016). Matching entities across online social networks. Neurocomputing, 210, 91–106.
Google Scholar
Huisman, M. (2014). Imputation of missing network data: Some simple procedures. In Encyclopedia of social network analysis and mining (pp. 707–715). New York: Springer New York.
Chapter Google Scholar
Kossinets, G. (2006). Effects of missing data in social networks. Social Networks, 28(3), 247–268.
Article Google Scholar
Kim, M., & Leskovec, J. (2011). The network completion problem: Inferring missing nodes and edges in networks. In Proceedings of the 2011 SIAM International Conference on Data Mining (pp. 47–58). Philadelphia: Society for Industrial and Applied Mathematics.
Chapter Google Scholar
Hira, Z. M., & Gillies, D. F. (2015). A review of feature selection and feature extraction methods applied on microarray data. Advances in Bioinformatics, 2015, 198363.
Article Google Scholar
Tan, W., Blake, M. B., Saleh, I., & Dustdar, S. (2013, September). Social-network-sourced big data analytics. IEEE Internet Computing, 17(5), 62–69.
Article Google Scholar
Taleb, I., Dssouli, R., & Serhani, M. A. (2015). Big data pre-processing: A quality framework. 2015 IEEE International Congress on Big Data, pp. 191–198.
Google Scholar
Khayyat, Z., Ilyas, I. F., Jindal, A., Madden, S., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.-A., Tang, N., & Yin, S. (2015). BigDansing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data – SIGMOD’15, pp. 1215–1230.
Google Scholar
Chu, X., Ilyas, I. F., & Koutris, P. (2016). Distributed data deduplication. Proceedings of the VLDB Endowment, 9(11), 864–875.
Article Google Scholar
Fan, W., & Wenfei. (December 2015). Data quality: From theory to practice. ACM SIGMOD Record, 44(3), 7–18.
Article Google Scholar
Chu, X., Morcos, J., Ilyas, I. F., Ouzzani, M., Papotti, P., Tang, N., & Ye, Y. (2015). KATARA. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data – SIGMOD’15, pp. 1247–1261.
Google Scholar
Volkovs, M., Chiang, F., Szlichta, J., & Miller, R. J. (2014, March). Continuous data cleaning. In 2014 IEEE 30th International Conference on Data Engineering (pp. 244–255). IEEE.
Google Scholar
Chu, X., Ilyas, I. F., Krishnan, S., & Wang, J. (2016). Data cleaning: Overview and emerging challenges. In SIGMOD’16 Proceedings of the 2016 International Conference on Management of Data, pp. 2201–2206.
Google Scholar
Zhou, D., Chen, L., & He, Y. (2015). An unsupervised framework of exploring events on twitter: Filtering, extraction and categorization. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence.
Google Scholar
Sankaranarayanan, J., Samet, H., Teitler, B. E., Lieberman, M. D., & Sperling, J. (2009). TwitterStand. In Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems – GIS’09, p. 42.
Google Scholar
Ritter, A., Mausam, Etzioni, O., & Clark, S. (2012). Open domain event extraction from Twitter. In Proceedings of the 18th ACM SIGKDD – KDD’12, p. 1104.
Google Scholar
Tang, N. (2014). Big data cleaning (pp. 13–24). Cham: Springer.
Google Scholar
Cao, Y., Fan, W., & Yu, W. (2013). Determining the relative accuracy of attributes. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 565–576.
Google Scholar
Fan, W., Wu, Y., & Xu, J. (2016). Functional dependencies for graphs. In Proceedings of the 2016 International Conference on Management of Data – SIGMOD’16, pp. 1843–1857.
Google Scholar
Wang, P., Zhao, J., Huang, K., & Xu, B. (2014). A unified semi-supervised framework for author disambiguation in academic social network (pp. 1–16). Cham: Springer.
Google Scholar
Abedjan, Z., Akcora, C. G., Ouzzani, M., Papotti, P., & Stonebraker, M. (Dec. 2015). Temporal rules discovery for web data cleaning. Proceedings of the VLDB Endowment, 9(4), 336–347.
Article Google Scholar
Kaisler, S., Armour, F., Espinosa, J. A., & Money, W. (2013). Big data: Issues and challenges moving forward. In 2013 46th Hawaii International Conference on System Sciences, pp. 995–1004.
Google Scholar
Fan, J., Han, F., & Liu, H. (Jun. 2014). Challenges of big data analysis. National Science Review, 1(2), 293–314.
Article Google Scholar
Gandomi, A., & Haider, M. (April 2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137–144.
Article Google Scholar
Shi, W., Zhu, Y., Huang, T., Sheng, G., Lian, Y., Wang, G., & Chen, Y. (2016, March). An integrated data preprocessing framework based on apache spark for fault diagnosis of power grid equipment. Journal of Signal Processing Systems, 86, 1–16.
Google Scholar
Poulos, J., & Valle, R. (2018). Missing data imputation for supervised learning. Applied Artificial Intelligence, 32(2), 186–196. https://doi.org/10.1080/08839514.2018.1448143.
Article Google Scholar
Galar, M., Fernández, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A review on ensembles for class imbalance problem: Bagging, boosting and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics – Part C: Applications and Reviews, 42(4), 463–484.
Article Google Scholar
Fire, M., Tenenboim-Chekina, L., Puzis, R., Lesser, O., Rokach, L., & Elovici, Y. (December 2013). Computationally efficient link prediction in a variety of social networks. ACM Transactions on Intelligent Systems and Technology, 5(1), 1–25.
Article Google Scholar
Rennie, J., Shih, L., Teevan, J., & Karger, D. (2003) Tackling the poor assumptions of naive Bayes text classifiers. In Proceedings of the ICLM-2003.
Google Scholar
Soley-Bori, M. (2013). Dealing with missing data: Key assumptions and methods for applied analysis (Vol. 4, pp. 1–19). Boston University.
Google Scholar
Loh, P., & Wainwright, M. J. (2012). High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. Annals of Statistics, 40(3), 1637–1664.
Article MathSciNet Google Scholar
Stekhoven, D. J., & Buhlmann, P. (2012). Missforest – Non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112–118.
Article Google Scholar
Mislove, A., Viswanath, B., Gummadi, K. P., & Druschel, P. (2010). You are who you know. In Proceedings of the Third ACM International Conference on Web Search and Data Mining – WSDM’10, p. 251.
Google Scholar
Kolb, L., Thor, A., & Rahm, E. (2012). Dedoop: Efficient deduplication with Hadoop. In Proceedings of the VLDB endowment (Vol. 5, p. 1878).
Google Scholar
Singh, T., & Kumari, M. (2016). Role of text pre-processing in Twitter sentiment analysis. Procedia Computer Science, 89, 549–554.
Article Google Scholar
Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J., & Smith, N. A. (2010). Part-of-speech tagging for twitter: Annotation, features, and experiments. Carnegie-Mellon Univ Pittsburgh Pa School of Computer Science.
Google Scholar
Owoputi, O., Owoputi, O., Dyer, C., Gimpel, K., Schneider, N., & Smith, N. A. (2013). Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of NAACL.
Google Scholar
Al-Hamami, M. A. H. (2015). The impact of big data on security. In Handbook of research on threat detection and countermeasures in network security (Vol. 3, pp. 276–298). Pennsylvania: IGI Global.
Chapter Google Scholar
Nirmal, V. J., Amalarethinam, D. I. G., & Author, C. (2015). Parallel implementation of big data pre-processing algorithms for sentiment analysis of social networking data. International Journal of Fuzzy Mathematical Archive, 6(2), 149–159.
Google Scholar
Bilgic, M., Licamele, L., Getoor, L., & Shneiderman, B. (2006). D-dupe: An interactive tool for entity resolution in social networks. In 2006 IEEE Symposium on Visual Analytics and Technology, pp. 43–50.
Google Scholar
Ebaid, A., Elmagarmid, A., Ilyas, I. F., Ouzzani, M., Quiane-Ruiz, J. A., Tang, N., & Yin, S. (2013). NADEEF: A generalized data cleaning system. Proceedings of the VLDB Endowment, 6(12), 1218–1221.
Google Scholar
Geerts, F., Mecca, G., Papotti, p. & Santoro, D., 2014. That’s all folks! LLUNATIC goes open source. Proceedings of the VLDB Endowment, 7(13), pp. 1565–1568.
Google Scholar
Cai, L., & Zhu, Y. (2015). The challenges of data quality and data quality assessment in the big data era. Data Science Journal, 14(May), 2. https://dx.doi.org/10.5334/dsj-2015-002.
Article Google Scholar
Immonen, A., Paakkonen, P., & Ovaska, E. (2015). Evaluating the quality of social media data in big data architecture. IEEE Access, 3, 2028–2043.
Article Google Scholar
Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10), 78–87.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Behavioral and Movement Sciences, Vrije Universiteit, Amsterdam, The Netherlands
Androniki Sapountzi
Department of Applied Informatics, University of Macedonia, Thessaloniki, Greece
Kostas E. Psannis

Authors

Androniki Sapountzi
View author publications
You can also search for this author in PubMed Google Scholar
Kostas E. Psannis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Androniki Sapountzi .

Editor information

Editors and Affiliations

University of Georgia, Athens, GA, USA
Hamid R. Arabnia
University of Detroit Mercy, Detroit, MI, USA
Kevin Daimi
University of Hamburg, Hamburg, Hamburg, Germany
Robert Stahlbock
Features Analytics, Nivelles, Belgium
Cristina Soviany
University of Hamburg, Hamburg, Hamburg, Germany
Leonard Heilig
University of Hamburg, Hamburg, Hamburg, Germany
Kai Brüssau

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Sapountzi, A., Psannis, K.E. (2020). Big Data Preprocessing: An Application on Online Social Networks. In: Arabnia, H.R., Daimi, K., Stahlbock, R., Soviany, C., Heilig, L., Brüssau, K. (eds) Principles of Data Science. Transactions on Computational Science and Computational Intelligence. Springer, Cham. https://doi.org/10.1007/978-3-030-43981-1_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-43981-1_4
Published: 09 July 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-43980-4
Online ISBN: 978-3-030-43981-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Big Data Preprocessing: An Application on Online Social Networks

Abstract

Access this chapter

Similar content being viewed by others

Evaluation and Development of Data Mining Tools for Social Network Analysis

Social Media Analytics, Types and Methodology

Big Data Search and Mining

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Big Data Preprocessing: An Application on Online Social Networks

Abstract

Access this chapter

Similar content being viewed by others

Evaluation and Development of Data Mining Tools for Social Network Analysis

Social Media Analytics, Types and Methodology

Big Data Search and Mining

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation