Abstract
The mass adoption of social network services enabled online social networks a big data source. Machine learning and statistical analysis results are highly dependent on data preprocessing tasks. The purpose of data preprocessing is to revert the data to a format capable for the analysis and to ensure the high quality of data. However, not only management aspects for unstructured or semi-structured data remain largely unexplored but also new preprocessing techniques are required for addressing big data. In this chapter, the data preprocessing stages for big data sources emphasizing on online social networks are investigated. Special attention is paid to practical questions regarding low-quality data including incomplete, imbalanced, and noisy data. Furthermore, challenges and potential solutions of statistical and rule-based analysis for data cleansing are overviewed. The contribution of natural language processing, feature engineering, and machine learning methods is explored. Online social networks are investigated as (i) context, (ii) analysis practices, (iii) low-quality data, and most importantly (iv) how the latter are being addressed by techniques and frameworks. Last but not least, preprocessing on the broader field of distributed infrastructures is briefly overviewed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Saha, B., & Srivastava, D. (2014). Data quality: The other face of Big Data. In 2014 IEEE 30th international conference on data engineering, pp. 1294–1297.
Amin, A., et al. (2016). Comparing oversampling techniques to handle the class imbalance problem: A customer churn prediction case study. IEEE Access, 4, 7940–7957. https://doi.org/10.1109/ACCESS.2016.2619719.
Jagadish, H. V., Gehrke, J., Labrinidis, A., Papakonstantinou, Y., Patel, J. M., Ramakrishnan, R., & Shahabi, C. (Jul. 2014). Big data and its technical challenges. Communications of the ACM, 57(7), 86–94.
Sapountzi, A., & Psannis, K. E. (2016). Social networking data analysis tools & challenges. Future Generation Computer Systems. https://doi.org/10.1016/j.future.2016.10.019.
Hassan, H., & Menezes, A. (2013). Social text normalization using contextual graph random walks (pp. 1577–1586). Sofia: Association for Computational Linguistics.
Peled, O., Fire, M., Rokach, L., & Elovici, Y. (2016). Matching entities across online social networks. Neurocomputing, 210, 91–106.
Huisman, M. (2014). Imputation of missing network data: Some simple procedures. In Encyclopedia of social network analysis and mining (pp. 707–715). New York: Springer New York.
Kossinets, G. (2006). Effects of missing data in social networks. Social Networks, 28(3), 247–268.
Kim, M., & Leskovec, J. (2011). The network completion problem: Inferring missing nodes and edges in networks. In Proceedings of the 2011 SIAM International Conference on Data Mining (pp. 47–58). Philadelphia: Society for Industrial and Applied Mathematics.
Hira, Z. M., & Gillies, D. F. (2015). A review of feature selection and feature extraction methods applied on microarray data. Advances in Bioinformatics, 2015, 198363.
Tan, W., Blake, M. B., Saleh, I., & Dustdar, S. (2013, September). Social-network-sourced big data analytics. IEEE Internet Computing, 17(5), 62–69.
Taleb, I., Dssouli, R., & Serhani, M. A. (2015). Big data pre-processing: A quality framework. 2015 IEEE International Congress on Big Data, pp. 191–198.
Khayyat, Z., Ilyas, I. F., Jindal, A., Madden, S., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.-A., Tang, N., & Yin, S. (2015). BigDansing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data – SIGMOD’15, pp. 1215–1230.
Chu, X., Ilyas, I. F., & Koutris, P. (2016). Distributed data deduplication. Proceedings of the VLDB Endowment, 9(11), 864–875.
Fan, W., & Wenfei. (December 2015). Data quality: From theory to practice. ACM SIGMOD Record, 44(3), 7–18.
Chu, X., Morcos, J., Ilyas, I. F., Ouzzani, M., Papotti, P., Tang, N., & Ye, Y. (2015). KATARA. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data – SIGMOD’15, pp. 1247–1261.
Volkovs, M., Chiang, F., Szlichta, J., & Miller, R. J. (2014, March). Continuous data cleaning. In 2014 IEEE 30th International Conference on Data Engineering (pp. 244–255). IEEE.
Chu, X., Ilyas, I. F., Krishnan, S., & Wang, J. (2016). Data cleaning: Overview and emerging challenges. In SIGMOD’16 Proceedings of the 2016 International Conference on Management of Data, pp. 2201–2206.
Zhou, D., Chen, L., & He, Y. (2015). An unsupervised framework of exploring events on twitter: Filtering, extraction and categorization. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence.
Sankaranarayanan, J., Samet, H., Teitler, B. E., Lieberman, M. D., & Sperling, J. (2009). TwitterStand. In Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems – GIS’09, p. 42.
Ritter, A., Mausam, Etzioni, O., & Clark, S. (2012). Open domain event extraction from Twitter. In Proceedings of the 18th ACM SIGKDD – KDD’12, p. 1104.
Tang, N. (2014). Big data cleaning (pp. 13–24). Cham: Springer.
Cao, Y., Fan, W., & Yu, W. (2013). Determining the relative accuracy of attributes. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 565–576.
Fan, W., Wu, Y., & Xu, J. (2016). Functional dependencies for graphs. In Proceedings of the 2016 International Conference on Management of Data – SIGMOD’16, pp. 1843–1857.
Wang, P., Zhao, J., Huang, K., & Xu, B. (2014). A unified semi-supervised framework for author disambiguation in academic social network (pp. 1–16). Cham: Springer.
Abedjan, Z., Akcora, C. G., Ouzzani, M., Papotti, P., & Stonebraker, M. (Dec. 2015). Temporal rules discovery for web data cleaning. Proceedings of the VLDB Endowment, 9(4), 336–347.
Kaisler, S., Armour, F., Espinosa, J. A., & Money, W. (2013). Big data: Issues and challenges moving forward. In 2013 46th Hawaii International Conference on System Sciences, pp. 995–1004.
Fan, J., Han, F., & Liu, H. (Jun. 2014). Challenges of big data analysis. National Science Review, 1(2), 293–314.
Gandomi, A., & Haider, M. (April 2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137–144.
Shi, W., Zhu, Y., Huang, T., Sheng, G., Lian, Y., Wang, G., & Chen, Y. (2016, March). An integrated data preprocessing framework based on apache spark for fault diagnosis of power grid equipment. Journal of Signal Processing Systems, 86, 1–16.
Poulos, J., & Valle, R. (2018). Missing data imputation for supervised learning. Applied Artificial Intelligence, 32(2), 186–196. https://doi.org/10.1080/08839514.2018.1448143.
Galar, M., Fernández, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A review on ensembles for class imbalance problem: Bagging, boosting and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics – Part C: Applications and Reviews, 42(4), 463–484.
Fire, M., Tenenboim-Chekina, L., Puzis, R., Lesser, O., Rokach, L., & Elovici, Y. (December 2013). Computationally efficient link prediction in a variety of social networks. ACM Transactions on Intelligent Systems and Technology, 5(1), 1–25.
Rennie, J., Shih, L., Teevan, J., & Karger, D. (2003) Tackling the poor assumptions of naive Bayes text classifiers. In Proceedings of the ICLM-2003.
Soley-Bori, M. (2013). Dealing with missing data: Key assumptions and methods for applied analysis (Vol. 4, pp. 1–19). Boston University.
Loh, P., & Wainwright, M. J. (2012). High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. Annals of Statistics, 40(3), 1637–1664.
Stekhoven, D. J., & Buhlmann, P. (2012). Missforest – Non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112–118.
Mislove, A., Viswanath, B., Gummadi, K. P., & Druschel, P. (2010). You are who you know. In Proceedings of the Third ACM International Conference on Web Search and Data Mining – WSDM’10, p. 251.
Kolb, L., Thor, A., & Rahm, E. (2012). Dedoop: Efficient deduplication with Hadoop. In Proceedings of the VLDB endowment (Vol. 5, p. 1878).
Singh, T., & Kumari, M. (2016). Role of text pre-processing in Twitter sentiment analysis. Procedia Computer Science, 89, 549–554.
Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J., & Smith, N. A. (2010). Part-of-speech tagging for twitter: Annotation, features, and experiments. Carnegie-Mellon Univ Pittsburgh Pa School of Computer Science.
Owoputi, O., Owoputi, O., Dyer, C., Gimpel, K., Schneider, N., & Smith, N. A. (2013). Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of NAACL.
Al-Hamami, M. A. H. (2015). The impact of big data on security. In Handbook of research on threat detection and countermeasures in network security (Vol. 3, pp. 276–298). Pennsylvania: IGI Global.
Nirmal, V. J., Amalarethinam, D. I. G., & Author, C. (2015). Parallel implementation of big data pre-processing algorithms for sentiment analysis of social networking data. International Journal of Fuzzy Mathematical Archive, 6(2), 149–159.
Bilgic, M., Licamele, L., Getoor, L., & Shneiderman, B. (2006). D-dupe: An interactive tool for entity resolution in social networks. In 2006 IEEE Symposium on Visual Analytics and Technology, pp. 43–50.
Ebaid, A., Elmagarmid, A., Ilyas, I. F., Ouzzani, M., Quiane-Ruiz, J. A., Tang, N., & Yin, S. (2013). NADEEF: A generalized data cleaning system. Proceedings of the VLDB Endowment, 6(12), 1218–1221.
Geerts, F., Mecca, G., Papotti, p. & Santoro, D., 2014. That’s all folks! LLUNATIC goes open source. Proceedings of the VLDB Endowment, 7(13), pp. 1565–1568.
Cai, L., & Zhu, Y. (2015). The challenges of data quality and data quality assessment in the big data era. Data Science Journal, 14(May), 2. https://dx.doi.org/10.5334/dsj-2015-002.
Immonen, A., Paakkonen, P., & Ovaska, E. (2015). Evaluating the quality of social media data in big data architecture. IEEE Access, 3, 2028–2043.
Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10), 78–87.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Sapountzi, A., Psannis, K.E. (2020). Big Data Preprocessing: An Application on Online Social Networks. In: Arabnia, H.R., Daimi, K., Stahlbock, R., Soviany, C., Heilig, L., Brüssau, K. (eds) Principles of Data Science. Transactions on Computational Science and Computational Intelligence. Springer, Cham. https://doi.org/10.1007/978-3-030-43981-1_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-43981-1_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-43980-4
Online ISBN: 978-3-030-43981-1
eBook Packages: EngineeringEngineering (R0)