Skip to main content

Big Data Preprocessing: An Application on Online Social Networks

  • Chapter
  • First Online:
Principles of Data Science

Abstract

The mass adoption of social network services enabled online social networks a big data source. Machine learning and statistical analysis results are highly dependent on data preprocessing tasks. The purpose of data preprocessing is to revert the data to a format capable for the analysis and to ensure the high quality of data. However, not only management aspects for unstructured or semi-structured data remain largely unexplored but also new preprocessing techniques are required for addressing big data. In this chapter, the data preprocessing stages for big data sources emphasizing on online social networks are investigated. Special attention is paid to practical questions regarding low-quality data including incomplete, imbalanced, and noisy data. Furthermore, challenges and potential solutions of statistical and rule-based analysis for data cleansing are overviewed. The contribution of natural language processing, feature engineering, and machine learning methods is explored. Online social networks are investigated as (i) context, (ii) analysis practices, (iii) low-quality data, and most importantly (iv) how the latter are being addressed by techniques and frameworks. Last but not least, preprocessing on the broader field of distributed infrastructures is briefly overviewed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Saha, B., & Srivastava, D. (2014). Data quality: The other face of Big Data. In 2014 IEEE 30th international conference on data engineering, pp. 1294–1297.

    Google Scholar 

  2. Amin, A., et al. (2016). Comparing oversampling techniques to handle the class imbalance problem: A customer churn prediction case study. IEEE Access, 4, 7940–7957. https://doi.org/10.1109/ACCESS.2016.2619719.

    Article  Google Scholar 

  3. Jagadish, H. V., Gehrke, J., Labrinidis, A., Papakonstantinou, Y., Patel, J. M., Ramakrishnan, R., & Shahabi, C. (Jul. 2014). Big data and its technical challenges. Communications of the ACM, 57(7), 86–94.

    Article  Google Scholar 

  4. Sapountzi, A., & Psannis, K. E. (2016). Social networking data analysis tools & challenges. Future Generation Computer Systems. https://doi.org/10.1016/j.future.2016.10.019.

  5. Hassan, H., & Menezes, A. (2013). Social text normalization using contextual graph random walks (pp. 1577–1586). Sofia: Association for Computational Linguistics.

    Google Scholar 

  6. Peled, O., Fire, M., Rokach, L., & Elovici, Y. (2016). Matching entities across online social networks. Neurocomputing, 210, 91–106.

    Google Scholar 

  7. Huisman, M. (2014). Imputation of missing network data: Some simple procedures. In Encyclopedia of social network analysis and mining (pp. 707–715). New York: Springer New York.

    Chapter  Google Scholar 

  8. Kossinets, G. (2006). Effects of missing data in social networks. Social Networks, 28(3), 247–268.

    Article  Google Scholar 

  9. Kim, M., & Leskovec, J. (2011). The network completion problem: Inferring missing nodes and edges in networks. In Proceedings of the 2011 SIAM International Conference on Data Mining (pp. 47–58). Philadelphia: Society for Industrial and Applied Mathematics.

    Chapter  Google Scholar 

  10. Hira, Z. M., & Gillies, D. F. (2015). A review of feature selection and feature extraction methods applied on microarray data. Advances in Bioinformatics, 2015, 198363.

    Article  Google Scholar 

  11. Tan, W., Blake, M. B., Saleh, I., & Dustdar, S. (2013, September). Social-network-sourced big data analytics. IEEE Internet Computing, 17(5), 62–69.

    Article  Google Scholar 

  12. Taleb, I., Dssouli, R., & Serhani, M. A. (2015). Big data pre-processing: A quality framework. 2015 IEEE International Congress on Big Data, pp. 191–198.

    Google Scholar 

  13. Khayyat, Z., Ilyas, I. F., Jindal, A., Madden, S., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.-A., Tang, N., & Yin, S. (2015). BigDansing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data – SIGMOD’15, pp. 1215–1230.

    Google Scholar 

  14. Chu, X., Ilyas, I. F., & Koutris, P. (2016). Distributed data deduplication. Proceedings of the VLDB Endowment, 9(11), 864–875.

    Article  Google Scholar 

  15. Fan, W., & Wenfei. (December 2015). Data quality: From theory to practice. ACM SIGMOD Record, 44(3), 7–18.

    Article  Google Scholar 

  16. Chu, X., Morcos, J., Ilyas, I. F., Ouzzani, M., Papotti, P., Tang, N., & Ye, Y. (2015). KATARA. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data – SIGMOD’15, pp. 1247–1261.

    Google Scholar 

  17. Volkovs, M., Chiang, F., Szlichta, J., & Miller, R. J. (2014, March). Continuous data cleaning. In 2014 IEEE 30th International Conference on Data Engineering (pp. 244–255). IEEE.

    Google Scholar 

  18. Chu, X., Ilyas, I. F., Krishnan, S., & Wang, J. (2016). Data cleaning: Overview and emerging challenges. In SIGMOD’16 Proceedings of the 2016 International Conference on Management of Data, pp. 2201–2206.

    Google Scholar 

  19. Zhou, D., Chen, L., & He, Y. (2015). An unsupervised framework of exploring events on twitter: Filtering, extraction and categorization. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence.

    Google Scholar 

  20. Sankaranarayanan, J., Samet, H., Teitler, B. E., Lieberman, M. D., & Sperling, J. (2009). TwitterStand. In Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems – GIS’09, p. 42.

    Google Scholar 

  21. Ritter, A., Mausam, Etzioni, O., & Clark, S. (2012). Open domain event extraction from Twitter. In Proceedings of the 18th ACM SIGKDD – KDD’12, p. 1104.

    Google Scholar 

  22. Tang, N. (2014). Big data cleaning (pp. 13–24). Cham: Springer.

    Google Scholar 

  23. Cao, Y., Fan, W., & Yu, W. (2013). Determining the relative accuracy of attributes. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 565–576.

    Google Scholar 

  24. Fan, W., Wu, Y., & Xu, J. (2016). Functional dependencies for graphs. In Proceedings of the 2016 International Conference on Management of Data – SIGMOD’16, pp. 1843–1857.

    Google Scholar 

  25. Wang, P., Zhao, J., Huang, K., & Xu, B. (2014). A unified semi-supervised framework for author disambiguation in academic social network (pp. 1–16). Cham: Springer.

    Google Scholar 

  26. Abedjan, Z., Akcora, C. G., Ouzzani, M., Papotti, P., & Stonebraker, M. (Dec. 2015). Temporal rules discovery for web data cleaning. Proceedings of the VLDB Endowment, 9(4), 336–347.

    Article  Google Scholar 

  27. Kaisler, S., Armour, F., Espinosa, J. A., & Money, W. (2013). Big data: Issues and challenges moving forward. In 2013 46th Hawaii International Conference on System Sciences, pp. 995–1004.

    Google Scholar 

  28. Fan, J., Han, F., & Liu, H. (Jun. 2014). Challenges of big data analysis. National Science Review, 1(2), 293–314.

    Article  Google Scholar 

  29. Gandomi, A., & Haider, M. (April 2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137–144.

    Article  Google Scholar 

  30. Shi, W., Zhu, Y., Huang, T., Sheng, G., Lian, Y., Wang, G., & Chen, Y. (2016, March). An integrated data preprocessing framework based on apache spark for fault diagnosis of power grid equipment. Journal of Signal Processing Systems, 86, 1–16.

    Google Scholar 

  31. Poulos, J., & Valle, R. (2018). Missing data imputation for supervised learning. Applied Artificial Intelligence, 32(2), 186–196. https://doi.org/10.1080/08839514.2018.1448143.

    Article  Google Scholar 

  32. Galar, M., Fernández, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A review on ensembles for class imbalance problem: Bagging, boosting and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics – Part C: Applications and Reviews, 42(4), 463–484.

    Article  Google Scholar 

  33. Fire, M., Tenenboim-Chekina, L., Puzis, R., Lesser, O., Rokach, L., & Elovici, Y. (December 2013). Computationally efficient link prediction in a variety of social networks. ACM Transactions on Intelligent Systems and Technology, 5(1), 1–25.

    Article  Google Scholar 

  34. Rennie, J., Shih, L., Teevan, J., & Karger, D. (2003) Tackling the poor assumptions of naive Bayes text classifiers. In Proceedings of the ICLM-2003.

    Google Scholar 

  35. Soley-Bori, M. (2013). Dealing with missing data: Key assumptions and methods for applied analysis (Vol. 4, pp. 1–19). Boston University.

    Google Scholar 

  36. Loh, P., & Wainwright, M. J. (2012). High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. Annals of Statistics, 40(3), 1637–1664.

    Article  MathSciNet  Google Scholar 

  37. Stekhoven, D. J., & Buhlmann, P. (2012). Missforest – Non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112–118.

    Article  Google Scholar 

  38. Mislove, A., Viswanath, B., Gummadi, K. P., & Druschel, P. (2010). You are who you know. In Proceedings of the Third ACM International Conference on Web Search and Data Mining – WSDM’10, p. 251.

    Google Scholar 

  39. Kolb, L., Thor, A., & Rahm, E. (2012). Dedoop: Efficient deduplication with Hadoop. In Proceedings of the VLDB endowment (Vol. 5, p. 1878).

    Google Scholar 

  40. Singh, T., & Kumari, M. (2016). Role of text pre-processing in Twitter sentiment analysis. Procedia Computer Science, 89, 549–554.

    Article  Google Scholar 

  41. Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J., & Smith, N. A. (2010). Part-of-speech tagging for twitter: Annotation, features, and experiments. Carnegie-Mellon Univ Pittsburgh Pa School of Computer Science.

    Google Scholar 

  42. Owoputi, O., Owoputi, O., Dyer, C., Gimpel, K., Schneider, N., & Smith, N. A. (2013). Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of NAACL.

    Google Scholar 

  43. Al-Hamami, M. A. H. (2015). The impact of big data on security. In Handbook of research on threat detection and countermeasures in network security (Vol. 3, pp. 276–298). Pennsylvania: IGI Global.

    Chapter  Google Scholar 

  44. Nirmal, V. J., Amalarethinam, D. I. G., & Author, C. (2015). Parallel implementation of big data pre-processing algorithms for sentiment analysis of social networking data. International Journal of Fuzzy Mathematical Archive, 6(2), 149–159.

    Google Scholar 

  45. Bilgic, M., Licamele, L., Getoor, L., & Shneiderman, B. (2006). D-dupe: An interactive tool for entity resolution in social networks. In 2006 IEEE Symposium on Visual Analytics and Technology, pp. 43–50.

    Google Scholar 

  46. Ebaid, A., Elmagarmid, A., Ilyas, I. F., Ouzzani, M., Quiane-Ruiz, J. A., Tang, N., & Yin, S. (2013). NADEEF: A generalized data cleaning system. Proceedings of the VLDB Endowment, 6(12), 1218–1221.

    Google Scholar 

  47. Geerts, F., Mecca, G., Papotti, p. & Santoro, D., 2014. That’s all folks! LLUNATIC goes open source. Proceedings of the VLDB Endowment, 7(13), pp. 1565–1568.

    Google Scholar 

  48. Cai, L., & Zhu, Y. (2015). The challenges of data quality and data quality assessment in the big data era. Data Science Journal, 14(May), 2. https://dx.doi.org/10.5334/dsj-2015-002.

    Article  Google Scholar 

  49. Immonen, A., Paakkonen, P., & Ovaska, E. (2015). Evaluating the quality of social media data in big data architecture. IEEE Access, 3, 2028–2043.

    Article  Google Scholar 

  50. Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10), 78–87.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Androniki Sapountzi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Sapountzi, A., Psannis, K.E. (2020). Big Data Preprocessing: An Application on Online Social Networks. In: Arabnia, H.R., Daimi, K., Stahlbock, R., Soviany, C., Heilig, L., Brüssau, K. (eds) Principles of Data Science. Transactions on Computational Science and Computational Intelligence. Springer, Cham. https://doi.org/10.1007/978-3-030-43981-1_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-43981-1_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-43980-4

  • Online ISBN: 978-3-030-43981-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics