CrowdCorrect: A Curation Pipeline for Social Data Cleansing and Curation

  • Amin BeheshtiEmail author
  • Kushal Vaghani
  • Boualem Benatallah
  • Alireza Tabebordbar
Conference paper
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 317)


Process and data are equally important for business process management. Data-driven approaches in process analytics aims to value decisions that can be backed up with verifiable private and open data. Over the last few years, data-driven analysis of how knowledge workers and customers interact in social contexts, often with data obtained from social networking services such as Twitter and Facebook, have become a vital asset for organizations. For example, governments started to extract knowledge and derive insights from vastly growing open data to improve their services. A key challenge in analyzing social data is to understand the raw data generated by social actors and prepare it for analytic tasks. In this context, it is important to transform the raw data into a contextualized data and knowledge. This task, known as data curation, involves identifying relevant data sources, extracting data and knowledge, cleansing, maintaining, merging, enriching and linking data and knowledge. In this paper we present CrowdCorrect, a data curation pipeline to enable analysts cleansing and curating social data and preparing it for reliable business data analytics. The first step offers automatic feature extraction, correction and enrichment. Next, we design micro-tasks and use the knowledge of the crowd to identify and correct information items that could not be corrected in the first step. Finally, we offer a domain-model mediated method to use the knowledge of domain experts to identify and correct items that could not be corrected in previous steps. We adopt a typical scenario for analyzing Urban Social Issues from Twitter as it relates to the Government Budget, to highlight how CrowdCorrect significantly improves the quality of extracted knowledge compared to the classical curation pipeline and in the absence of knowledge of the crowd and domain experts.



We Acknowledge the Data to Decisions CRC (D2D CRC) and the Cooperative Research Centres Program for funding this research.


  1. 1.
    Abilhoa, W.D., De Castro, L.N.: A keyword extraction method from Twitter messages represented as graphs. Appl. Math. Comput. 240, 308–325 (2014)Google Scholar
  2. 2.
    Abu-Salih, B., Wongthongtham, P., Beheshti, S., Zhu, D.: A preliminary approach to domain-based evaluation of users’ trustworthiness in online social networks. In: 2015 IEEE International Congress on Big Data, New York City, NY, USA, 27 June–2 July 2015, pp. 460–466 (2015)Google Scholar
  3. 3.
    Aggarwal, C.C.: An introduction to social network data analytics. In: Social Network Data Analytics, pp. 1–15 (2011)CrossRefGoogle Scholar
  4. 4.
    Anderson, M., et al.: Brainwash: a data system for feature engineering. In: CIDR (2013)Google Scholar
  5. 5.
    Bae, Y., Lee, H.: Sentiment analysis of Twitter audiences: measuring the positive or negative influence of popular Twitterers. J. Assoc. Inf. Sci. Technol. 63(12), 2521–2535 (2012)CrossRefGoogle Scholar
  6. 6.
    Batarfi, O., Shawi, R.E., Fayoumi, A.G., Nouri, R., Beheshti, S., Barnawi, A., Sakr, S.: Large scale graph processing systems: survey and an experimental evaluation. Cluster Comput. 18(3), 1189–1213 (2015)CrossRefGoogle Scholar
  7. 7.
    Beheshti, A., Benatallah, B., Motahari-Nezhad, H.R.: ProcessAtlas: a scalable and extensible platform for business process analytics. Softw. Pract. Exp. 48(4), 842–866 (2018)CrossRefGoogle Scholar
  8. 8.
    Beheshti, A., Benatallah, B., Nouri, R., Chhieng, V.M., Xiong, H., Zhao, X.: Coredb: a data lake service. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, Singapore, 06–10 November 2017, pp. 2451–2454 (2017)Google Scholar
  9. 9.
    Beheshti, S., Benatallah, B., Motahari-Nezhad, H.R.: Galaxy: a platform for explorative analysis of open data sources. In: Proceedings of the 19th International Conference on Extending Database Technology, EDBT 2016, Bordeaux, France, 15–16 March 2016, pp. 640–643 (2016)Google Scholar
  10. 10.
    Beheshti, S., Benatallah, B., Motahari-Nezhad, H.R.: Scalable graph-based OLAP analytics over process execution data. Distrib. Parallel Databases 34(3), 379–423 (2016)CrossRefGoogle Scholar
  11. 11.
    Beheshti, S.-M.-R., Benatallah, B., Sakr, S., Grigori, D., Motahari-Nezhad, H.R., Barukh, M.C., Gater, A., Ryu, S.H.: Process Analytics - Concepts and Techniques for Querying and Analyzing Process Data. Springer, Cham (2016). Scholar
  12. 12.
    Beheshti, S., Benatallah, B., Venugopal, S., Ryu, S.H., Motahari-Nezhad, H.R., Wang, W.: A systematic review and comparative analysis of cross-document coreference resolution methods and tools. Computing 99(4), 313–349 (2017)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Beheshti, S., Tabebordbar, A., Benatallah, B., Nouri, R.: On automating basic data curation tasks. In: Proceedings of the 26th International Conference on World Wide Web Companion, Perth, Australia, 3–7 April 2017, pp. 165–169 (2017).
  14. 14.
    Beheshti, S., Venugopal, S., Ryu, S.H., Benatallah, B., Wang, W.: Big data and cross-document coreference resolution: current state and future opportunities. CoRR abs/1311.3987 (2013)Google Scholar
  15. 15.
    Beheshti, S., et al.: Business process data analysis. In: Beheshti, S., et al. (eds.) Process Analytics, pp. 107–134. Springer, Cham (2016). Scholar
  16. 16.
    Brigadir, I., Greene, D., Cunningham, P.: A system for Twitter user list curation. In: Proceedings of the Sixth ACM Conference on Recommender Systems, pp. 293–294. ACM (2012)Google Scholar
  17. 17.
    Cha, M., Haddadi, H., Benevenuto, F., Gummadi, P.K.: Measuring user influence in Twitter: the million follower fallacy. ICWSM 10(10–17), 30 (2010)Google Scholar
  18. 18.
    Chai, X., et al.: Social media analytics: the Kosmix story. IEEE Data Eng. Bull. 36(3), 4–12 (2013)Google Scholar
  19. 19.
    Chitrakala, S.: Twitter data analysis. In: Modern Technologies for Big Data Classification and Clustering, p. 124 (2017)Google Scholar
  20. 20.
    Duh, K., Hirao, T., Kimura, A., Ishiguro, K., Iwata, T., Yeung, C.M.A.: Creating stories: social curation of Twitter messages. In: ICWSM (2012)Google Scholar
  21. 21.
    Ginn, R., Pimpalkhute, P., Nikfarjam, A., Patki, A., OConnor, K., Sarker, A., Smith, K., Gonzalez, G.: Mining Twitter for adverse drug reaction mentions: a corpus and classification benchmark. In: Proceedings of the Fourth Workshop on Building and Evaluating Resources for Health and Biomedical Text Processing (2014)Google Scholar
  22. 22.
    Godin, F., Slavkovikj, V., De Neve, W., Schrauwen, B., Van de Walle, R.: Using topic models for Twitter hashtag recommendation. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 593–596. ACM (2013)Google Scholar
  23. 23.
    Goonetilleke, O., Sellis, T., Zhang, X., Sathe, S.: Twitter analytics: a big data management perspective. SIGKDD Explor. Newsl. 16(1), 11–20 (2014). Scholar
  24. 24.
    Howe, J.: The rise of crowdsourcing. Wired Mag. 14(6), 1–4 (2006)Google Scholar
  25. 25.
    Kim, N.W., et al.: BudgetMap: engaging taxpayers in the issue-driven classification of a government budget. In: CSCW, pp. 1026–1037 (2016)Google Scholar
  26. 26.
    Kittur, A., Nickerson, J.V., Bernstein, M., Gerber, E., Shaw, A., Zimmerman, J., Lease, M., Horton, J.: The future of crowd work. In: CSCW (2013)Google Scholar
  27. 27.
    Kooge, E., et al.: Merging data streams. Res. World 2016(56), 34–37 (2016)CrossRefGoogle Scholar
  28. 28.
    Koyutürk, M., Grama, A., Szpankowski, W.: An efficient algorithm for detecting frequent subgraphs in biological networks. Bioinformatics 20(Suppl_1), i200–i207 (2004)CrossRefGoogle Scholar
  29. 29.
    Krishnan, S., et al.: Towards reliable interactive data cleaning: a user survey and recommendations. In: HILDA@ SIGMOD, p. 9 (2016)Google Scholar
  30. 30.
    Kwak, H., Lee, C., Park, H., Moon, S.: What is Twitter, a social network or a news media? In: WWW (2010)Google Scholar
  31. 31.
    Lee, K., Palsetia, D., Narayanan, R., Patwary, M.M.A., Agrawal, A., Choudhary, A.: Twitter trending topic classification. In: 2011 IEEE 11th International Conference on Data Mining Workshops (ICDMW), pp. 251–258. IEEE (2011)Google Scholar
  32. 32.
    Maynard, D., Funk, A.: Automatic detection of political opinions in tweets. In: García-Castro, R., Fensel, D., Antoniou, G. (eds.) ESWC 2011. LNCS, vol. 7117, pp. 88–99. Springer, Heidelberg (2012). Scholar
  33. 33.
    Perera, R.D., Anand, S., Subbalakshmi, K., Chandramouli, R.: Twitter analytics: architecture, tools and analysis. In: Military Communications Conference, 2010-MILCOM 2010, pp. 2186–2191. IEEE (2010)Google Scholar
  34. 34.
    Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)Google Scholar
  35. 35.
    Roberts, K., Roach, M.A., Johnson, J., Guthrie, J., Harabagiu, S.M.: EmpaTweet: annotating and detecting emotions on Twitter. In: LREC, vol. 12, pp. 3806–3813 (2012)Google Scholar
  36. 36.
    Rundensteiner, E., et al.: Maintaining data warehouses over changing information sources. Commun. ACM 43(6), 57–62 (2000)CrossRefGoogle Scholar
  37. 37.
    Russom, P., et al.: Big data analytics. TDWI Best Practices Report, Fourth Quarter, pp. 1–35 (2011)Google Scholar
  38. 38.
    Sadeghi, F., et al.: VisKE: visual knowledge extraction and question answering by visual verification of relation phrases. In: CVPR, pp. 1456–1464. IEEE (2015)Google Scholar
  39. 39.
    Salih, B.A., Wongthongtham, P., Beheshti, S.M.R., Zajabbari, B.: Towards a methodology for social business intelligence in the era of big social data incorporating trust and semantic analysis. In: Second International Conference on Advanced Data and Information Engineering (DaEng-2015). Springer, Bali (2015)Google Scholar
  40. 40.
    Shen, W., et al.: Entity linking with a knowledge base: issues, techniques, and solutions. ITKDE 27(2), 443–460 (2015)Google Scholar
  41. 41.
    Sosamphan, P., et al.: SNET: a statistical normalisation method for Twitter. Master’s thesis (2016)Google Scholar
  42. 42.
    Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., Demirbas, M.: Short text classification in Twitter to improve information filtering. In: SIGIR. ACM (2010)Google Scholar
  43. 43.
    Troncy, R.: Linking entities for enriching and structuring social media content. In: WWW, pp. 597–597 (2016)Google Scholar
  44. 44.
    Ye, S., Wu, S.F.: Measuring message propagation and social influence on In: Bolc, L., Makowski, M., Wierzbicki, A. (eds.) SocInfo 2010. LNCS, vol. 6430, pp. 216–231. Springer, Heidelberg (2010). Scholar
  45. 45.
    Zhao, W.X., Jiang, J., He, J., Song, Y., Achananuparp, P., Lim, E.P., Li, X.: Topical keyphrase extraction from Twitter. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 379–388. Association for Computational Linguistics (2011)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Amin Beheshti
    • 1
    • 2
    Email author
  • Kushal Vaghani
    • 1
  • Boualem Benatallah
    • 1
  • Alireza Tabebordbar
    • 1
  1. 1.University of New South WalesSydneyAustralia
  2. 2.Macquarie UniversitySydneyAustralia

Personalised recommendations