Building Data Curation Processes with Crowd Intelligence

  • Tianwa ChenEmail author
  • Lei Han
  • Gianluca Demartini
  • Marta Indulska
  • Shazia Sadiq
Conference paper
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 386)


Data curation processes constitute a number of activities, such as transforming, filtering or de-duplicating data. These processes consume an excessive amount of time in data science projects, due to datasets often being external, re-purposed and generally not ready for analytics. Overall, data curation processes are difficult to automate and require human input, which results in a lack of repeatability and potential errors propagating into analytical results. In this paper, we explore a crowd intelligence-based approach to building robust data curation processes. We study how data workers engage with data curation activities, specifically related to data quality detection, and how to build a robust and effective data curation process by learning from the wisdom of the crowd. With the help of a purpose-designed data curation platform based on iPython Notebook, we conducted a lab experiment with data workers and collected a multi-modal dataset that includes measures of task performance and behaviour data. Our findings identify avenues by which effective data curation processes can be built through crowd intelligence.


Data curation Data quality Crowd intelligence 



This work is partly supported by ARC Discovery Project DP190102141 on Building Crowd Sourced Data Curation Processes.


  1. 1.
    Azuan, N.A., Embury, S.M., Paton, N.W.: Observing the data scientist: using manual corrections as implicit feedback. In: Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics, p. 13. ACM (2017)Google Scholar
  2. 2.
    Blanco, R., Ottaviano, G., Meij, E.: Fast and space-efficient entity linking for queries. In: Proceedings of WSDM, pp. 179–188. ACM (2015)Google Scholar
  3. 3.
    Demartini, G., Difallah, D.E., Cudré-Mauroux, P.: Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: Proceedings of WWW, pp. 469–478. ACM (2012)Google Scholar
  4. 4.
    Demartini, G., Difallah, D.E., Gadiraju, U., Catasta, M., et al.: An introduction to hybrid human-machine information systems. Found. Trends® Web Sci. 7(1), 1–87 (2017)CrossRefGoogle Scholar
  5. 5.
    Filatova, E.: Irony and sarcasm: corpus generation and analysis using crowdsourcing. In: Lrec, pp. 392–398. Citeseer (2012)Google Scholar
  6. 6.
    Freitas, A., Curry, E.: Big data curation. In: Cavanillas, J.M., Curry, E., Wahlster, W. (eds.) New Horizons for a Data-Driven Economy, pp. 87–118. Springer, Cham (2016). Scholar
  7. 7.
    Hart, S.G.: Nasa-task load index (NASA-TLX); 20 years later. In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting, vol. 50, pp. 904–908 (2006)Google Scholar
  8. 8.
    Hey, T., Trefethen, A.: The data deluge: an e-science perspective. In: Grid computing: Making the global infrastructure a reality, pp. 809–824 (2003)Google Scholar
  9. 9.
    Jewitt, C.: National centre for research methods working paper 03/12. an introduction to using video for research. Lontoo: Institute of education (2012)Google Scholar
  10. 10.
    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)CrossRefGoogle Scholar
  11. 11.
    Lin, Y., Shen, S., Liu, Z., Luan, H., Sun, M.: Neural relation extraction with selective attention over instances. In: Proceedings of the 54th Annual Meeting of the ACL (Volume 1: Long Papers), pp. 2124–2133 (2016)Google Scholar
  12. 12.
    Marcus, A., Wu, E., Karger, D.R., Madden, S., Miller, R.C.: Crowdsourced databases: Query processing with people. CIDR (2011)Google Scholar
  13. 13.
    Mehrotra, R., et al.: Deep sequential models for task satisfaction prediction. In: Proceedings of the 2017 ACM CIKM Conference, pp. 737–746 (2017)Google Scholar
  14. 14.
    Minelli, R., Mocci, A., Lanza, M.: I know what you did last summer: an investigation of how developers spend their time. In: Proceedings of the 2015 IEEE 23rd International Conference on Program Comprehension, pp. 25–35 (2015)Google Scholar
  15. 15.
    Muller, M., et al.: How data science workers work with data: discovery, capture, curation, design, creation. In: Proceedings of the 2019 CHI Conference (2019)Google Scholar
  16. 16.
    Narasimhan, K., Reichenbach, C.: Copy and paste redeemed (t). In: 2015 30th IEEE/ACM International Conference on ASE, pp. 630–640. IEEE (2015)Google Scholar
  17. 17.
    Palmer, A., Stonebraker, M., Bates-Haus, N., Cleary, L., Marinelli, M.: Getting DataOps Right. O’Reilly Media, Sebastopol (2019)Google Scholar
  18. 18.
    Patil, D.: Data Jujitsu. O’Reilly Media Inc., Sebastopol (2012)Google Scholar
  19. 19.
    Piorkowski, D.J., et al.: The whats and hows of programmers’ foraging diets. In: Proceedings of the CHI Conference, pp. 3063–3072 (2013)Google Scholar
  20. 20.
    Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)Google Scholar
  21. 21.
    Sadiq, S., et al.: Data quality: the role of empiricism. ACM SIGMOD Rec. 46(4), 35–43 (2018)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Stonebraker, M., et al.: Data curation at scale: the data tamer system. In: CIDR (2013)Google Scholar
  23. 23.
    Sutton, C., Hobson, T., Geddes, J., Caruana, R.: Data diff: interpretable, executable summaries of changes in distributions for data wrangling. In: Proceedings of the 24th ACM SIGKDD Conference, pp. 2279–2288 (2018)Google Scholar
  24. 24.
    Thusoo, A., Sarma, J.: Creating a Data-Driven Enterprise with DataOps. O’Reilly Media, Incorporated, Sebastopol (2017)Google Scholar
  25. 25.
    Zhang, R., Indulska, M., Sadiq, S.: Discovering data quality problems. Bus. Inf. Syst. Eng. 61(5), 575–593 (2019)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Tianwa Chen
    • 1
    Email author
  • Lei Han
    • 1
  • Gianluca Demartini
    • 1
  • Marta Indulska
    • 2
  • Shazia Sadiq
    • 1
  1. 1.School of Information Technology and Electrical EngineeringThe University of QueenslandBrisbaneAustralia
  2. 2.Business School, The University of QueenslandBrisbaneAustralia

Personalised recommendations