A Hybrid Machine-Crowdsourcing Approach for Web Table Matching and Cleaning

  • Chunhua Li
  • Pengpeng Zhao
  • Victor S. Sheng
  • Zhixu Li
  • Guanfeng Liu
  • Jian Wu
  • Zhiming Cui
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9659)


Table matching and data cleaning are two crucial activities in integrating data from different web tables, which have traditionally been considered as separate activities. We show that data cleaning can effectively help us discover table matches, and vice versa. In this paper, we study a hybrid machine-crowdsourcing approach to handle the two activities together with a well-developed knowledge base. Understanding the semantics of tables is fundamental to both matching and cleaning. We select the most valuable columns to crowdsourcing validation and infer others by consolidating crowdsourcing results and machine-generated results. When resolving inconsistency between data and semantics, relative trust is taken into account to validate data or semantics via crowd. Our experimental results show the effectiveness of the proposed approach for matching and cleaning web tables using real-life datasets.


Crowdsourcing Table matching Data cleaning 



This work was partially supported by Chinese NSFC (61170020, 61402311, 61440053), Jiangsu Province Colleges and Universities Natural Science Research project (13KJB520021), Jiangsu Province Postgraduate Cultivation and Innovation project (CXZZ13_0813), and the US National Science Foundation (IIS-1115417).


  1. 1.
    Bernstein, P.A., Madhavan, J., Rahm, E.: Generic schema matching, ten years later. Proc. VLDB Endowment 4(11), 695–701 (2011)Google Scholar
  2. 2.
    Cafarella, M.J., Halevy, A., Khoussainova, N.: Data integration for the relational web. Proc. VLDB Endowment 2(1), 1090–1101 (2009)CrossRefGoogle Scholar
  3. 3.
    Cafarella, M.J., Halevy, A., Wang, D.Z., Wu, E., Zhang, Y.: Webtables: exploring the power of tables on the web. Proc. VLDB Endowment 1(1), 538–549 (2008)CrossRefGoogle Scholar
  4. 4.
    Chu, X., Morcos, J., Ilyas, I.F., Ouzzani, M., Papotti, P., Tang, N., Ye, Y.: Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1247–1261. ACM (2015)Google Scholar
  5. 5.
    Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: Consistency and accuracy. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 315–326. VLDB Endowment (2007)Google Scholar
  6. 6.
    Deng, D., Jiang, Y., Li, G., Li, J., Yu, C.: Scalable column concept determination for web tables using large knowledge bases. Proc. VLDB Endowment 6(13), 1606–1617 (2013)CrossRefGoogle Scholar
  7. 7.
    Fan, J., Lu, M., Ooi, B.C., Tan, W.C., Zhang, M.: A hybrid machine-crowdsourcing system for matching web tables. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 976–987. IEEE (2014)Google Scholar
  8. 8.
    Fan, W., Ma, S., Tang, N., Yu, W.: Interaction between record matching and data repairing. J. Data Inf. Qual. (JDIQ) 4(4), 16 (2014)Google Scholar
  9. 9.
    Geerts, F., Mecca, G., Papotti, P., Santoro, D.: Mapping and cleaning. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 232–243. IEEE (2014)Google Scholar
  10. 10.
    Geerts, F., Mecca, G., Papotti, P., Santoro, D.: The llunatic data-cleaning framework. Proc. VLDB Endowment 6(9), 625–636 (2013)CrossRefGoogle Scholar
  11. 11.
    Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: Yago2: A spatially and temporally enhanced knowledge base from wikipedia. In: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, pp. 3161–3165. AAAI Press (2013)Google Scholar
  12. 12.
    Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endowment 3(1–2), 1338–1347 (2010)CrossRefGoogle Scholar
  13. 13.
    Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)CrossRefzbMATHGoogle Scholar
  14. 14.
    Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)Google Scholar
  15. 15.
    Venetis, P., Halevy, A., Madhavan, J., Paşca, M., Shen, W., Wu, F., Miao, G., Wu, C.: Recovering semantics of tables on the web. Proc. VLDB Endowment 4(9), 528–538 (2011)CrossRefGoogle Scholar
  16. 16.
    Wang, S., Xiao, X., Lee, C.H.: Crowd-based deduplication: An adaptive approach. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1263–1277. ACM (2015)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Chunhua Li
    • 1
  • Pengpeng Zhao
    • 1
  • Victor S. Sheng
    • 2
  • Zhixu Li
    • 1
  • Guanfeng Liu
    • 1
  • Jian Wu
    • 1
  • Zhiming Cui
    • 1
  1. 1.School of Computer Science and TechnologySoochow UniversitySuzhouChina
  2. 2.Department of Computer ScienceUniversity of Central ArkansasConwayUSA

Personalised recommendations