Advertisement

The VLDB Journal

, Volume 27, Issue 6, pp 745–770 | Cite as

A partial-order-based framework for cost-effective crowdsourced entity resolution

  • Chengliang Chai
  • Guoliang LiEmail author
  • Jian Li
  • Dong Deng
  • Jianhua Feng
Regular Paper
  • 152 Downloads

Abstract

Crowdsourced entity resolution has recently attracted significant attentions because it can harness the wisdom of crowd to improve the quality of entity resolution. However, existing techniques either cannot achieve high quality or incur huge monetary costs. To address these problems, we propose a cost-effective crowdsourced entity resolution framework, which significantly reduces the monetary cost while keeping high quality. We first define a partial order on the pairs of records. Then, we select a pair as a question and ask the crowd to check whether the records in the pair refer to the same entity. After getting the answer of this pair, we infer the answers of other pairs based on the partial order. Next, we iteratively select pairs without answers to ask until we get the answers of all pairs. We devise effective algorithms to judiciously select the pairs to ask in order to minimize the number of asked pairs. To further reduce the cost, we propose a grouping technique to group the pairs and we only ask one pair instead of all pairs in each group. We develop error-tolerant techniques to tolerate the errors introduced by the partial order and the crowd. We also study the budget-aware entity resolution, which, given a budget, finds the maximum number of matching pairs within the budget, and propose effective optimization techniques. Experimental results show that our method reduces the cost to 1.25% of existing approaches (or existing approaches take \(80\times \) monetary cost of our method) while not sacrificing the quality.

Keywords

Crowdsourced entity resolution Partial order Quality Cost Latency 

Notes

Acknowledgements

This work was supported by the 973 Program of China (2015CB358700), NSF of China (61632016, 61472198, 61521002, 61661166012), and TAL education.

References

  1. 1.
    Dawid, A.P., Skene, A.M.: Maximum likelihood estimation of observer error-rates using em algorithm. Appl. Stat. 28(1), 20–28 (1979)CrossRefGoogle Scholar
  2. 2.
    Aydin, B.I., Yilmaz, Y.S., Li, Y., Li, Q., Gao, J., Demirbas, M.: Crowdsourcing for multiple-choice question answering. In: AAAI, pp. 2946–2953 (2014)Google Scholar
  3. 3.
    Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 24–27 Aug 2003. pp. 39–48 (2003)Google Scholar
  4. 4.
    Cao, C.C., She, J., Tong, Y., Chen, L.: Whom to ask? Jury selection for decision making tasks on micro-blog services. PVLDB 5(11), 1495–1506 (2012)Google Scholar
  5. 5.
    Chai, C., Li, G., Li, J., Deng, D., Feng. J.: Cost-effective crowdsourced entity resolution: a partial-order approach. In: SIGMOD, pp. 969–984 (2016)Google Scholar
  6. 6.
    Chai, C., Li, G., Li, J., Deng, D., Feng, J.: A partial-order framework for cost-effective crowdsourced entity resolution. In: Technical Report. http://dbgroup.cs.tsinghua.edu.cn/ligl/papers/vldbj2018-partialorder.pdf. Accessed 11 June 2018
  7. 7.
    Chen, X., Bennett, P.N., Collins-Thompson, K., Horvitz, E.: Pairwise ranking aggregation in a crowdsourced setting. In: WSDM, pp. 193–202 (2013)Google Scholar
  8. 8.
    Davidson, S.B., Khanna, S., Milo, T., Roy, S.: Using the crowd for top-k and group-by queries. In: ICDT, pp. 225–236 (2013)Google Scholar
  9. 9.
    Dong, X., Halevy, A.Y., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD, pp. 85–96 (2005)Google Scholar
  10. 10.
    Eriksson, B.: Learning to top-k search using pairwise comparisons. In: AISTATS, pp. 265–273 (2013)Google Scholar
  11. 11.
    Fan, J., Li, G., Ooi, B. C., Tan, K., Feng, J.: icrowd: an adaptive crowdsourcing framework. In: SIGMOD, pp. 1015–1030 (2015)Google Scholar
  12. 12.
    Felsner, S., Raghavan, V., Spinrad, J.P.: Recognition algorithms for orders of small width and graphs of small dilworth number. Order 20(4), 351–364 (2003)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Franklin, M.J., Kossmann, D., Kraska, T., Ramesh, S., Xin, R.: Crowddb: answering queries with crowdsourcing. In: SIGMOD, pp. 61–72 (2011)Google Scholar
  14. 14.
    Fulkerson, D.R.: Note on Dilworth’s decomposition theorem for partially ordered sets. Am. Math. Soc. 7(4), 701–702 (1956)MathSciNetzbMATHGoogle Scholar
  15. 15.
    Gokhale, C., Das, S., Doan, A., Naughton, J.F., Rampalli, N., Shavlik, J.W., Zhu, X.: Corleone: hands-off crowdsourcing for entity matching. In: SIGMOD, pp. 601–612 (2014)Google Scholar
  16. 16.
    Guo, S., Parameswaran, A.G., Garcia-Molina, H.: So who won? Dynamic max discovery with the crowd. In: SIGMOD, pp. 385–396 (2012)Google Scholar
  17. 17.
    Ho, C.-J., Jabbari, S., Vaughan, J.W.: Adaptive task assignment for crowdsourced classification. In: ICML, pp. 534–542 (2013)Google Scholar
  18. 18.
    Ipeirotis, P.G., Provost, F., Wang, J.: Quality management on amazon mechanical turk. In: SIGKDD workshop on human computation, pp. 64–67. ACM (2010)Google Scholar
  19. 19.
    Joglekar, M., Garcia-Molina, H., Parameswaran, A.G.: Evaluating the crowd with confidence. In: SIGKDD, pp. 686–694 (2013)Google Scholar
  20. 20.
    Karger, D.R., Oh, S., Shah, D.: Iterative learning for reliable crowdsourcing systems. In: NIPS, pp. 1953–1961 (2011)Google Scholar
  21. 21.
    Kazemi, L., Shahabi, C., Chen, L.: Geotrucrowd: trustworthy query answering with spatial crowdsourcing. In: SIGSPATIAL, pp. 304–313 (2013)Google Scholar
  22. 22.
    Khan, A.R., Garcia-Molina, H.: Hybrid strategies for finding the max with the crowd. In: Technical Report (2014)Google Scholar
  23. 23.
    Konda, P., Das, S., Suganthan, G.C.P., Doan, A., Ardalan, A., Ballard, J.R., Li, H., Panahi, F., Zhang, H., Naughton, J.F., Prasad, S., Krishnan, G., Deep, R., Raghavendra, V.: Magellan: toward building entity matching management systems. PVLDB 9(12), 1197–1208 (2016)Google Scholar
  24. 24.
    Kreveld, M.,Overmars, M., Schwarzkopf,O., Berg, M.d., Schwart- skopf, O.: Computational geometry: algorithms and applications, pp. I-XII, 1–365. Springer (1997). ISBN 354061270XGoogle Scholar
  25. 25.
    Li, G.: Human-in-the-loop data integration. PVLDB 10(12), 2006–2017 (2017)Google Scholar
  26. 26.
    Li, G., Chai, C., Fan, J., Weng, X., Li, J., Zheng, Y., Li, Y., Yu, X., Zhang, X., Yuan, H.: CDB: optimizing queries with crowd-based selections and joins. In: SIGMOD, pp. 1463–1478 (2017)Google Scholar
  27. 27.
    Li, G., Wang, J., Zheng, Y., Franklin, M.J.: Crowdsourced data management: a survey. IEEE Trans. Knowl. Data Eng. 28(9), 2296–2319 (2016)CrossRefGoogle Scholar
  28. 28.
    Liu, X., Lu, M., Ooi, B.C., Shen, Y., Wu, S., Zhang, M.: CDAS: a crowdsourcing data analytics system. PVLDB 5(10), 1040–1051 (2012)Google Scholar
  29. 29.
    Marcus, A., Wu, E., Karger, D.R., Madden, S., Miller, R.C.: Human-powered sorts and joins. PVLDB 5(1), 13–24 (2011)Google Scholar
  30. 30.
    Marcus, A., Wu, E., Madden, S., Miller, R.C.: Crowdsourced databases: query processing with people. In: CIDR, pp. 211–214 (2011)Google Scholar
  31. 31.
    Megiddo, N., Supowit, K.J.: On the complexity of some common geometric location problems. SIAM J. Comput. 13(1), 182–196 (1984)MathSciNetCrossRefGoogle Scholar
  32. 32.
    Ouyang, W.R., Kaplan, L.M. et al.: Debiasing crowdsourced quantitative characteristics in local businesses and services. In: IPSN, pp. 190–201 (2015)Google Scholar
  33. 33.
    Parameswaran, A.G., Boyd, S.P., Garcia-Molina, H., Gupta, A., Polyzotis, N., Widom, J.: Optimal crowd-powered rating and filtering algorithms. PVLDB 7(9), 685–696 (2014)Google Scholar
  34. 34.
    Parameswaran, A.G., Garcia-Molina, H., Park, H., Polyzotis, N., Ramesh, A., Widom, J.: Crowdscreen: algorithms for filtering data with humans. In: SIGMOD, pp. 361–372 (2012)Google Scholar
  35. 35.
    Parameswaran, A.G., Park, H., Garcia-Molina, H., Polyzotis, N., Widom, J.: Deco: declarative crowdsourcing. In: CIKM, pp. 1203–1212 (2012)Google Scholar
  36. 36.
    Park, H., Widom, J.: Query optimization over crowdsourced data. PVLDB 6(10), 781–792 (2013)Google Scholar
  37. 37.
    Park, H., Widom, J.: Crowdfill: collecting structured data from the crowd. In: SIGMOD, pp. 577–588 (2014)Google Scholar
  38. 38.
    Pfeiffer, T., Gao, X.A., Chen, Y., Mao, A., Rand, D.G.: Adaptive polling for information aggregation. In: AAAI (2012)Google Scholar
  39. 39.
    Raykar, V.C., Yu, S., Zhao, L.H., Jerebko, A.K., Florin, C., Valadez, G.H., Bogoni, L., Moy, L.: Supervised learning from multiple experts: whom to trust when everyone lies a bit. In: ICML, pp. 889–896 (2009)Google Scholar
  40. 40.
    Sarma, A.D., Parameswaran, A.G., Garcia-Molina, H., Halevy, A.Y.: Crowd-powered find algorithms. In: ICDE, pp. 964–975 (2014)Google Scholar
  41. 41.
    Trushkowsky, B., Kraska, T., Franklin, M.J., Sarkar, P.: Crowdsourced enumeration queries. In: ICDE, pp. 673–684 (2013)Google Scholar
  42. 42.
    Venanzi, M., Guiver, J., Kazai, G., Kohli, P., Shokouhi, M.: Community-based bayesian aggregation models for crowdsourcing. In: WWW, pp. 155–164 (2014)Google Scholar
  43. 43.
    Venetis, P., Garcia-Molina, H., Huang, K., Polyzotis, N.: Max algorithms in crowdsourcing environments. In: WWW, pp. 989–998 (2012)Google Scholar
  44. 44.
    Verroios, V., Garcia-Molina, H.: Entity resolution with crowd errors. In: ICDE, pp. 219–230 (2015)Google Scholar
  45. 45.
    Vesdapunt, N., Bellare, K., Dalvi, N.N.: Crowdsourcing algorithms for entity resolution. In: PVLDB (2014)Google Scholar
  46. 46.
    Wang, J., Kraska, T., Franklin, M.J., Feng, J.: Crowder: Crowdsourcing entity resolution. In: PVLDB (2012)Google Scholar
  47. 47.
    Wang, J., Li, G., Kraska, T., Franklin, M.J., Feng, J.: Leveraging transitive relations for crowdsourced joins. In: SIGMOD, pp. 229–240 (2013)Google Scholar
  48. 48.
    Wang, S., Xiao, X., Lee, C.: Crowd-based deduplication: An adaptive approach. In: SIGMOD, pp. 1263–1277 (2015)Google Scholar
  49. 49.
    Whang, S.E., Lofgren, P., Garcia-Molina, H.: Question selection for crowd entity resolution. In: PVLDB (2013)Google Scholar
  50. 50.
    Whitehill, J., Ruvolo, P., Wu, T., Bergsma, J., Movellan, J.R.: Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In: NIPS, pp. 2035–2043 (2009)Google Scholar
  51. 51.
    Ye, P., EDU, U., Doermann, D.: Combining preference and absolute judgements in a crowd-sourced setting. In: ICML Workshop (2013)Google Scholar
  52. 52.
    Zhang, C.J., Chen, L., Jagadish, H.V., Cao, C.C.: Reducing uncertainty of schema matching via crowdsourcing. PVLDB 6(9), 757–768 (2013)Google Scholar
  53. 53.
    Zhao, Z., Wei, F., Zhou, M., Chen, W., Ng, W.: Crowd-selection query processing in crowdsourcing databases: a task-driven approach. In: EDBT, pp. 397–408 (2015)Google Scholar
  54. 54.
    Zhao, Z., Yan, D., Ng, W., Gao, S.: A transfer learning based framework of crowd-selection on twitter. In: SIGKDD, pp. 1514–1517 (2013)Google Scholar
  55. 55.
    Zheng, Y., Cheng, R., Maniu, S., Mo, L.: On optimality of jury selection in crowdsourcing. In: EDBT, pp. 193–204 (2015)Google Scholar
  56. 56.
    Zheng, Y., Li, G., Cheng, R.: DOCS: domain-aware crowdsourcing system. PVLDB 10(4), 361–372 (2016)Google Scholar
  57. 57.
    Zheng, Y., Li, G., Li, Y., Shan, C., Cheng, R.: Truth inference in crowdsourcing: is the problem solved? PVLDB 10(5), 541–552 (2016)Google Scholar
  58. 58.
    Zheng, Y., Wang, J., Li, G., Cheng, R., Feng, J.: QASCA: A quality-aware task assignment system for crowdsourcing applications. In: SIGMOD, pp. 1031–1046 (2015)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer Science and TechnologyTsinghua UniversityBeijingChina

Personalised recommendations